BaseRelation — Collection of Tuples with Schema

BaseRelation is the contract of relations (aka collections of tuples) with a known schema.

"Data source", "relation" and "table" are often used as synonyms.
package org.apache.spark.sql.sources

abstract class BaseRelation {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  def schema: StructType
  def sqlContext: SQLContext
Table 1. (Subset of) BaseRelation Contract
Method Description


StructType that describes the schema of tuples



BaseRelation is "created" when DataSource is requested to resolve a relation.

BaseRelation is transformed into a DataFrame when SparkSession is requested to create a DataFrame.

BaseRelation uses needConversion flag to control type conversion of objects inside Rows to Catalyst types, e.g. java.lang.String to UTF8String.

It is recommended that custom data sources (outside Spark SQL) should leave needConversion flag enabled, i.e. true.

BaseRelation can optionally give an estimated size (in bytes).

Table 2. BaseRelations
BaseRelation Description


Used in Spark Structured Streaming




Datasets with records from Apache Kafka

Should JVM Objects Inside Rows Be Converted to Catalyst Types? — needConversion Method

needConversion: Boolean

needConversion flag is enabled (true) by default.

It is recommended to leave needConversion enabled for data sources outside Spark SQL.
needConversion is used exclusively when DataSourceStrategy execution planning strategy is executed (and does the RDD conversion from RDD[Row] to RDD[InternalRow]).

Finding Unhandled Filter Predicates — unhandledFilters Method

unhandledFilters(filters: Array[Filter]): Array[Filter]

unhandledFilters returns Filter predicates that the data source does not support (handle) natively.

unhandledFilters returns the input filters by default as it is considered safe to double evaluate filters regardless whether they could be supported or not.
unhandledFilters is used exclusively when DataSourceStrategy execution planning strategy is requested to selectFilters.

Requesting Estimated Size — sizeInBytes Method

sizeInBytes: Long

sizeInBytes is the estimated size of a relation (used in query planning).

sizeInBytes defaults to spark.sql.defaultSizeInBytes internal property (i.e. infinite).
sizeInBytes is used exclusively when LogicalRelation is requested to computeStats (and they are not available in CatalogTable).

results matching ""

    No results matching ""