BaseRelation — Collection of Tuples with Schema

BaseRelation is the contract of relations (aka collections of tuples) with a known schema.

Note
"Data source", "relation" and "table" are often used as synonyms.
package org.apache.spark.sql.sources

abstract class BaseRelation {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  def schema: StructType
  def sqlContext: SQLContext
}
Table 1. (Subset of) BaseRelation Contract
Method Description

schema

StructType that describes the schema of tuples

sqlContext

SQLContext

BaseRelation is "created" when DataSource is requested to resolve a relation.

BaseRelation is transformed into a DataFrame when SparkSession is requested to create a DataFrame.

BaseRelation uses needConversion flag to control type conversion of objects inside Rows to Catalyst types, e.g. java.lang.String to UTF8String.

Note
It is recommended that custom data sources (outside Spark SQL) should leave needConversion flag enabled, i.e. true.

BaseRelation can optionally give an estimated size (in bytes).

Table 2. BaseRelations
BaseRelation Description

ConsoleRelation

Used in Spark Structured Streaming

HadoopFsRelation

JDBCRelation

KafkaRelation

Datasets with records from Apache Kafka

Should JVM Objects Inside Rows Be Converted to Catalyst Types? — needConversion Method

needConversion: Boolean

needConversion flag is enabled (true) by default.

Note
It is recommended to leave needConversion enabled for data sources outside Spark SQL.
Note
needConversion is used exclusively when DataSourceStrategy execution planning strategy is executed (and does the RDD conversion from RDD[Row] to RDD[InternalRow]).

Finding Unhandled Filter Predicates — unhandledFilters Method

unhandledFilters(filters: Array[Filter]): Array[Filter]

unhandledFilters returns Filter predicates that the data source does not support (handle) natively.

Note
unhandledFilters returns the input filters by default as it is considered safe to double evaluate filters regardless whether they could be supported or not.
Note
unhandledFilters is used exclusively when DataSourceStrategy execution planning strategy is requested to selectFilters.

Estimated Size Of Relation (In Bytes) — sizeInBytes Method

sizeInBytes: Long

sizeInBytes is the estimated size of a relation (used in query planning).

Note
sizeInBytes defaults to spark.sql.defaultSizeInBytes internal property (i.e. infinite).
Note
sizeInBytes is used exclusively when LogicalRelation is requested to computeStats (and they are not available in CatalogTable).

results matching ""

    No results matching ""