BaseRelation — Collection of Tuples with Schema

BaseRelation is the contract of relations (aka collections of tuples) with a known schema.

Note	"Data source", "relation" and "table" are often used as synonyms.

package org.apache.spark.sql.sources

abstract class BaseRelation {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  def schema: StructType
  def sqlContext: SQLContext
}

Table 1. (Subset of) BaseRelation Contract
Method	Description
`schema`	StructType that describes the schema of tuples
`sqlContext`	SQLContext

BaseRelation is "created" when DataSource is requested to resolve a relation.

BaseRelation is transformed into a DataFrame when SparkSession is requested to create a DataFrame.

BaseRelation uses needConversion flag to control type conversion of objects inside Rows to Catalyst types, e.g. java.lang.String to UTF8String.

Note	It is recommended that custom data sources (outside Spark SQL) should leave needConversion flag enabled, i.e. `true`.

BaseRelation can optionally give an estimated size (in bytes).

Table 2. BaseRelations
BaseRelation	Description
`ConsoleRelation`	Used in Spark Structured Streaming
HadoopFsRelation
JDBCRelation
KafkaRelation	Datasets with records from Apache Kafka

Should JVM Objects Inside Rows Be Converted to Catalyst Types? — `needConversion` Method

needConversion: Boolean

needConversion flag is enabled (true) by default.

Note	It is recommended to leave `needConversion` enabled for data sources outside Spark SQL.

Note	`needConversion` is used exclusively when `DataSourceStrategy` execution planning strategy is executed (and does the RDD conversion from `RDD[Row]` to `RDD[InternalRow]`).

Finding Unhandled Filter Predicates — `unhandledFilters` Method

unhandledFilters(filters: Array[Filter]): Array[Filter]

unhandledFilters returns Filter predicates that the data source does not support (handle) natively.

Note	`unhandledFilters` returns the input `filters` by default as it is considered safe to double evaluate filters regardless whether they could be supported or not.

Note	`unhandledFilters` is used exclusively when `DataSourceStrategy` execution planning strategy is requested to selectFilters.

Estimated Size Of Relation (In Bytes) — `sizeInBytes` Method

sizeInBytes: Long

sizeInBytes is the estimated size of a relation (used in query planning).

Note	`sizeInBytes` defaults to spark.sql.defaultSizeInBytes internal property (i.e. infinite).

Note	`sizeInBytes` is used exclusively when `LogicalRelation` is requested to computeStats (and they are not available in CatalogTable).

BaseRelation

BaseRelation — Collection of Tuples with Schema

Should JVM Objects Inside Rows Be Converted to Catalyst Types? — `needConversion` Method

Finding Unhandled Filter Predicates — `unhandledFilters` Method

Estimated Size Of Relation (In Bytes) — `sizeInBytes` Method

results matching ""

No results matching ""

BaseRelation — Collection of Tuples with Schema

Should JVM Objects Inside Rows Be Converted to Catalyst Types? — needConversion Method

Finding Unhandled Filter Predicates — unhandledFilters Method

Estimated Size Of Relation (In Bytes) — sizeInBytes Method

results matching ""

No results matching ""

Should JVM Objects Inside Rows Be Converted to Catalyst Types? — `needConversion` Method

Finding Unhandled Filter Predicates — `unhandledFilters` Method

Estimated Size Of Relation (In Bytes) — `sizeInBytes` Method