DataSourceReader Contract

DataSourceReader is the abstraction of data source readers in Data Source API V2 that can plan InputPartitions and know the schema for reading.

DataSourceReader is created to scan the data from a data source when:

DataSourceReader is used to create StreamingDataSourceV2Relation and DataSourceV2ScanExec physical operator

Note
It appears that all concrete data source readers are used in Spark Structured Streaming only.
Table 1. DataSourceReader Contract
Method Description

planInputPartitions

List<InputPartition<InternalRow>> planInputPartitions()

Used exclusively when DataSourceV2ScanExec leaf physical operator is requested for the input partitions (and simply delegates to the underlying DataSourceReader) to create the input RDD[InternalRow] (inputRDD)

readSchema

StructType readSchema()

Schema for reading (loading) data from a data source

Used when:

  • DataSourceV2Relation factory object is requested to create a DataSourceV2Relation (when DataFrameReader is requested to "load" data (as a DataFrame) from a data source with ReadSupport)

  • DataSourceV2Strategy execution planning strategy is requested to apply column pruning optimization

  • Spark Structured Streaming’s MicroBatchExecution stream execution is requested to run a single streaming batch

  • Spark Structured Streaming’s ContinuousExecution stream execution is requested to run a streaming query in continuous mode

  • Spark Structured Streaming’s DataStreamReader is requested to "load" data (as a DataFrame)

Note

DataSourceReader is an Evolving contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release.

In other words, using the contract is as "treading on thin ice".

Table 2. DataSourceReaders (Direct Implementations and Extensions Only)
DataSourceReader Description

ContinuousReader

DataSourceReaders for Continuous Stream Processing in Spark Structured Streaming

Consult The Internals of Spark Structured Streaming

MicroBatchReader

DataSourceReaders for Micro-Batch Stream Processing in Spark Structured Streaming

Consult The Internals of Spark Structured Streaming

SupportsPushDownFilters

DataSourceReaders that can push down filters to the data source and reduce the size of the data to be read

SupportsPushDownRequiredColumns

DataSourceReaders that can push down required columns to the data source and only read these columns during scan to reduce the size of the data to be read

SupportsReportPartitioning

DataSourceReaders that can report data partitioning and try to avoid shuffle at Spark side

SupportsReportStatistics

DataSourceReaders that can report statistics to Spark

SupportsScanColumnarBatch

DataSourceReaders that can output ColumnarBatch and make the scan faster

results matching ""

    No results matching ""