Powered by GitBook

DataSourceReader Contract

DataSourceReader is the abstraction of data source readers in Data Source API V2 that can plan InputPartitions and know the schema for reading.

DataSourceReader is created to scan the data from a data source when:

DataSourceV2Relation is requested to create a new reader
ReadSupport is requested to create a reader

DataSourceReader is used to create StreamingDataSourceV2Relation and DataSourceV2ScanExec physical operator

Note	It appears that all concrete data source readers are used in Spark Structured Streaming only.

Table 1. DataSourceReader Contract
Method	Description
`planInputPartitions`	`List<InputPartition<InternalRow>> planInputPartitions()` InputPartitions Used exclusively when `DataSourceV2ScanExec` leaf physical operator is requested for the input partitions (and simply delegates to the underlying DataSourceReader) to create the input `RDD[InternalRow]` (`inputRDD`)
`readSchema`	`StructType readSchema()` Schema for reading (loading) data from a data source Used when: `DataSourceV2Relation` factory object is requested to create a DataSourceV2Relation (when `DataFrameReader` is requested to "load" data (as a DataFrame) from a data source with ReadSupport) `DataSourceV2Strategy` execution planning strategy is requested to apply column pruning optimization Spark Structured Streaming’s `MicroBatchExecution` stream execution is requested to run a single streaming batch Spark Structured Streaming’s `ContinuousExecution` stream execution is requested to run a streaming query in continuous mode Spark Structured Streaming’s `DataStreamReader` is requested to "load" data (as a DataFrame)

Note	`DataSourceReader` is an `Evolving` contract that is evolving towards becoming a stable API, but is not a stable API yet and can change from one feature release to another release. In other words, using the contract is as "treading on thin ice".

Table 2. DataSourceReaders (Direct Implementations and Extensions Only)
DataSourceReader	Description
ContinuousReader	`DataSourceReaders` for Continuous Stream Processing in Spark Structured Streaming Consult The Internals of Spark Structured Streaming
MicroBatchReader	`DataSourceReaders` for Micro-Batch Stream Processing in Spark Structured Streaming Consult The Internals of Spark Structured Streaming
SupportsPushDownFilters	`DataSourceReaders` that can push down filters to the data source and reduce the size of the data to be read
SupportsPushDownRequiredColumns	`DataSourceReaders` that can push down required columns to the data source and only read these columns during scan to reduce the size of the data to be read
SupportsReportPartitioning	`DataSourceReaders` that can report data partitioning and try to avoid shuffle at Spark side
SupportsReportStatistics	`DataSourceReaders` that can report statistics to Spark
SupportsScanColumnarBatch	`DataSourceReaders` that can output ColumnarBatch and make the scan faster

results matching ""

No results matching ""