DataSourceV2ScanExec Leaf Physical Operator

DataSourceV2ScanExec is a leaf physical operator that represents a DataSourceV2Relation logical operator at execution time.

Note	A DataSourceV2Relation logical operator is created exclusively when `DataFrameReader` is requested to "load" data (as a DataFrame) (from a data source with ReadSupport).

DataSourceV2ScanExec supports ColumnarBatchScan with vectorized batch decoding (when created for a DataSourceReader that supports it, i.e. the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled).

DataSourceV2ScanExec is also a DataSourceV2StringFormat, i.e….FIXME

DataSourceV2ScanExec is created exclusively when DataSourceV2Strategy execution planning strategy is executed (i.e. applied to a logical plan) and finds a DataSourceV2Relation logical operator.

DataSourceV2ScanExec gives the single input RDD as the only input RDD of internal rows (when WholeStageCodegenExec physical operator is executed).

Executing Physical Operator (Generating RDD[InternalRow]) — `doExecute` Method

doExecute(): RDD[InternalRow]

Note	`doExecute` is part of SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. `RDD[InternalRow]`).

doExecute…FIXME

`supportsBatch` Property

supportsBatch: Boolean

Note	`supportsBatch` is part of ColumnarBatchScan Contract to control whether the physical operator supports vectorized decoding or not.

supportsBatch is enabled (true) only when the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled.

Note	enableBatchRead flag is enabled by default.

supportsBatch is disabled (i.e. false) otherwise.

Creating DataSourceV2ScanExec Instance

DataSourceV2ScanExec takes the following when created:

Output schema (as a collection of AttributeReferences)
DataSourceReader

DataSourceV2ScanExec initializes the internal properties.

Creating Input RDD of Internal Rows — `inputRDD` Internal Property

inputRDD: RDD[InternalRow]

Note	`inputRDD` is a Scala lazy value which is computed once when accessed and cached afterwards.

inputRDD branches off per the type of the DataSourceReader:

For a ContinuousReader in Spark Structured Streaming, inputRDD is a ContinuousDataSourceRDD that…FIXME
For a SupportsScanColumnarBatch with the enableBatchRead flag enabled, inputRDD is a DataSourceRDD with the batchPartitions
For all other types of the DataSourceReader, inputRDD is a DataSourceRDD with the partitions.

Note	`inputRDD` is used when `DataSourceV2ScanExec` physical operator is requested for the input RDDs and to execute.

Internal Properties

Name Description

Name	Description
`batchPartitions`	Input partitions of ColumnarBatches (`Seq[InputPartition[ColumnarBatch]]`)
`partitions`	Input partitions of InternalRows (`Seq[InputPartition[InternalRow]]`)

batchPartitions

Input partitions of ColumnarBatches (Seq[InputPartition[ColumnarBatch]])

partitions

Input partitions of InternalRows (Seq[InputPartition[InternalRow]])

DataSourceV2ScanExec