DataSourceV2ScanExec Leaf Physical Operator

DataSourceV2ScanExec is a leaf physical operator to represent DataSourceV2Relation logical operators at execution time.

A DataSourceV2Relation logical operator is created when…​FIXME

DataSourceV2ScanExec is a ColumnarBatchScan that supports vectorized batch decoding (when created for a DataSourceReader that supports it, i.e. the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled).

DataSourceV2ScanExec is also a DataSourceReaderHolder.

DataSourceV2ScanExec is created exclusively when DataSourceV2Strategy execution planning strategy is executed and finds a DataSourceV2Relation logical operator in a logical query plan.

DataSourceV2ScanExec gives the single input RDD as the only input RDD of internal rows (when WholeStageCodegenExec physical operator is executed).

Table 1. DataSourceV2ScanExec’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description


Collection of DataReaderFactory objects of UnsafeRows

Used when…​FIXME

Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method

doExecute(): RDD[InternalRow]
doExecute is part of SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow]).


supportsBatch Property

supportsBatch: Boolean
supportsBatch is part of ColumnarBatchScan Contract to control whether the physical operator supports vectorized decoding or not.

supportsBatch is enabled (i.e. true) only when the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled.

enableBatchRead flag is enabled by default.

supportsBatch is disabled (i.e. false) otherwise.

Creating DataSourceV2ScanExec Instance

DataSourceV2ScanExec takes the following when created:

DataSourceV2ScanExec initializes the internal registries and counters.

Creating Input RDD of Internal Rows — inputRDD Internal Property

inputRDD: RDD[InternalRow]
inputRDD is a Scala lazy value which is computed once when accessed and cached afterwards.


inputRDD is used when DataSourceV2ScanExec physical operator is requested for the input RDDs and to execute.

