DataSourceV2ScanExec Leaf Physical Operator

DataSourceV2ScanExec is a leaf physical operator that represents a DataSourceV2Relation logical operator at execution time.

A DataSourceV2Relation logical operator is created exclusively when DataFrameReader is requested to "load" data (as a DataFrame) (from a data source with ReadSupport).

DataSourceV2ScanExec supports ColumnarBatchScan with vectorized batch decoding (when created for a DataSourceReader that supports it, i.e. the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled).

DataSourceV2ScanExec is also a DataSourceV2StringFormat, i.e…​.FIXME

DataSourceV2ScanExec is created exclusively when DataSourceV2Strategy execution planning strategy is executed (i.e. applied to a logical plan) and finds a DataSourceV2Relation logical operator.

DataSourceV2ScanExec gives the single input RDD as the only input RDD of internal rows (when WholeStageCodegenExec physical operator is executed).

Table 1. DataSourceV2ScanExec’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description


Input partitions (Seq[InputPartition[InternalRow]])

Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method

doExecute(): RDD[InternalRow]
doExecute is part of SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow]).


supportsBatch Property

supportsBatch: Boolean
supportsBatch is part of ColumnarBatchScan Contract to control whether the physical operator supports vectorized decoding or not.

supportsBatch is enabled (true) only when the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled.

enableBatchRead flag is enabled by default.

supportsBatch is disabled (i.e. false) otherwise.

Creating DataSourceV2ScanExec Instance

DataSourceV2ScanExec takes the following when created:

DataSourceV2ScanExec initializes the internal registries and counters.

Creating Input RDD of Internal Rows — inputRDD Internal Property

inputRDD: RDD[InternalRow]
inputRDD is a Scala lazy value which is computed once when accessed and cached afterwards.

inputRDD branches off per the type of the DataSourceReader:

  1. For a ContinuousReader in Spark Structured Streaming, inputRDD is a ContinuousDataSourceRDD that…​FIXME

  2. For a SupportsScanColumnarBatch with the enableBatchRead flag enabled, inputRDD is a DataSourceRDD with the batchPartitions

  3. For all other types of the DataSourceReader, inputRDD is a DataSourceRDD with the partitions.

inputRDD is used when DataSourceV2ScanExec physical operator is requested for the input RDDs and to execute.

results matching ""

    No results matching ""