DataSourceV2ScanExec Leaf Physical Operator

DataSourceV2ScanExec is a leaf physical operator that represents a DataSourceV2Relation logical operator at execution time.

Note
A DataSourceV2Relation logical operator is created exclusively when DataFrameReader is requested to "load" data (as a DataFrame) (from a data source with ReadSupport).

DataSourceV2ScanExec supports ColumnarBatchScan with vectorized batch decoding (when created for a DataSourceReader that supports it, i.e. the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled).

DataSourceV2ScanExec is also a DataSourceV2StringFormat, i.e…​.FIXME

DataSourceV2ScanExec is created exclusively when DataSourceV2Strategy execution planning strategy is executed (i.e. applied to a logical plan) and finds a DataSourceV2Relation logical operator.

DataSourceV2ScanExec gives the single input RDD as the only input RDD of internal rows (when WholeStageCodegenExec physical operator is executed).

Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method

doExecute(): RDD[InternalRow]
Note
doExecute is part of SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow]).

doExecute…​FIXME

supportsBatch Property

supportsBatch: Boolean
Note
supportsBatch is part of ColumnarBatchScan Contract to control whether the physical operator supports vectorized decoding or not.

supportsBatch is enabled (true) only when the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled.

Note
enableBatchRead flag is enabled by default.

supportsBatch is disabled (i.e. false) otherwise.

Creating DataSourceV2ScanExec Instance

DataSourceV2ScanExec takes the following when created:

DataSourceV2ScanExec initializes the internal properties.

Creating Input RDD of Internal Rows — inputRDD Internal Property

inputRDD: RDD[InternalRow]
Note
inputRDD is a Scala lazy value which is computed once when accessed and cached afterwards.

inputRDD branches off per the type of the DataSourceReader:

  1. For a ContinuousReader in Spark Structured Streaming, inputRDD is a ContinuousDataSourceRDD that…​FIXME

  2. For a SupportsScanColumnarBatch with the enableBatchRead flag enabled, inputRDD is a DataSourceRDD with the batchPartitions

  3. For all other types of the DataSourceReader, inputRDD is a DataSourceRDD with the partitions.

Note
inputRDD is used when DataSourceV2ScanExec physical operator is requested for the input RDDs and to execute.

Internal Properties

Name Description

batchPartitions

Input partitions of ColumnarBatches (Seq[InputPartition[ColumnarBatch]])

partitions

Input partitions of InternalRows (Seq[InputPartition[InternalRow]])

results matching ""

    No results matching ""