doExecute(): RDD[InternalRow]
DataSourceV2ScanExec Leaf Physical Operator
DataSourceV2ScanExec
is a leaf physical operator that represents a DataSourceV2Relation logical operator at execution time.
Note
|
A DataSourceV2Relation logical operator is created exclusively when DataFrameReader is requested to "load" data (as a DataFrame) (from a data source with ReadSupport).
|
DataSourceV2ScanExec
supports ColumnarBatchScan with vectorized batch decoding (when created for a DataSourceReader that supports it, i.e. the DataSourceReader
is a SupportsScanColumnarBatch with the enableBatchRead flag enabled).
DataSourceV2ScanExec
is also a DataSourceV2StringFormat, i.e….FIXME
DataSourceV2ScanExec
is created exclusively when DataSourceV2Strategy execution planning strategy is executed (i.e. applied to a logical plan) and finds a DataSourceV2Relation logical operator.
DataSourceV2ScanExec
gives the single input RDD as the only input RDD of internal rows (when WholeStageCodegenExec
physical operator is executed).
Executing Physical Operator (Generating RDD[InternalRow]) — doExecute
Method
Note
|
doExecute is part of SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow] ).
|
doExecute
…FIXME
supportsBatch
Property
supportsBatch: Boolean
Note
|
supportsBatch is part of ColumnarBatchScan Contract to control whether the physical operator supports vectorized decoding or not.
|
supportsBatch
is enabled (true
) only when the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled.
Note
|
enableBatchRead flag is enabled by default. |
supportsBatch
is disabled (i.e. false
) otherwise.
Creating DataSourceV2ScanExec Instance
DataSourceV2ScanExec
takes the following when created:
DataSourceV2ScanExec
initializes the internal properties.
Creating Input RDD of Internal Rows — inputRDD
Internal Property
inputRDD: RDD[InternalRow]
Note
|
inputRDD is a Scala lazy value which is computed once when accessed and cached afterwards.
|
inputRDD
branches off per the type of the DataSourceReader:
-
For a
ContinuousReader
in Spark Structured Streaming,inputRDD
is aContinuousDataSourceRDD
that…FIXME -
For a SupportsScanColumnarBatch with the enableBatchRead flag enabled,
inputRDD
is a DataSourceRDD with the batchPartitions -
For all other types of the DataSourceReader,
inputRDD
is a DataSourceRDD with the partitions.
Note
|
inputRDD is used when DataSourceV2ScanExec physical operator is requested for the input RDDs and to execute.
|
Internal Properties
Name | Description |
---|---|
|
Input partitions of ColumnarBatches ( |
|
Input partitions of InternalRows ( |