doExecute(): RDD[InternalRow]
DataSourceV2ScanExec Leaf Physical Operator
DataSourceV2ScanExec is a leaf physical operator that represents a DataSourceV2Relation logical operator at execution time.
|
Note
|
A DataSourceV2Relation logical operator is created exclusively when DataFrameReader is requested to "load" data (as a DataFrame) (from a data source with ReadSupport).
|
DataSourceV2ScanExec supports ColumnarBatchScan with vectorized batch decoding (when created for a DataSourceReader that supports it, i.e. the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled).
DataSourceV2ScanExec is also a DataSourceV2StringFormat, i.e….FIXME
DataSourceV2ScanExec is created exclusively when DataSourceV2Strategy execution planning strategy is executed (i.e. applied to a logical plan) and finds a DataSourceV2Relation logical operator.
DataSourceV2ScanExec gives the single input RDD as the only input RDD of internal rows (when WholeStageCodegenExec physical operator is executed).
Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method
|
Note
|
doExecute is part of SparkPlan Contract to generate the runtime representation of a structured query as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow]).
|
doExecute…FIXME
supportsBatch Property
supportsBatch: Boolean
|
Note
|
supportsBatch is part of ColumnarBatchScan Contract to control whether the physical operator supports vectorized decoding or not.
|
supportsBatch is enabled (true) only when the DataSourceReader is a SupportsScanColumnarBatch with the enableBatchRead flag enabled.
|
Note
|
enableBatchRead flag is enabled by default. |
supportsBatch is disabled (i.e. false) otherwise.
Creating DataSourceV2ScanExec Instance
DataSourceV2ScanExec takes the following when created:
DataSourceV2ScanExec initializes the internal properties.
Creating Input RDD of Internal Rows — inputRDD Internal Property
inputRDD: RDD[InternalRow]
|
Note
|
inputRDD is a Scala lazy value which is computed once when accessed and cached afterwards.
|
inputRDD branches off per the type of the DataSourceReader:
-
For a
ContinuousReaderin Spark Structured Streaming,inputRDDis aContinuousDataSourceRDDthat…FIXME -
For a SupportsScanColumnarBatch with the enableBatchRead flag enabled,
inputRDDis a DataSourceRDD with the batchPartitions -
For all other types of the DataSourceReader,
inputRDDis a DataSourceRDD with the partitions.
|
Note
|
inputRDD is used when DataSourceV2ScanExec physical operator is requested for the input RDDs and to execute.
|
Internal Properties
| Name | Description |
|---|---|
|
Input partitions of ColumnarBatches ( |
|
Input partitions of InternalRows ( |