ContinuousDataSourceRDD — Input RDD of DataSourceV2ScanExec Physical Operator with ContinuousReader

ContinuousDataSourceRDD is a specialized RDD (RDD[InternalRow]) that is used exclusively for the only input RDD (with the input rows) of DataSourceV2ScanExec leaf physical operator with a ContinuousReader.

ContinuousDataSourceRDD is created exclusively when DataSourceV2ScanExec leaf physical operator is requested for the input RDDs (which there is only one actually).

ContinuousDataSourceRDD uses spark.sql.streaming.continuous.executorQueueSize configuration property for the size of the data queue.

ContinuousDataSourceRDD uses spark.sql.streaming.continuous.executorPollIntervalMs configuration property for the epochPollIntervalMs.

ContinuousDataSourceRDD takes the following to be created:

SparkContext
Size of the data queue
epochPollIntervalMs
InputPartition[InternalRow]s

ContinuousDataSourceRDD uses InputPartition (of a ContinuousDataSourceRDDPartition) for preferred host locations (where the input partition reader can run faster).

Computing Partition — `compute` Method

compute(
  split: Partition,
  context: TaskContext): Iterator[InternalRow]

Note	`compute` is part of the RDD Contract to compute a given partition.

compute…FIXME

`getPartitions` Method

getPartitions: Array[Partition]

Note	`getPartitions` is part of the `RDD` Contract to specify the partitions to compute.

getPartitions…FIXME

ContinuousDataSourceRDD