DataSourceRDD — Input RDD Of DataSourceV2ScanExec Physical Operator

DataSourceRDD acts as a thin adapter between Spark SQL’s Data Source API V2 and Spark Core’s RDD API.

DataSourceRDD is an RDD that is created exclusively when DataSourceV2ScanExec physical operator is requested for the input RDD (when WholeStageCodegenExec physical operator is executed).

DataSourceRDD uses DataSourceRDDPartition when requested for the partitions (that is a mere wrapper of the InputPartitions).

DataSourceRDD takes the following to be created:

DataSourceRDD as a Spark Core RDD overrides the following methods:

Requesting Preferred Locations (For Partition) — getPreferredLocations Method

getPreferredLocations(split: Partition): Seq[String]
Note
getPreferredLocations is part of Spark Core’s RDD Contract to…​FIXME.

getPreferredLocations simply requests the given split DataSourceRDDPartition for the InputPartition that in turn is requested for the preferred locations.

RDD Partitions — getPartitions Method

getPartitions: Array[Partition]
Note
getPartitions is part of Spark Core’s RDD Contract to…​FIXME

getPartitions simply creates a DataSourceRDDPartition for every InputPartition.

Computing Partition (in TaskContext) — compute Method

compute(split: Partition, context: TaskContext): Iterator[T]
Note
compute is part of Spark Core’s RDD Contract to compute a partition (in a TaskContext).

compute requests the input DataSourceRDDPartition (the split partition) for the InputPartition that in turn is requested to create an InputPartitionReader.

compute registers a Spark Core TaskCompletionListener that requests the InputPartitionReader to close when a task completes.

compute returns a Spark Core InterruptibleIterator that requests the InputPartitionReader to proceed to the next record (when requested to hasNext) and return the current record (when next).

results matching ""

    No results matching ""