ShuffledRowRDD

ShuffledRowRDD is an RDD of internal binary rows (i.e. RDD[InternalRow]) that is created when:

  • ShuffleExchangeExec physical operator is requested to preparePostShuffleRDD

  • CollectLimitExec and TakeOrderedAndProjectExec physical operators are requested to doExecute

ShuffledRowRDD takes the following to be created:

  • ShuffleDependency[Int, InternalRow, InternalRow]

  • Optional partition start indices (Option[Array[Int]] , default: None)

Note
The dependency property is mutable so it can be cleared.

ShuffledRowRDD takes an optional partition start indices that is the number of post-shuffle partitions. When not specified (when executing CollectLimitExec and TakeOrderedAndProjectExec physical operators), the number of post-shuffle partitions is managed by the Partitioner of the input ShuffleDependency. Otherwise, when specified (when ExchangeCoordinator is requested to doEstimationIfNecessary), ShuffledRowRDD…​FIXME

Note
Post-shuffle partition is…​FIXME
Note
ShuffledRowRDD looks like ShuffledRDD, and the difference is in the type of the values to process, i.e. InternalRow and (K, C) key-value pairs, respectively.
Table 1. ShuffledRowRDD and RDD Contract
Name Description

getDependencies

A single-element collection with ShuffleDependency[Int, InternalRow, InternalRow].

partitioner

CoalescedPartitioner (with the Partitioner of the dependency)

getPreferredLocations

compute

numPreShufflePartitions Property

Caution
FIXME

Computing Partition (in TaskContext) — compute Method

compute(split: Partition, context: TaskContext): Iterator[InternalRow]
Note
compute is part of Spark Core’s RDD Contract to compute a partition (in a TaskContext).

Internally, compute makes sure that the input split is a ShuffledRowRDDPartition. It then requests ShuffleManager for a ShuffleReader to read InternalRows for the split.

Note
compute uses ShuffleHandle (of ShuffleDependency dependency) and the pre-shuffle start and end partition offsets.

Getting Placement Preferences of Partition — getPreferredLocations Method

getPreferredLocations(partition: Partition): Seq[String]
Note
getPreferredLocations is part of RDD contract to specify placement preferences (aka preferred task locations), i.e. where tasks should be executed to be as close to the data as possible.

Internally, getPreferredLocations requests MapOutputTrackerMaster for the preferred locations of the input partition (for the single ShuffleDependency).

Note
getPreferredLocations uses SparkEnv to access the current MapOutputTrackerMaster (which runs on the driver).

CoalescedPartitioner

Caution
FIXME

ShuffledRowRDDPartition

Caution
FIXME

Clearing Dependencies — clearDependencies Method

clearDependencies(): Unit
Note
clearDependencies is part of the RDD Contract to clear dependencies of the RDD (to enable the parent RDDs to be garbage collected).

clearDependencies simply requests the parent RDD to clearDependencies followed by clear the given dependency (i.e. set to null).

results matching ""

    No results matching ""