numPreShufflePartitions: Int
ShuffledRowRDD
ShuffledRowRDD is an RDD of internal binary rows (i.e. RDD[InternalRow]) that is created when:
-
ShuffleExchangeExecphysical operator is requested to create one -
CollectLimitExecandTakeOrderedAndProjectExecphysical operators are executed
ShuffledRowRDD takes the following to be created:
|
Note
|
The dependency property is mutable so it can be cleared. |
ShuffledRowRDD takes an optional partition start indices that is the number of post-shuffle partitions. When not specified, the number of post-shuffle partitions is managed by the Partitioner of the input ShuffleDependency. Otherwise, when specified (when ExchangeCoordinator is requested to doEstimationIfNecessary), ShuffledRowRDD…FIXME
|
Note
|
Post-shuffle partition is…FIXME |
|
Note
|
ShuffledRowRDD is similar to Spark Core’s ShuffledRDD, with the difference being the type of the values to process, i.e. InternalRow and (K, C) key-value pairs, respectively.
|
| Name | Description |
|---|---|
|
A single-element collection with |
|
CoalescedPartitioner (with the Partitioner of the |
numPreShufflePartitions Internal Property
numPreShufflePartitions is exactly the number of partitions of the Partitioner of the given ShuffleDependency.
|
Note
|
numPreShufflePartitions is used when ShuffledRowRDD is requested for the partitionStartIndices (with no optional partition indices given) and partitions.
|
partitionStartIndices Internal Property
partitionStartIndices: Array[Int]
partitionStartIndices is whatever given by the optional partition indices or an empty array of numPreShufflePartitions elements (that means that post-shuffle partitions correspond to pre-shuffle partitions).
|
Note
|
partitionStartIndices is used when ShuffledRowRDD is requested for the partitions and Partitioner.
|
part Internal Property
part: Partitioner
part is simply a CoalescedPartitioner (for the Partitioner of the given ShuffleDependency and the partitionStartIndices).
|
Note
|
part is used when ShuffledRowRDD is requested for the Partitioner and the partitions.
|
Computing Partition (in TaskContext) — compute Method
compute(
split: Partition,
context: TaskContext): Iterator[InternalRow]
|
Note
|
compute is part of Spark Core’s RDD contract to compute a partition (in a TaskContext).
|
Internally, compute makes sure that the input split is a ShuffledRowRDDPartition. compute then requests the ShuffleManager for a ShuffleReader to read InternalRows for the input split partition.
|
Note
|
compute uses Spark Core’s SparkEnv to access the current ShuffleManager.
|
|
Note
|
compute uses ShuffleHandle (of the ShuffleDependency) with the pre-shuffle start and end partition offsets of the ShuffledRowRDDPartition.
|
Getting Placement Preferences of Partition — getPreferredLocations Method
getPreferredLocations(partition: Partition): Seq[String]
|
Note
|
getPreferredLocations is part of RDD contract to specify placement preferences (aka preferred task locations), i.e. where tasks should be executed to be as close to the data as possible.
|
Internally, getPreferredLocations requests MapOutputTrackerMaster for the preferred locations of the single ShuffleDependency and the input partition.
|
Note
|
getPreferredLocations uses SparkEnv to access the current MapOutputTrackerMaster (that runs on the driver).
|
Clearing Dependencies — clearDependencies Method
clearDependencies(): Unit
|
Note
|
clearDependencies is part of the RDD contract to clear dependencies of the RDD (to enable the parent RDDs to be garbage collected).
|
clearDependencies simply requests the parent RDD to clearDependencies followed by clear the given dependency (i.e. set to null).