SparkPlan — Contract of Physical Operators in Physical Query Plan of Structured Query

SparkPlan is the contract in Spark SQL for physical operators to build a physical query plan.

SparkPlan is a recursive data structure in Spark SQL’s Catalyst tree manipulation framework and as such represents a single physical operator in a physical execution query plan as well as a physical execution query plan itself (i.e. a tree of physical operators in a query plan of a structured query).

spark sql SparkPlan webui physical plan.png
Figure 1. Physical Plan of Structured Query (i.e. Tree of SparkPlans)
Note
A structured query can be expressed using Spark SQL’s high-level strongly-typed Dataset API or good ol' SQL.

A SparkPlan physical operator is a Catalyst tree node that may have zero or more child physical operators.

Note
A structured query is basically a single SparkPlan physical operator with child physical operators.
Note
Spark SQL uses Catalyst tree manipulation framework to compose nodes to build a tree of (logical or physical) operators that, in this particular case, is composing SparkPlan physical operator nodes to build the physical execution plan tree of a structured query.

When executed, SparkPlan creates an RDD of internal binary rows (i.e. RDD[InternalRow]).

spark sql SparkPlan execute.png
Figure 2. SparkPlan’s Execution (execute Method)
Caution
FIXME Picture between Spark SQL’s Dataset ⇒ Spark Core’s RDD
Note

execute is called when QueryExecution is requested for the RDD that is Spark Core’s physical execution plan (as a RDD lineage) that triggers query execution (i.e. physical planning, but not execution of the plan) and could be considered execution of a structured query.

The could part above refers to the fact that the final execution of a structured query happens only when a RDD action is executed on the RDD of a structured query. And hence the need for Spark SQL’s high-level Dataset API in which the Dataset operators simply execute a RDD action on the corresponding RDD. Easy, isn’t it?

Tip

Use explain operator to see the execution plan of a structured query.

val q = // your query here
q.explain

You may also access the execution plan of a Dataset using its queryExecution property.

val q = // your query here
q.queryExecution.sparkPlan

The SparkPlan contract assumes that concrete physical operators define doExecute method (with optional hooks like doPrepare) which is executed when the physical operator is executed.

Caution
FIXME A picture with methods/hooks called.
Caution
FIXME SparkPlan is Serializable. Why?

SparkPlan has the following final methods that prepare execution environment and pass calls to corresponding methods (that constitute SparkPlan Contract).

Table 1. SparkPlan’s Final Methods
Name Description

execute

Executes a physical operator (and its children) that triggers physical query planning and in the end creates an RDD of internal binary rows (i.e. RDD[InternalRow]).

final def execute(): RDD[InternalRow]

Used mostly when QueryExecution is requested for a RDD that represents the final execution plan.

Internally, execute calls the physical operator’s doExecute after preparing the query for execution.

Note
Executing doExecute in a named scope happens only after the operator is prepared for execution followed by waiting for any subqueries to finish.

executeQuery

Executes a physical operator in a single RDD scope, i.e. all RDDs created during execution of the physical operator have the same scope.

protected final def executeQuery[T](query: => T): T

executeQuery executes the input query after the following methods (in order):

Note

executeQuery is used when:

prepare

Prepares a physical operator for execution

final def prepare(): Unit

prepare is used mainly when a physical operator is requested to execute a structured query

prepare is also used recursively for every child physical operator (down the physical plan) and when a physical operator is requested to prepare subqueries.

Note
prepare is idempotent, i.e. can be called multiple times with no change to the final result. It uses prepared internal flag to execute the physical operator once only.

Internally, prepare calls doPrepare of its children before prepareSubqueries and doPrepare.

executeBroadcast

Calls doExecuteBroadcast

Table 2. Physical Query Operators / Specialized SparkPlans
Name Description

BinaryExecNode

Binary physical operator with two child left and right physical operators

LeafExecNode

Leaf physical operator with no children

By default, the set of all attributes that are produced is exactly the set of attributes that are output.

UnaryExecNode

Unary physical operator with one child physical operator

Note
The naming convention for physical operators in Spark’s source code is to have their names end with the Exec prefix, e.g. DebugExec or LocalTableScanExec that is however removed when the operator is displayed, e.g. in web UI.
Table 3. SparkPlan’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

prepared

Flag that controls that prepare is executed only once.

outputOrdering Method

Caution
FIXME

decodeUnsafeRows Method

Caution
FIXME

prepareSubqueries Method

Caution
FIXME

getByteArrayRdd Internal Method

getByteArrayRdd(n: Int = -1): RDD[Array[Byte]]
Caution
FIXME

waitForSubqueries Method

Caution
FIXME

executeCollect Method

Caution
FIXME
Note
executeCollect does not convert data to JVM types.

executeToIterator Method

Caution
FIXME

SparkPlan Contract

SparkPlan contract requires that concrete physical operators define their own custom doExecute.

doExecute(): RDD[InternalRow]

doExecute produces the result of a structured query as an RDD of internal binary rows.

Table 4. SparkPlan’s Extension Hooks (in alphabetical order)
Name Description

doExecuteBroadcast

By default reports a UnsupportedOperationException.

[nodeName] does not implement doExecuteBroadcast

Executed exclusively as part of executeBroadcast to return the result of a structured query as a broadcast variable.

doPrepare

Prepares a physical operator for execution.

Executed exclusively as part of prepare and is supposed to set some state up before executing a query (e.g. BroadcastExchangeExec to broadcast asynchronously).

outputPartitioning

Specifies how data is partitioned across different nodes in the cluster

requiredChildDistribution

Required partition requirements (aka child output distributions) of the input data, i.e. how children physical operators' output is split across partitions.

requiredChildDistribution: Seq[Distribution]

Defaults to UnspecifiedDistribution for all of the physical operator’s children.

Used exclusively when EnsureRequirements physical preparation rule enforces partition requirements of a physical operator.

requiredChildOrdering

Specifies required sort ordering for each partition requirement (from children operators)

requiredChildOrdering: Seq[Seq[SortOrder]]

Defaults to no sort ordering for all of the physical operator’s children.

Used exclusively when EnsureRequirements physical preparation rule enforces sort requirements of a physical operator.

Preparing SparkPlan for Query Execution — executeQuery Final Method

executeQuery[T](query: => T): T

executeQuery executes the input query in a named scope (i.e. so that all RDDs created will have the same scope for visualization like web UI).

Internally, executeQuery calls prepare and waitForSubqueries followed by executing query.

Note
executeQuery is executed as part of execute, executeBroadcast and when CodegenSupport-enabled physical operator produces a Java source code.

Broadcasting Result of Structured Query — executeBroadcast Final Method

executeBroadcast[T](): broadcast.Broadcast[T]

executeBroadcast returns the result of a structured query as a broadcast variable.

Internally, executeBroadcast calls doExecuteBroadcast inside executeQuery.

Note
executeBroadcast is called in BroadcastHashJoinExec, BroadcastNestedLoopJoinExec and ReusedExchangeExec physical operators.

metrics Internal Registry

metrics: Map[String, SQLMetric] = Map.empty

metrics is a registry of supported SQLMetrics by their names.

Taking First N UnsafeRows — executeTake Method

executeTake(n: Int): Array[InternalRow]

executeTake gives an array of up to n first internal rows.

spark sql SparkPlan executeTake.png
Figure 3. SparkPlan’s executeTake takes 5 elements

Internally, executeTake gets an RDD of byte array of n unsafe rows and scans the RDD partitions one by one until n is reached or all partitions were processed.

executeTake runs Spark jobs that take all the elements from requested number of partitions, starting from the 0th partition and increasing their number by spark.sql.limit.scaleUpFactor property (but minimum twice as many).

Note
executeTake uses SparkContext.runJob to run a Spark job.

In the end, executeTake decodes the unsafe rows.

Note
executeTake gives an empty collection when n is 0 (and no Spark job is executed).
Note
executeTake may take and decode more unsafe rows than really needed since all unsafe rows from a partition are read (if the partition is included in the scan).
import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS
spark.sessionState.conf.setConf(SHUFFLE_PARTITIONS, 10)

// 8 groups over 10 partitions
// only 7 partitions are with numbers
val nums = spark.
  range(start = 0, end = 20, step = 1, numPartitions = 4).
  repartition($"id" % 8)

import scala.collection.Iterator
val showElements = (it: Iterator[java.lang.Long]) => {
  val ns = it.toSeq
  import org.apache.spark.TaskContext
  val pid = TaskContext.get.partitionId
  println(s"[partition: $pid][size: ${ns.size}] ${ns.mkString(" ")}")
}
// ordered by partition id manually for demo purposes
scala> nums.foreachPartition(showElements)
[partition: 0][size: 2] 4 12
[partition: 1][size: 2] 7 15
[partition: 2][size: 0]
[partition: 3][size: 0]
[partition: 4][size: 0]
[partition: 5][size: 5] 0 6 8 14 16
[partition: 6][size: 0]
[partition: 7][size: 3] 3 11 19
[partition: 8][size: 5] 2 5 10 13 18
[partition: 9][size: 3] 1 9 17

scala> println(spark.sessionState.conf.limitScaleUpFactor)
4

// Think how many Spark jobs will the following queries run?
// Answers follow
scala> nums.take(13)
res0: Array[Long] = Array(4, 12, 7, 15, 0, 6, 8, 14, 16, 3, 11, 19, 2)

// The number of Spark jobs = 3

scala> nums.take(5)
res34: Array[Long] = Array(4, 12, 7, 15, 0)

// The number of Spark jobs = 4

scala> nums.take(3)
res38: Array[Long] = Array(4, 12, 7)

// The number of Spark jobs = 2
Note

executeTake is used when:

  • CollectLimitExec is requested to executeCollect

  • AnalyzeColumnCommand is executed

results matching ""

    No results matching ""