Debugging Query Execution

debug package object contains tools for debugging query execution, i.e. a full analysis of structured queries (as Datasets).

Table 1. Debugging Query Execution Tools (debug Methods)
Method	Description
debug	Debugging a structured query `debug(): Unit`
debugCodegen	Displays the Java source code generated for a structured query in whole-stage code generation (i.e. the output of each WholeStageCodegen subtree in a query plan). `debugCodegen(): Unit`

debug package object is in org.apache.spark.sql.execution.debug package that you have to import before you can use the debug and debugCodegen methods.

// Import the package object
import org.apache.spark.sql.execution.debug._

// Every Dataset (incl. DataFrame) has now the debug and debugCodegen methods
val q: DataFrame = ...
q.debug
q.debugCodegen

Tip	Read up on Package Objects in the Scala programming language.

Internally, debug package object uses DebugQuery implicit class that "extends" Dataset[_] Scala type with the debug methods.

implicit class DebugQuery(query: Dataset[_]) {
  def debug(): Unit = ...
  def debugCodegen(): Unit = ...
}

Tip	Read up on Implicit Classes in the official documentation of the Scala programming language.

Debugging Dataset — `debug` Method

debug(): Unit

debug requests the QueryExecution (of the structured query) for the optimized physical query plan.

debug transforms the optimized physical query plan to add a new DebugExec physical operator for every physical operator.

debug requests the query plan to execute and then counts the number of rows in the result. It prints out the following message:

Results returned: [count]

In the end, debug requests every DebugExec physical operator (in the query plan) to dumpStats.

val q = spark.range(10).where('id === 4)

scala> :type q
org.apache.spark.sql.Dataset[Long]

// Extend Dataset[Long] with debug and debugCodegen methods
import org.apache.spark.sql.execution.debug._

scala> q.debug
Results returned: 1
== WholeStageCodegen ==
Tuples output: 1
 id LongType: {java.lang.Long}
== Filter (id#0L = 4) ==
Tuples output: 0
 id LongType: {}
== Range (0, 10, step=1, splits=8) ==
Tuples output: 0
 id LongType: {}

Displaying Java Source Code Generated for Structured Query in Whole-Stage Code Generation ("Debugging" Codegen) — `debugCodegen` Method

debugCodegen(): Unit

debugCodegen requests the QueryExecution (of the structured query) for the optimized physical query plan.

In the end, debugCodegen simply codegenString the query plan and prints it out to the standard output.

import org.apache.spark.sql.execution.debug._

scala> spark.range(10).where('id === 4).debugCodegen
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
*Filter (id#29L = 4)
+- *Range (0, 10, splits=8)

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
...

Note

debugCodegen is equivalent to using debug interface of the QueryExecution.

val q = spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6)
scala> q.queryExecution.debug.codegen
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
*Project [(id#3L + 6) AS (((id + 1) + 2) + 3)#6L, (id#3L + 15) AS (((id + 4) + 5) + 6)#7L]
+- *Range (1, 1000, step=1, splits=8)

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
...

`codegenToSeq` Method

codegenToSeq(): Seq[(String, String)]

codegenToSeq…FIXME

Note	`codegenToSeq` is used when…FIXME

`codegenString` Method

codegenString(plan: SparkPlan): String

codegenString…FIXME

Note	`codegenString` is used when…FIXME

Debugging Query Execution