Analyzer — Logical Query Plan Analyzer

Analyzer (aka Spark Analyzer or Query Analyzer) is the logical query plan analyzer that semantically validates and transforms an unresolved logical plan to an analyzed logical plan.

Analyzer is a concrete RuleExecutor of LogicalPlan (i.e. RuleExecutor[LogicalPlan]) with the logical evaluation rules.

Analyzer: Unresolved Logical Plan ==> Analyzed Logical Plan

Analyzer uses SessionCatalog while resolving relational entities, e.g. databases, tables, columns.

Analyzer is created when SessionState is requested for the analyzer.

Figure 1. Creating Analyzer

Analyzer is available as the analyzer property of a session-specific SessionState.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.analyzer
org.apache.spark.sql.catalyst.analysis.Analyzer

You can access the analyzed logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended flag enabled) or SQL’s EXPLAIN EXTENDED SQL command.

// sample structured query
val inventory = spark
  .range(5)
  .withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
      +- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)

Alternatively, you can access the analyzed logical plan using QueryExecution and its analyzed property (that together with numberedTreeString method is a very good "debugging" tool).

val analyzedPlan = inventory.queryExecution.analyzed
scala> println(analyzedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))

Analyzer defines extendedResolutionRules extension point for additional logical evaluation rules that a custom Analyzer can use to extend the Resolution rule batch. The rules are added at the end of the Resolution batch.

Note	SessionState uses its own `Analyzer` with custom extendedResolutionRules, postHocResolutionRules, and extendedCheckRules extension methods.

Table 1. Analyzer’s Internal Registries and Counters
Name	Description
`extendedResolutionRules`	Additional rules for Resolution batch. Empty by default
`fixedPoint`	`FixedPoint` with maxIterations for Hints, Substitution, Resolution and Cleanup batches. Set when `Analyzer` is created (and can be defined explicitly or through optimizerMaxIterations configuration setting.
`postHocResolutionRules`	The only rules in Post-Hoc Resolution batch if defined (that are executed in one pass, i.e. `Once` strategy). Empty by default

Analyzer is used by QueryExecution to resolve the managed LogicalPlan (and, as a sort of follow-up, assert that a structured query has already been properly analyzed, i.e. no failed or unresolved or somehow broken logical plan operators and expressions exist).

Tip

Enable TRACE or DEBUG logging levels for the respective session-specific loggers to see what happens inside Analyzer.

org.apache.spark.sql.internal.SessionState$$anon$1
org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1 when Hive support is enabled

Add the following line to conf/log4j.properties:

# with no Hive support
log4j.logger.org.apache.spark.sql.internal.SessionState$$anon$1=TRACE

# with Hive support enabled
log4j.logger.org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1=DEBUG

Refer to Logging.

The reason for such weird-looking logger names is that analyzer attribute is created as an anonymous subclass of Analyzer class in the respective SessionStates.

Executing Logical Evaluation Rules — `execute` Method

Analyzer is a RuleExecutor that defines the logical rules (i.e. resolving, removing, and in general modifying it), e.g.

Resolves unresolved relations and functions (including UnresolvedGenerators) using provided SessionCatalog
…

Batch Name Strategy Rules Description

Hints

FixedPoint

ResolveBroadcastHints

Resolves UnresolvedHint logical operators with BROADCAST, BROADCASTJOIN or MAPJOIN hints to ResolvedHint operators

ResolveCoalesceHints

Resolves UnresolvedHint logical operators with COALESCE or REPARTITION hints to ResolvedHint operators

RemoveAllHints

Removes all UnresolvedHint logical operators

Simple Sanity Check

Once

LookupFunctions

Checks whether a function identifier (referenced by an UnresolvedFunction) exists in the function registry. Throws a NoSuchFunctionException if not.

Substitution

FixedPoint

CTESubstitution

Resolves With operators (and substitutes named common table expressions — CTEs)

WindowsSubstitution

Substitutes an UnresolvedWindowExpression with a WindowExpression for WithWindowDefinition logical operators.

EliminateUnions

Eliminates Union of one child into that child

SubstituteUnresolvedOrdinals

Replaces ordinals in Sort and Aggregate logical operators with UnresolvedOrdinal expressions

Resolution

FixedPoint

ResolveTableValuedFunctions

Replaces UnresolvedTableValuedFunction with table-valued function.

ResolveRelations

Resolves:

ResolveReferences

ResolveCreateNamedStruct

Resolves CreateNamedStruct expressions (with NamePlaceholders) to use Literal expressions

ResolveDeserializer

ResolveNewInstance

ResolveUpCast

ResolveGroupingAnalytics

Resolves grouping expressions up in a logical plan tree:

Cube, Rollup and GroupingSets expressions
Filter with Grouping or GroupingID expressions
Sort with Grouping or GroupingID expressions

Expects that all children of a logical operator are already resolved (and given it belongs to a fixed-point batch it will likely happen at some iteration).

Fails analysis when grouping__id Hive function is used.

scala> sql("select grouping__id").queryExecution.analyzed
org.apache.spark.sql.AnalysisException: grouping__id is deprecated; use grouping_id() instead;
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$apply$6.applyOrElse(Analyzer.scala:451)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$apply$6.applyOrElse(Analyzer.scala:448)

Note	`ResolveGroupingAnalytics` is only for grouping functions resolution while ResolveAggregateFunctions is responsible for resolving the other aggregates.

ResolvePivot

Resolves Pivot logical operator to Project with an Aggregate unary logical operator (for supported data types in aggregates) or just a single Aggregate.

ResolveOrdinalInOrderByAndGroupBy

ResolveMissingReferences

ExtractGenerator

ResolveGenerate

ResolveFunctions

Resolves functions using SessionCatalog:

UnresolvedGenerator to a Generator
UnresolvedFunction to a AggregateExpression (with AggregateFunction) or AggregateWindowFunction

If Generator is not found, ResolveFunctions reports the error:

[name] is expected to be a generator. However, its class is [className], which is not a generator.

ResolveAliases

Replaces UnresolvedAlias expressions with concrete aliases:

NamedExpressions
MultiAlias (for GeneratorOuter and Generator)
Alias (for Cast and ExtractValue)

ResolveSubquery

Resolves subquery expressions (i.e. ScalarSubquery, Exists and In)

ResolveWindowOrder

ResolveWindowFrame

Resolves WindowExpression expressions

ResolveNaturalAndUsingJoin

ExtractWindowExpressions

GlobalAggregates

Resolves (aka replaces) Project operators with AggregateExpression that are not WindowExpression with Aggregate unary logical operators.

ResolveAggregateFunctions

Resolves aggregate functions in Filter and Sort operators

Note	`ResolveAggregateFunctions` skips (i.e. does not resolve) grouping functions that are resolved by ResolveGroupingAnalytics rule.

TimeWindowing

Resolves TimeWindow expressions to Filter with Expand logical operators.

ResolveInlineTables

Resolves UnresolvedInlineTable operators to LocalRelations

TypeCoercion.typeCoercionRules

Type coercion rules

extendedResolutionRules

Post-Hoc Resolution

Once

postHocResolutionRules

View

Once

AliasViewChild

Nondeterministic

Once

PullOutNondeterministic

UDF

Once

HandleNullInputsForUDF

FixNullability

Once

FixNullability

ResolveTimeZone

Once

ResolveTimeZone

Replaces TimeZoneAwareExpression with no timezone with one with session-local time zone.

Cleanup

FixedPoint

CleanupAliases

Tip	Consult the sources of `Analyzer` for the up-to-date list of the evaluation rules.

Creating Analyzer Instance

Analyzer takes the following when created:

SessionCatalog
CatalystConf
Maximum number of iterations (of the FixedPoint rule batches, i.e. Hints, Substitution, Resolution and Cleanup)

Analyzer initializes the internal registries and counters.

Note	`Analyzer` can also be created without specifying the maxIterations argument which is then configured using optimizerMaxIterations configuration setting.

`resolver` Method

resolver: Resolver

resolver requests CatalystConf for Resolver.

Note	`Resolver` is a mere function of two `String` parameters that returns `true` if both refer to the same entity (i.e. for case insensitive equality).

`resolveExpression` Method

resolveExpression(
  expr: Expression,
  plan: LogicalPlan,
  throws: Boolean = false): Expression

resolveExpression…FIXME

Note	`resolveExpression` is a `protected[sql]` method.

Note	`resolveExpression` is used when…FIXME

`commonNaturalJoinProcessing` Internal Method

commonNaturalJoinProcessing(
  left: LogicalPlan,
  right: LogicalPlan,
  joinType: JoinType,
  joinNames: Seq[String],
  condition: Option[Expression]): Project

commonNaturalJoinProcessing…FIXME

Note	`commonNaturalJoinProcessing` is used when…FIXME

`executeAndCheck` Method

executeAndCheck(plan: LogicalPlan): LogicalPlan

executeAndCheck…FIXME

Note	`executeAndCheck` is used exclusively when `QueryExecution` is requested for the analyzed logical plan.

Analyzer

Analyzer — Logical Query Plan Analyzer

Executing Logical Evaluation Rules — `execute` Method

Creating Analyzer Instance

`resolver` Method

`resolveExpression` Method

`commonNaturalJoinProcessing` Internal Method

`executeAndCheck` Method

results matching ""

No results matching ""

Analyzer — Logical Query Plan Analyzer

Executing Logical Evaluation Rules — execute Method

Creating Analyzer Instance

resolver Method

resolveExpression Method

commonNaturalJoinProcessing Internal Method

executeAndCheck Method

results matching ""

No results matching ""

Executing Logical Evaluation Rules — `execute` Method

`resolver` Method

`resolveExpression` Method

`commonNaturalJoinProcessing` Internal Method

`executeAndCheck` Method