Analyzer — Logical Query Plan Analyzer

Analyzer (aka Spark Analyzer or Query Analyzer) is the logical query plan analyzer that semantically validates and transforms an unresolved logical plan to an analyzed logical plan.

Analyzer is a concrete RuleExecutor of LogicalPlan (i.e. RuleExecutor[LogicalPlan]) with the logical evaluation rules.

Analyzer: Unresolved Logical Plan ==> Analyzed Logical Plan

Analyzer uses SessionCatalog while resolving relational entities, e.g. databases, tables, columns.

Analyzer is created when SessionState is requested for the analyzer.

spark sql Analyzer.png
Figure 1. Creating Analyzer

Analyzer is available as the analyzer property of a session-specific SessionState.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.analyzer
org.apache.spark.sql.catalyst.analysis.Analyzer

You can access the analyzed logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended flag enabled) or SQL’s EXPLAIN EXTENDED SQL command.

// sample structured query
val inventory = spark
  .range(5)
  .withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
      +- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)

Alternatively, you can access the analyzed logical plan using QueryExecution and its analyzed property (that together with numberedTreeString method is a very good "debugging" tool).

val analyzedPlan = inventory.queryExecution.analyzed
scala> println(analyzedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))

Analyzer defines extendedResolutionRules extension point for additional logical evaluation rules that a custom Analyzer can use to extend the Resolution rule batch. The rules are added at the end of the Resolution batch.

Note
SessionState uses its own Analyzer with custom extendedResolutionRules, postHocResolutionRules, and extendedCheckRules extension methods.
Table 1. Analyzer’s Internal Registries and Counters
Name Description

extendedResolutionRules

Additional rules for Resolution batch.

Empty by default

fixedPoint

FixedPoint with maxIterations for Hints, Substitution, Resolution and Cleanup batches.

Set when Analyzer is created (and can be defined explicitly or through optimizerMaxIterations configuration setting.

postHocResolutionRules

The only rules in Post-Hoc Resolution batch if defined (that are executed in one pass, i.e. Once strategy). Empty by default

Analyzer is used by QueryExecution to resolve the managed LogicalPlan (and, as a sort of follow-up, assert that a structured query has already been properly analyzed, i.e. no failed or unresolved or somehow broken logical plan operators and expressions exist).

Tip

Enable TRACE or DEBUG logging levels for the respective session-specific loggers to see what happens inside Analyzer.

  • org.apache.spark.sql.internal.SessionState$$anon$1

  • org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1 when Hive support is enabled

Add the following line to conf/log4j.properties:

# with no Hive support
log4j.logger.org.apache.spark.sql.internal.SessionState$$anon$1=TRACE

# with Hive support enabled
log4j.logger.org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1=DEBUG

Refer to Logging.


The reason for such weird-looking logger names is that analyzer attribute is created as an anonymous subclass of Analyzer class in the respective SessionStates.

Executing Logical Evaluation Rules — execute Method

Analyzer is a RuleExecutor that defines the logical rules (i.e. resolving, removing, and in general modifying it), e.g.

Table 2. Analyzer’s Batches and Logical Evaluation Rules (in the order of execution)
Batch Name Strategy Rules Description

Hints

FixedPoint

ResolveBroadcastHints

Resolves UnresolvedHint logical operators with BROADCAST, BROADCASTJOIN or MAPJOIN hints to ResolvedHint operators

ResolveCoalesceHints

Resolves UnresolvedHint logical operators with COALESCE or REPARTITION hints to ResolvedHint operators

RemoveAllHints

Removes all UnresolvedHint logical operators

Simple Sanity Check

Once

LookupFunctions

Checks whether a function identifier (referenced by an UnresolvedFunction) exists in the function registry. Throws a NoSuchFunctionException if not.

Substitution

FixedPoint

CTESubstitution

Resolves With operators (and substitutes named common table expressions — CTEs)

WindowsSubstitution

Substitutes an UnresolvedWindowExpression with a WindowExpression for WithWindowDefinition logical operators.

EliminateUnions

Eliminates Union of one child into that child

SubstituteUnresolvedOrdinals

Replaces ordinals in Sort and Aggregate logical operators with UnresolvedOrdinal expressions

Resolution

FixedPoint

ResolveTableValuedFunctions

Replaces UnresolvedTableValuedFunction with table-valued function.

ResolveRelations

ResolveReferences

ResolveCreateNamedStruct

Resolves CreateNamedStruct expressions (with NamePlaceholders) to use Literal expressions

ResolveDeserializer

ResolveNewInstance

ResolveUpCast

ResolveGroupingAnalytics

Resolves grouping expressions up in a logical plan tree:

  • Cube, Rollup and GroupingSets expressions

  • Filter with Grouping or GroupingID expressions

  • Sort with Grouping or GroupingID expressions

Expects that all children of a logical operator are already resolved (and given it belongs to a fixed-point batch it will likely happen at some iteration).

Fails analysis when grouping__id Hive function is used.

scala> sql("select grouping__id").queryExecution.analyzed
org.apache.spark.sql.AnalysisException: grouping__id is deprecated; use grouping_id() instead;
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$apply$6.applyOrElse(Analyzer.scala:451)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$apply$6.applyOrElse(Analyzer.scala:448)
Note
ResolveGroupingAnalytics is only for grouping functions resolution while ResolveAggregateFunctions is responsible for resolving the other aggregates.

ResolvePivot

Resolves Pivot logical operator to Project with an Aggregate unary logical operator (for supported data types in aggregates) or just a single Aggregate.

ResolveOrdinalInOrderByAndGroupBy

ResolveMissingReferences

ExtractGenerator

ResolveGenerate

ResolveFunctions

Resolves functions using SessionCatalog:

If Generator is not found, ResolveFunctions reports the error:

[name] is expected to be a generator. However, its class is [className], which is not a generator.

ResolveAliases

Replaces UnresolvedAlias expressions with concrete aliases:

  • NamedExpressions

  • MultiAlias (for GeneratorOuter and Generator)

  • Alias (for Cast and ExtractValue)

ResolveSubquery

Resolves subquery expressions (i.e. ScalarSubquery, Exists and In)

ResolveWindowOrder

ResolveWindowFrame

Resolves WindowExpression expressions

ResolveNaturalAndUsingJoin

ExtractWindowExpressions

GlobalAggregates

Resolves (aka replaces) Project operators with AggregateExpression that are not WindowExpression with Aggregate unary logical operators.

ResolveAggregateFunctions

Resolves aggregate functions in Filter and Sort operators

Note
ResolveAggregateFunctions skips (i.e. does not resolve) grouping functions that are resolved by ResolveGroupingAnalytics rule.

TimeWindowing

Resolves TimeWindow expressions to Filter with Expand logical operators.

ResolveInlineTables

Resolves UnresolvedInlineTable operators to LocalRelations

TypeCoercion.typeCoercionRules

Type coercion rules

extendedResolutionRules

Post-Hoc Resolution

Once

postHocResolutionRules

View

Once

AliasViewChild

Nondeterministic

Once

PullOutNondeterministic

UDF

Once

HandleNullInputsForUDF

FixNullability

Once

FixNullability

ResolveTimeZone

Once

ResolveTimeZone

Replaces TimeZoneAwareExpression with no timezone with one with session-local time zone.

Cleanup

FixedPoint

CleanupAliases

Tip
Consult the sources of Analyzer for the up-to-date list of the evaluation rules.

Creating Analyzer Instance

Analyzer takes the following when created:

Analyzer initializes the internal registries and counters.

Note
Analyzer can also be created without specifying the maxIterations argument which is then configured using optimizerMaxIterations configuration setting.

resolver Method

resolver: Resolver

resolver requests CatalystConf for Resolver.

Note
Resolver is a mere function of two String parameters that returns true if both refer to the same entity (i.e. for case insensitive equality).

resolveExpression Method

resolveExpression(
  expr: Expression,
  plan: LogicalPlan,
  throws: Boolean = false): Expression

resolveExpression…​FIXME

Note
resolveExpression is a protected[sql] method.
Note
resolveExpression is used when…​FIXME

commonNaturalJoinProcessing Internal Method

commonNaturalJoinProcessing(
  left: LogicalPlan,
  right: LogicalPlan,
  joinType: JoinType,
  joinNames: Seq[String],
  condition: Option[Expression]): Project

commonNaturalJoinProcessing…​FIXME

Note
commonNaturalJoinProcessing is used when…​FIXME

executeAndCheck Method

executeAndCheck(plan: LogicalPlan): LogicalPlan

executeAndCheck…​FIXME

Note
executeAndCheck is used exclusively when QueryExecution is requested for the analyzed logical plan.

results matching ""

    No results matching ""