Catalyst Optimizer — Generic Logical Query Plan Optimizer

Optimizer (aka Catalyst Optimizer) is the base of logical query plan optimizers that defines the rule batches of logical optimizations (i.e. logical optimizations that are the rules that transform the query plan of a structured query to produce the optimized logical plan).

Note
SparkOptimizer is the one and only direct implementation of the Optimizer Contract in Spark SQL.

Optimizer is a RuleExecutor of LogicalPlan (i.e. RuleExecutor[LogicalPlan]).

Optimizer: Analyzed Logical Plan ==> Optimized Logical Plan

Optimizer is available as the optimizer property of a session-specific SessionState.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.optimizer
org.apache.spark.sql.catalyst.optimizer.Optimizer

You can access the optimized logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended flag enabled) or SQL’s EXPLAIN EXTENDED SQL command.

// sample structured query
val inventory = spark
  .range(5)
  .withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
      +- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)

Alternatively, you can access the analyzed logical plan using QueryExecution and its optimizedPlan property (that together with numberedTreeString method is a very good "debugging" tool).

val optimizedPlan = inventory.queryExecution.optimizedPlan
scala> println(optimizedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + 5) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))

Optimizer defines the default rule batches that are considered the base rule batches that can be further refined (extended or with some rules excluded).

Table 1. Optimizer’s Default Optimization Rule Batches (in the order of execution)
Batch Name Strategy Rules Description

Eliminate Distinct

Once

EliminateDistinct

Finish Analysis

Once

EliminateSubqueryAliases

Removes (eliminates) SubqueryAlias unary logical operators from a logical plan

EliminateView

Removes (eliminates) View unary logical operators from a logical plan and replaces them with their child logical operator

ReplaceExpressions

Replaces RuntimeReplaceable expressions with their single child expression

ComputeCurrentTime

GetCurrentDatabase

RewriteDistinctAggregates

ReplaceDeduplicateWithAggregate

Union

Once

CombineUnions

LocalRelation early

FixedPoint

ConvertToLocalRelation

PropagateEmptyRelation

Pullup Correlated Expressions

Once

PullupCorrelatedPredicates

Subquery

Once

OptimizeSubqueries

Replace Operators

FixedPoint

RewriteExceptAll

RewriteIntersectAll

ReplaceIntersectWithSemiJoin

ReplaceExceptWithFilter

ReplaceExceptWithAntiJoin

ReplaceDistinctWithAggregate

Aggregate

FixedPoint

RemoveLiteralFromGroupExpressions

RemoveRepetitionFromGroupExpressions

operatorOptimizationBatch

Join Reorder

Once

CostBasedJoinReorder

Reorders Join logical operators

Remove Redundant Sorts

Once

RemoveRedundantSorts

Decimal Optimizations

FixedPoint

DecimalAggregates

Object Expressions Optimization

FixedPoint

EliminateMapObjects

CombineTypedFilters

LocalRelation

FixedPoint

ConvertToLocalRelation

PropagateEmptyRelation

Extract PythonUDF From JoinCondition

Once

PullOutPythonUDFInJoinCondition

Check Cartesian Products

Once

CheckCartesianProducts

RewriteSubquery

Once

RewritePredicateSubquery

ColumnPruning

CollapseProject

RemoveRedundantProject

UpdateAttributeReferences

Once

UpdateNullabilityInAttributeReferences

Tip
Consult the sources of the Optimizer class for the up-to-date list of the default optimization rule batches.

Optimizer defines the operator optimization rules with the extendedOperatorOptimizationRules extension point for additional optimizations in the Operator Optimization batch.

Table 2. Optimizer’s Operator Optimization Rules (in the order of execution)
Rule Name Description

PushProjectionThroughUnion

ReorderJoin

EliminateOuterJoin

PushPredicateThroughJoin

PushDownPredicate

LimitPushDown

ColumnPruning

CollapseRepartition

CollapseProject

CollapseWindow

Collapses two adjacent Window logical operators

CombineFilters

CombineLimits

CombineUnions

NullPropagation

ConstantPropagation

FoldablePropagation

OptimizeIn

ConstantFolding

ReorderAssociativeOperator

LikeSimplification

BooleanSimplification

SimplifyConditionals

RemoveDispensableExpressions

SimplifyBinaryComparison

PruneFilters

EliminateSorts

SimplifyCasts

SimplifyCaseConversionExpressions

RewriteCorrelatedScalarSubquery

EliminateSerialization

RemoveRedundantAliases

RemoveRedundantProject

SimplifyExtractValueOps

CombineConcats

Optimizer defines Operator Optimization Batch that is simply a collection of rule batches with the operator optimization rules before and after InferFiltersFromConstraints logical rule.

Table 3. Optimizer’s Operator Optimization Batch (in the order of execution)
Batch Name Strategy Rules

Operator Optimization before Inferring Filters

FixedPoint

Operator optimization rules

Infer Filters

Once

InferFiltersFromConstraints

Operator Optimization after Inferring Filters

FixedPoint

Operator optimization rules

Optimizer uses spark.sql.optimizer.excludedRules configuration property to control what optimization rules in the defaultBatches should be excluded (default: none).

Optimizer takes a SessionCatalog when created.

Note
Optimizer is a Scala abstract class and cannot be created directly. It is created indirectly when the concrete Optimizers are.

Optimizer defines the non-excludable optimization rules that are considered critical for query optimization and will never be excluded (even if they are specified in spark.sql.optimizer.excludedRules configuration property).

Table 4. Optimizer’s Non-Excludable Optimization Rules
Rule Name Description

PushProjectionThroughUnion

EliminateDistinct

EliminateSubqueryAliases

EliminateView

ReplaceExpressions

ComputeCurrentTime

GetCurrentDatabase

RewriteDistinctAggregates

ReplaceDeduplicateWithAggregate

ReplaceIntersectWithSemiJoin

ReplaceExceptWithFilter

ReplaceExceptWithAntiJoin

RewriteExceptAll

RewriteIntersectAll

ReplaceDistinctWithAggregate

PullupCorrelatedPredicates

RewriteCorrelatedScalarSubquery

RewritePredicateSubquery

PullOutPythonUDFInJoinCondition

Table 5. Optimizer’s Internal Registries and Counters
Name Initial Value Description

fixedPoint

FixedPoint with the number of iterations as defined by spark.sql.optimizer.maxIterations

Used in Replace Operators, Aggregate, Operator Optimizations, Decimal Optimizations, Typed Filter Optimization and LocalRelation batches (and also indirectly in the User Provided Optimizers rule batch in SparkOptimizer).

Additional Operator Optimization Rules — extendedOperatorOptimizationRules Extension Point

extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]]

extendedOperatorOptimizationRules extension point defines additional rules for the Operator Optimization batch.

Note
extendedOperatorOptimizationRules rules are executed right after Operator Optimization before Inferring Filters and Operator Optimization after Inferring Filters.

batches Final Method

batches: Seq[Batch]
Note
batches is part of the RuleExecutor Contract to define the rule batches to use when executed.

batches…​FIXME

results matching ""

    No results matching ""