Catalyst Optimizer — Generic Logical Query Plan Optimizer

Optimizer (aka Catalyst Optimizer) is the base of logical query plan optimizers that defines the rule batches of logical optimizations (i.e. logical optimizations that are the rules that transform the query plan of a structured query to produce the optimized logical plan).

Note	SparkOptimizer is the one and only direct implementation of the `Optimizer` Contract in Spark SQL.

Optimizer is a RuleExecutor of LogicalPlan (i.e. RuleExecutor[LogicalPlan]).

Optimizer: Analyzed Logical Plan ==> Optimized Logical Plan

Optimizer is available as the optimizer property of a session-specific SessionState.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.optimizer
org.apache.spark.sql.catalyst.optimizer.Optimizer

You can access the optimized logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended flag enabled) or SQL’s EXPLAIN EXTENDED SQL command.

// sample structured query
val inventory = spark
  .range(5)
  .withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
      +- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)

Alternatively, you can access the analyzed logical plan using QueryExecution and its optimizedPlan property (that together with numberedTreeString method is a very good "debugging" tool).

val optimizedPlan = inventory.queryExecution.optimizedPlan
scala> println(optimizedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + 5) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))

Optimizer defines the default rule batches that are considered the base rule batches that can be further refined (extended or with some rules excluded).

Table 1. Optimizer’s Default Optimization Rule Batches (in the order of execution)
Batch Name	Strategy	Rules	Description
Eliminate Distinct	`Once`	EliminateDistinct
Finish Analysis	`Once`	EliminateSubqueryAliases	Removes (eliminates) SubqueryAlias unary logical operators from a logical plan
		EliminateView	Removes (eliminates) View unary logical operators from a logical plan and replaces them with their child logical operator
		ReplaceExpressions	Replaces RuntimeReplaceable expressions with their single child expression
		ComputeCurrentTime
		GetCurrentDatabase
		RewriteDistinctAggregates
		ReplaceDeduplicateWithAggregate
Union	`Once`	CombineUnions
LocalRelation early	FixedPoint	ConvertToLocalRelation
LocalRelation early	FixedPoint	PropagateEmptyRelation
Pullup Correlated Expressions	`Once`	PullupCorrelatedPredicates
Subquery	`Once`	OptimizeSubqueries
Replace Operators	FixedPoint	RewriteExceptAll
		RewriteIntersectAll
		ReplaceIntersectWithSemiJoin
		ReplaceExceptWithFilter
		ReplaceExceptWithAntiJoin
		ReplaceDistinctWithAggregate
Aggregate	FixedPoint	RemoveLiteralFromGroupExpressions
Aggregate	FixedPoint	RemoveRepetitionFromGroupExpressions
operatorOptimizationBatch
Join Reorder	`Once`	CostBasedJoinReorder	Reorders Join logical operators
Remove Redundant Sorts	`Once`	RemoveRedundantSorts
Decimal Optimizations	FixedPoint	DecimalAggregates
Object Expressions Optimization	FixedPoint	EliminateMapObjects
Object Expressions Optimization	FixedPoint	CombineTypedFilters
LocalRelation	FixedPoint	ConvertToLocalRelation
LocalRelation	FixedPoint	PropagateEmptyRelation
Extract PythonUDF From JoinCondition	`Once`	PullOutPythonUDFInJoinCondition
Check Cartesian Products	`Once`	CheckCartesianProducts
RewriteSubquery	`Once`	RewritePredicateSubquery
		ColumnPruning
		CollapseProject
		RemoveRedundantProject
UpdateAttributeReferences	`Once`	UpdateNullabilityInAttributeReferences

Tip	Consult the sources of the `Optimizer` class for the up-to-date list of the default optimization rule batches.

Optimizer defines the operator optimization rules with the extendedOperatorOptimizationRules extension point for additional optimizations in the Operator Optimization batch.

Table 2. Optimizer’s Operator Optimization Rules (in the order of execution)
Rule Name	Description
PushProjectionThroughUnion
ReorderJoin
EliminateOuterJoin
PushPredicateThroughJoin
PushDownPredicate
LimitPushDown
ColumnPruning
CollapseRepartition
CollapseProject
CollapseWindow	Collapses two adjacent Window logical operators
CombineFilters
CombineLimits
CombineUnions
NullPropagation
ConstantPropagation
FoldablePropagation
OptimizeIn
ConstantFolding
ReorderAssociativeOperator
LikeSimplification
BooleanSimplification
SimplifyConditionals
RemoveDispensableExpressions
SimplifyBinaryComparison
PruneFilters
EliminateSorts
SimplifyCasts
SimplifyCaseConversionExpressions
RewriteCorrelatedScalarSubquery
EliminateSerialization
RemoveRedundantAliases
RemoveRedundantProject
SimplifyExtractValueOps
CombineConcats

Optimizer defines Operator Optimization Batch that is simply a collection of rule batches with the operator optimization rules before and after InferFiltersFromConstraints logical rule.

Table 3. Optimizer’s Operator Optimization Batch (in the order of execution)
Batch Name	Strategy	Rules
Operator Optimization before Inferring Filters	FixedPoint	Operator optimization rules
Infer Filters	`Once`	InferFiltersFromConstraints
Operator Optimization after Inferring Filters	FixedPoint	Operator optimization rules

Optimizer uses spark.sql.optimizer.excludedRules configuration property to control what optimization rules in the defaultBatches should be excluded (default: none).

Optimizer takes a SessionCatalog when created.

Note	`Optimizer` is a Scala abstract class and cannot be created directly. It is created indirectly when the concrete Optimizers are.

Optimizer considers some optimization rules as non-excludable. The non-excludable optimization rules are considered critical for query optimization and are not recommended to be excluded (even if they are specified in spark.sql.optimizer.excludedRules configuration property).

ComputeCurrentTime
EliminateDistinct
EliminateSubqueryAliases
EliminateView
GetCurrentDatabase
PullOutPythonUDFInJoinCondition
PullupCorrelatedPredicates
ReplaceDeduplicateWithAggregate
ReplaceDistinctWithAggregate
ReplaceExceptWithAntiJoin
ReplaceExceptWithFilter
ReplaceExpressions
ReplaceIntersectWithSemiJoin
RewriteCorrelatedScalarSubquery
RewriteDistinctAggregates
RewriteExceptAll
RewriteIntersectAll
RewritePredicateSubquery

Table 4. Optimizer’s Internal Registries and Counters
Name	Initial Value	Description
`fixedPoint`	`FixedPoint` with the number of iterations as defined by spark.sql.optimizer.maxIterations	Used in Replace Operators, Aggregate, Operator Optimizations, Decimal Optimizations, Typed Filter Optimization and LocalRelation batches (and also indirectly in the User Provided Optimizers rule batch in SparkOptimizer).

Additional Operator Optimization Rules — `extendedOperatorOptimizationRules` Extension Point

extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]]

extendedOperatorOptimizationRules extension point defines additional rules for the Operator Optimization batch.

Note	`extendedOperatorOptimizationRules` rules are executed right after Operator Optimization before Inferring Filters and Operator Optimization after Inferring Filters.

`batches` Final Method

batches: Seq[Batch]

Note	`batches` is part of the RuleExecutor Contract to define the rule batches to use when executed.

batches…FIXME

Catalyst Optimizer