Optimizer: Analyzed Logical Plan ==> Optimized Logical Plan
Catalyst Optimizer — Generic Logical Query Plan Optimizer
Optimizer
(aka Catalyst Optimizer) is the base of logical query plan optimizers that defines the rule batches of logical optimizations (i.e. logical optimizations that are the rules that transform the query plan of a structured query to produce the optimized logical plan).
Note
|
SparkOptimizer is the one and only direct implementation of the Optimizer Contract in Spark SQL.
|
Optimizer
is a RuleExecutor of LogicalPlan (i.e. RuleExecutor[LogicalPlan]
).
Optimizer
is available as the optimizer property of a session-specific SessionState
.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> :type spark.sessionState.optimizer
org.apache.spark.sql.catalyst.optimizer.Optimizer
You can access the optimized logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended
flag enabled) or SQL’s EXPLAIN EXTENDED
SQL command.
// sample structured query
val inventory = spark
.range(5)
.withColumn("new_column", 'id + 5 as "plus5")
// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
+- Range (0, 5, step=1, splits=Some(8))
== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))
== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))
== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)
Alternatively, you can access the analyzed logical plan using QueryExecution
and its optimizedPlan property (that together with numberedTreeString
method is a very good "debugging" tool).
val optimizedPlan = inventory.queryExecution.optimizedPlan
scala> println(optimizedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + 5) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))
Optimizer
defines the default rule batches that are considered the base rule batches that can be further refined (extended or with some rules excluded).
Batch Name | Strategy | Rules | Description |
---|---|---|---|
|
|||
|
Removes (eliminates) SubqueryAlias unary logical operators from a logical plan |
||
Removes (eliminates) View unary logical operators from a logical plan and replaces them with their child logical operator |
|||
Replaces RuntimeReplaceable expressions with their single child expression |
|||
|
|||
|
|||
|
|||
RewriteIntersectAll |
|||
ReplaceIntersectWithSemiJoin |
|||
ReplaceDistinctWithAggregate |
|||
RemoveLiteralFromGroupExpressions |
|||
RemoveRepetitionFromGroupExpressions |
|||
|
Reorders Join logical operators |
||
|
|||
EliminateMapObjects |
|||
ConvertToLocalRelation |
|||
|
PullOutPythonUDFInJoinCondition |
||
|
CheckCartesianProducts |
||
|
|||
|
UpdateNullabilityInAttributeReferences |
Tip
|
Consult the sources of the Optimizer class for the up-to-date list of the default optimization rule batches.
|
Optimizer
defines the operator optimization rules with the extendedOperatorOptimizationRules extension point for additional optimizations in the Operator Optimization batch.
Rule Name | Description |
---|---|
PushProjectionThroughUnion |
|
ReorderJoin |
|
EliminateOuterJoin |
|
PushPredicateThroughJoin |
|
PushDownPredicate |
|
LimitPushDown |
|
ColumnPruning |
|
CollapseRepartition |
|
CollapseProject |
|
CombineFilters |
|
CombineLimits |
|
CombineUnions |
|
NullPropagation |
|
ConstantPropagation |
|
FoldablePropagation |
|
OptimizeIn |
|
ConstantFolding |
|
ReorderAssociativeOperator |
|
LikeSimplification |
|
BooleanSimplification |
|
SimplifyConditionals |
|
RemoveDispensableExpressions |
|
SimplifyBinaryComparison |
|
PruneFilters |
|
EliminateSorts |
|
SimplifyCasts |
|
SimplifyCaseConversionExpressions |
|
RewriteCorrelatedScalarSubquery |
|
EliminateSerialization |
|
RemoveRedundantAliases |
|
RemoveRedundantProject |
|
SimplifyExtractValueOps |
|
CombineConcats |
Optimizer
defines Operator Optimization Batch that is simply a collection of rule batches with the operator optimization rules before and after InferFiltersFromConstraints
logical rule.
Batch Name | Strategy | Rules |
---|---|---|
|
||
Optimizer
uses spark.sql.optimizer.excludedRules configuration property to control what optimization rules in the defaultBatches should be excluded (default: none).
Optimizer
takes a SessionCatalog when created.
Note
|
Optimizer is a Scala abstract class and cannot be created directly. It is created indirectly when the concrete Optimizers are.
|
Optimizer
considers some optimization rules as non-excludable. The non-excludable optimization rules are considered critical for query optimization and are not recommended to be excluded (even if they are specified in spark.sql.optimizer.excludedRules configuration property).
Name | Initial Value | Description |
---|---|---|
|
Used in Replace Operators, Aggregate, Operator Optimizations, Decimal Optimizations, Typed Filter Optimization and LocalRelation batches (and also indirectly in the User Provided Optimizers rule batch in SparkOptimizer). |
Additional Operator Optimization Rules — extendedOperatorOptimizationRules
Extension Point
extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]]
extendedOperatorOptimizationRules
extension point defines additional rules for the Operator Optimization batch.
Note
|
extendedOperatorOptimizationRules rules are executed right after Operator Optimization before Inferring Filters and Operator Optimization after Inferring Filters.
|
batches
Final Method
batches: Seq[Batch]
Note
|
batches is part of the RuleExecutor Contract to define the rule batches to use when executed.
|
batches
…FIXME