val q = spark.range(5).hint("broadcast").join(spark.range(1), "id")
val plan = q.queryExecution.optimizedPlan
val stats = plan.stats
scala> :type stats
org.apache.spark.sql.catalyst.plans.logical.Statistics
scala> println(stats.simpleString)
sizeInBytes=213.0 B, hints=none
Statistics — Estimates of Plan Statistics and Query Hints
Statistics
holds the statistics estimates and query hints of a logical operator:
Note
|
Cost statistics, plan statistics or query statistics are all synonyms and used interchangeably. |
You can access statistics and query hints of a logical plan using stats property.
Note
|
Use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute total size and row count statistics of a table. |
Note
|
Use ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL Command to generate column (equi-height) histograms of a table. |
Note
|
Use Dataset.hint or SELECT SQL statement with hints to specify query hints.
|
Statistics
is created when:
-
Leaf logical operators (specifically) and logical operators (in general) are requested for statistics estimates
-
HiveTableRelation and LogicalRelation are requested for statistics estimates (through CatalogStatistics)
Note
|
row count estimate is used in CostBasedJoinReorder logical optimization when cost-based optimization is enabled. |
Note
|
CatalogStatistics is a "subset" of all possible
|
Statistics
comes with simpleString
method that is used for the readable text representation (that is toString
with Statistics prefix).
import org.apache.spark.sql.catalyst.plans.logical.Statistics
import org.apache.spark.sql.catalyst.plans.logical.HintInfo
val stats = Statistics(sizeInBytes = 10, rowCount = Some(20), hints = HintInfo(broadcast = true))
scala> println(stats)
Statistics(sizeInBytes=10.0 B, rowCount=20, hints=(broadcast))
scala> println(stats.simpleString)
sizeInBytes=10.0 B, rowCount=20, hints=(broadcast)