Statistics — Estimates of Plan Statistics and Query Hints

Statistics holds the statistics estimates and query hints of a logical operator:

Total (output) size (in bytes)
Estimated number of rows (aka row count)
Column attribute statistics (aka column (equi-height) histograms)
Query hints

Note	Cost statistics, plan statistics or query statistics are all synonyms and used interchangeably.

You can access statistics and query hints of a logical plan using stats property.

val q = spark.range(5).hint("broadcast").join(spark.range(1), "id")
val plan = q.queryExecution.optimizedPlan
val stats = plan.stats

scala> :type stats
org.apache.spark.sql.catalyst.plans.logical.Statistics

scala> println(stats.simpleString)
sizeInBytes=213.0 B, hints=none

Note	Use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute total size and row count statistics of a table.

Note	Use ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL Command to generate column (equi-height) histograms of a table.

Note	Use `Dataset.hint` or `SELECT` SQL statement with hints to specify query hints.

Statistics is created when:

Leaf logical operators (specifically) and logical operators (in general) are requested for statistics estimates
HiveTableRelation and LogicalRelation are requested for statistics estimates (through CatalogStatistics)

Note	row count estimate is used in CostBasedJoinReorder logical optimization when cost-based optimization is enabled.

Note

CatalogStatistics is a "subset" of all possible Statistics (as there are no concepts of attributes and query hints in metastore).

CatalogStatistics are statistics stored in an external catalog (usually a Hive metastore) and are often referred as Hive statistics while Statistics represents the Spark statistics.

Statistics comes with simpleString method that is used for the readable text representation (that is toString with Statistics prefix).

import org.apache.spark.sql.catalyst.plans.logical.Statistics
import org.apache.spark.sql.catalyst.plans.logical.HintInfo
val stats = Statistics(sizeInBytes = 10, rowCount = Some(20), hints = HintInfo(broadcast = true))

scala> println(stats)
Statistics(sizeInBytes=10.0 B, rowCount=20, hints=(broadcast))

scala> println(stats.simpleString)
sizeInBytes=10.0 B, rowCount=20, hints=(broadcast)

Statistics

Statistics — Estimates of Plan Statistics and Query Hints

results matching ""

No results matching ""