Statistics — Estimates of Plan Statistics and Query Hints

Statistics holds the statistics estimates and query hints of a logical operator:

  • Total (output) size (in bytes)

  • Estimated number of rows (aka row count)

  • Column attribute statistics (aka column (equi-height) histograms)

  • Query hints

Cost statistics, plan statistics or query statistics are all synonyms and used interchangeably.

You can access statistics and query hints of a logical plan using stats property.

val q = spark.range(5).hint("broadcast").join(spark.range(1), "id")
val plan = q.queryExecution.optimizedPlan
val stats = plan.stats

scala> :type stats

scala> println(stats.simpleString)
sizeInBytes=213.0 B, hints=none
Use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute total size and row count statistics of a table.
Use Dataset.hint or SELECT SQL statement with hints to specify query hints.

Statistics is created when:

row count estimate is used in CostBasedJoinReorder logical optimization when cost-based optimization is enabled.

CatalogStatistics is a "subset" of all possible Statistics (as there are no concepts of attributes and query hints in metastore).

CatalogStatistics are statistics stored in an external catalog (usually a Hive metastore) and are often referred as Hive statistics while Statistics represents the Spark statistics.

Statistics comes with simpleString method that is used for the readable text representation (that is toString with Statistics prefix).

import org.apache.spark.sql.catalyst.plans.logical.Statistics
import org.apache.spark.sql.catalyst.plans.logical.HintInfo
val stats = Statistics(sizeInBytes = 10, rowCount = Some(20), hints = HintInfo(broadcast = true))

scala> println(stats)
Statistics(sizeInBytes=10.0 B, rowCount=20, hints=(broadcast))

scala> println(stats.simpleString)
sizeInBytes=10.0 B, rowCount=20, hints=(broadcast)

results matching ""

    No results matching ""