scala> :type spark.sessionState.catalog
org.apache.spark.sql.catalyst.catalog.SessionCatalog
// Using higher-level interface to access CatalogStatistics
// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
// Using low-level internal SessionCatalog interface to access CatalogTables
val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName)
val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid)
val stats = metadata.stats
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
CatalogStatistics — Table Statistics From External Catalog (Metastore)
CatalogStatistics are table statistics that are stored in an external catalog (aka metastore):
-
Column statistics (i.e. column names and their statistics)
|
Note
|
|
CatalogStatistics can be converted to Spark statistics using toPlanStats method.
CatalogStatistics is created when:
-
AnalyzeColumnCommand,
AlterTableAddPartitionCommandandTruncateTableCommandcommands are executed (and store statistics in ExternalCatalog) -
CommandUtilsis requested for updating existing table statistics, the current statistics (if changed) -
HiveExternalCatalogis requested for restoring Spark statistics from properties (from a Hive Metastore) -
DetermineTableStats and PruneFileSourcePartitions logical optimizations are executed (i.e. applied to a logical plan)
-
HiveClientImplis requested for a table or partition statistics from Hive’s parameters
CatalogStatistics has a text representation.
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
scala> stats.map(_.simpleString).foreach(println)
714 bytes, 2 rows
Converting Metastore Statistics to Spark Statistics — toPlanStats Method
toPlanStats(planOutput: Seq[Attribute], cboEnabled: Boolean): Statistics
toPlanStats converts the table statistics (from an external metastore) to Spark statistics.
With cost-based optimization enabled and row count statistics available, toPlanStats creates a Statistics with the estimated total (output) size, row count and column statistics.
|
Note
|
Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is turned on, i.e. true, and is disabled by default.
|
Otherwise, when cost-based optimization is disabled, toPlanStats creates a Statistics with just the mandatory sizeInBytes.
|
Caution
|
FIXME Why does toPlanStats compute sizeInBytes differently per CBO?
|
|
Note
|
|
|
Note
|
toPlanStats is used when HiveTableRelation and LogicalRelation are requested for statistics.
|