CatalogStatistics — Table Statistics From External Catalog (Metastore)

CatalogStatistics are table statistics that are stored in an external catalog (aka metastore):

  • Physical total size (in bytes)

  • Estimated number of rows (aka row count)

  • Column statistics (i.e. column names and their statistics)

Note

CatalogStatistics is a "subset" of the statistics in Statistics (as there are no concepts of attributes and broadcast hint in metastore).

CatalogStatistics are often stored in a Hive metastore and are referred as Hive statistics while Statistics are the Spark statistics.

CatalogStatistics can be converted to Spark statistics using toPlanStats method.

CatalogStatistics is created when:

scala> :type spark.sessionState.catalog
org.apache.spark.sql.catalyst.catalog.SessionCatalog

// Using higher-level interface to access CatalogStatistics
// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats

scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

// Using low-level internal SessionCatalog interface to access CatalogTables
val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName)
val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid)
val stats = metadata.stats

scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

CatalogStatistics has a text representation.

scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

scala> stats.map(_.simpleString).foreach(println)
714 bytes, 2 rows

Converting Metastore Statistics to Spark Statistics — toPlanStats Method

toPlanStats(planOutput: Seq[Attribute], cboEnabled: Boolean): Statistics

toPlanStats converts the table statistics (from an external metastore) to Spark statistics.

With cost-based optimization enabled and row count statistics available, toPlanStats creates a Statistics with the estimated total (output) size, row count and column statistics.

Note
Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is turned on, i.e. true, and is disabled by default.

Otherwise, when cost-based optimization is disabled, toPlanStats creates a Statistics with just the mandatory sizeInBytes.

Caution
FIXME Why does toPlanStats compute sizeInBytes differently per CBO?
Note

toPlanStats does the reverse of HiveExternalCatalog.statsToProperties.

FIXME Example
Note
toPlanStats is used when HiveTableRelation and LogicalRelation are requested for statistics.

results matching ""

    No results matching ""