scala> :type spark.sessionState.catalog
org.apache.spark.sql.catalyst.catalog.SessionCatalog
// Using higher-level interface to access CatalogStatistics
// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
// Using low-level internal SessionCatalog interface to access CatalogTables
val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName)
val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid)
val stats = metadata.stats
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
CatalogStatistics — Table Statistics From External Catalog (Metastore)
CatalogStatistics
are table statistics that are stored in an external catalog (aka metastore):
-
Column statistics (i.e. column names and their statistics)
Note
|
|
CatalogStatistics
can be converted to Spark statistics using toPlanStats method.
CatalogStatistics
is created when:
-
AnalyzeColumnCommand,
AlterTableAddPartitionCommand
andTruncateTableCommand
commands are executed (and store statistics in ExternalCatalog) -
CommandUtils
is requested for updating existing table statistics, the current statistics (if changed) -
HiveExternalCatalog
is requested for restoring Spark statistics from properties (from a Hive Metastore) -
DetermineTableStats and PruneFileSourcePartitions logical optimizations are executed (i.e. applied to a logical plan)
-
HiveClientImpl
is requested for a table or partition statistics from Hive’s parameters
CatalogStatistics
has a text representation.
scala> :type stats
Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
scala> stats.map(_.simpleString).foreach(println)
714 bytes, 2 rows
Converting Metastore Statistics to Spark Statistics — toPlanStats
Method
toPlanStats(planOutput: Seq[Attribute], cboEnabled: Boolean): Statistics
toPlanStats
converts the table statistics (from an external metastore) to Spark statistics.
With cost-based optimization enabled and row count statistics available, toPlanStats
creates a Statistics with the estimated total (output) size, row count and column statistics.
Note
|
Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is turned on, i.e. true , and is disabled by default.
|
Otherwise, when cost-based optimization is disabled, toPlanStats
creates a Statistics with just the mandatory sizeInBytes.
Caution
|
FIXME Why does toPlanStats compute sizeInBytes differently per CBO?
|
Note
|
|
Note
|
toPlanStats is used when HiveTableRelation and LogicalRelation are requested for statistics.
|