val cols = "id, p1, p2"
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $cols"
spark.sql(analyzeTableSQL)
ColumnStat — Column Statistics
ColumnStat holds the statistics of a table column (as part of the table statistics in a metastore).
| Name | Description |
|---|---|
Number of distinct values |
|
Minimum value |
|
Maximum value |
|
Number of |
|
Average length of the values |
|
Maximum length of the values |
|
Histogram of values (as |
ColumnStat is computed (and created from the result row) using ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command (that SparkSqlAstBuilder translates to AnalyzeColumnCommand logical command).
ColumnStat may optionally hold the histogram of values which is empty by default. With spark.sql.statistics.histogram.enabled configuration property turned on ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command generates column (equi-height) histograms.
|
Note
|
spark.sql.statistics.histogram.enabled is off by default.
|
You can inspect the column statistics using DESCRIBE EXTENDED SQL command.
scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name |info_value|
+--------------+----------+
|col_name |id |
|data_type |int |
|comment |NULL |
|min |0 |
|max |1 |
|num_nulls |0 |
|distinct_count|2 |
|avg_col_len |4 |
|max_col_len |4 |
|histogram |NULL | <-- no histogram (spark.sql.statistics.histogram.enabled off)
+--------------+----------+
ColumnStat is part of the statistics of a table.
// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats.get
scala> :type stats
org.apache.spark.sql.catalyst.catalog.CatalogStatistics
val colStats = stats.colStats
scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]
ColumnStat is converted to properties (serialized) while persisting the table (statistics) to a metastore.
scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]
val colName = "p1"
val p1stats = colStats(colName)
scala> :type p1stats
org.apache.spark.sql.catalyst.plans.logical.ColumnStat
import org.apache.spark.sql.types.DoubleType
val props = p1stats.toMap(colName, dataType = DoubleType)
scala> println(props)
Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)
ColumnStat is re-created from properties (deserialized) when HiveExternalCatalog is requested for restoring table statistics from properties (from a Hive Metastore).
scala> :type props
Map[String,String]
scala> println(props)
Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)
import org.apache.spark.sql.types.StructField
val p1 = $"p1".double
import org.apache.spark.sql.catalyst.plans.logical.ColumnStat
val colStatsOpt = ColumnStat.fromMap(table = "t1", field = p1, map = props)
scala> :type colStatsOpt
Option[org.apache.spark.sql.catalyst.plans.logical.ColumnStat]
ColumnStat is also created when JoinEstimation is requested to estimateInnerOuterJoin for Inner, Cross, LeftOuter, RightOuter and FullOuter joins.
val tableName = "t1"
// Make the example reproducible
import org.apache.spark.sql.catalyst.TableIdentifier
val tid = TableIdentifier(tableName)
val sessionCatalog = spark.sessionState.catalog
sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true)
// CREATE TABLE t1
Seq((0, 0, "zero"), (1, 1, "one")).
toDF("id", "p1", "p2").
write.
saveAsTable(tableName)
// As we drop and create immediately we may face problems with unavailable partition files
// Invalidate cache
spark.sql(s"REFRESH TABLE $tableName")
// Use ANALYZE TABLE...FOR COLUMNS to compute column statistics
// that saves them in a metastore (aka an external catalog)
val df = spark.table(tableName)
val allCols = df.columns.mkString(",")
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols"
spark.sql(analyzeTableSQL)
// Fetch the table metadata (with column statistics) from a metastore
val metastore = spark.sharedState.externalCatalog
val db = spark.catalog.currentDatabase
val tableMeta = metastore.getTable(db, table = tableName)
// The column statistics are part of the table statistics
val colStats = tableMeta.stats.get.colStats
scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]
scala> colStats.map { case (name, cs) => s"$name: $cs" }.foreach(println)
// the output may vary
id: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p1: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p2: ColumnStat(2,None,None,0,4,4,None)
|
Note
|
ColumnStat does not support minimum and maximum metrics for binary (i.e. Array[Byte]) and string types.
|
Converting Value to External/Java Representation (per Catalyst Data Type) — toExternalString Internal Method
toExternalString(v: Any, colName: String, dataType: DataType): String
toExternalString…FIXME
|
Note
|
toExternalString is used exclusively when ColumnStat is requested for statistic properties.
|
supportsHistogram Method
supportsHistogram(dataType: DataType): Boolean
supportsHistogram…FIXME
|
Note
|
supportsHistogram is used when…FIXME
|
Converting ColumnStat to Properties (ColumnStat Serialization) — toMap Method
toMap(colName: String, dataType: DataType): Map[String, String]
toMap converts ColumnStat to the properties.
| Key | Value |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Serialized version of Histogram (using |
|
Note
|
toMap adds min, max, histogram entries only if they are available.
|
|
Note
|
Interestingly, colName and dataType input parameters bring no value to toMap itself, but merely allow for a more user-friendly error reporting when converting min and max column statistics.
|
|
Note
|
toMap is used exclusively when HiveExternalCatalog is requested for converting table statistics to properties (before persisting them as part of table metadata in a Hive metastore).
|
Re-Creating Column Statistics from Properties (ColumnStat Deserialization) — fromMap Method
fromMap(table: String, field: StructField, map: Map[String, String]): Option[ColumnStat]
fromMap creates a ColumnStat by fetching properties of every column statistic from the input map.
fromMap returns None when recovering column statistics fails for whatever reason.
WARN Failed to parse column statistics for column [fieldName] in table [table]
|
Note
|
Interestingly, table input parameter brings no value to fromMap itself, but merely allows for a more user-friendly error reporting when parsing column statistics fails.
|
|
Note
|
fromMap is used exclusively when HiveExternalCatalog is requested for restoring table statistics from properties (from a Hive Metastore).
|
Creating Column Statistics from InternalRow (Result of Computing Column Statistics) — rowToColumnStat Method
rowToColumnStat(
row: InternalRow,
attr: Attribute,
rowCount: Long,
percentiles: Option[ArrayData]): ColumnStat
rowToColumnStat creates a ColumnStat from the input row and the following positions:
If the 6th field is not empty, rowToColumnStat uses it to create histogram.
|
Note
|
rowToColumnStat is used exclusively when AnalyzeColumnCommand is executed (to compute the statistics for specified columns).
|