ColumnStat — Column Statistics

ColumnStat holds the statistics of a table column (as part of the table statistics in a metastore).

Table 1. Column Statistics
Name Description

distinctCount

Number of distinct values

min

Minimum value

max

Maximum value

nullCount

Number of null values

avgLen

Average length of the values

maxLen

Maximum length of the values

histogram

Histogram of values (as Histogram which is empty by default)

ColumnStat is computed (and created from the result row) using ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command (that SparkSqlAstBuilder translates to AnalyzeColumnCommand logical command).

val cols = "id, p1, p2"
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $cols"
spark.sql(analyzeTableSQL)

ColumnStat may optionally hold the histogram of values which is empty by default. With spark.sql.statistics.histogram.enabled configuration property turned on ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command generates column (equi-height) histograms.

Note
spark.sql.statistics.histogram.enabled is off by default.

You can inspect the column statistics using DESCRIBE EXTENDED SQL command.

scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |id        |
|data_type     |int       |
|comment       |NULL      |
|min           |0         |
|max           |1         |
|num_nulls     |0         |
|distinct_count|2         |
|avg_col_len   |4         |
|max_col_len   |4         |
|histogram     |NULL      | <-- no histogram (spark.sql.statistics.histogram.enabled off)
+--------------+----------+

ColumnStat is part of the statistics of a table.

// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats.get

scala> :type stats
org.apache.spark.sql.catalyst.catalog.CatalogStatistics

val colStats = stats.colStats
scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

ColumnStat is converted to properties (serialized) while persisting the table (statistics) to a metastore.

scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

val colName = "p1"

val p1stats = colStats(colName)
scala> :type p1stats
org.apache.spark.sql.catalyst.plans.logical.ColumnStat

import org.apache.spark.sql.types.DoubleType
val props = p1stats.toMap(colName, dataType = DoubleType)
scala> println(props)
Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)

ColumnStat is re-created from properties (deserialized) when HiveExternalCatalog is requested for restoring table statistics from properties (from a Hive Metastore).

scala> :type props
Map[String,String]

scala> println(props)
Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)

import org.apache.spark.sql.types.StructField
val p1 = $"p1".double

import org.apache.spark.sql.catalyst.plans.logical.ColumnStat
val colStatsOpt = ColumnStat.fromMap(table = "t1", field = p1, map = props)

scala> :type colStatsOpt
Option[org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

ColumnStat is also created when JoinEstimation is requested to estimateInnerOuterJoin for Inner, Cross, LeftOuter, RightOuter and FullOuter joins.

val tableName = "t1"

// Make the example reproducible
import org.apache.spark.sql.catalyst.TableIdentifier
val tid = TableIdentifier(tableName)
val sessionCatalog = spark.sessionState.catalog
sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true)

// CREATE TABLE t1
Seq((0, 0, "zero"), (1, 1, "one")).
  toDF("id", "p1", "p2").
  write.
  saveAsTable(tableName)

// As we drop and create immediately we may face problems with unavailable partition files
// Invalidate cache
spark.sql(s"REFRESH TABLE $tableName")

// Use ANALYZE TABLE...FOR COLUMNS to compute column statistics
// that saves them in a metastore (aka an external catalog)
val df = spark.table(tableName)
val allCols = df.columns.mkString(",")
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols"
spark.sql(analyzeTableSQL)

// Fetch the table metadata (with column statistics) from a metastore
val metastore = spark.sharedState.externalCatalog
val db = spark.catalog.currentDatabase
val tableMeta = metastore.getTable(db, table = tableName)

// The column statistics are part of the table statistics
val colStats = tableMeta.stats.get.colStats

scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

scala> colStats.map { case (name, cs) => s"$name: $cs" }.foreach(println)
// the output may vary
id: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p1: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p2: ColumnStat(2,None,None,0,4,4,None)
Note
ColumnStat does not support minimum and maximum metrics for binary (i.e. Array[Byte]) and string types.

Converting Value to External/Java Representation (per Catalyst Data Type) — toExternalString Internal Method

toExternalString(v: Any, colName: String, dataType: DataType): String

toExternalString…​FIXME

Note
toExternalString is used exclusively when ColumnStat is requested for statistic properties.

supportsHistogram Method

supportsHistogram(dataType: DataType): Boolean

supportsHistogram…​FIXME

Note
supportsHistogram is used when…​FIXME

Converting ColumnStat to Properties (ColumnStat Serialization) — toMap Method

toMap(colName: String, dataType: DataType): Map[String, String]

toMap converts ColumnStat to the properties.

Table 2. ColumnStat.toMap’s Properties
Key Value

version

1

distinctCount

distinctCount

nullCount

nullCount

avgLen

avgLen

maxLen

maxLen

min

External/Java representation of min

max

External/Java representation of max

histogram

Serialized version of Histogram (using HistogramSerializer.serialize)

Note
toMap adds min, max, histogram entries only if they are available.
Note
Interestingly, colName and dataType input parameters bring no value to toMap itself, but merely allow for a more user-friendly error reporting when converting min and max column statistics.
Note
toMap is used exclusively when HiveExternalCatalog is requested for converting table statistics to properties (before persisting them as part of table metadata in a Hive metastore).

Re-Creating Column Statistics from Properties (ColumnStat Deserialization) — fromMap Method

fromMap(table: String, field: StructField, map: Map[String, String]): Option[ColumnStat]

fromMap creates a ColumnStat by fetching properties of every column statistic from the input map.

fromMap returns None when recovering column statistics fails for whatever reason.

WARN Failed to parse column statistics for column [fieldName] in table [table]
Note
Interestingly, table input parameter brings no value to fromMap itself, but merely allows for a more user-friendly error reporting when parsing column statistics fails.
Note
fromMap is used exclusively when HiveExternalCatalog is requested for restoring table statistics from properties (from a Hive Metastore).

Creating Column Statistics from InternalRow (Result of Computing Column Statistics) — rowToColumnStat Method

rowToColumnStat(
  row: InternalRow,
  attr: Attribute,
  rowCount: Long,
  percentiles: Option[ArrayData]): ColumnStat

rowToColumnStat creates a ColumnStat from the input row and the following positions:

If the 6th field is not empty, rowToColumnStat uses it to create histogram.

Note
rowToColumnStat is used exclusively when AnalyzeColumnCommand is executed (to compute the statistics for specified columns).

statExprs Method

statExprs(
  col: Attribute,
  conf: SQLConf,
  colPercentiles: AttributeMap[ArrayData]): CreateNamedStruct

statExprs…​FIXME

Note
statExprs is used when…​FIXME

results matching ""

    No results matching ""