val cols = "id, p1, p2"
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $cols"
spark.sql(analyzeTableSQL)
ColumnStat — Column Statistics
ColumnStat
holds the statistics of a table column (as part of the table statistics in a metastore).
Name | Description |
---|---|
Number of distinct values |
|
Minimum value |
|
Maximum value |
|
Number of |
|
Average length of the values |
|
Maximum length of the values |
|
Histogram of values (as |
ColumnStat
is computed (and created from the result row) using ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command (that SparkSqlAstBuilder
translates to AnalyzeColumnCommand logical command).
ColumnStat
may optionally hold the histogram of values which is empty by default. With spark.sql.statistics.histogram.enabled configuration property turned on ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS
SQL command generates column (equi-height) histograms.
Note
|
spark.sql.statistics.histogram.enabled is off by default.
|
You can inspect the column statistics using DESCRIBE EXTENDED SQL command.
scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name |info_value|
+--------------+----------+
|col_name |id |
|data_type |int |
|comment |NULL |
|min |0 |
|max |1 |
|num_nulls |0 |
|distinct_count|2 |
|avg_col_len |4 |
|max_col_len |4 |
|histogram |NULL | <-- no histogram (spark.sql.statistics.histogram.enabled off)
+--------------+----------+
ColumnStat
is part of the statistics of a table.
// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats.get
scala> :type stats
org.apache.spark.sql.catalyst.catalog.CatalogStatistics
val colStats = stats.colStats
scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]
ColumnStat
is converted to properties (serialized) while persisting the table (statistics) to a metastore.
scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]
val colName = "p1"
val p1stats = colStats(colName)
scala> :type p1stats
org.apache.spark.sql.catalyst.plans.logical.ColumnStat
import org.apache.spark.sql.types.DoubleType
val props = p1stats.toMap(colName, dataType = DoubleType)
scala> println(props)
Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)
ColumnStat
is re-created from properties (deserialized) when HiveExternalCatalog
is requested for restoring table statistics from properties (from a Hive Metastore).
scala> :type props
Map[String,String]
scala> println(props)
Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)
import org.apache.spark.sql.types.StructField
val p1 = $"p1".double
import org.apache.spark.sql.catalyst.plans.logical.ColumnStat
val colStatsOpt = ColumnStat.fromMap(table = "t1", field = p1, map = props)
scala> :type colStatsOpt
Option[org.apache.spark.sql.catalyst.plans.logical.ColumnStat]
ColumnStat
is also created when JoinEstimation
is requested to estimateInnerOuterJoin for Inner
, Cross
, LeftOuter
, RightOuter
and FullOuter
joins.
val tableName = "t1"
// Make the example reproducible
import org.apache.spark.sql.catalyst.TableIdentifier
val tid = TableIdentifier(tableName)
val sessionCatalog = spark.sessionState.catalog
sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true)
// CREATE TABLE t1
Seq((0, 0, "zero"), (1, 1, "one")).
toDF("id", "p1", "p2").
write.
saveAsTable(tableName)
// As we drop and create immediately we may face problems with unavailable partition files
// Invalidate cache
spark.sql(s"REFRESH TABLE $tableName")
// Use ANALYZE TABLE...FOR COLUMNS to compute column statistics
// that saves them in a metastore (aka an external catalog)
val df = spark.table(tableName)
val allCols = df.columns.mkString(",")
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols"
spark.sql(analyzeTableSQL)
// Fetch the table metadata (with column statistics) from a metastore
val metastore = spark.sharedState.externalCatalog
val db = spark.catalog.currentDatabase
val tableMeta = metastore.getTable(db, table = tableName)
// The column statistics are part of the table statistics
val colStats = tableMeta.stats.get.colStats
scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]
scala> colStats.map { case (name, cs) => s"$name: $cs" }.foreach(println)
// the output may vary
id: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p1: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p2: ColumnStat(2,None,None,0,4,4,None)
Note
|
ColumnStat does not support minimum and maximum metrics for binary (i.e. Array[Byte] ) and string types.
|
Converting Value to External/Java Representation (per Catalyst Data Type) — toExternalString
Internal Method
toExternalString(v: Any, colName: String, dataType: DataType): String
toExternalString
…FIXME
Note
|
toExternalString is used exclusively when ColumnStat is requested for statistic properties.
|
supportsHistogram
Method
supportsHistogram(dataType: DataType): Boolean
supportsHistogram
…FIXME
Note
|
supportsHistogram is used when…FIXME
|
Converting ColumnStat to Properties (ColumnStat Serialization) — toMap
Method
toMap(colName: String, dataType: DataType): Map[String, String]
toMap
converts ColumnStat to the properties.
Key | Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Serialized version of Histogram (using |
Note
|
toMap adds min , max , histogram entries only if they are available.
|
Note
|
Interestingly, colName and dataType input parameters bring no value to toMap itself, but merely allow for a more user-friendly error reporting when converting min and max column statistics.
|
Note
|
toMap is used exclusively when HiveExternalCatalog is requested for converting table statistics to properties (before persisting them as part of table metadata in a Hive metastore).
|
Re-Creating Column Statistics from Properties (ColumnStat Deserialization) — fromMap
Method
fromMap(table: String, field: StructField, map: Map[String, String]): Option[ColumnStat]
fromMap
creates a ColumnStat
by fetching properties of every column statistic from the input map
.
fromMap
returns None
when recovering column statistics fails for whatever reason.
WARN Failed to parse column statistics for column [fieldName] in table [table]
Note
|
Interestingly, table input parameter brings no value to fromMap itself, but merely allows for a more user-friendly error reporting when parsing column statistics fails.
|
Note
|
fromMap is used exclusively when HiveExternalCatalog is requested for restoring table statistics from properties (from a Hive Metastore).
|
Creating Column Statistics from InternalRow (Result of Computing Column Statistics) — rowToColumnStat
Method
rowToColumnStat(
row: InternalRow,
attr: Attribute,
rowCount: Long,
percentiles: Option[ArrayData]): ColumnStat
rowToColumnStat
creates a ColumnStat
from the input row
and the following positions:
If the 6
th field is not empty, rowToColumnStat
uses it to create histogram.
Note
|
rowToColumnStat is used exclusively when AnalyzeColumnCommand is executed (to compute the statistics for specified columns).
|