ANALYZE TABLE tableName COMPUTE STATISTICS FOR COLUMNS columnNames
AnalyzeColumnCommand Logical Command for ANALYZE TABLE…COMPUTE STATISTICS FOR COLUMNS SQL Command
AnalyzeColumnCommand
is a logical command for ANALYZE TABLE with FOR COLUMNS
clause (and no PARTITION
specification).
// Make the example reproducible
val tableName = "t1"
import org.apache.spark.sql.catalyst.TableIdentifier
val tableId = TableIdentifier(tableName)
val sessionCatalog = spark.sessionState.catalog
sessionCatalog.dropTable(tableId, ignoreIfNotExists = true, purge = true)
val df = Seq((0, 0.0, "zero"), (1, 1.4, "one")).toDF("id", "p1", "p2")
df.write.saveAsTable("t1")
// AnalyzeColumnCommand represents ANALYZE TABLE...FOR COLUMNS SQL command
val allCols = df.columns.mkString(",")
val analyzeTableSQL = s"ANALYZE TABLE $tableName COMPUTE STATISTICS FOR COLUMNS $allCols"
val plan = spark.sql(analyzeTableSQL).queryExecution.logical
import org.apache.spark.sql.execution.command.AnalyzeColumnCommand
val cmd = plan.asInstanceOf[AnalyzeColumnCommand]
scala> println(cmd)
AnalyzeColumnCommand `t1`, [id, p1, p2]
spark.sql(analyzeTableSQL)
val stats = sessionCatalog.getTableMetadata(tableId).stats.get
scala> println(stats.simpleString)
1421 bytes, 2 rows
scala> stats.colStats.map { case (c, ss) => s"$c: $ss" }.foreach(println)
id: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p1: ColumnStat(2,Some(0.0),Some(1.4),0,8,8,None)
p2: ColumnStat(2,None,None,0,4,4,None)
// Use DESC EXTENDED for friendlier output
scala> sql(s"DESC EXTENDED $tableName id").show
+--------------+----------+
| info_name|info_value|
+--------------+----------+
| col_name| id|
| data_type| int|
| comment| NULL|
| min| 0|
| max| 1|
| num_nulls| 0|
|distinct_count| 2|
| avg_col_len| 4|
| max_col_len| 4|
| histogram| NULL|
+--------------+----------+
AnalyzeColumnCommand
can generate column histograms when spark.sql.statistics.histogram.enabled configuration property is turned on (which is disabled by default). AnalyzeColumnCommand
supports column histograms for the following data types:
-
IntegralType
-
DecimalType
-
DoubleType
-
FloatType
-
DateType
-
TimestampType
Note
|
Histograms can provide better estimation accuracy. Currently, Spark only supports equi-height histogram. Note that collecting histograms takes extra cost. For example, collecting column statistics usually takes only one table scan, but generating equi-height histogram will cause an extra table scan. |
// ./bin/spark-shell --conf spark.sql.statistics.histogram.enabled=true
// Use the above example to set up the environment
// Make sure that ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS was run with histogram enabled
// There are 254 bins by default
// Use spark.sql.statistics.histogram.numBins to control the bins
val descExtSQL = s"DESC EXTENDED $tableName p1"
scala> spark.sql(descExtSQL).show(truncate = false)
+--------------+-----------------------------------------------------+
|info_name |info_value |
+--------------+-----------------------------------------------------+
|col_name |p1 |
|data_type |double |
|comment |NULL |
|min |0.0 |
|max |1.4 |
|num_nulls |0 |
|distinct_count|2 |
|avg_col_len |8 |
|max_col_len |8 |
|histogram |height: 0.007874015748031496, num_of_bins: 254 |
|bin_0 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_1 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_2 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_3 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_4 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_5 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_6 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_7 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_8 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_9 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
+--------------+-----------------------------------------------------+
only showing top 20 rows
Note
|
AnalyzeColumnCommand is described by analyze labeled alternative in statement expression in SqlBase.g4 and parsed using SparkSqlAstBuilder.
|
Note
|
AnalyzeColumnCommand is not supported on views.
|
Executing Logical Command — run
Method
run(sparkSession: SparkSession): Seq[Row]
Note
|
run is part of RunnableCommand Contract to execute (run) a logical command.
|
run
calculates the following statistics:
-
sizeInBytes
-
stats for each column
Caution
|
FIXME |
Computing Statistics for Specified Columns — computeColumnStats
Internal Method
computeColumnStats(
sparkSession: SparkSession,
tableIdent: TableIdentifier,
columnNames: Seq[String]): (Long, Map[String, ColumnStat])
computeColumnStats
…FIXME
Note
|
computeColumnStats is used exclusively when AnalyzeColumnCommand is executed.
|
computePercentiles
Internal Method
computePercentiles(
attributesToAnalyze: Seq[Attribute],
sparkSession: SparkSession,
relation: LogicalPlan): AttributeMap[ArrayData]
computePercentiles
…FIXME
Note
|
computePercentiles is used exclusively when AnalyzeColumnCommand is executed (and computes column statistics).
|