CommandUtils — Utilities for Table Statistics

CommandUtils is a helper class that logical commands, e.g. InsertInto*, AlterTable*Command, LoadDataCommand, and CBO’s Analyze*, use to manage table statistics.

CommandUtils defines the following utilities:


Enable INFO logging level for org.apache.spark.sql.execution.command.CommandUtils logger to see what happens inside.

Add the following line to conf/

Refer to Logging.

Updating Existing Table Statistics — updateTableStats Method

updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit

updateTableStats updates the table statistics of the input CatalogTable (only if the statistics are available in the metastore already).

updateTableStats requests SessionCatalog to alterTableStats with the current total size (when spark.sql.statistics.size.autoUpdate.enabled property is turned on) or empty statistics (that effectively removes the recorded statistics completely).

updateTableStats uses spark.sql.statistics.size.autoUpdate.enabled property to auto-update table statistics and can be expensive (and slow down data change commands) if the total number of files of a table is very large.
updateTableStats uses SparkSession to access the current SessionState that it then uses to access the session-scoped SessionCatalog.
updateTableStats is used when InsertIntoHiveTable, InsertIntoHadoopFsRelationCommand, AlterTableDropPartitionCommand, AlterTableSetLocationCommand and LoadDataCommand commands are executed.

Calculating Total Size of Table (with Partitions) — calculateTotalSize Method

calculateTotalSize(sessionState: SessionState, catalogTable: CatalogTable): BigInt

calculateTotalSize calculates total file size for the entire input CatalogTable (when it has no partitions defined) or all its partitions (through the session-scoped SessionCatalog).

calculateTotalSize uses the input SessionState to access the SessionCatalog.

calculateTotalSize is used when:

Calculating Total File Size Under Path — calculateLocationSize Method

  sessionState: SessionState,
  identifier: TableIdentifier,
  locationUri: Option[URI]): Long

calculateLocationSize reads hive.exec.stagingdir configuration property for the staging directory (with .hive-staging being the default).

You should see the following INFO message in the logs:

INFO CommandUtils: Starting to calculate the total file size under path [locationUri].

calculateLocationSize calculates the sum of the length of all the files under the input locationUri.

calculateLocationSize uses Hadoop’s FileSystem.getFileStatus and FileStatus.getLen to access a file and the length of the file (in bytes), respectively.

In the end, you should see the following INFO message in the logs:

INFO CommandUtils: It took [durationInMs] ms to calculate the total file size under path [locationUri].

calculateLocationSize is used when:

Creating CatalogStatistics with Current Statistics — compareAndGetNewStats Method

  oldStats: Option[CatalogStatistics],
  newTotalSize: BigInt,
  newRowCount: Option[BigInt]): Option[CatalogStatistics]

compareAndGetNewStats creates a new CatalogStatistics with the input newTotalSize and newRowCount only when they are different from the oldStats.

compareAndGetNewStats is used when AnalyzePartitionCommand and AnalyzeTableCommand are executed.

results matching ""

    No results matching ""