CommandUtils — Utilities for Table Statistics

CommandUtils is a helper class that logical commands, e.g. InsertInto*, AlterTable*Command, LoadDataCommand, and CBO’s Analyze*, use to manage table statistics.

CommandUtils defines the following utilities:

Tip

Enable INFO logging level for org.apache.spark.sql.execution.command.CommandUtils logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.execution.command.CommandUtils=INFO

Refer to Logging.

Updating Existing Table Statistics — updateTableStats Method

updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit

updateTableStats updates the table statistics of the input CatalogTable (only if the statistics are available in the metastore already).

updateTableStats requests SessionCatalog to alterTableStats with the current total size (when spark.sql.statistics.size.autoUpdate.enabled property is turned on) or empty statistics (that effectively removes the recorded statistics completely).

Important
updateTableStats uses spark.sql.statistics.size.autoUpdate.enabled property to auto-update table statistics and can be expensive (and slow down data change commands) if the total number of files of a table is very large.
Note
updateTableStats uses SparkSession to access the current SessionState that it then uses to access the session-scoped SessionCatalog.
Note
updateTableStats is used when InsertIntoHiveTable, InsertIntoHadoopFsRelationCommand, AlterTableDropPartitionCommand, AlterTableSetLocationCommand and LoadDataCommand commands are executed.

Calculating Total Size of Table (with Partitions) — calculateTotalSize Method

calculateTotalSize(sessionState: SessionState, catalogTable: CatalogTable): BigInt

calculateTotalSize calculates total file size for the entire input CatalogTable (when it has no partitions defined) or all its partitions (through the session-scoped SessionCatalog).

Note
calculateTotalSize uses the input SessionState to access the SessionCatalog.
Note

calculateTotalSize is used when:

Calculating Total File Size Under Path — calculateLocationSize Method

calculateLocationSize(
  sessionState: SessionState,
  identifier: TableIdentifier,
  locationUri: Option[URI]): Long

calculateLocationSize reads hive.exec.stagingdir configuration property for the staging directory (with .hive-staging being the default).

You should see the following INFO message in the logs:

INFO CommandUtils: Starting to calculate the total file size under path [locationUri].

calculateLocationSize calculates the sum of the length of all the files under the input locationUri.

Note
calculateLocationSize uses Hadoop’s FileSystem.getFileStatus and FileStatus.getLen to access a file and the length of the file (in bytes), respectively.

In the end, you should see the following INFO message in the logs:

INFO CommandUtils: It took [durationInMs] ms to calculate the total file size under path [locationUri].
Note

calculateLocationSize is used when:

Creating CatalogStatistics with Current Statistics — compareAndGetNewStats Method

compareAndGetNewStats(
  oldStats: Option[CatalogStatistics],
  newTotalSize: BigInt,
  newRowCount: Option[BigInt]): Option[CatalogStatistics]

compareAndGetNewStats creates a new CatalogStatistics with the input newTotalSize and newRowCount only when they are different from the oldStats.

Note
compareAndGetNewStats is used when AnalyzePartitionCommand and AnalyzeTableCommand are executed.

results matching ""

    No results matching ""