log4j.logger.org.apache.spark.sql.execution.command.CommandUtils=INFO
CommandUtils — Utilities for Table Statistics
CommandUtils is a helper class that logical commands, e.g. InsertInto*, AlterTable*Command, LoadDataCommand, and CBO’s Analyze*, use to manage table statistics.
CommandUtils defines the following utilities:
|
Tip
|
Enable Add the following line to Refer to Logging. |
Updating Existing Table Statistics — updateTableStats Method
updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit
updateTableStats updates the table statistics of the input CatalogTable (only if the statistics are available in the metastore already).
updateTableStats requests SessionCatalog to alterTableStats with the current total size (when spark.sql.statistics.size.autoUpdate.enabled property is turned on) or empty statistics (that effectively removes the recorded statistics completely).
|
Important
|
updateTableStats uses spark.sql.statistics.size.autoUpdate.enabled property to auto-update table statistics and can be expensive (and slow down data change commands) if the total number of files of a table is very large.
|
|
Note
|
updateTableStats uses SparkSession to access the current SessionState that it then uses to access the session-scoped SessionCatalog.
|
|
Note
|
updateTableStats is used when InsertIntoHiveTable, InsertIntoHadoopFsRelationCommand, AlterTableDropPartitionCommand, AlterTableSetLocationCommand and LoadDataCommand commands are executed.
|
Calculating Total Size of Table (with Partitions) — calculateTotalSize Method
calculateTotalSize(sessionState: SessionState, catalogTable: CatalogTable): BigInt
calculateTotalSize calculates total file size for the entire input CatalogTable (when it has no partitions defined) or all its partitions (through the session-scoped SessionCatalog).
|
Note
|
calculateTotalSize uses the input SessionState to access the SessionCatalog.
|
|
Note
|
|
Calculating Total File Size Under Path — calculateLocationSize Method
calculateLocationSize(
sessionState: SessionState,
identifier: TableIdentifier,
locationUri: Option[URI]): Long
calculateLocationSize reads hive.exec.stagingdir configuration property for the staging directory (with .hive-staging being the default).
You should see the following INFO message in the logs:
INFO CommandUtils: Starting to calculate the total file size under path [locationUri].
calculateLocationSize calculates the sum of the length of all the files under the input locationUri.
|
Note
|
calculateLocationSize uses Hadoop’s FileSystem.getFileStatus and FileStatus.getLen to access a file and the length of the file (in bytes), respectively.
|
In the end, you should see the following INFO message in the logs:
INFO CommandUtils: It took [durationInMs] ms to calculate the total file size under path [locationUri].
|
Note
|
|
Creating CatalogStatistics with Current Statistics — compareAndGetNewStats Method
compareAndGetNewStats(
oldStats: Option[CatalogStatistics],
newTotalSize: BigInt,
newRowCount: Option[BigInt]): Option[CatalogStatistics]
compareAndGetNewStats creates a new CatalogStatistics with the input newTotalSize and newRowCount only when they are different from the oldStats.
|
Note
|
compareAndGetNewStats is used when AnalyzePartitionCommand and AnalyzeTableCommand are executed.
|