CatalogTable — Table Specification (Metadata)

CatalogTable is the specification (metadata) of a table.

CatalogTable is stored in a SessionCatalog (session-scoped catalog of relational entities).

scala> :type spark.sessionState.catalog
org.apache.spark.sql.catalyst.catalog.SessionCatalog

// Using high-level user-friendly catalog interface
scala> spark.catalog.listTables.filter($"name" === "t1").show
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
|  t1| default|       null|  MANAGED|      false|
+----+--------+-----------+---------+-----------+

// Using low-level internal SessionCatalog interface to access CatalogTables
val t1Tid = spark.sessionState.sqlParser.parseTableIdentifier("t1")
val t1Metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(t1Tid)
scala> :type t1Metadata
org.apache.spark.sql.catalyst.catalog.CatalogTable

CatalogTable is created when:

SessionCatalog is requested for a table metadata
HiveClientImpl is requested for looking up a table in a metastore
DataFrameWriter is requested to create a table
InsertIntoHiveDirCommand logical command is executed
SparkSqlAstBuilder does visitCreateTable and visitCreateHiveTable
CreateTableLikeCommand logical command is executed
CreateViewCommand logical command is executed (and prepareTable)
CatalogImpl is requested to createTable

The readable text representation of a CatalogTable (aka simpleString) is…FIXME

Note	`simpleString` is used exclusively when `ShowTablesCommand` logical command is executed (with a partition specification).

CatalogTable uses the following text representation (i.e. toString)…FIXME

CatalogTable is created with the optional bucketing specification that is used for the following:

CatalogImpl is requested to list the columns of a table
FindDataSourceTable logical evaluation rule is requested to readDataSourceTable (when executed for data source tables)
CreateTableLikeCommand logical command is executed
DescribeTableCommand logical command is requested to describe detailed partition and storage information (when executed)
ShowCreateTableCommand logical command is executed
CreateDataSourceTableCommand and CreateDataSourceTableAsSelectCommand logical commands are executed
CatalogTable is requested to convert itself to LinkedHashMap
HiveExternalCatalog is requested to doCreateTable, tableMetaToTableProps, doAlterTable, restoreHiveSerdeTable and restoreDataSourceTable
HiveClientImpl is requested to retrieve a table metadata if available>> and toHiveTable
InsertIntoHiveTable logical command is executed
DataFrameWriter is requested to create a table (via saveAsTable)
SparkSqlAstBuilder is requested to visitCreateTable and visitCreateHiveTable

Creating CatalogTable Instance

CatalogTable takes the following to be created:

TableIdentifier
Table type
CatalogStorageFormat
Schema
Name of the table provider (optional)
Partition column names
Optional Bucketing specification (default: None)
Owner
Create time
Last access time
Create version
Properties
Optional table statistics
Optional view text
Optional comment
Unsupported features
tracksPartitionsInCatalog flag
schemaPreservesCase flag
Ignored properties

Table Type

The type of a table (CatalogTableType) can be one of the following:

EXTERNAL for external tables (EXTERNAL_TABLE in Hive)
MANAGED for managed tables (MANAGED_TABLE in Hive)
VIEW for views (VIRTUAL_VIEW in Hive)

CatalogTableType is included when a TreeNode is requested for a JSON representation for…FIXME

Table Statistics for Query Planning (Auto Broadcast Joins and Cost-Based Optimization)

You manage a table metadata using the catalog interface (aka metastore). Among the management tasks is to get the statistics of a table (that are used for cost-based query optimization).

scala> t1Metadata.stats.foreach(println)
CatalogStatistics(714,Some(2),Map(p1 -> ColumnStat(2,Some(0),Some(1),0,4,4,None), id -> ColumnStat(2,Some(0),Some(1),0,4,4,None)))

scala> t1Metadata.stats.map(_.simpleString).foreach(println)
714 bytes, 2 rows

Note	The CatalogStatistics are optional when `CatalogTable` is created.

Caution

FIXME When are stats specified? What if there are not?

Unless CatalogStatistics are available in a table metadata (in a catalog) for a non-streaming file data source table, DataSource creates a HadoopFsRelation with the table size specified by spark.sql.defaultSizeInBytes internal property (default: Long.MaxValue) for query planning of joins (and possibly to auto broadcast the table).

Internally, Spark alters table statistics using ExternalCatalog.doAlterTableStats.

Unless CatalogStatistics are available in a table metadata (in a catalog) for HiveTableRelation (and hive provider) DetermineTableStats logical resolution rule can compute the table size using HDFS (if spark.sql.statistics.fallBackToHdfs property is turned on) or assume spark.sql.defaultSizeInBytes (that effectively disables table broadcasting).

When requested to look up a table in a metastore, HiveClientImpl reads table or partition statistics directly from a Hive metastore.

You can use AnalyzeColumnCommand, AnalyzePartitionCommand, AnalyzeTableCommand commands to record statistics in a catalog.

The table statistics can be automatically updated (after executing commands like AlterTableAddPartitionCommand) when spark.sql.statistics.size.autoUpdate.enabled property is turned on.

You can use DESCRIBE SQL command to show the histogram of a column if stored in a catalog.

`dataSchema` Method

dataSchema: StructType

dataSchema…FIXME

Note	`dataSchema` is used when…FIXME

`partitionSchema` Method

partitionSchema: StructType

partitionSchema…FIXME

Note	`partitionSchema` is used when…FIXME

Converting Table Specification to LinkedHashMap — `toLinkedHashMap` Method

toLinkedHashMap: mutable.LinkedHashMap[String, String]

toLinkedHashMap converts the table specification to a collection of pairs (LinkedHashMap[String, String]) with the following fields and their values:

Database with the database of the TableIdentifier
Table with the table of the TableIdentifier
Owner with the owner (if defined)
Created Time with the createTime
Created By with Spark and the createVersion
Type with the name of the CatalogTableType
Provider with the provider (if defined)
Bucket specification (of the BucketSpec if defined)
Comment with the comment (if defined)
View Text, View Default Database and View Query Output Columns for VIEW table type
Table Properties with the tableProperties (if not empty)
Statistics with the CatalogStatistics (if defined)
Storage specification (of the CatalogStorageFormat if defined)
Partition Provider with Catalog if the tracksPartitionsInCatalog flag is on
Partition Columns with the partitionColumns (if not empty)
Schema with the schema (if not empty)

Note	`toLinkedHashMap` is used when: `DescribeTableCommand` is requested to describeFormattedTableInfo (when `DescribeTableCommand` is requested to run for a non-temporary table and the isExtended flag on) `CatalogTable` is requested for either a simple or a catalog text representation

`database` Method

database: String

database simply returns the database (of the TableIdentifier) or throws an AnalysisException:

table [identifier] did not specify database

Note	`database` is used when…FIXME

CatalogTable — Table Specification (Native Table Metadata)

CatalogTable — Table Specification (Metadata)

Creating CatalogTable Instance

Table Type

Table Statistics for Query Planning (Auto Broadcast Joins and Cost-Based Optimization)

`dataSchema` Method

`partitionSchema` Method

Converting Table Specification to LinkedHashMap — `toLinkedHashMap` Method

`database` Method

results matching ""

No results matching ""

CatalogTable — Table Specification (Metadata)

Creating CatalogTable Instance

Table Type

Table Statistics for Query Planning (Auto Broadcast Joins and Cost-Based Optimization)

dataSchema Method

partitionSchema Method

Converting Table Specification to LinkedHashMap — toLinkedHashMap Method

database Method

results matching ""

No results matching ""

`dataSchema` Method

`partitionSchema` Method

Converting Table Specification to LinkedHashMap — `toLinkedHashMap` Method

`database` Method