scala> :type spark.sessionState.catalog
org.apache.spark.sql.catalyst.catalog.SessionCatalog
// Using high-level user-friendly catalog interface
scala> spark.catalog.listTables.filter($"name" === "t1").show
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
| t1| default| null| MANAGED| false|
+----+--------+-----------+---------+-----------+
// Using low-level internal SessionCatalog interface to access CatalogTables
val t1Tid = spark.sessionState.sqlParser.parseTableIdentifier("t1")
val t1Metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(t1Tid)
scala> :type t1Metadata
org.apache.spark.sql.catalyst.catalog.CatalogTable
CatalogTable — Table Specification (Metadata)
CatalogTable is the specification (metadata) of a table.
CatalogTable is stored in a SessionCatalog (session-scoped catalog of relational entities).
CatalogTable is created when:
-
SessionCatalogis requested for a table metadata -
HiveClientImplis requested for looking up a table in a metastore -
DataFrameWriteris requested to create a table -
InsertIntoHiveDirCommand logical command is executed
-
SparkSqlAstBuilderdoes visitCreateTable and visitCreateHiveTable -
CreateTableLikeCommandlogical command is executed -
CreateViewCommandlogical command is executed (and prepareTable) -
CatalogImplis requested to createTable
The readable text representation of a CatalogTable (aka simpleString) is…FIXME
|
Note
|
simpleString is used exclusively when ShowTablesCommand logical command is executed (with a partition specification).
|
CatalogTable uses the following text representation (i.e. toString)…FIXME
CatalogTable is created with the optional bucketing specification that is used for the following:
-
CatalogImplis requested to list the columns of a table -
FindDataSourceTablelogical evaluation rule is requested to readDataSourceTable (when executed for data source tables) -
CreateTableLikeCommandlogical command is executed -
DescribeTableCommandlogical command is requested to describe detailed partition and storage information (when executed) -
ShowCreateTableCommand logical command is executed
-
CreateDataSourceTableCommand and CreateDataSourceTableAsSelectCommand logical commands are executed
-
CatalogTableis requested to convert itself to LinkedHashMap -
HiveExternalCatalogis requested to doCreateTable, tableMetaToTableProps, doAlterTable, restoreHiveSerdeTable and restoreDataSourceTable -
HiveClientImplis requested to retrieve a table metadata if available>> and toHiveTable -
InsertIntoHiveTable logical command is executed
-
DataFrameWriteris requested to create a table (via saveAsTable) -
SparkSqlAstBuilderis requested to visitCreateTable and visitCreateHiveTable
Creating CatalogTable Instance
CatalogTable takes the following to be created:
-
Optional Bucketing specification (default:
None) -
Optional table statistics
Table Type
The type of a table (CatalogTableType) can be one of the following:
-
EXTERNALfor external tables (EXTERNAL_TABLE in Hive) -
MANAGEDfor managed tables (MANAGED_TABLE in Hive) -
VIEWfor views (VIRTUAL_VIEW in Hive)
CatalogTableType is included when a TreeNode is requested for a JSON representation for…FIXME
Table Statistics for Query Planning (Auto Broadcast Joins and Cost-Based Optimization)
You manage a table metadata using the catalog interface (aka metastore). Among the management tasks is to get the statistics of a table (that are used for cost-based query optimization).
scala> t1Metadata.stats.foreach(println)
CatalogStatistics(714,Some(2),Map(p1 -> ColumnStat(2,Some(0),Some(1),0,4,4,None), id -> ColumnStat(2,Some(0),Some(1),0,4,4,None)))
scala> t1Metadata.stats.map(_.simpleString).foreach(println)
714 bytes, 2 rows
|
Note
|
The CatalogStatistics are optional when CatalogTable is created.
|
|
Caution
|
FIXME When are stats specified? What if there are not? |
Unless CatalogStatistics are available in a table metadata (in a catalog) for a non-streaming file data source table, DataSource creates a HadoopFsRelation with the table size specified by spark.sql.defaultSizeInBytes internal property (default: Long.MaxValue) for query planning of joins (and possibly to auto broadcast the table).
Internally, Spark alters table statistics using ExternalCatalog.doAlterTableStats.
Unless CatalogStatistics are available in a table metadata (in a catalog) for HiveTableRelation (and hive provider) DetermineTableStats logical resolution rule can compute the table size using HDFS (if spark.sql.statistics.fallBackToHdfs property is turned on) or assume spark.sql.defaultSizeInBytes (that effectively disables table broadcasting).
When requested to look up a table in a metastore, HiveClientImpl reads table or partition statistics directly from a Hive metastore.
You can use AnalyzeColumnCommand, AnalyzePartitionCommand, AnalyzeTableCommand commands to record statistics in a catalog.
The table statistics can be automatically updated (after executing commands like AlterTableAddPartitionCommand) when spark.sql.statistics.size.autoUpdate.enabled property is turned on.
You can use DESCRIBE SQL command to show the histogram of a column if stored in a catalog.
partitionSchema Method
partitionSchema: StructType
partitionSchema…FIXME
|
Note
|
partitionSchema is used when…FIXME
|
Converting Table Specification to LinkedHashMap — toLinkedHashMap Method
toLinkedHashMap: mutable.LinkedHashMap[String, String]
toLinkedHashMap converts the table specification to a collection of pairs (LinkedHashMap[String, String]) with the following fields and their values:
-
Database with the database of the TableIdentifier
-
Table with the table of the TableIdentifier
-
Owner with the owner (if defined)
-
Created Time with the createTime
-
Created By with
Sparkand the createVersion -
Type with the name of the CatalogTableType
-
Provider with the provider (if defined)
-
Bucket specification (of the BucketSpec if defined)
-
Comment with the comment (if defined)
-
View Text, View Default Database and View Query Output Columns for VIEW table type
-
Table Properties with the tableProperties (if not empty)
-
Statistics with the CatalogStatistics (if defined)
-
Storage specification (of the CatalogStorageFormat if defined)
-
Partition Provider with Catalog if the tracksPartitionsInCatalog flag is on
-
Partition Columns with the partitionColumns (if not empty)
-
Schema with the schema (if not empty)
|
Note
|
|
database Method
database: String
database simply returns the database (of the TableIdentifier) or throws an AnalysisException:
table [identifier] did not specify database
|
Note
|
database is used when…FIXME
|