scala> :type spark.sessionState.catalog
org.apache.spark.sql.catalyst.catalog.SessionCatalog
// Using high-level user-friendly catalog interface
scala> spark.catalog.listTables.filter($"name" === "t1").show
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
| t1| default| null| MANAGED| false|
+----+--------+-----------+---------+-----------+
// Using low-level internal SessionCatalog interface to access CatalogTables
val t1Tid = spark.sessionState.sqlParser.parseTableIdentifier("t1")
val t1Metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(t1Tid)
scala> :type t1Metadata
org.apache.spark.sql.catalyst.catalog.CatalogTable
CatalogTable — Table Specification (Metadata)
CatalogTable
is the specification (metadata) of a table.
CatalogTable
is stored in a SessionCatalog (session-scoped catalog of relational entities).
CatalogTable
is created when:
-
SessionCatalog
is requested for a table metadata -
HiveClientImpl
is requested for looking up a table in a metastore -
DataFrameWriter
is requested to create a table -
InsertIntoHiveDirCommand logical command is executed
-
SparkSqlAstBuilder
does visitCreateTable and visitCreateHiveTable -
CreateTableLikeCommand
logical command is executed -
CreateViewCommand
logical command is executed (and prepareTable) -
CatalogImpl
is requested to createTable
The readable text representation of a CatalogTable
(aka simpleString
) is…FIXME
Note
|
simpleString is used exclusively when ShowTablesCommand logical command is executed (with a partition specification).
|
CatalogTable
uses the following text representation (i.e. toString
)…FIXME
CatalogTable
is created with the optional bucketing specification that is used for the following:
-
CatalogImpl
is requested to list the columns of a table -
FindDataSourceTable
logical evaluation rule is requested to readDataSourceTable (when executed for data source tables) -
CreateTableLikeCommand
logical command is executed -
DescribeTableCommand
logical command is requested to describe detailed partition and storage information (when executed) -
ShowCreateTableCommand logical command is executed
-
CreateDataSourceTableCommand and CreateDataSourceTableAsSelectCommand logical commands are executed
-
CatalogTable
is requested to convert itself to LinkedHashMap -
HiveExternalCatalog
is requested to doCreateTable, tableMetaToTableProps, doAlterTable, restoreHiveSerdeTable and restoreDataSourceTable -
HiveClientImpl
is requested to retrieve a table metadata if available>> and toHiveTable -
InsertIntoHiveTable logical command is executed
-
DataFrameWriter
is requested to create a table (via saveAsTable) -
SparkSqlAstBuilder
is requested to visitCreateTable and visitCreateHiveTable
Creating CatalogTable Instance
CatalogTable
takes the following to be created:
-
Optional Bucketing specification (default:
None
) -
Optional table statistics
Table Type
The type of a table (CatalogTableType
) can be one of the following:
-
EXTERNAL
for external tables (EXTERNAL_TABLE in Hive) -
MANAGED
for managed tables (MANAGED_TABLE in Hive) -
VIEW
for views (VIRTUAL_VIEW in Hive)
CatalogTableType
is included when a TreeNode
is requested for a JSON representation for…FIXME
Table Statistics for Query Planning (Auto Broadcast Joins and Cost-Based Optimization)
You manage a table metadata using the catalog interface (aka metastore). Among the management tasks is to get the statistics of a table (that are used for cost-based query optimization).
scala> t1Metadata.stats.foreach(println)
CatalogStatistics(714,Some(2),Map(p1 -> ColumnStat(2,Some(0),Some(1),0,4,4,None), id -> ColumnStat(2,Some(0),Some(1),0,4,4,None)))
scala> t1Metadata.stats.map(_.simpleString).foreach(println)
714 bytes, 2 rows
Note
|
The CatalogStatistics are optional when CatalogTable is created.
|
Caution
|
FIXME When are stats specified? What if there are not? |
Unless CatalogStatistics are available in a table metadata (in a catalog) for a non-streaming file data source table, DataSource
creates a HadoopFsRelation
with the table size specified by spark.sql.defaultSizeInBytes internal property (default: Long.MaxValue
) for query planning of joins (and possibly to auto broadcast the table).
Internally, Spark alters table statistics using ExternalCatalog.doAlterTableStats.
Unless CatalogStatistics are available in a table metadata (in a catalog) for HiveTableRelation
(and hive
provider) DetermineTableStats
logical resolution rule can compute the table size using HDFS (if spark.sql.statistics.fallBackToHdfs property is turned on) or assume spark.sql.defaultSizeInBytes (that effectively disables table broadcasting).
When requested to look up a table in a metastore, HiveClientImpl
reads table or partition statistics directly from a Hive metastore.
You can use AnalyzeColumnCommand, AnalyzePartitionCommand, AnalyzeTableCommand commands to record statistics in a catalog.
The table statistics can be automatically updated (after executing commands like AlterTableAddPartitionCommand
) when spark.sql.statistics.size.autoUpdate.enabled property is turned on.
You can use DESCRIBE
SQL command to show the histogram of a column if stored in a catalog.
partitionSchema
Method
partitionSchema: StructType
partitionSchema
…FIXME
Note
|
partitionSchema is used when…FIXME
|
Converting Table Specification to LinkedHashMap — toLinkedHashMap
Method
toLinkedHashMap: mutable.LinkedHashMap[String, String]
toLinkedHashMap
converts the table specification to a collection of pairs (LinkedHashMap[String, String]
) with the following fields and their values:
-
Database with the database of the TableIdentifier
-
Table with the table of the TableIdentifier
-
Owner with the owner (if defined)
-
Created Time with the createTime
-
Created By with
Spark
and the createVersion -
Type with the name of the CatalogTableType
-
Provider with the provider (if defined)
-
Bucket specification (of the BucketSpec if defined)
-
Comment with the comment (if defined)
-
View Text, View Default Database and View Query Output Columns for VIEW table type
-
Table Properties with the tableProperties (if not empty)
-
Statistics with the CatalogStatistics (if defined)
-
Storage specification (of the CatalogStorageFormat if defined)
-
Partition Provider with Catalog if the tracksPartitionsInCatalog flag is on
-
Partition Columns with the partitionColumns (if not empty)
-
Schema with the schema (if not empty)
Note
|
|
database
Method
database: String
database
simply returns the database (of the TableIdentifier) or throws an AnalysisException
:
table [identifier] did not specify database
Note
|
database is used when…FIXME
|