CatalogImpl
CatalogImpl
is the Catalog in Spark SQL that…FIXME
Note
|
CatalogImpl is in org.apache.spark.sql.internal package.
|
Creating Table — createTable
Method
createTable(
tableName: String,
source: String,
schema: StructType,
options: Map[String, String]): DataFrame
Note
|
createTable is part of Catalog Contract to…FIXME.
|
createTable
…FIXME
getTable
Method
getTable(tableName: String): Table
getTable(dbName: String, tableName: String): Table
Note
|
getTable is part of Catalog Contract to…FIXME.
|
getTable
…FIXME
getFunction
Method
getFunction(
functionName: String): Function
getFunction(
dbName: String,
functionName: String): Function
Note
|
getFunction is part of Catalog Contract to…FIXME.
|
getFunction
…FIXME
functionExists
Method
functionExists(
functionName: String): Boolean
functionExists(
dbName: String,
functionName: String): Boolean
Note
|
functionExists is part of Catalog Contract to…FIXME.
|
functionExists
…FIXME
Caching Table or View In-Memory — cacheTable
Method
cacheTable(tableName: String): Unit
Internally, cacheTable
first creates a DataFrame for the table followed by requesting CacheManager
to cache it.
Note
|
cacheTable uses the session-scoped SharedState to access the CacheManager .
|
Note
|
cacheTable is part of Catalog contract.
|
Removing All Cached Tables From In-Memory Cache — clearCache
Method
clearCache(): Unit
clearCache
requests CacheManager
to remove all cached tables from in-memory cache.
Note
|
clearCache is part of Catalog contract.
|
Creating External Table From Path — createExternalTable
Method
createExternalTable(tableName: String, path: String): DataFrame
createExternalTable(tableName: String, path: String, source: String): DataFrame
createExternalTable(
tableName: String,
source: String,
options: Map[String, String]): DataFrame
createExternalTable(
tableName: String,
source: String,
schema: StructType,
options: Map[String, String]): DataFrame
createExternalTable
creates an external table tableName
from the given path
and returns the corresponding DataFrame.
import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
val readmeTable = spark.catalog.createExternalTable("readme", "README.md", "text")
readmeTable: org.apache.spark.sql.DataFrame = [value: string]
scala> spark.catalog.listTables.filter(_.name == "readme").show
+------+--------+-----------+---------+-----------+
| name|database|description|tableType|isTemporary|
+------+--------+-----------+---------+-----------+
|readme| default| null| EXTERNAL| false|
+------+--------+-----------+---------+-----------+
scala> sql("select count(*) as count from readme").show(false)
+-----+
|count|
+-----+
|99 |
+-----+
The source
input parameter is the name of the data source provider for the table, e.g. parquet, json, text. If not specified, createExternalTable
uses spark.sql.sources.default setting to know the data source format.
Note
|
source input parameter must not be hive as it leads to a AnalysisException .
|
createExternalTable
sets the mandatory path
option when specified explicitly in the input parameter list.
createExternalTable
parses tableName
into TableIdentifier
(using SparkSqlParser). It creates a CatalogTable and then executes (by toRDD) a CreateTable logical plan. The result DataFrame is a Dataset[Row]
with the QueryExecution after executing SubqueryAlias logical plan and RowEncoder.
Note
|
createExternalTable is part of Catalog contract.
|
Listing Tables in Database (as Dataset) — listTables
Method
listTables(): Dataset[Table]
listTables(dbName: String): Dataset[Table]
Note
|
listTables is part of Catalog Contract to get a list of tables in the specified database.
|
Internally, listTables
requests SessionCatalog to list all tables in the specified dbName
database and converts them to Tables.
In the end, listTables
creates a Dataset with the tables.
Listing Columns of Table (as Dataset) — listColumns
Method
listColumns(tableName: String): Dataset[Column]
listColumns(dbName: String, tableName: String): Dataset[Column]
Note
|
listColumns is part of Catalog Contract to…FIXME.
|
listColumns
requests SessionCatalog for the table metadata.
listColumns
takes the schema from the table metadata and creates a Column
for every field (with the optional comment as the description).
In the end, listColumns
creates a Dataset with the columns.
Converting TableIdentifier to Table — makeTable
Internal Method
makeTable(tableIdent: TableIdentifier): Table
makeTable
creates a Table
using the input TableIdentifier
and the table metadata (from the current SessionCatalog) if available.
Note
|
makeTable uses SparkSession to access SessionState that is then used to access SessionCatalog.
|
Note
|
makeTable is used when CatalogImpl is requested to listTables or getTable.
|
Creating Dataset from DefinedByConstructorParams Data — makeDataset
Method
makeDataset[T <: DefinedByConstructorParams](
data: Seq[T],
sparkSession: SparkSession): Dataset[T]
makeDataset
creates an ExpressionEncoder (from DefinedByConstructorParams) and encodes elements of the input data
to internal binary rows.
makeDataset
then creates a LocalRelation logical operator. makeDataset
requests SessionState
to execute the plan and creates the result Dataset
.
Note
|
makeDataset is used when CatalogImpl is requested to list databases, tables, functions and columns
|
Refreshing Analyzed Logical Plan of Table Query and Re-Caching It — refreshTable
Method
refreshTable(tableName: String): Unit
Note
|
refreshTable is part of Catalog Contract to…FIXME.
|
refreshTable
requests SessionState
for the SQL parser to parse a TableIdentifier given the table name.
Note
|
refreshTable uses SparkSession to access the SessionState.
|
refreshTable
requests SessionCatalog for the table metadata.
refreshTable
then creates a DataFrame for the table name.
For a temporary or persistent VIEW
table, refreshTable
requests the analyzed logical plan of the DataFrame (for the table) to refresh itself.
For other types of table, refreshTable
requests SessionCatalog for refreshing the table metadata (i.e. invalidating the table).
If the table has been cached, refreshTable
requests CacheManager
to uncache and cache the table DataFrame
again.
Note
|
refreshTable uses SparkSession to access the SharedState that is used to access CacheManager.
|
refreshByPath
Method
refreshByPath(resourcePath: String): Unit
Note
|
refreshByPath is part of Catalog Contract to…FIXME.
|
refreshByPath
…FIXME
dropGlobalTempView
Method
dropGlobalTempView(
viewName: String): Boolean
Note
|
dropGlobalTempView is part of Catalog contract].
|
dropGlobalTempView
…FIXME
listColumns
Internal Method
listColumns(tableIdentifier: TableIdentifier): Dataset[Column]
listColumns
…FIXME
Note
|
listColumns is used exclusively when CatalogImpl is requested to listColumns.
|