Hive Data Source

Spark SQL supports Apache Hive using Hive data source. Spark SQL allows executing structured queries on Hive tables, a persistent Hive metastore, support for Hive serdes and user-defined functions.

Consult Demo: Connecting Spark SQL to Hive Metastore (with Remote Metastore Server) to learn in a more practical approach.

In order to use Hive-related features in a Spark SQL application a SparkSession has to be created with Builder.enableHiveSupport.

Hive Data Source uses custom Spark SQL configuration properties (in addition to Hive’s).

Hive Data Source uses HiveTableRelation to represent Hive tables. HiveTableRelation can be converted to a HadoopFsRelation based on spark.sql.hive.convertMetastoreParquet and spark.sql.hive.convertMetastoreOrc properties (and "disappears" from a logical plan when enabled).

Hive Data Source uses HiveSessionStateBuilder (to build a Hive-specific SessionState) and HiveExternalCatalog.

Hive Data Source uses HiveClientImpl for meta data/DDL operations (using calls to a Hive metastore).

Hive Configuration Properties

Hive uses configuration properties that can be defined in configuration files and as System properties (that have a higher precedence).

In Spark applications, you could use --driver-java-options or --conf spark.hadoop.[property] to define Hive configuration properties.

$SPARK_HOME/bin/spark-shell \
  --jars \
    $HIVE_HOME/lib/guava-14.0.1.jar \
  --conf spark.sql.hive.metastore.version=2.3 \
  --conf spark.sql.hive.metastore.jars=$HIVE_HOME"/lib/*" \
  --conf spark.sql.warehouse.dir=hdfs://localhost:9000/user/hive/warehouse \
  --driver-java-options="-Dhive.exec.scratchdir=/tmp/hive-scratchdir" \
  --conf spark.hadoop.hive.exec.scratchdir=/tmp/hive-scratchdir-conf
Table 1. HiveConf’s Configuration Properties
Configuration Property




Default: false




Hive log4j configuration file for execution mode(sub command).

If the property is not set, then logging will be initialized using found on the classpath.

If the property is set, the value must be a valid URI (, e.g. file:///tmp/my-logging.xml).


HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: hive.exec.scratchdir/<username> is created, with hive.scratch.dir.permission.

Default: /tmp/hive


Temporary staging directory that will be created inside table locations in order to support HDFS encryption.

Default: .hive-staging

This replaces hive.exec.scratchdir for query results with the exception of read-only tables. In all cases hive.exec.scratchdir is still used for other temporary files, such as job plans.


Hive log4j configuration file.

If the property is not set, then logging will be initialized using found on the classpath.

If the property is set, the value must be a valid URI (, e.g. file:///tmp/my-logging.xml).


Number of retries while opening a connection to metastore

Default: 3


Port of the Hive metastore

Default: 9083


Comma-separated list of the thrift URIs of remote metastores that is used by metastore clients to connect (incl. Spark SQL applications)


The permission for the user specific scratch directories that get created.

Default: 700


Driver class name for a JDBC metastore

Default: org.apache.derby.jdbc.EmbeddedDriver


Password to use against metastore database

Default: mine


JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL (jdbc:postgresql://myhost/db?ssl=true for postgres database).

Default: jdbc:derby:;databaseName=metastore_db;create=true


Username to use against metastore database

Default: APP

Hive Configuration Files

HiveConf loads hive-default.xml when on the classpath.

HiveConf loads and prints out the location of hive-site.xml configuration file (when on the classpath, in $HIVE_CONF_DIR or $HIVE_HOME/conf directories, or in the directory with the jar file with HiveConf class).

Enable ALL logging level in conf/

Execute the following spark.sharedState.externalCatalog.getTable("default", "t1") to have the following INFO message in the logs:

Found configuration file [url]
Spark SQL loads hive-site.xml found in $SPARK_HOME/conf while Hive in $SPARK_HOME. Make sure there are no two configuration files that could lead to hard to diagnose issues at runtime.

results matching ""

    No results matching ""