Hive Data Source

Spark SQL supports Apache Hive using Hive data source. Spark SQL allows executing structured queries on Hive tables, a persistent Hive metastore, support for Hive serdes and user-defined functions.

Tip
Consult Demo: Connecting Spark SQL to Hive Metastore (with Remote Metastore Server) to learn in a more practical approach.

In order to use Hive-related features in a Spark SQL application a SparkSession has to be created with Builder.enableHiveSupport.

Hive Data Source uses custom Spark SQL configuration properties (in addition to Hive’s).

Hive Data Source uses HiveTableRelation to represent Hive tables. HiveTableRelation can be converted to a HadoopFsRelation based on spark.sql.hive.convertMetastoreParquet and spark.sql.hive.convertMetastoreOrc properties (and "disappears" from a logical plan when enabled).

Hive Data Source uses HiveSessionStateBuilder (to build a Hive-specific SessionState) and HiveExternalCatalog.

Hive Data Source uses HiveClientImpl for meta data/DDL operations (using calls to a Hive metastore).

Hive Configuration Properties

Hive uses configuration properties that can be defined in configuration files and as System properties (that have a higher precedence).

In Spark applications, you could use --driver-java-options or --conf spark.hadoop.[property] to define Hive configuration properties.

$SPARK_HOME/bin/spark-shell \
  --jars \
    $HIVE_HOME/lib/hive-metastore-2.3.6.jar,\
    $HIVE_HOME/lib/hive-exec-2.3.6.jar,\
    $HIVE_HOME/lib/hive-common-2.3.6.jar,\
    $HIVE_HOME/lib/hive-serde-2.3.6.jar,\
    $HIVE_HOME/lib/guava-14.0.1.jar \
  --conf spark.sql.hive.metastore.version=2.3 \
  --conf spark.sql.hive.metastore.jars=$HIVE_HOME"/lib/*" \
  --conf spark.sql.warehouse.dir=hdfs://localhost:9000/user/hive/warehouse \
  --driver-java-options="-Dhive.exec.scratchdir=/tmp/hive-scratchdir" \
  --conf spark.hadoop.hive.exec.scratchdir=/tmp/hive-scratchdir-conf
Table 1. HiveConf’s Configuration Properties
Configuration Property

hive.enforce.bucketing

hive.enforce.sorting

hive.exec.compress.output

Default: false

hive.exec.dynamic.partition

hive.exec.dynamic.partition.mode

hive.exec.log4j.file

Hive log4j configuration file for execution mode(sub command).

If the property is not set, then logging will be initialized using hive-exec-log4j2.properties found on the classpath.

If the property is set, the value must be a valid URI (java.net.URI, e.g. file:///tmp/my-logging.xml).

hive.exec.scratchdir

HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: hive.exec.scratchdir/<username> is created, with hive.scratch.dir.permission.

Default: /tmp/hive

hive.exec.stagingdir

Temporary staging directory that will be created inside table locations in order to support HDFS encryption.

Default: .hive-staging

This replaces hive.exec.scratchdir for query results with the exception of read-only tables. In all cases hive.exec.scratchdir is still used for other temporary files, such as job plans.

hive.log4j.file

Hive log4j configuration file.

If the property is not set, then logging will be initialized using hive-log4j2.properties found on the classpath.

If the property is set, the value must be a valid URI (java.net.URI, e.g. file:///tmp/my-logging.xml).

hive.metastore.connect.retries

Number of retries while opening a connection to metastore

Default: 3

hive.metastore.port

Port of the Hive metastore

Default: 9083

hive.metastore.uris

Comma-separated list of the thrift URIs of remote metastores that is used by metastore clients to connect (incl. Spark SQL applications)

hive.scratch.dir.permission

The permission for the user specific scratch directories that get created.

Default: 700

javax.jdo.option.ConnectionDriverName

Driver class name for a JDBC metastore

Default: org.apache.derby.jdbc.EmbeddedDriver

javax.jdo.option.ConnectionPassword

Password to use against metastore database

Default: mine

javax.jdo.option.ConnectionURL

JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL (jdbc:postgresql://myhost/db?ssl=true for postgres database).

Default: jdbc:derby:;databaseName=metastore_db;create=true

javax.jdo.option.ConnectionUserName

Username to use against metastore database

Default: APP

Hive Configuration Files

HiveConf loads hive-default.xml when on the classpath.

HiveConf loads and prints out the location of hive-site.xml configuration file (when on the classpath, in $HIVE_CONF_DIR or $HIVE_HOME/conf directories, or in the directory with the jar file with HiveConf class).

Enable ALL logging level in conf/log4j.properties:

log4j.logger.org.apache.hadoop.hive=ALL

Execute the following spark.sharedState.externalCatalog.getTable("default", "t1") to have the following INFO message in the logs:

Found configuration file [url]
Important
Spark SQL loads hive-site.xml found in $SPARK_HOME/conf while Hive in $SPARK_HOME. Make sure there are no two configuration files that could lead to hard to diagnose issues at runtime.

results matching ""

    No results matching ""