Hive Data Source

Spark SQL supports Apache Hive using Hive data source. Spark SQL allows executing structured queries on Hive tables, a persistent Hive metastore, support for Hive serdes and user-defined functions.

Tip	Consult Demo: Connecting Spark SQL to Hive Metastore (with Remote Metastore Server) to learn in a more practical approach.

In order to use Hive-related features in a Spark SQL application a SparkSession has to be created with Builder.enableHiveSupport.

Hive Data Source uses custom Spark SQL configuration properties (in addition to Hive’s).

Hive Data Source uses HiveTableRelation to represent Hive tables. HiveTableRelation can be converted to a HadoopFsRelation based on spark.sql.hive.convertMetastoreParquet and spark.sql.hive.convertMetastoreOrc properties (and "disappears" from a logical plan when enabled).

Hive Data Source uses HiveSessionStateBuilder (to build a Hive-specific SessionState) and HiveExternalCatalog.

Hive Data Source uses HiveClientImpl for meta data/DDL operations (using calls to a Hive metastore).

Hive Configuration Properties

Hive uses configuration properties that can be defined in configuration files and as System properties (that have a higher precedence).

In Spark applications, you could use --driver-java-options or --conf spark.hadoop.[property] to define Hive configuration properties.

$SPARK_HOME/bin/spark-shell \
  --jars \
    $HIVE_HOME/lib/hive-metastore-2.3.6.jar,\
    $HIVE_HOME/lib/hive-exec-2.3.6.jar,\
    $HIVE_HOME/lib/hive-common-2.3.6.jar,\
    $HIVE_HOME/lib/hive-serde-2.3.6.jar,\
    $HIVE_HOME/lib/guava-14.0.1.jar \
  --conf spark.sql.hive.metastore.version=2.3 \
  --conf spark.sql.hive.metastore.jars=$HIVE_HOME"/lib/*" \
  --conf spark.sql.warehouse.dir=hdfs://localhost:9000/user/hive/warehouse \
  --driver-java-options="-Dhive.exec.scratchdir=/tmp/hive-scratchdir" \
  --conf spark.hadoop.hive.exec.scratchdir=/tmp/hive-scratchdir-conf

Table 1. HiveConf’s Configuration Properties
Configuration Property
hive.enforce.bucketing
hive.enforce.sorting
hive.exec.compress.output Default: `false`
hive.exec.dynamic.partition
hive.exec.dynamic.partition.mode
hive.exec.log4j.file Hive log4j configuration file for execution mode(sub command). If the property is not set, then logging will be initialized using `hive-exec-log4j2.properties` found on the classpath. If the property is set, the value must be a valid URI (`java.net.URI`, e.g. `file:///tmp/my-logging.xml`).
hive.exec.scratchdir HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: `hive.exec.scratchdir`/<username> is created, with hive.scratch.dir.permission. Default: `/tmp/hive`
hive.exec.stagingdir Temporary staging directory that will be created inside table locations in order to support HDFS encryption. Default: `.hive-staging` This replaces hive.exec.scratchdir for query results with the exception of read-only tables. In all cases hive.exec.scratchdir is still used for other temporary files, such as job plans.
hive.log4j.file Hive log4j configuration file. If the property is not set, then logging will be initialized using `hive-log4j2.properties` found on the classpath. If the property is set, the value must be a valid URI (`java.net.URI`, e.g. `file:///tmp/my-logging.xml`).
hive.metastore.connect.retries Number of retries while opening a connection to metastore Default: `3`
hive.metastore.port Port of the Hive metastore Default: `9083`
hive.metastore.uris Comma-separated list of the thrift URIs of remote metastores that is used by metastore clients to connect (incl. Spark SQL applications)
hive.scratch.dir.permission The permission for the user specific scratch directories that get created. Default: `700`
javax.jdo.option.ConnectionDriverName Driver class name for a JDBC metastore Default: `org.apache.derby.jdbc.EmbeddedDriver`
javax.jdo.option.ConnectionPassword Password to use against metastore database Default: `mine`
javax.jdo.option.ConnectionURL JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL (`jdbc:postgresql://myhost/db?ssl=true` for postgres database). Default: `jdbc:derby:;databaseName=metastore_db;create=true`
javax.jdo.option.ConnectionUserName Username to use against metastore database Default: `APP`

Hive Configuration Files

HiveConf loads hive-default.xml when on the classpath.

HiveConf loads and prints out the location of hive-site.xml configuration file (when on the classpath, in $HIVE_CONF_DIR or $HIVE_HOME/conf directories, or in the directory with the jar file with HiveConf class).

Enable ALL logging level in conf/log4j.properties:

log4j.logger.org.apache.hadoop.hive=ALL

Execute the following spark.sharedState.externalCatalog.getTable("default", "t1") to have the following INFO message in the logs:

Found configuration file [url]

Important

Spark SQL loads hive-site.xml found in $SPARK_HOME/conf while Hive in $SPARK_HOME. Make sure there are no two configuration files that could lead to hard to diagnose issues at runtime.

Hive Data Source

Hive Data Source

Hive Configuration Properties

Hive Configuration Files

results matching ""

No results matching ""