$SPARK_HOME/bin/spark-shell \
--jars \
$HIVE_HOME/lib/hive-metastore-2.3.6.jar,\
$HIVE_HOME/lib/hive-exec-2.3.6.jar,\
$HIVE_HOME/lib/hive-common-2.3.6.jar,\
$HIVE_HOME/lib/hive-serde-2.3.6.jar,\
$HIVE_HOME/lib/guava-14.0.1.jar \
--conf spark.sql.hive.metastore.version=2.3 \
--conf spark.sql.hive.metastore.jars=$HIVE_HOME"/lib/*" \
--conf spark.sql.warehouse.dir=hdfs://localhost:9000/user/hive/warehouse \
--driver-java-options="-Dhive.exec.scratchdir=/tmp/hive-scratchdir" \
--conf spark.hadoop.hive.exec.scratchdir=/tmp/hive-scratchdir-conf
Hive Data Source
Spark SQL supports Apache Hive using Hive data source. Spark SQL allows executing structured queries on Hive tables, a persistent Hive metastore, support for Hive serdes and user-defined functions.
Tip
|
Consult Demo: Connecting Spark SQL to Hive Metastore (with Remote Metastore Server) to learn in a more practical approach. |
In order to use Hive-related features in a Spark SQL application a SparkSession has to be created with Builder.enableHiveSupport.
Hive Data Source uses custom Spark SQL configuration properties (in addition to Hive’s).
Hive Data Source uses HiveTableRelation to represent Hive tables. HiveTableRelation
can be converted to a HadoopFsRelation based on spark.sql.hive.convertMetastoreParquet and spark.sql.hive.convertMetastoreOrc properties (and "disappears" from a logical plan when enabled).
Hive Data Source uses HiveSessionStateBuilder (to build a Hive-specific SessionState) and HiveExternalCatalog.
Hive Data Source uses HiveClientImpl for meta data/DDL operations (using calls to a Hive metastore).
Hive Configuration Properties
Hive uses configuration properties that can be defined in configuration files and as System properties (that have a higher precedence).
In Spark applications, you could use --driver-java-options
or --conf spark.hadoop.[property]
to define Hive configuration properties.
Configuration Property |
---|
Hive log4j configuration file for execution mode(sub command). If the property is not set, then logging will be initialized using If the property is set, the value must be a valid URI ( |
HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: Default: |
Temporary staging directory that will be created inside table locations in order to support HDFS encryption. Default: This replaces hive.exec.scratchdir for query results with the exception of read-only tables. In all cases hive.exec.scratchdir is still used for other temporary files, such as job plans. |
JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL ( Default: |
Hive Configuration Files
HiveConf
loads hive-default.xml
when on the classpath.
HiveConf
loads and prints out the location of hive-site.xml
configuration file (when on the classpath, in $HIVE_CONF_DIR
or $HIVE_HOME/conf
directories, or in the directory with the jar file with HiveConf
class).
Enable ALL logging level in conf/log4j.properties
:
log4j.logger.org.apache.hadoop.hive=ALL
Execute the following spark.sharedState.externalCatalog.getTable("default", "t1")
to have the following INFO message in the logs:
Found configuration file [url]
Important
|
Spark SQL loads hive-site.xml found in $SPARK_HOME/conf while Hive in $SPARK_HOME . Make sure there are no two configuration files that could lead to hard to diagnose issues at runtime.
|