import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.appName("My Spark Application")
.config("spark.sql.warehouse.dir", "c:/Temp") (1)
.getOrCreate
Configuration Properties
Configuration properties (aka settings) allow you to fine-tune a Spark SQL application.
You can set a configuration property in a SparkSession while creating a new instance using config method.
-
Sets spark.sql.warehouse.dir for the Spark SQL session
You can also set a property using SQL SET
command.
scala> spark.conf.getOption("spark.sql.hive.metastore.version")
res1: Option[String] = None
scala> spark.sql("SET spark.sql.hive.metastore.version=2.3.2").show(truncate = false)
+--------------------------------+-----+
|key |value|
+--------------------------------+-----+
|spark.sql.hive.metastore.version|2.3.2|
+--------------------------------+-----+
scala> spark.conf.get("spark.sql.hive.metastore.version")
res2: String = 2.3.2
Configuration Property | ||
---|---|---|
Enables adaptive query execution Default: Use SQLConf.adaptiveExecutionEnabled method to access the current value. |
||
(internal) The advisory minimal number of post-shuffle partitions for ExchangeCoordinator. Default: This setting is used in Spark SQL tests to have enough parallelism to expose issues that will not be exposed with a single partition. Only positive values are used. Use SQLConf.minNumPostShufflePartitions method to access the current value. |
||
Recommended size of the input data of a post-shuffle partition (in bytes) Default: Use SQLConf.targetPostShuffleInputSize method to access the current value. |
||
Maximum size (in bytes) for a table that will be broadcast to all worker nodes when performing a join. Default: If the size of the statistics of the logical plan of a table is at most the setting, the DataFrame is broadcast for join. Negative values or Use SQLConf.autoBroadcastJoinThreshold method to access the current value. |
||
The compression codec to use when writing Avro data to disk Default: The supported codecs are:
Use SQLConf.avroCompressionCodec method to access the current value. |
||
Timeout in seconds for the broadcast wait time in broadcast joins. Default: When negative, it is assumed infinite (i.e. Use SQLConf.broadcastTimeout method to access the current value. |
||
(internal) Controls whether the query analyzer should be case sensitive ( Default: It is highly discouraged to turn on case sensitive mode. Use SQLConf.caseSensitiveAnalysis method to access the current value. |
||
Enables cost-based optimization (CBO) for estimation of plan statistics when Default: Use SQLConf.cboEnabled method to access the current value. |
||
Enables join reorder for cost-based optimization (CBO). Default: Use SQLConf.joinReorderEnabled method to access the current value. |
||
Enables join reordering based on star schema detection for cost-based optimization (CBO) in ReorderJoin logical plan optimization. Default: Use SQLConf.starSchemaDetection method to access the current value. |
||
Controls whether Default: |
||
(internal) Determines the codegen generator fallback behavior Default: Acceptable values: Used when |
||
(internal) Whether the whole stage codegen could be temporary disabled for the part of a query that has failed to compile generated code ( Default: Use SQLConf.wholeStageFallback method to access the current value. |
||
(internal) The maximum bytecode size of a single compiled Java function generated by whole-stage codegen. Default: The default value Use SQLConf.hugeMethodLimit method to access the current value. |
||
(internal) Controls whether to embed the (whole-stage) codegen stage ID into the class name of the generated class as a suffix ( Default: Use SQLConf.wholeStageUseIdInClassName method to access the current value. |
||
(internal) Maximum number of output fields (including nested fields) that whole-stage codegen supports. Going above the number deactivates whole-stage codegen. Default: Use SQLConf.wholeStageMaxNumFields method to access the current value. |
||
(internal) Controls whether whole stage codegen puts the logic of consuming rows of each physical operator into individual methods, instead of a single big method. This can be used to avoid oversized function that can miss the opportunity of JIT optimization. Default: Use SQLConf.wholeStageSplitConsumeFuncByOperator method to access the current value. |
||
(internal) Whether the whole stage (of multiple physical operators) will be compiled into a single Java method ( Default: Use SQLConf.wholeStageEnabled method to access the current value. |
||
(internal) Enables OffHeapColumnVector in ColumnarBatch ( Default: Use SQLConf.offHeapColumnVectorEnabled method to access the current value. |
||
(internal) When true, the query optimizer will infer and propagate data constraints in the query plan to optimize them. Constraint propagation can sometimes be computationally expensive for certain kinds of query plans (such as those with a large number of predicates and aliases) which might negatively impact overall runtime. Default: Use SQLConf.constraintPropagationEnabled method to access the current value. |
||
(internal) Estimated size of a table or relation used in query planning Default: Java’s Set to Java’s Used by the planner to decide when it is safe to broadcast a relation. By default, the system will assume that tables are too large to broadcast. Use SQLConf.defaultSizeInBytes method to access the current value. |
||
(internal) When enabled (i.e. Default:
Use SQLConf.exchangeReuseEnabled method to access the current value. |
||
Enables ObjectHashAggregateExec when Aggregation execution planning strategy is executed. Default: Use SQLConf.useObjectHashAggregation method to access the current value. |
||
Controls whether to ignore corrupt files ( Default: Use SQLConf.ignoreCorruptFiles method to access the current value. |
||
Controls whether to ignore missing files ( Default: Use SQLConf.ignoreMissingFiles method to access the current value. |
||
The maximum number of bytes to pack into a single partition when reading files. Default: Use SQLConf.filesMaxPartitionBytes method to access the current value. |
||
(internal) The estimated cost to open a file, measured by the number of bytes could be scanned at the same time (to include multiple files into a partition). Default: It’s better to over estimate it, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first). Use SQLConf.filesOpenCostInBytes method to access the current value. |
||
Maximum number of records to write out to a single file. If this value is Default: Use SQLConf.maxRecordsPerFile method to access the current value. |
||
(internal) Controls…FIXME Default: Use SQLConf.columnBatchSize method to access the current value. |
||
(internal) Controls…FIXME Default: Use SQLConf.useCompression method to access the current value. |
||
Enables vectorized reader for columnar caching. Default: Use SQLConf.cacheVectorizedReaderEnabled method to access the current value. |
||
(internal) Enables partition pruning for in-memory columnar tables Default: Use SQLConf.inMemoryPartitionPruning method to access the current value. |
||
(internal) Controls whether JoinSelection execution planning strategy prefers sort merge join over shuffled hash join. Default: Use SQLConf.preferSortMergeJoin method to access the current value. |
||
(internal) Enables propagation of SQL configurations when executing operations on the RDD that represents a structured query. This is the (buggy) behavior up to 2.4.4. Default: This is for cases not tracked by SQL execution, when a This config is deprecated and will be removed in 3.0.0. |
||
Enables resolving (mapping) the data source provider Default: Use SQLConf.replaceDatabricksSparkAvroEnabled method to access the current value. |
||
(internal) Minimal increase rate in the number of partitions between attempts when executing Default: Use SQLConf.limitScaleUpFactor method to access the current value. |
||
Comma-separated list of optimization rule names that should be disabled (excluded) in the optimizer. The optimizer will log the rules that have indeed been excluded. Default:
Use SQLConf.optimizerExcludedRules method to access the current value. |
||
(internal) The threshold of set size for Default: Use SQLConf.optimizerInSetConversionThreshold method to access the current value. |
||
(internal) When Default: |
||
Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems. Default: Use SQLConf.isParquetBinaryAsString method to access the current value. |
||
The number of rows to include in a parquet vectorized reader batch (the capacity of VectorizedParquetRecordReader). Default: The number should be carefully chosen to minimize overhead and avoid OOMs while reading data. Use SQLConf.parquetVectorizedReaderBatchSize method to access the current value. |
||
Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems. Default: Use SQLConf.isParquetINT96AsTimestamp method to access the current value. |
||
Enables vectorized parquet decoding. Default: Use SQLConf.parquetVectorizedReaderEnabled method to access the current value. |
||
Controls the filter predicate push-down optimization for data sources using parquet file format Default: Use SQLConf.parquetFilterPushDown method to access the current value. |
||
(internal) Enables parquet filter push-down optimization for Date (when spark.sql.parquet.filterPushdown is enabled) Default: Use SQLConf.parquetFilterPushDownDate method to access the current value. |
||
Controls whether timestamp adjustments should be applied to INT96 data when converting to timestamps, for data written by Impala. Default: This is necessary because Impala stores INT96 data with a different timezone offset than Hive and Spark. Use SQLConf.isParquetINT96TimestampConversion method to access the current value. |
||
Enables Parquet’s native record-level filtering using the pushed down filters. Default:
Use SQLConf.parquetRecordFilterEnabled method to access the current value. |
||
Controls whether quoted identifiers (using backticks) in SELECT statements should be interpreted as regular expressions. Default: Use SQLConf.supportQuotedRegexColumnName method to access the current value. |
||
(internal) Controls whether to use radix sort ( Default: Radix sort is much faster but requires additional memory to be reserved up-front. The memory overhead may be significant when sorting very small rows (up to 50% more). Use SQLConf.enableRadixSort method to access the current value. |
||
(internal) Fully-qualified class name of the Default: SQLHadoopMapReduceCommitProtocol Use SQLConf.fileCommitProtocolClass method to access the current value. |
||
Enables dynamic partition inserts when dynamic Default: When The default (STATIC) is to keep the same behavior of Spark prior to 2.3. Note that this config doesn’t affect Hive serde tables, as they are always overwritten with dynamic mode. Use SQLConf.partitionOverwriteMode method to access the current value. |
||
Maximum number of (distinct) values that will be collected without error (when doing a pivot without specifying the values for the pivot column) Default: Use SQLConf.dataFramePivotMaxValues method to access the current value. |
||
Regular expression to find options of a Spark SQL command with sensitive information Default: The values of the options matched will be redacted in the explain output. This redaction is applied on top of the global redaction configuration defined by Used exclusively when |
||
Regular expression to point at sensitive information in text output Default: When this regex matches a string part, it is replaced by a dummy value (i.e.
Use SQLConf.stringRedactionPattern method to access the current value. |
||
Controls whether to retain columns used for aggregation or not (in RelationalGroupedDataset operators). Default: Use SQLConf.dataFrameRetainGroupColumns method to access the current value. |
||
(internal) Controls whether Spark SQL could use Default: Use SQLConf.runSQLonFile method to access the current value. |
||
Controls whether to resolve ambiguity in join conditions for self-joins automatically ( Default: |
||
The ID of session-local timezone, e.g. "GMT", "America/Los_Angeles", etc. Default: Java’s Use SQLConf.sessionLocalTimeZone method to access the current value. |
||
Number of partitions to use by default when shuffling data for joins or aggregations Default: Corresponds to Apache Hive’s mapred.reduce.tasks property that Spark considers deprecated. Use SQLConf.numShufflePartitions method to access the current value. |
||
Enables bucketing support. When disabled (i.e. Default: Use SQLConf.bucketingEnabled method to access the current value. |
||
Defines the default data source to use for DataFrameReader. Default: Used when:
|
||
Enables automatic calculation of table size statistic by falling back to HDFS if the table statistics are not available from table metadata. Default: This can be useful in determining if a table is small enough for auto broadcast joins in query planning. Use SQLConf.fallBackToHdfsForStatsEnabled method to access the current value. |
||
Enables generating histograms when computing column statistics Default:
Use SQLConf.histogramEnabled method to access the current value. |
||
(internal) The number of bins when generating histograms. Default:
Use SQLConf.histogramNumBins method to access the current value. |
||
(internal) Enables parallel file listing in SQL commands, e.g. Default: Use SQLConf.parallelFileListingInStatsComputation method to access the current value. |
||
Enables automatic update of the table size statistic of a table after the table has changed. Default:
Use SQLConf.autoSizeUpdateEnabled method to access the current value. |
||
(internal) Enables subexpression elimination Default: Use subexpressionEliminationEnabled method to access the current value. |
||
(internal) Disables setting back original permission and ACLs when re-creating the table/partition paths for TRUNCATE TABLE command. Default: Use truncateTableIgnorePermissionAcl method to access the current value. |
||
The number of SQLExecutionUIData entries to keep in Default: When a query execution finishes, the execution is removed from the internal |
||
(internal) Threshold for number of rows guaranteed to be held in memory by WindowExec physical operator. Default: Use windowExecBufferInMemoryThreshold method to access the current value. |
||
(internal) Threshold for number of rows buffered in a WindowExec physical operator. Default: Use windowExecBufferSpillThreshold method to access the current value. |