Configuration Properties

The following list are the properties that you can use to fine-tune Spark Structured Streaming applications.

You can set them in a SparkSession upon instantiation using config method.

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .master("local[*]")
  .appName("My Spark Application")
  .config("spark.sql.streaming.metricsEnabled", true)
  .getOrCreate
Table 1. Structured Streaming’s Properties
Name / Default Value Description

spark.sql.streaming.aggregation.stateFormatVersion

2

(internal) State format version used by streaming aggregation operations in a streaming query.

Supported values: 1 or 2

State between versions are tend to be incompatible, so state format version shouldn’t be modified after running.

spark.sql.streaming.checkpointLocation

(empty)

Default checkpoint directory for storing checkpoint data for streaming queries

spark.sql.streaming.disabledV2MicroBatchReaders

(empty)

(internal) A comma-separated list of fully-qualified class names of data source providers for which MicroBatchReadSupport is disabled. Reads from these sources will fall back to the V1 Sources.

spark.sql.streaming.disabledV2MicroBatchReaders=org.apache.spark.sql.kafka010.KafkaSourceProvider

Use SQLConf.disabledV2StreamingMicroBatchReaders to get the current value.

spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion

2

(internal) State format version used by flatMapGroupsWithState operation in a streaming query.

Supported values:

  • 1

  • 2

spark.sql.streaming.maxBatchesToRetainInMemory

2

(internal) The maximum number of batches which will be retained in memory to avoid loading from files.

Maximum count of versions a State Store implementation should retain in memory.

The value adjusts a trade-off between memory usage vs cache miss:

  • 2 covers both success and direct failure cases

  • 1 covers only success case

  • 0 or negative value disables cache to maximize memory size of executors

Used exclusively when HDFSBackedStateStoreProvider is requested to initialize.

spark.sql.streaming.metricsEnabled

false

Flag whether Dropwizard CodaHale metrics are reported for active streaming queries

Use SQLConf.streamingMetricsEnabled to get the current value.

spark.sql.streaming.minBatchesToRetain

100

(internal) The minimum number of batches that must be retained and made recoverable.

Used…​FIXME

spark.sql.streaming.multipleWatermarkPolicy

min

Policy to calculate the global watermark value when there are multiple watermark operators in a streaming query.

Supported values:

  • min (default) - chooses the minimum watermark reported across multiple operators

  • max - chooses the maximum across multiple operators

Note
This configuration cannot be changed between query restarts from the same checkpoint location.

spark.sql.streaming.numRecentProgressUpdates

100

Number of progress updates to retain for a streaming query

spark.sql.streaming.pollingDelay

10

(internal) Time delay (in ms) before StreamExecution polls for new data when no data was available in a batch.

spark.sql.streaming.stateStore.maintenanceInterval

60s

The initial delay and how often to execute StateStore’s maintenance task.

spark.sql.streaming.stateStore.providerClass

org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider

(internal) The fully-qualified class name of the StateStoreProvider implementation that manages state data in stateful streaming queries. This class must have a zero-arg constructor.

Use SQLConf.stateStoreProviderClass to get the current value.

spark.sql.streaming.unsupportedOperationCheck

true

(internal) When enabled (true), StreamingQueryManager makes sure that the logical plan of a streaming query uses supported operations only.

results matching ""

    No results matching ""