Spark Configuration Properties

Table 1. Spark Configuration Properties
Name Description

spark.default.parallelism

Number of partitions to use for HashPartitioner.

spark.default.parallelism corresponds to default parallelism of a scheduler backend and is as follows:

spark.diskStore.subDirectories

Default: 64

spark.driver.maxResultSize

The maximum size of all results of the tasks in a TaskSet

Default: 1g

Used when:

spark.executor.extraClassPath

User-defined class path for executors, i.e. URLs representing user-defined class path entries that are added to an executor’s class path. URLs are separated by system-dependent path separator, i.e. : on Unix-like systems and ; on Microsoft Windows.

Default: (empty)

Used when:

spark.file.transferTo

Default: true

spark.launcher.port

spark.launcher.secret

spark.locality.wait

For locality-aware delay scheduling for PROCESS_LOCAL, NODE_LOCAL, and RACK_LOCAL TaskLocalities when locality-specific setting is not set.

Default: 3s

spark.locality.wait.node

Scheduling delay for NODE_LOCAL TaskLocality

Default: The value of spark.locality.wait

spark.locality.wait.process

Scheduling delay for PROCESS_LOCAL TaskLocality

Default: The value of spark.locality.wait

spark.locality.wait.rack

Scheduling delay for RACK_LOCAL TaskLocality

Default: The value of spark.locality.wait

spark.logging.exceptionPrintInterval

How frequently to reprint duplicate exceptions in full (in millis).

Default: 10000

spark.master

Master URL to connect a Spark application to

spark.scheduler.allocation.file

Path to the configuration file of FairSchedulableBuilder

Default: fairscheduler.xml (on a Spark application’s class path)

spark.scheduler.executorTaskBlacklistTime

How long to wait before a task can be re-launched on the executor where it once failed. It is to prevent repeated task failures due to executor failures.

Default: 0L

spark.scheduler.mode

Scheduling Mode of the TaskSchedulerImpl, i.e. case-insensitive name of the scheduling mode that TaskSchedulerImpl uses to choose between the available SchedulableBuilders for task scheduling (of tasks of jobs submitted for execution to the same SparkContext)

Default: FIFO

Supported values:

  • FAIR for fair sharing (of cluster resources)

  • FIFO (default) for queueing jobs one after another

Task scheduling is an algorithm that is used to assign cluster resources (CPU cores and memory) to tasks (that are part of jobs with one or more stages). Fair sharing allows for executing tasks of different jobs at the same time (that were all submitted to the same SparkContext). In FIFO scheduling mode a single SparkContext can submit a single job for execution only (regardless of how many cluster resources the job really use which could lead to a inefficient utilization of cluster resources and a longer execution of the Spark application overall).

Scheduling mode is particularly useful in multi-tenant environments in which a single SparkContext could be shared across different users (to make a cluster resource utilization more efficient).

Tip
Use web UI to know the current scheduling mode (e.g. Environment tab as part of Spark Properties and Jobs tab as Scheduling Mode).

spark.shuffle.file.buffer

Size of the in-memory buffer for each shuffle file output stream, in KiB unless otherwise specified. These buffers reduce the number of disk seeks and system calls made in creating intermediate shuffle files.

Default: 32k

Must be greater than 0 and less than or equal to 2097151 ((Integer.MAX_VALUE - 15) / 1024)

spark.shuffle.manager

Specifies the fully-qualified class name or the alias of the ShuffleManager in a Spark application

Default: sort

The supported aliases:

  • sort

  • tungsten-sort

Used exclusively when SparkEnv object is requested to create a "base" SparkEnv for a driver or an executor

spark.shuffle.minNumPartitionsToHighlyCompress

(internal) Minimum number of partitions (threshold) when MapStatus object creates a HighlyCompressedMapStatus (over CompressedMapStatus) when requested to create one (for ShuffleWriters).

Default: 2000

Must be a positive integer (above 0)

spark.shuffle.sort.initialBufferSize

Initial buffer size for sorting

Default: 4096

Used exclusively when UnsafeShuffleWriter is requested to open (and creates a ShuffleExternalSorter)

spark.shuffle.sync

Controls whether DiskBlockObjectWriter should force outstanding writes to disk when committing a single atomic block, i.e. all operating system buffers should synchronize with the disk to ensure that all changes to a file are in fact recorded in the storage.

Default: false

Used exclusively when BlockManager is requested to get a DiskBlockObjectWriter

spark.shuffle.unsafe.file.output.buffer

The file system for this buffer size after each partition is written in unsafe shuffle writer. In KiB unless otherwise specified.

Default: 32k

Must be greater than 0 and less than or equal to 2097151 ((Integer.MAX_VALUE - 15) / 1024)

spark.starvation.timeout

Threshold above which Spark warns a user that an initial TaskSet may be starved

Default: 15s

spark.storage.exceptionOnPinLeak

spark.task.cpus

The number of CPU cores used to schedule (allocate for) a task

Default: 1

Used when:

spark.task.maxFailures

The number of individual task failures before giving up on the entire TaskSet and the job afterwards

Default:

spark.unsafe.exceptionOnMemoryLeak

results matching ""

    No results matching ""