DiskBlockManager

DiskBlockManager creates and maintains the logical mapping between logical blocks and physical on-disk locations.

By default, one block is mapped to one file with a name given by its BlockId. It is however possible to have a block map to only a segment of a file.

Block files are hashed among the local directories.

Note
DiskBlockManager is used exclusively by DiskStore and created when BlockManager is created (and passed to DiskStore).
Table 1. DiskBlockManager’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

localDirs

Local directories for storing block data

localDirs is initialized using createLocalDirs.

Note
There has to be at least one local directory or DiskBlockManager cannot be created.

Used when:

subDirsPerLocalDir

The value of spark.diskStore.subDirectories Spark configuration property or 64.


Used when:

Tip

Enable INFO or DEBUG logging levels for org.apache.spark.storage.DiskBlockManager logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.storage.DiskBlockManager=DEBUG

Refer to Logging.

Finding File — getFile Method

Caution
FIXME

createTempShuffleBlock Method

createTempShuffleBlock(): (TempShuffleBlockId, File)

createTempShuffleBlock creates a temporary TempShuffleBlockId block.

Caution
FIXME

getAllFiles Method

getAllFiles(): Seq[File]

getAllFiles…​FIXME

Note
getAllFiles is used exclusively when DiskBlockManager is requested to getAllBlocks.

Creating DiskBlockManager Instance

DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolean)

When created, DiskBlockManager uses spark.diskStore.subDirectories to set subDirsPerLocalDir.

DiskBlockManager creates one or many local directories to store block data (as localDirs). When not successful, you should see the following ERROR message in the logs and DiskBlockManager exits with error code 53.

ERROR DiskBlockManager: Failed to create any local dir.

DiskBlockManager initializes the internal subDirs collection of locks for every local directory to store block data with an array of subDirsPerLocalDir size for files.

In the end, DiskBlockManager registers a shutdown hook to clean up the local directories for blocks.

Registering Shutdown Hook — addShutdownHook Internal Method

addShutdownHook(): AnyRef

addShutdownHook registers a shutdown hook to execute doStop at shutdown.

When executed, you should see the following DEBUG message in the logs:

DEBUG DiskBlockManager: Adding shutdown hook

addShutdownHook adds the shutdown hook so it prints the following INFO message and executes doStop.

INFO DiskBlockManager: Shutdown hook called

Stopping DiskBlockManager (Removing Local Directories for Blocks) — doStop Internal Method

doStop(): Unit

doStop deletes the local directories recursively (only when the constructor’s deleteFilesOnStop is enabled and the parent directories are not registered to be removed at shutdown).

Note
doStop is used when DiskBlockManager is requested to shut down or stop.

Getting Local Directories for Spark to Write Files — Utils.getConfiguredLocalDirs Internal Method

getConfiguredLocalDirs(conf: SparkConf): Array[String]

getConfiguredLocalDirs returns the local directories where Spark can write files.

Internally, getConfiguredLocalDirs uses conf SparkConf to know if External Shuffle Service is enabled (using spark.shuffle.service.enabled).

getConfiguredLocalDirs checks if Spark runs on YARN and if so, returns LOCAL_DIRS-controlled local directories.

In non-YARN mode (or for the driver in yarn-client mode), getConfiguredLocalDirs checks the following environment variables (in the order) and returns the value of the first met:

  1. SPARK_EXECUTOR_DIRS environment variable

  2. SPARK_LOCAL_DIRS environment variable

  3. MESOS_DIRECTORY environment variable (only when External Shuffle Service is not used)

In the end, when no earlier environment variables were found, getConfiguredLocalDirs uses spark.local.dir Spark property or falls back on java.io.tmpdir System property.

Note

getConfiguredLocalDirs is used when:

Getting Writable Directories in YARN — getYarnLocalDirs Internal Method

getYarnLocalDirs(conf: SparkConf): String

getYarnLocalDirs uses conf SparkConf to read LOCAL_DIRS environment variable with comma-separated local directories (that have already been created and secured so that only the user has access to them).

getYarnLocalDirs throws an Exception with the message Yarn Local dirs can’t be empty if LOCAL_DIRS environment variable was not set.

Checking If Spark Runs on YARN — isRunningInYarnContainer Internal Method

isRunningInYarnContainer(conf: SparkConf): Boolean

isRunningInYarnContainer uses conf SparkConf to read Hadoop YARN’s CONTAINER_ID environment variable to find out if Spark runs in a YARN container.

Note
CONTAINER_ID environment variable is exported by YARN NodeManager.

Getting All Blocks Stored On Disk — getAllBlocks Method

getAllBlocks(): Seq[BlockId]

getAllBlocks gets all the blocks stored on disk.

Internally, getAllBlocks takes the block files and returns their names (as BlockId).

Note
getAllBlocks is used exclusively when BlockManager is requested to find IDs of existing blocks for a given filter.

Creating Local Directories for Storing Block Data — createLocalDirs Internal Method

createLocalDirs(conf: SparkConf): Array[File]

createLocalDirs creates blockmgr-[random UUID] directory under local directories to store block data.

Internally, createLocalDirs reads local writable directories and creates a subdirectory blockmgr-[random UUID] under every configured parent directory.

If successful, you should see the following INFO message in the logs:

INFO DiskBlockManager: Created local directory at [localDir]

When failed to create a local directory, you should see the following ERROR message in the logs:

ERROR DiskBlockManager: Failed to create local dir in [rootDir]. Ignoring this directory.
Note
createLocalDirs is used exclusively when localDirs is initialized.

stop Internal Method

stop(): Unit

stop…​FIXME

Note
stop is used exclusively when BlockManager is requested to stop.

File Locks for Local Block Store Directories — subDirs Internal Property

subDirs: Array[Array[File]]

subDirs is a collection of subDirsPerLocalDir file locks for every local block store directory where DiskBlockManager stores block data (with the columns being the number of local directories and the rows as collection of subDirsPerLocalDir size).

Note
subDirs(n) is to access n-th local directory.
Note
subDirs is used when DiskBlockManager is requested to getFile or getAllFiles.

Settings

Table 2. Spark Properties
Spark Property Default Value Description

spark.diskStore.subDirectories

64

The number of …​FIXME

results matching ""

    No results matching ""