ShuffleExternalSorter — Cache-Efficient Sorter

ShuffleExternalSorter is a specialized cache-efficient sorter that sorts arrays of compressed record pointers and partition ids. By using only 8 bytes of space per record in the sorting array, ShuffleExternalSorter can fit more of the array into cache.

ShuffleExternalSorter is a MemoryConsumer.

Table 1. ShuffleExternalSorter’s Internal Properties
Name Initial Value Description

inMemSorter

(empty)

ShuffleInMemorySorter

Tip

Enable INFO or ERROR logging levels for org.apache.spark.shuffle.sort.ShuffleExternalSorter logger to see what happens in ShuffleExternalSorter.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.shuffle.sort.ShuffleExternalSorter=INFO

Refer to Logging.

getMemoryUsage Method

Caution
FIXME

closeAndGetSpills Method

Caution
FIXME

insertRecord Method

Caution
FIXME

freeMemory Method

Caution
FIXME

getPeakMemoryUsedBytes Method

Caution
FIXME

writeSortedFile Method

Caution
FIXME

cleanupResources Method

Caution
FIXME

Creating ShuffleExternalSorter Instance

ShuffleExternalSorter takes the following when created:

  1. memoryManager — TaskMemoryManager

  2. blockManager — BlockManager

  3. taskContext — TaskContext

  4. initialSize

  5. numPartitions

  6. SparkConf

  7. writeMetrics — ShuffleWriteMetrics

ShuffleExternalSorter initializes itself as a MemoryConsumer (with pageSize as the minimum of PackedRecordPointer.MAXIMUM_PAGE_SIZE_BYTES and pageSizeBytes, and Tungsten memory mode).

ShuffleExternalSorter uses spark.shuffle.file.buffer (for fileBufferSizeBytes) and spark.shuffle.spill.numElementsForceSpillThreshold (for numElementsForSpillThreshold) Spark properties.

ShuffleExternalSorter creates a ShuffleInMemorySorter (with spark.shuffle.sort.useRadixSort Spark property enabled by default).

ShuffleExternalSorter initializes the internal registries and counters.

Note
ShuffleExternalSorter is created when UnsafeShuffleWriter is open (which is when UnsafeShuffleWriter is created).

Freeing Execution Memory by Spilling To Disk — spill Method

long spill(long size, MemoryConsumer trigger)
throws IOException
Note
spill is part of MemoryConsumer contract to sort and spill the current records due to memory pressure.

spill frees execution memory, updates TaskMetrics, and in the end returns the spill size.

Note
spill returns 0 when ShuffleExternalSorter has no ShuffleInMemorySorter or the ShuffleInMemorySorter manages no records.

You should see the following INFO message in the logs:

INFO Thread [id] spilling sort data of [memoryUsage] to disk ([size] times so far)

spill writes sorted file (with isLastFile disabled).

spill frees memory and records the spill size.

spill resets the internal ShuffleInMemorySorter (that in turn frees up the underlying in-memory pointer array).

spill returns the spill size.

results matching ""

    No results matching ""