MapStatus Contract — Shuffle Map Output Status

MapStatus is the abstraction of shuffle map output statuses that are metadata of shuffle map outputs with estimated size for the reduce block and block location.

After a ShuffleMapTask has finished execution successfully, DAGScheduler is requested to handleTaskCompletion (of the ShuffleMapTask) that requests the MapOutputTrackerMaster to register the MapStatus.

Table 1. MapStatus Contract
Method Description

getSizeForBlock

getSizeForBlock(reduceId: Int): Long

Estimated size for the reduce block (in bytes)

Used when:

location

location: BlockManagerId

Block location, i.e. the BlockManager where a ShuffleMapTask ran and the result is stored.

Used when:

Table 2. MapStatuses
MapStatus Description

CompressedMapStatus

Default MapStatus that compresses the estimated map output size to 8 bits (Byte) for efficient reporting

HighlyCompressedMapStatus

Stores the average size of non-empty blocks, and a compressed bitmap for tracking which blocks are empty. Used when the number of partitions is above the spark.shuffle.minNumPartitionsToHighlyCompress threshold

MapStatus object uses spark.shuffle.minNumPartitionsToHighlyCompress internal configuration property for the minimum number of partitions threshold to create a HighlyCompressedMapStatus when requested to create a MapStatus.

Creating MapStatus — apply Factory Method

apply(
  loc: BlockManagerId,
  uncompressedSizes: Array[Long]): MapStatus

apply creates a concrete MapStatus per the size of the given uncompressedSizes array:

Note

apply is used when:

results matching ""

    No results matching ""