A TaskSet is a collection of tasks that belong to a single stage and a stage attempt. It has also priority and properties attributes. Priority is used in FIFO scheduling mode (see Priority Field and FIFO Scheduling) while properties are the properties of the first job in the stage.

A TaskSet represents the missing partitions of a stage.

The pair of a stage and a stage attempt uniquely describes a TaskSet and that is what you can see in the logs when a TaskSet is used:

TaskSet [stageId].[stageAttemptId]

A TaskSet contains a fully-independent sequence of tasks that can run right away based on the data that is already on the cluster, e.g. map output files from previous stages, though it may fail if this data becomes unavailable.

TaskSet can be submitted (consult TaskScheduler Contract).


Where TaskSets are used

Priority Field and FIFO Scheduling

A TaskSet has priority field that turns into the priority field’s value of TaskSetManager (which is a Schedulable).

The priority field is used in FIFOSchedulingAlgorithm in which equal priorities give stages an advantage (not to say priority).

FIFOSchedulingAlgorithm is only used for FIFO scheduling mode in a Pool (i.e. a schedulable collection of Schedulable objects).

Effectively, the priority field is the job’s id of the first job this stage was part of (for FIFO scheduling).

