Partitioning — Specification of Physical Operator’s Output Partitions

Partitioning is the contract to hint the Spark Physical Optimizer for the number of partitions the output of a physical operator should be split across.

numPartitions: Int

numPartitions is used in:

Table 1. Partitioning Schemes (Partitionings) and Their Properties
Partitioning compatibleWith guarantees numPartitions satisfies

BroadcastPartitioning

BroadcastPartitioning with the same BroadcastMode

Exactly the same BroadcastPartitioning

1

BroadcastDistribution with the same BroadcastMode

HashPartitioning

  • clustering expressions

  • numPartitions

HashPartitioning (when their underlying expressions are semantically equal, i.e. deterministic and canonically equal)

HashPartitioning (when their underlying expressions are semantically equal, i.e. deterministic and canonically equal)

Input numPartitions

PartitioningCollection

  • partitionings

Any Partitioning that is compatible with one of the input partitionings

Any Partitioning that is guaranteed by any of the input partitionings

Number of partitions of the first Partitioning in the input partitionings

Any Distribution that is satisfied by any of the input partitionings

RangePartitioning

  • ordering collection of SortOrder

  • numPartitions

RangePartitioning (when semantically equal, i.e. underlying expressions are deterministic and canonically equal)

RangePartitioning (when semantically equal, i.e. underlying expressions are deterministic and canonically equal)

Input numPartitions

RoundRobinPartitioning

  • numPartitions

Always negative

Always negative

Input numPartitions

UnspecifiedDistribution

SinglePartition

Any Partitioning with exactly one partition

Any Partitioning with exactly one partition

1

Any Distribution except BroadcastDistribution

UnknownPartitioning

  • numPartitions

Always negative

Always negative

Input numPartitions

UnspecifiedDistribution

results matching ""

    No results matching ""