ClusteredDistribution

ClusteredDistribution is a Distribution that creates a HashPartitioning for the clustering expressions and a requested number of partitions.

ClusteredDistribution requires that the clustering expressions should not be empty (i.e. Nil).

ClusteredDistribution is created when the following physical operators are requested for a required child distribution:

  • MapGroupsExec, HashAggregateExec, ObjectHashAggregateExec, SortAggregateExec, WindowExec

  • Spark Structured Streaming’s FlatMapGroupsWithStateExec, StateStoreRestoreExec, StateStoreSaveExec, StreamingDeduplicateExec, StreamingSymmetricHashJoinExec, StreamingSymmetricHashJoinExec

  • SparkR’s FlatMapGroupsInRExec

  • PySpark’s FlatMapGroupsInPandasExec

ClusteredDistribution is used when:

  • DataSourcePartitioning, SinglePartition, HashPartitioning, and RangePartitioning are requested to satisfies

  • EnsureRequirements is requested to add an ExchangeCoordinator for Adaptive Query Execution

createPartitioning Method

createPartitioning(numPartitions: Int): Partitioning
Note
createPartitioning is part of Distribution Contract to create a Partitioning for a given number of partitions.

createPartitioning creates a HashPartitioning for the clustering expressions and the input numPartitions.

createPartitioning reports an AssertionError when the number of partitions is not the input numPartitions.

This ClusteredDistribution requires [requiredNumPartitions] partitions, but the actual number of partitions is [numPartitions].

Creating ClusteredDistribution Instance

ClusteredDistribution takes the following when created:

  • Clustering expressions

  • Required number of partitions (default: None)

Note
None for the required number of partitions indicates to use any number of partitions (possibly spark.sql.shuffle.partitions configuration property with the default of 200 partitions).

results matching ""

    No results matching ""