HashPartitioning

HashPartitioning is a Partitioning in which rows are distributed across partitions based on the MurMur3 hash of partitioning expressions (modulo the number of partitions).

HashPartitioning takes the following to be created:

Partitioning expressions
Number of partitions

HashPartitioning is an Expression that cannot be evaluated (and produce a value given an internal row).

HashPartitioning uses the MurMur3 Hash to compute the partitionId for data distribution (consistent for shuffling and bucketing that is crucial for joins of bucketed and regular tables).

val nums = spark.range(5)
val numParts = 200 // the default number of partitions
val partExprs = Seq(nums("id"))

val partitionIdExpression = pmod(hash(partExprs: _*), lit(numParts))
scala> partitionIdExpression.explain(extended = true)
pmod(hash(id#32L, 42), 200)

val q = nums.withColumn("partitionId", partitionIdExpression)
scala> q.show
+---+-----------+
| id|partitionId|
+---+-----------+
|  0|          5|
|  1|         69|
|  2|        128|
|  3|        107|
|  4|        140|
+---+-----------+

`satisfies0` Method

satisfies0(
  required: Distribution): Boolean

Note	`satisfies0` is part of the Partitioning contract.

satisfies0 is positive (true) when the following conditions all hold:

The base satisfies0 holds
For an input HashClusteredDistribution, the number of the given partitioning expressions and the HashClusteredDistribution’s are the same and semantically equal pair-wise
For an input ClusteredDistribution, the given partitioning expressions are among the ClusteredDistribution’s clustering expressions and they are semantically equal pair-wise

Otherwise, satisfies0 is negative (false).

`partitionIdExpression` Method

partitionIdExpression: Expression

partitionIdExpression creates (returns) a Pmod expression of a Murmur3Hash (with the partitioning expressions) and a Literal (with the number of partitions).

Note

partitionIdExpression is used when:

BucketingUtils utility is used for getBucketIdFromValue (for bucketing support)
FileFormatWriter utility is used for writing the result of a structured query out (for bucketing support)
ShuffleExchangeExec utility is used to prepare a ShuffleDependency

HashPartitioning