HashPartitioning is a Partitioning in which rows are distributed across partitions based on the MurMur3 hash of partitioning expressions (modulo the number of partitions).

HashPartitioning takes the following to be created:

HashPartitioning is an Expression that cannot be evaluated (and produce a value given an internal row).

HashPartitioning uses the MurMur3 Hash to compute the partitionId for data distribution (consistent for shuffling and bucketing that is crucial for joins of bucketed and regular tables).

val nums = spark.range(5)
val numParts = 200 // the default number of partitions
val partExprs = Seq(nums("id"))

val partitionIdExpression = pmod(hash(partExprs: _*), lit(numParts))
scala> partitionIdExpression.explain(extended = true)
pmod(hash(id#32L, 42), 200)

val q = nums.withColumn("partitionId", partitionIdExpression)
scala> q.show
| id|partitionId|
|  0|          5|
|  1|         69|
|  2|        128|
|  3|        107|
|  4|        140|

satisfies0 Method

  required: Distribution): Boolean
satisfies0 is part of the Partitioning contract.

satisfies0 is positive (true) when the following conditions all hold:

Otherwise, satisfies0 is negative (false).

partitionIdExpression Method

partitionIdExpression: Expression

partitionIdExpression creates (returns) a Pmod expression of a Murmur3Hash (with the partitioning expressions) and a Literal (with the number of partitions).


partitionIdExpression is used when:

results matching ""

    No results matching ""