HashPartitioning

HashPartitioning is a Partitioning in which rows are distributed across partitions based on the MurMur3 hash of partitioning expressions (modulo the number of partitions).

HashPartitioning takes the following to be created:

HashPartitioning is an Expression that cannot be evaluated (and produce a value given an internal row).

HashPartitioning uses the MurMur3 Hash to compute the partitionId for data distribution (consistent for shuffling and bucketing that is crucial for joins of bucketed and regular tables).

val nums = spark.range(5)
val numParts = 200 // the default number of partitions
val partExprs = Seq(nums("id"))

val partitionIdExpression = pmod(hash(partExprs: _*), lit(numParts))
scala> partitionIdExpression.explain(extended = true)
pmod(hash(id#32L, 42), 200)

val q = nums.withColumn("partitionId", partitionIdExpression)
scala> q.show
+---+-----------+
| id|partitionId|
+---+-----------+
|  0|          5|
|  1|         69|
|  2|        128|
|  3|        107|
|  4|        140|
+---+-----------+

satisfies0 Method

satisfies0(
  required: Distribution): Boolean
Note
satisfies0 is part of the Partitioning contract.

satisfies0 is positive (true) when the following conditions all hold:

Otherwise, satisfies0 is negative (false).

partitionIdExpression Method

partitionIdExpression: Expression

partitionIdExpression creates (returns) a Pmod expression of a Murmur3Hash (with the partitioning expressions) and a Literal (with the number of partitions).

Note

partitionIdExpression is used when:

results matching ""

    No results matching ""