val nums = spark.range(5)
val numParts = 200 // the default number of partitions
val partExprs = Seq(nums("id"))
val partitionIdExpression = pmod(hash(partExprs: _*), lit(numParts))
scala> partitionIdExpression.explain(extended = true)
pmod(hash(id#32L, 42), 200)
val q = nums.withColumn("partitionId", partitionIdExpression)
scala> q.show
+---+-----------+
| id|partitionId|
+---+-----------+
| 0| 5|
| 1| 69|
| 2| 128|
| 3| 107|
| 4| 140|
+---+-----------+
HashPartitioning
HashPartitioning
is a Partitioning in which rows are distributed across partitions based on the MurMur3 hash of partitioning expressions (modulo the number of partitions).
HashPartitioning
takes the following to be created:
-
Partitioning expressions
HashPartitioning
is an Expression that cannot be evaluated (and produce a value given an internal row).
HashPartitioning
uses the MurMur3 Hash to compute the partitionId for data distribution (consistent for shuffling and bucketing that is crucial for joins of bucketed and regular tables).
satisfies0
Method
satisfies0(
required: Distribution): Boolean
Note
|
satisfies0 is part of the Partitioning contract.
|
satisfies0
is positive (true
) when the following conditions all hold:
-
The base satisfies0 holds
-
For an input HashClusteredDistribution, the number of the given partitioning expressions and the HashClusteredDistribution’s are the same and semantically equal pair-wise
-
For an input ClusteredDistribution, the given partitioning expressions are among the ClusteredDistribution’s clustering expressions and they are semantically equal pair-wise
Otherwise, satisfies0
is negative (false
).
partitionIdExpression
Method
partitionIdExpression: Expression
partitionIdExpression
creates (returns) a Pmod
expression of a Murmur3Hash (with the partitioning expressions) and a Literal (with the number of partitions).
Note
|
|