val nums = spark.range(5)
val numParts = 200 // the default number of partitions
val partExprs = Seq(nums("id"))
val partitionIdExpression = pmod(hash(partExprs: _*), lit(numParts))
scala> partitionIdExpression.explain(extended = true)
pmod(hash(id#32L, 42), 200)
val q = nums.withColumn("partitionId", partitionIdExpression)
scala> q.show
+---+-----------+
| id|partitionId|
+---+-----------+
| 0| 5|
| 1| 69|
| 2| 128|
| 3| 107|
| 4| 140|
+---+-----------+
HashPartitioning
HashPartitioning is a Partitioning in which rows are distributed across partitions based on the MurMur3 hash of partitioning expressions (modulo the number of partitions).
HashPartitioning takes the following to be created:
-
Partitioning expressions
HashPartitioning is an Expression that cannot be evaluated (and produce a value given an internal row).
HashPartitioning uses the MurMur3 Hash to compute the partitionId for data distribution (consistent for shuffling and bucketing that is crucial for joins of bucketed and regular tables).
satisfies0 Method
satisfies0(
required: Distribution): Boolean
|
Note
|
satisfies0 is part of the Partitioning contract.
|
satisfies0 is positive (true) when the following conditions all hold:
-
The base satisfies0 holds
-
For an input HashClusteredDistribution, the number of the given partitioning expressions and the HashClusteredDistribution’s are the same and semantically equal pair-wise
-
For an input ClusteredDistribution, the given partitioning expressions are among the ClusteredDistribution’s clustering expressions and they are semantically equal pair-wise
Otherwise, satisfies0 is negative (false).
partitionIdExpression Method
partitionIdExpression: Expression
partitionIdExpression creates (returns) a Pmod expression of a Murmur3Hash (with the partitioning expressions) and a Literal (with the number of partitions).
|
Note
|
|