// Dataset[T]
groupByKey(func: T => K): KeyValueGroupedDataset[K, T]
KeyValueGroupedDataset — Streaming Aggregation
KeyValueGroupedDataset
represents a grouped dataset as a result of Dataset.groupByKey operator (that aggregates records by a grouping function).
import java.sql.Timestamp
val numGroups = spark.
readStream.
format("rate").
load.
as[(Timestamp, Long)].
groupByKey { case (time, value) => value % 2 }
scala> :type numGroups
org.apache.spark.sql.KeyValueGroupedDataset[Long,(java.sql.Timestamp, Long)]
KeyValueGroupedDataset
is also created for KeyValueGroupedDataset.keyAs and KeyValueGroupedDataset.mapValues operators.
scala> :type numGroups
org.apache.spark.sql.KeyValueGroupedDataset[Long,(java.sql.Timestamp, Long)]
scala> :type numGroups.keyAs[String]
org.apache.spark.sql.KeyValueGroupedDataset[String,(java.sql.Timestamp, Long)]
scala> :type numGroups
org.apache.spark.sql.KeyValueGroupedDataset[Long,(java.sql.Timestamp, Long)]
val mapped = numGroups.mapValues { case (ts, n) => s"($ts, $n)" }
scala> :type mapped
org.apache.spark.sql.KeyValueGroupedDataset[Long,String]
KeyValueGroupedDataset
works for batch and streaming aggregations, but shines the most when used for Streaming Aggregation.
scala> :type numGroups
org.apache.spark.sql.KeyValueGroupedDataset[Long,(java.sql.Timestamp, Long)]
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
numGroups.
mapGroups { case(group, values) => values.size }.
writeStream.
format("console").
trigger(Trigger.ProcessingTime(10.seconds)).
start
-------------------------------------------
Batch: 0
-------------------------------------------
+-----+
|value|
+-----+
+-----+
-------------------------------------------
Batch: 1
-------------------------------------------
+-----+
|value|
+-----+
| 3|
| 2|
+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
+-----+
|value|
+-----+
| 5|
| 5|
+-----+
// Eventually...
spark.streams.active.foreach(_.stop)
The most prestigious use case of KeyValueGroupedDataset
however is Arbitrary Stateful Streaming Aggregation that allows for accumulating streaming state (by means of GroupState) using mapGroupsWithState and the more advanced flatMapGroupsWithState operators.
Operator | Description | ||
---|---|---|---|
|
|
||
|
|
||
|
|
||
|
|
||
|
Arbitrary Stateful Streaming Aggregation - streaming aggregation with explicit state and state timeout
|
||
|
|
||
|
|
||
|
Creates a new
|
||
|
|
||
|
|