(StateStore, Iterator[T]) => Iterator[U]
Arbitrary Stateful Streaming Aggregation
KeyValueGroupedDataset represents a grouped dataset as a result of Dataset.groupByKey operator.
flatMapGroupsWithState operators use GroupState as group streaming aggregation state that is created separately for every aggregation key with an aggregation state value (of a user-defined type).
Use the following demos and complete applications to learn more:
One of the most important internal execution components of Arbitrary Stateful Streaming Aggregation is FlatMapGroupsWithStateExec physical operator.
When requested to execute and generate a recipe for a distributed computation (as an RDD[InternalRow]),
FlatMapGroupsWithStateExec first validates a selected GroupStateTimeout:
|FIXME When are the above requirements met?|
FlatMapGroupsWithStateExec physical operator then mapPartitionsWithStateStore with a custom
storeUpdateFunction of the following signature:
While generating the recipe,
FlatMapGroupsWithStateExec uses StateStoreOps extension method object to register a listener that is executed on a task completion. The listener makes sure that a given StateStore has all state changes either committed or aborted.
In the end,
FlatMapGroupsWithStateExec creates a new StateStoreRDD and adds it to the RDD lineage.