Streaming Join
In Spark Structured Streaming, a streaming join is a streaming query that was described (build) using the high-level streaming operators:
-
SQL’s
JOIN
clause
Streaming joins can be stateless or stateful:
-
Joins of a streaming query and a batch query (stream-static joins) are stateless and no state management is required
-
Joins of two streaming queries (stream-stream joins) are stateful and require streaming state (with an optional join state watermark for state removal).
Stream-Stream Joins
Spark Structured Streaming supports stream-stream joins with the following:
-
Equality predicate (i.e. equi-joins that use only equality comparisons in the join predicate)
-
Inner
,LeftOuter
, andRightOuter
join types only
Stream-stream equi-joins are planned as StreamingSymmetricHashJoinExec physical operators of two ShuffleExchangeExec
physical operators (per Required Partition Requirements).
Join State Watermark for State Removal
Stream-stream joins may optionally define Join State Watermark for state removal (cf. Watermark Predicates for State Removal).
A join state watermark can be specified on the following:
-
Join keys (key state)
-
Columns of the left and right sides (value state)
A join state watermark can be specified on key state, value state or both.
IncrementalExecution — QueryExecution of Streaming Queries
Under the covers, the high-level operators create a logical query plan with one or more Join
logical operators.
Tip
|
Read up on Join Logical Operator in The Internals of Spark SQL online book. |
In Spark Structured Streaming IncrementalExecution is responsible for planning streaming queries for execution.
At query planning, IncrementalExecution
uses the StreamingJoinStrategy execution planning strategy for planning stream-stream joins as StreamingSymmetricHashJoinExec physical operators.
Further Reading Or Watching
-
Stream-stream Joins in the official documentation of Apache Spark for Structured Streaming
-
Introducing Stream-Stream Joins in Apache Spark 2.3 by Databricks
-
(video) Deep Dive into Stateful Stream Processing in Structured Streaming by Tathagata Das