StreamingSymmetricHashJoinExec Binary Physical Operator — Stream-Stream Joins

StreamingSymmetricHashJoinExec is a binary physical operator that represents a Join logical operator of two streaming queries at execution time.

Note

A binary physical operator (BinaryExecNode) is a physical operator with left and right child physical operators.

Read up on BinaryExecNode (and physical operators in general) in The Internals of Spark SQL book.

Note

Join logical operator represents Dataset.join operator in a logical query plan.

StreamingSymmetricHashJoinExec requires that the join type be Inner, LeftOuter, or RightOuter with the same data types of the left and the right keys.

StreamingSymmetricHashJoinExec is created exclusively when StreamingJoinStrategy execution planning strategy is requested to plan a logical query plan with a Join logical operator of two streaming queries with equality predicates (EqualTo and EqualNullSafe)

StreamingSymmetricHashJoinExec is a stateful physical operator that writes to a state store.

The output schema of StreamingSymmetricHashJoinExec is…​FIXME

The output partitioning of StreamingSymmetricHashJoinExec is…​FIXME

Creating StreamingSymmetricHashJoinExec Instance

StreamingSymmetricHashJoinExec takes the following to be created:

  • Catalyst expressions of the keys on the left side

  • Catalyst expressions of the keys on the right side

  • JoinType

  • Join condition (JoinConditionSplitPredicates)

  • StatefulOperatorStateInfo

  • Event-time watermark

  • State watermark (JoinStateWatermarkPredicates)

  • Physical operator on the left side (SparkPlan)

  • Physical operator on the right side (SparkPlan)

StreamingSymmetricHashJoinExec initializes the internal properties.

Performance Metrics

StreamingSymmetricHashJoinExec uses the performance metrics of StateStoreWriter.

Checking Out Whether Last Batch Execution Requires Another Non-Data Batch or Not — shouldRunAnotherBatch Method

shouldRunAnotherBatch(newMetadata: OffsetSeqMetadata): Boolean
Note
shouldRunAnotherBatch is part of the StateStoreWriter Contract to indicate whether MicroBatchExecution should run another non-data batch (based on the updated OffsetSeqMetadata with the current event-time watermark and the batch timestamp).

shouldRunAnotherBatch…​FIXME

Executing Physical Operator (Generating RDD[InternalRow]) — doExecute Method

doExecute(): RDD[InternalRow]
Note
doExecute is part of SparkPlan Contract to generate the runtime representation of an physical operator as a recipe for distributed computation over internal binary rows on Apache Spark (RDD[InternalRow]).

doExecute first requests the StreamingQueryManager for the StateStoreCoordinatorRef to the StateStoreCoordinator RPC endpoint (for the driver).

doExecute then requests the SymmetricHashJoinStateManager for the names of the state stores for the left and right side of the streaming join.

In the end, doExecute requests the left and right physical operators to execute (generate an RDD) and then stateStoreAwareZipPartitions with processPartitions (and with the StateStoreCoordinatorRef and the state stores).

Processing Partitions — processPartitions Internal Method

processPartitions(
  leftInputIter: Iterator[InternalRow],
  rightInputIter: Iterator[InternalRow]): Iterator[InternalRow]

processPartitions…​FIXME

Note
processPartitions is used exclusively when StreamingSymmetricHashJoinExec physical operator is requested to execute.

Internal Properties

Name Description

hadoopConfBcast

Hadoop Configuration broadcast (to the Spark cluster)

joinStateManager

nullLeft

GenericInternalRow of the size of the output schema of the left physical operator

nullRight

GenericInternalRow of the size of the output schema of the right physical operator

storeConf

results matching ""

    No results matching ""