Extending Structured Streaming with New Data Sources

Spark Structured Streaming uses Spark SQL for planning streaming queries (preparing for execution).

Spark SQL is migrating from the former Data Source API V1 to a new Data Source API V2, and so is Structured Streaming. That is exactly the reason for BaseStreamingSource and BaseStreamingSink APIs for the two different Data Source API’s class hierarchies, for streaming sources and sinks, respectively.

Structured Streaming supports two stream execution engines (i.e. Micro-Batch and Continuous) with their own APIs.

Micro-Batch Stream Processing supports the old Data Source API V1 and the new modern Data Source API V2 with micro-batch-specific APIs for streaming sources and sinks.

Continuous Stream Processing supports the new modern Data Source API V2 only with continuous-specific APIs for streaming sources and sinks.

The following are the questions to think of (and answer) while considering development of a new data source for Structured Streaming. They are supposed to give you a sense of how much work and time it takes as well as what Spark version to support (e.g. 2.2 vs 2.4).

Data Source API V1
Data Source API V2
Micro-Batch Stream Processing (Structured Streaming V1)
Continuous Stream Processing (Structured Streaming V2)
Read side (BaseStreamingSource)
Write side (BaseStreamingSink)

Extending Structured Streaming with New Data Sources

Extending Structured Streaming with New Data Sources

results matching ""

No results matching ""