Extending Structured Streaming with New Data Sources
Spark Structured Streaming uses Spark SQL for planning streaming queries (preparing for execution).
Spark SQL is migrating from the former Data Source API V1 to a new Data Source API V2, and so is Structured Streaming. That is exactly the reason for BaseStreamingSource and BaseStreamingSink APIs for the two different Data Source API’s class hierarchies, for streaming sources and sinks, respectively.
Structured Streaming supports two stream execution engines (i.e. Micro-Batch and Continuous) with their own APIs.
Micro-Batch Stream Processing supports the old Data Source API V1 and the new modern Data Source API V2 with micro-batch-specific APIs for streaming sources and sinks.
Continuous Stream Processing supports the new modern Data Source API V2 only with continuous-specific APIs for streaming sources and sinks.
The following are the questions to think of (and answer) while considering development of a new data source for Structured Streaming. They are supposed to give you a sense of how much work and time it takes as well as what Spark version to support (e.g. 2.2 vs 2.4).
-
Data Source API V1
-
Data Source API V2
-
Micro-Batch Stream Processing (Structured Streaming V1)
-
Continuous Stream Processing (Structured Streaming V2)
-
Read side (BaseStreamingSource)
-
Write side (BaseStreamingSink)