val batchQuery = spark.
read. // <-- batch non-streaming query
assert(batchQuery.isStreaming == false)
val streamingQuery = spark.
readStream. // <-- streaming query
Spark Structured Streaming and Streaming Queries
Spark Structured Streaming (aka Structured Streaming or Spark Streams) is the module of Apache Spark for stream processing using streaming queries.
Streaming queries can be expressed using a high-level declarative streaming API (Dataset API) or good ol' SQL (SQL over stream / streaming SQL). The declarative streaming Dataset API and SQL are executed on the underlying highly-optimized Spark SQL engine.
The semantics of the Structured Streaming model is as follows (see the article Structured Streaming In Apache Spark):
At any time, the output of a continuous application is equivalent to executing a batch job on a prefix of the data.
|As of Spark 2.2.0, Structured Streaming has been marked stable and ready for production use. With that the other older streaming module Spark Streaming is considered obsolete and not used for developing new streaming applications with Apache Spark.
Spark Structured Streaming comes with two stream execution engines for executing streaming queries:
The goal of Spark Structured Streaming is to unify streaming, interactive, and batch queries over structured datasets for developing end-to-end stream processing applications dubbed continuous applications using Spark SQL’s Datasets API with additional support for the following features:
In Structured Streaming, Spark developers describe custom streaming computations in the same way as with Spark SQL. Internally, Structured Streaming applies the user-defined structured query to the continuously and indefinitely arriving data to analyze real-time streaming data.
Read up on Spark SQL, Datasets and logical plans in The Internals of Spark SQL book.
Structured Streaming models a stream of data as an infinite (and hence continuous) table that could be changed every streaming batch.
You can specify output mode of a streaming dataset which is what gets written to a streaming sink (i.e. the infinite result table) when there is a new data available.
Streaming Datasets use streaming query plans (as opposed to regular batch Datasets that are based on batch query plans).
From this perspective, batch queries can be considered streaming Datasets executed once only (and is why some batch queries, e.g. KafkaSource, can easily work in batch mode).
With Structured Streaming, Spark 2 aims at simplifying streaming analytics with little to no need to reason about effective data streaming (trying to hide the unnecessary complexity in your streaming analytics architectures).
Structured streaming is defined by the following data abstractions in
Structured Streaming follows micro-batch model and periodically fetches data from the data source (and uses the
DataFrame data abstraction to represent the fetched data for a certain batch).
With Datasets as Spark SQL’s view of structured data, structured streaming checks input sources for new data every trigger (time) and executes the (continuous) queries.
|The feature has also been called Streaming Spark SQL Query, Streaming DataFrames, Continuous DataFrame or Continuous Query. There have been lots of names before the Spark project settled on Structured Streaming.
The official Structured Streaming Programming Guide
(article) Structured Streaming In Apache Spark
(video) The Future of Real Time in Spark from Spark Summit East 2016 in which Reynold Xin presents the concept of Streaming DataFrames
(video) A Deep Dive Into Structured Streaming by Tathagata "TD" Das from Spark Summit 2016
(video) Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark by Burak Yavuz