Spark Streaming — Streaming RDDs

Spark Streaming is the incremental micro-batching stream processing framework for Spark.

Spark Streaming offers the data abstraction called DStream that hides the complexity of dealing with a continuous data stream and makes it as easy for programmers as using one single RDD at a time.

That is why Spark Streaming is also called a micro-batching streaming framework as a batch is one RDD at a time.

Note
I think Spark Streaming shines on performing the T stage well, i.e. the transformation stage, while leaving the E and L stages for more specialized tools like Apache Kafka or frameworks like Akka.

For a software developer, a DStream is similar to work with as a RDD with the DStream API to match RDD API. Interestingly, you can reuse your RDD-based code and apply it to DStream - a stream of RDDs - with no changes at all (through foreachRDD).

It runs streaming jobs every batch duration to pull and process data (often called records) from one or many input streams.

Each batch computes (generates) a RDD for data in input streams for a given batch and submits a Spark job to compute the result. It does this over and over again until the streaming context is stopped (and the owning streaming application terminated).

To avoid losing records in case of failure, Spark Streaming supports checkpointing that writes received records to a highly-available HDFS-compatible storage and allows to recover from temporary downtimes.

Spark Streaming allows for integration with real-time data sources ranging from such basic ones like a HDFS-compatible file system or socket connection to more advanced ones like Apache Kafka or Apache Flume.

Checkpointing is also the foundation of stateful and windowed operations.

About Spark Streaming from the official documentation (that pretty much nails what it offers):

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

Essential concepts in Spark Streaming:

Other concepts often used in Spark Streaming:

  • ingestion = the act of processing streaming data.

Micro Batch

Micro Batch is a collection of input records as collected by Spark Streaming that is later represented as an RDD.

A batch is internally represented as a JobSet.

Batch Interval (aka batchDuration)

Batch Interval is a property of a Streaming application that describes how often an RDD of input records is generated. It is the time to collect input records before they become a micro-batch.

Streaming Job

A streaming Job represents a Spark computation with one or many Spark jobs.

It is identified (in the logs) as streaming job [time].[outputOpId] with outputOpId being the position in the sequence of jobs in a JobSet.

When executed, it runs the computation (the input func function).

Note
A collection of streaming jobs is generated for a batch using DStreamGraph.generateJobs(time: Time).

Internal Registries

  • nextInputStreamId - the current InputStream id

results matching ""

    No results matching ""