Streaming Sources

A Streaming Source represents a continuous stream of data for a streaming query. It generates batches of DataFrame for given start and end offsets. For fault tolerance, a source must be able to replay data given a start offset.

A streaming source should be able to replay an arbitrary sequence of past data in a stream using a range of offsets. Streaming sources like Apache Kafka and Amazon Kinesis (with their per-record offsets) fit into this model nicely. This is the assumption so structured streaming can achieve end-to-end exactly-once guarantees.

A streaming source is described by Source contract in org.apache.spark.sql.execution.streaming package.

import org.apache.spark.sql.execution.streaming.Source

There are the following Source implementations available:

Source Contract

Source contract requires that the streaming sources provide the following features:

  1. The schema of the datasets using schema method.

  2. The maximum offset (of type Offset) using getOffset method.

  3. A batch for start and end offsets (of type DataFrame).

