Source Contract — Streaming Sources for Micro-Batch Stream Processing (Data Source API V1)

Source is the extension of the BaseStreamingSource contract for streaming sources that work with "continuous" stream of data identified by offsets.

Source is part of Data Source API V1 and used in Micro-Batch Stream Processing only.

For fault tolerance, Source must be able to replay an arbitrary sequence of past data in a stream using a range of offsets. This is the assumption so Structured Streaming can achieve end-to-end exactly-once guarantees.

Table 1. Source Contract
Method Description

commit

commit(end: Offset): Unit

Commits data up to the end offset, i.e. informs the source that Spark has completed processing all data for offsets less than or equal to the end offset and will only request offsets greater than the end offset in the future.

getBatch

getBatch(
  start: Option[Offset],
  end: Offset): DataFrame

Generating a streaming DataFrame with data between the start and end offsets

Start offset can be undefined (None) to indicate that the batch should begin with the first record

Used when MicroBatchExecution stream execution engine (Micro-Batch Stream Processing) is requested to run an activated streaming query, namely:

getOffset

getOffset: Option[Offset]

Latest (maximum) offset of the source (or None to denote no data)

schema

schema: StructType

Schema of the source

Note
schema seems to be used for tests only and a duplication of StreamSourceProvider.sourceSchema.
Table 2. Sources
Source Description

FileStreamSource

Part of file-based data sources (FileFormat)

KafkaSource

Part of kafka data source

results matching ""

    No results matching ""