Streaming Data Source

Streaming Data Source is a "continuous" stream of data and is described using the Source Contract.

Source can generate a streaming DataFrame (aka batch) given start and end offsets in a batch.

For fault tolerance, Source must be able to replay data given a start offset.

Source should be able to replay an arbitrary sequence of past data in a stream using a range of offsets. Streaming sources like Apache Kafka and Amazon Kinesis (with their per-record offsets) fit into this model nicely. This is the assumption so structured streaming can achieve end-to-end exactly-once guarantees.

Table 1. Sources
Format Source

Any FileFormat

  • csv

  • hive

  • json

  • libsvm

  • orc

  • parquet

  • text

FileStreamSource

kafka

KafkaSource

memory

MemoryStream

rate

RateStreamSource

socket

TextSocketSource

Source Contract

package org.apache.spark.sql.execution.streaming

trait Source {
  def commit(end: Offset) : Unit = {}
  def getBatch(start: Option[Offset], end: Offset): DataFrame
  def getOffset: Option[Offset]
  def schema: StructType
  def stop(): Unit
}
Table 2. Source Contract
Method Description

getBatch

Generates a DataFrame (with new rows) for a given batch (described using the optional start and end offsets).

Used when StreamExecution runs a batch and populateStartOffsets.

getOffset

Finding the latest offset

Note
Offset is…​FIXME

Used exclusively when StreamExecution runs streaming batches (and constructing the next streaming batch for every streaming data source in a streaming Dataset)

schema

Schema of the data from this source

Used when:

results matching ""

    No results matching ""