Kafka Data Source

Spark SQL supports reading data from or writing data to one or more topics in Apache Kafka.


Apache Kafka is a storage of records in a format-independent and fault-tolerant durable way.

Read up on Apache Kafka in the official documentation or in my other gitbook Mastering Apache Kafka.

Kafka Data Source supports options to get better performance of structured queries that use it.

Reading Data from Kafka Topics

As a Spark developer, you use DataFrameReader.format method to specify Apache Kafka as the external data source to load data from.

You use kafka (or org.apache.spark.sql.kafka010.KafkaSourceProvider) as the input data source format.

val kafka = spark.read.format("kafka").load

// Alternatively
val kafka = spark.read.format("org.apache.spark.sql.kafka010.KafkaSourceProvider").load

These one-liners create a DataFrame that represents the distributed process of loading data from one or many Kafka topics (with additional properties).

Writing Data to Kafka Topics

As a Spark developer,…​FIXME

results matching ""

    No results matching ""