KafkaUtils — Creating Kafka DStreams and RDDs

KafkaUtils is the object with the factory methods to create input dstreams and RDDs from records in topics in Apache Kafka.

import org.apache.spark.streaming.kafka010.KafkaUtils
Tip

Enable WARN logging level for org.apache.spark.streaming.kafka010.KafkaUtils logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.streaming.kafka010.KafkaUtils=WARN

Refer to Logging.

Creating Kafka DStream — createDirectStream Method

createDirectStream[K, V](
  ssc: StreamingContext,
  locationStrategy: LocationStrategy,
  consumerStrategy: ConsumerStrategy[K, V]): InputDStream[ConsumerRecord[K, V]]

createDirectStream is a method that creates a DirectKafkaInputDStream from a StreamingContext, LocationStrategy, and ConsumerStrategy.

Tip

Enable DEBUG logging level for org.apache.kafka.clients.consumer.KafkaConsumer logger to see what happens inside the Kafka consumer that is used to communicate with Kafka broker(s).

The following DEBUGs are from when a DirectKafkaInputDStream is started.

DEBUG KafkaConsumer: Starting the Kafka consumer
DEBUG KafkaConsumer: Kafka consumer created
DEBUG KafkaConsumer: Subscribed to topic(s): basic1, basic2, basic3

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.kafka.clients.consumer.KafkaConsumer=DEBUG

Refer to Logging.

Using KafkaUtils.createDirectStream to Connect to Kafka Brokers
// Include org.apache.spark:spark-streaming-kafka-0-10_2.11:2.1.0-SNAPSHOT dependency in the CLASSPATH, e.g.
// $ ./bin/spark-shell --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.1.0-SNAPSHOT

import org.apache.spark.streaming._
import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate
val ssc = new StreamingContext(sc, Seconds(5))

import org.apache.spark.streaming.kafka010._

val preferredHosts = LocationStrategies.PreferConsistent
val topics = List("topic1", "topic2", "topic3")
import org.apache.kafka.common.serialization.StringDeserializer
val kafkaParams = Map(
  "bootstrap.servers" -> "localhost:9092",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "spark-streaming-notes",
  "auto.offset.reset" -> "earliest"
)
import org.apache.kafka.common.TopicPartition
val offsets = Map(new TopicPartition("topic3", 0) -> 2L)

val dstream = KafkaUtils.createDirectStream[String, String](
  ssc,
  preferredHosts,
  ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, offsets))

dstream.foreachRDD { rdd =>
  // Get the offset ranges in the RDD
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  for (o <- offsetRanges) {
    println(s"${o.topic} ${o.partition} offsets: ${o.fromOffset} to ${o.untilOffset}")
  }
}

ssc.start

// the above code is printing out topic details every 5 seconds
// until you stop it.

ssc.stop(stopSparkContext = false)

Creating Kafka RDD — createRDD Method

def createRDD[K, V](
  sc: SparkContext,
  kafkaParams: java.util.Map[String, Object],
  offsetRanges: Array[OffsetRange],
  locationStrategy: LocationStrategy): RDD[ConsumerRecord[K, V]]

createRDD creates a KafkaRDD.

Caution
FIXME

fixKafkaParams Internal Method

fixKafkaParams(kafkaParams: ju.HashMap[String, Object]): Unit

fixKafkaParams fixes Kafka parameters to prevent any issues with communicating with Kafka on Spark executors.

Caution
FIXME

results matching ""

    No results matching ""