Overview of Kafka
Apache Kafka is an open source project for a distributed publish-subscribe messaging system rethought as a distributed commit log.
Kafka stores messages in topics that are partitioned and replicated across multiple brokers in a cluster. Producers send messages to topics from which consumers read.
Language Agnostic — producers and consumers use binary protocol to talk to a Kafka cluster.
Messages are byte arrays (with String, JSON, and Avro being the most common formats). If a message has a key, Kafka makes sure that all messages of the same key are in the same partition.
Consumers may be grouped in a consumer group with multiple consumers. Each consumer in a consumer group will read messages from a unique subset of partitions in each topic they subscribe to. Each message is delivered to one consumer in the group, and all messages with the same key arrive at the same consumer.
Durability — Kafka does not track which messages were read by each consumer. Kafka keeps all messages for a finite amount of time, and it is consumers' responsibility to track their location per topic, i.e. offsets.
It is worth to note that Kafka is often compared to the following open source projects:
-
Apache ActiveMQ and RabbitMQ given they are message broker systems, too.
-
Apache Flume for its ingestion capabilities designed to send data to HDFS and Apache HBase.