The document presents an overview of Kafka, an open-source message broker designed for asynchronous communication between systems, highlighting its architecture, features, and use cases. It discusses Kafka's terminology, Zookeeper's role in coordination, and compares Kafka with RabbitMQ, emphasizing its scalability and message handling capabilities. Additionally, it includes a sample application for an e-shopping system that demonstrates Kafka's functionality.
3Page
Messaging Systems
â˘âŻ Asynchronouscommunication between systems
â˘âŻ Some Use Cases
â˘âŻ Web application â fast response to client and handle heavy processing
tasks asynchronously
â˘âŻ Balance load between workers
â˘âŻ Decouple processing from data producers
â˘âŻ Models
â˘âŻ Queuing: a pool of consumers may read from a server and each message
goes to one of them
â˘âŻ Publish â Subscribe: the message is broadcast to all consumers
Producer
Messaging
System
Consumer
4.
4Page
Kafka
â˘âŻKafka is aopen-source message broker project
â˘âŻDistributed, replicated, scalable, durable, and gives high throughput
â˘âŻAim â âcentral nervous system for dataâ
â˘âŻThe design is heavily inďŹuenced by transaction logs
â˘âŻBuilt at LinkedIn with a speciďŹc purpose in mind: to serve as a central repository of data
streams
7Page
Kafka
â˘âŻAfter Kafka inplace, LinkedIn stats look great â as of March 2015 â
â˘âŻ800B messages produced / day â almost 175 TB of data
â˘âŻ1100 Kafka brokers organized in 60 clusters
â˘âŻAs of Sep 2015⌠around 1.1 trillion a day...
â˘âŻWritten in Scala, open-sourced in 2011 under the Apache Software Foundation
â˘âŻApache top level project since 2012
8.
8Page
Kafka Terminology
Kafka broker
â˘âŻDesigned for HA - there are no master nodes. All
nodes are interchangeable.
â˘âŻ Data is replicated.
â˘âŻ Messages are stored for conďŹgurable period of time
Topic
â˘âŻ A topic is a category or feed name to which messages
are published.
â˘âŻ Topics are partitioned
Log
â˘âŻ Append Only
â˘âŻ Totally ordered sequence of records â ordered by
time
â˘âŻ They record what happened and when
9.
9Page
Kafka Terminology (cont.)
â˘âŻPartitions
â˘âŻ Each partition is an ordered, immutable sequence of messages that is
continually appended to âa commit log
â˘âŻ Each message in the partition is assigned a unique sequenced ID, its offset
â˘âŻ More partitions allow greater parallelism for consumption
â˘âŻ They allow the log to scale beyond a size that will ďŹt on a single server. Each
individual partition must ďŹt on the servers that host it, but a topic can handle
an arbitrary amount of data.
â˘âŻ Number of partitions decide number of workers
â˘âŻ Each partition has one server which acts as the "leader" and zero or more
servers which act as "followers".
â˘âŻ Leader handles all read and write requests for the partition.
10.
10Page
Kafka Terminology (cont.)
Producers
â˘âŻSend messages to topics synchronously or asynchronously
â˘âŻ They decide
â˘âŻ Partition / Key / none of these / Partitioner class
â˘âŻ what sort of replication guarantees they want (acks setting)
â˘âŻ batching and compressing
Consumers and Consumer Groups
â˘âŻ Consumer labels themselves with a consumer group name; and subscribe to
one or more topics
â˘âŻ Consumers pull messages
â˘âŻ They control the offset read by them .. Can re-read without overhead on
broker
â˘âŻ Each consumer in a consumer group will read messages from a unique subset
of partitions in each topic they subscribe to, so each message is delivered to
one consumer in the group, and all messages with the same key arrive at the
same consumer
11.
11Page
Kafka Terminology âConsumer Groups
Queue model Publish-subscribe model
Topic
C3 C4C1 C2
ConsGroup1 ConsGroup2
m1 m1 m2m2
Topic
C2C1
ConsGroup1 ConsGroup2
m1,
m2
m1,
m2
12.
12Page
Zookeeper
â˘âŻ ZooKeeper isa fast, highly available, fault tolerant, distributed coordination service
â˘âŻ help distributed synchronization and
â˘âŻ maintain conďŹguration information
â˘âŻ Replicated: Like the distributed processes it coordinates, ZooKeeper itself is intended to be
replicated over a sets of hosts called an ensemble.
â˘âŻ Role in kafka architecture
â˘âŻ Coordinate cluster information
â˘âŻ Store cluster metadata
â˘âŻ Store consumer offsets
13.
13Page
Differences with RabbitMQ
FeatureKafka JMS Message Broker; RabbitMQ
Dequeuing cluster retains all published messagesâwhether or not
they have been consumedâfor a configurable period of
time.
When consumer acknowledges
Consumer metadata the only metadata retained on a per-consumer basis is
offset.
consumer acknowledgments per
message
Ordering Strong ordering within a partition Ordering of the messages is lost in the
presence of parallel consumption. For
workaround of âexclusive consumerâ
have to sacrifice parallelism
Batching / Streaming Available for both producer and consumer â supports
online and offline consumers
Consumers are mostly online
Scalability Client centric Broker centric
Complex routing Needs to be programmed Lot of options available with less work
Monitoring UI Needs work Decent web UI available
14.
14Page
Common Use Cases
â˘âŻMessaging
â˘âŻ Website Activity Tracking
â˘âŻ The original use case for Kafka - Often very high volume â
â˘âŻ (page views, searches, etc.) -> published to central topics -> subscribed by different consumers
for various use cases - real-time processing, monitoring, and loading into Hadoop or ofďŹine
processing and reporting.
â˘âŻ Log Aggregation
â˘âŻ Stream Processing
â˘âŻ Collect data from various sources
â˘âŻ Aggregate the data as soon as it arrives
â˘âŻ Feed it to systems such as Hadoop/ DB/ other clients
15.
15Page
Kafka 0.9 Features
â˘âŻSecurity
â˘âŻ authenticate users using either Kerberos or TLS client
certiďŹcates
â˘âŻ Unix-like permission system to control which user can
access which data
â˘âŻ encryption
â˘âŻ Kafka Connect
â˘âŻ User deďŹned Quota
â˘âŻ New Consumer
â˘âŻ New Java client
â˘âŻ Group management facility
â˘âŻ Faster rebalancing
â˘âŻ Fully decouple clients from Zookeeper
16.
16Page
Bootstrapping
Bootstrapping for producers
1.âŻCycle through a list of "bootstrap" kafka urls until we ďŹnd one we can connect to. Fetch cluster metadata.
2.⯠Process fetch or produce requests, directing them to the appropriate broker based on the topic/partitions they send
to or fetch from
3.⯠If we get an appropriate error, refresh the metadata and try again.
Bootstrapping of consumers
1.⯠On startup or on co-ordinator failover, the consumer sends a ConsumerMetadataRequest to any of the brokers in the
bootstrap.brokers list -> receives the location of the co-ordinator for it's group.
2.⯠The consumer connects to the co-ordinator and sends a HeartbeatRequest.
3.⯠If no error is returned in the HeartbeatResponse, the consumer continues fetching data, for the list of partitions it last
owned, without interruption.
18Page
Sample Application
â˘âŻE shoppingsystem â simpliďŹed scenario
â˘âŻSupports shipping in two cities
â˘âŻOnce order is placed we need to handle
payment and shipping
â˘âŻShipping system allows efďŹciency if
requests are grouped by city
â˘âŻSee simple architecture diagram in next
slide and check out the code
In demo application, we will cover:
â˘âŻ Zookeeper conďŹg
â˘âŻ Broker conďŹg
â˘âŻ Start two brokers
â˘âŻ Create Topic and describe / list
â˘âŻ Producer conďŹg
â˘âŻ Message delivery semantics
â˘âŻ Consumer conďŹg
â˘âŻ Consumer Rebalancing
â˘âŻ Sample application code: https://github.com/teamclairvoyant/meetup-docs/tree/master/Meetup-Kafka
24Page
RabbitMQ
â˘âŻ Proven MessageBroker uses Advanced Message Queuing Protocol
(AMQP) for messaging.
â˘âŻ Message ďŹow & concepts in RabbitMQ
â˘âŻ The producer publishes a message
â˘âŻ The exchange receives and routes the message in to the queues
â˘âŻ Routing can be based on different message attributes such as routing key,
depending on the exchange type
â˘âŻ Binding is a link between an exchange and a queue
â˘âŻ The messages stays in the queue until they are handled by a consumer
â˘âŻ The consumer handles the message.
â˘âŻ Channel: a virtual connection inside a connection. When you are publishing
or consuming messages or subscribing to a queue is it all done over a channel
25.
25Page
RabbitMQ (cont.)
â˘âŻ Typesof Exchange
â˘âŻ Direct: delivers messages to queues based on a message
routing key:
Queuesâ binding key == routing key of the message
â˘âŻ Fanout: routes messages to all of the queues that are
bound to it.
â˘âŻ Topic: does a wildcard match between the routing key
and the routing pattern speciďŹed in the binding.
â˘âŻ Headers: uses the message header attributes for
routing.
â˘âŻ CloudAMQP
â˘âŻ hosted RabbitMQ solution, just sign up for an account
and create an instance. You do not need to set up and
install RabbitMQ or care about cluster handling
26.
26Page
RabbitMQ (cont.)
â˘âŻ Managementand Monitoring
â˘âŻ Nice web UI for management and monitoring of your RabbitMQ server.
â˘âŻ Allows to handle, create, delete and list queues, monitor queue length, check message rate,
change and add users permissions, etc.
27.
27Page
Upgrading from 0.8.0,0.8.1.X or 0.8.2.X to 0.9.0.0
â˘âŻ 0.9.0.0 has potential breaking changes (please review before upgrading) and an inter-broker
protocol change from previous versions.
â˘âŻ Java 1.6 and Scala 2.9 is no longer supported
â˘âŻ http://kafka.apache.org/documentation.html
â˘âŻ Kafka consumers in earlier releases store their offsets by default in ZooKeeper. It is possible to
migrate these consumers to commit offsets into Kafka by following some steps
28.
28Page
Kafka Terminology (cont.)
â˘âŻProtocol
â˘âŻ These requests to publish or fetch data must be sent to the broker that is currently acting as the
leader for a given partition. This condition is enforced by the broker, so a request for a
particular partition to the wrong broker will result in an the NotLeaderForPartition error code
â˘âŻ All Kafka brokers can answer a metadata request that describes the current state of the cluster:
â˘âŻ what topics there are
â˘âŻ which partitions those topics have
â˘âŻ which broker is the leader for those partitions
â˘âŻ the host and port information for these brokers
â˘âŻ Good explanation:
https://cwiki.apache.org/conďŹuence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
29.
29Page
Kafka Adoption
Apache Kafkahas become a popular messaging system in a short period of time with a number of
organizations like
â˘âŻ LinkedIn
â˘âŻ Tumblr
â˘âŻ PayPal
â˘âŻ Cisco
â˘âŻ Box
â˘âŻ Airbnb
â˘âŻ NetďŹix
â˘âŻ Square
â˘âŻ Spotify
â˘âŻ Pinterest
â˘âŻ Uber
â˘âŻ Goldman Sachs
â˘âŻ Yahoo and Twitter among others using it in production systems