The document introduces Apache Kafka, a distributed streaming platform that facilitates high-throughput, fault-tolerant message delivery between producers and consumers. It highlights Kafka's features, including scalability, durability, and reliability, while explaining how messages are organized in topics and partitions. Additionally, it outlines Kafka's delivery semantics and use cases, demonstrating its utility in real-time data processing and integration within complex application architectures.
What is Kafka?
Messageflow
Distributed streams
Processing System
Messages sent by
Distributed Producers
to
Distributed Consumers
Consumers
via
Distributed Kafka Cluster
Cluster
3.
Kafka Benefits
ď§ Fast
ď§Scalable
ď§ Reliable
ď§ Durable
ď§ Open Source
ď§ Managed Service
Kafka benefits
4.
Kafka Benefits
ď§ Fastâ high throughput and low latency
ď§ Scalable â horizontally scalable with nodes and partitions
ď§ Reliable â distributed and fault tolerant
ď§ Durable - zero data loss, messages persisted to disk with immutable log
ď§ Open Source â An Apache project
ď§ Available as a Managed Service - on multiple cloud platforms
Kafka benefits
âPoste Restanteâ?
⢠Nota post office in a
restaurant
⢠General delivery (in the
US)
⢠The mail is delivered to
a post office, they hold
it for you until you call
Benefits include
⢠Disconnecteddelivery â
consumer doesnât need to
be available to receive
messages
⢠Less effort for the
messaging service â only
has to deliver to a few
locations not every
consumer
⢠Can scale better and
handle more complex
delivery semantics!
Santa
North PoleTopic
Timestamp, offset,partition
Key -> Partition (optional)
Kafka Producers and Consumers
need a serializer and de-serializer
to write & read key and value
30.
⢠Kafka doesnâtlook
at the value
⢠Consumer can
read value
⢠And try to make
sense of the
message
⢠What will Santa
be delivering?!
Producer
Partition 1
Partition 2
Partitionn
Topic âPartiesâ
C1
Consumer Group
Consumer Group
Consumer
Consumer
Consumer
Consumer
Consumers subscribed to topic are allocated partitions
They will only get messages from their allocated partitions.
53.
Producer
Partition 1
Partition 2
Partitionn
Topic âPartiesâ
C1
Consumer Group
Consumer
Consumer
Consumer
Consumers share work
within groups
Consumers in the same group share the work around
Each consumer gets only a subset of messages
54.
Producer
Partition 1
Partition 2
Partitionn
Topic âPartiesâ
C1
Consumer Group
Consumer Group
Consumer
Consumer
Consumer
Consumer
Messages are duplicated across
Consumer groups
Multiple groups enable message broadcasting
Messages are duplicated across groups, each consumer
group receives a copy of each message.
55.
Key
Partition based delivery
Whichmessages are delivered to
which consumers?
If a message has a key, then Kafka
uses Partition based delivery.
Messages with the same key are
always sent to the same partition
and therefore the same
consumer.
And the order is guaranteed.
56.
No Key
Round robindelivery
If the key is null, then
Kafka uses round robin
delivery
Each message is delivered
to the next partition
Consumer Group =Nerds
Multiple consumers
Consumer Group = Hairy
Single consumer
60.
Producer
Partition 1
Partition 2
Partitionn
Topic âPartiesâ
C1
Group âNerdsâ
Group âHairyâ
Consumer 1 (Bill)
Consumer 2 (Paul)
Consumer n
Consumer 1 (Chewy)
Consumers
Subscribed to âPartiesâ
No Key
Round Robin
M1
M2
etc
M1
M1
M1
M2
M2
M2
Case 1: No Key
Message (M1, M2, etc) sent to the next partition
All consumers allocated to that partition will receive a message when they poll next.
1. Both Groupssubscribe to Topic âpartiesâ (11 partitions, so 1 consumer per partition).
Subscribe to âPartiesâ Subscribe to âPartiesâ
63.
1. Both Groupssubscribe to Topic âpartiesâ (11 partitions, so 1 consumer per partition).
2. Producer sends record âCool pool party â Invitationâ
<key=null, value=âCool pool party - Invitationâ> to âpartiesâ topic (no key)
64.
1. Both Groupssubscribe to Topic âpartiesâ (11 partitions, so 1 consumer per partition).
2. Producer sends record âCool pool party - Invitationâ> to âpartiesâ topic
3. Bill and Chewbacca receive a copy of the invitation and plan to attend
65.
4. Producer sendsanother record âCool pool party â Cancelledâ
<key=null, value=âCool pool party - Cancelledâ> to âpartiesâ topic
66.
4. Producer sendsanother record <key=null, value=âCool pool party - Cancelledâ> to âpartiesâ topic
5. Paul and Chewbacca receive the cancellation.
Paul gets the message this time as itâs round robin, ignores it as he didnât get the invitation. Bill wastes his
time trying to go to cancelled party. The rest of the gang arenât surprised at not receiving any party invites and
stay at home to do some hacking. Chewy is only consumer in his group so gets all messages, plans something
fun insteadâŚ
68.
Producer
Partition 1
Partition 2
Partitionn
Topic âPartiesâ
C1
Group âNerdsâ
Group âHairyâ
Consumer 1 (Bill)
Consumer 2 (Paul)
Consumer n
Consumer 1 (Chewy)
Consumers
Subscribed to âPartiesâ
Key
Hashed to partition
M1, M2
etc
M1, M2
M1, M2
M1, M2
M3
M3
M3
M3
Case 2: If there is a Key
A key is hashed to a partition, and a Message with that key is always sent to that partition.
Assume there are 3 messages, and messages 1 and 2 are hashed to same partition.
69.
Hereâs what happenswith a key: key is âtitleâ of the message (e.g. âCool pool partyâ)
Same set up as before:
1. Both Groups subscribe to Topic âpartiesâ (11 partitions).
70.
1. Both Groupssubscribe to Topic âpartiesâ (11 partitions).
2. Producer sends record <key=âCool pool partyâ, value=âInvitationâ> to âpartiesâ topic
71.
1. Both Groupssubscribe to Topic âpartiesâ (11 partitions).
2. Producer sends record <key=âCool pool partyâ, value=âInvitationâ> to âpartiesâ topic
3. As before Bill and Chewbacca receive a copy of the invitation and plan to attend
72.
4. Producer sendsanother record <key=âCool pool partyâ, value=âCancelledâ> to âpartiesâ topic
73.
4. Producer sendsanother record <key=âCool pool partyâ, value=âCancelledâ> to âpartiesâ topic
5. Bill and Chewbacca receive the cancellation (same consumers this time, as identical key)
74.
6. Producer sendsanother record <key=âHorrible Halloween partyâ, value=âInvitationâ> to âpartiesâ topic
75.
6. Producer sendsanother record <key=âHorrible Halloween partyâ, value=âInvitationâ> to âpartiesâ topic
7. Paul and Chewy receive the invitation
Paul receives the Halloween invitation as the key is different and the record is sent to the partition that Paul is
allocated to
Chewy is the only consumer in his group so he gets every record no matter what partition itâs sent to
Real-time data pipeline
Read-timedata pipeline features:
⢠Ingestion of multiple heterogeneous sources
⢠Sending data to multiple heterogeneous sinks
⢠Acts as a buffer to smooth out load spikes
⢠Enables use cases which reprocess data (e.g. disaster recovery)
80.
Anomaly Detection Pipeline
Real-timeEvent processing pipeline:
⢠Simple event driven applications (If X then YâŚ)
⢠May write and read from other data sources (e.g. Cassandra)
⢠New Events sent back to Kafka or to other systems
⢠E.g. Anomaly Detection, check out my current blog series if you are interested in this example.
81.
Kafka Streams Processing(Kongo IoT Blog series)
Streams processing features:
⢠Complex streams processing (multiple events and streams)
⢠Time, windows, and transformations
⢠Uses Kafka Streams API, includes state store
⢠Visualization of the streams topology
⢠Continuously computes the loads for trucks and checks if they are overloaded.
82.
Linkedin - BeforeKafka (BK)
A real example from Linkedin, who developed Kafka.
Before Kafka they had spaghetti integration of monolithic applications.
To accommodate growing membership and increasing site complexity, they migrated from a monolithic
application infrastructure to one based on microservices, which made the integration even more complex!
83.
After Kafka (AK)
Ratherthan maintaining and scaling each pipeline individually, they invested in the
development of a single, distributed pub-sub platform - Kafka was born.
The main benefit was better Service decoupling and independent scaling.
84.
The End (ofthe introduction) -
Find out more
Apache Kafka: https://kafka.apache.org/
Instaclustr blogs
⢠Mix of Cassandra, Spark, Zeppelin and Kafka
https://www.instaclustr.com/paul-brebner/
⢠Kafka introduction
https://insidebigdata.com/2018/04/12/developing-deeper-understanding-apache-kafka-architecture/
https://insidebigdata.com/2018/04/19/developing-deeper-understanding-apache-kafka-architecture-part-2-
⢠Kongo â Kafka IoT logistics application blog series
https://www.instaclustr.com/instaclustr-kongo-iot-logistics-streaming-demo-application/
⢠Anomaly detection with Kafka and Cassandra (and Kubernetes), current blog series
https://www.instaclustr.com/anomalia-machina-1-massively-scalable-anomaly-detection-with-apache-kafka-
Instaclustrâs Managed Kafka (Free trial)
https://www.instaclustr.com/solutions/managed-apache-kafka/
Editor's Notes
#3Â What is Kafka? Kafka is a distributed streams processing system, it allows distributed producers to send messages to distributed consumers via a Kafka cluster.
#4Â Kafka benefits
Fast â high throughput and low latency
Scalable â horizontally scalable, just add nodes and partitions
Reliable â distributed and fault tolerant
Zero data loss â messages are persisted to disk with immutable log
Open Source â An Apache project
Available as a Managed Service - on multiple cloud platforms
#5Â Kafka benefits
Fast â high throughput and low latency
Scalable â horizontally scalable, just add nodes and partitions
Reliable â distributed and fault tolerant
Zero data loss â messages are persisted to disk with immutable log
Open Source â An Apache project
Available as a Managed Service - on multiple cloud platforms
#6 Back to the intro Kafka overview diagram, itâs a bit monochrome and boring. This talk will be more colourful and itâs going to be an extended storyâŚ
#7 Back to the intro Kafka overview diagram, itâs a bit monochrome and boring. This talk will be more colourful and itâs going to be an extended storyâŚ
#8Â Letâs build a modern day fully electronic postal service â the Kafka Postal Service
#9Â What does a postal service do? It sends messages from A to B (animated with sounds, click!)
#12Â Actually, no. Due to the decline in âsnail mailâ volumes, direct deliveries have been cancelled.
#13Â Actually, no. Due to the decline in âsnail mailâ volumes, direct deliveries have been cancelled.
#14Â Instead we have âPoste Restanteâ - Not a post office in a restaurant
Itâs called general delivery (in the US).
The mail is delivered to a post office, and they hold it for you until you call for it.
#15Â Consumers poll for messages by visiting the counter at the post office.
#16Â Consumers poll for messages by visiting the counter at the post office.
#17Â Kafka topics act like a Post Office. What are the benefits?
Disconnected delivery â consumer doesnât need to be available to receive messages
Less effort for the messaging service â only has to delivery to a few locations not larger number of consumer addresses
Can scale better and handle more complex delivery semantics!
#18Â Kafka topics act like a Post Office. What are the benefits?
Disconnected delivery â consumer doesnât need to be available to receive messages
Less effort for the messaging service â only has to delivery to a few locations not larger number of consumer addresses
Can scale better and handle more complex delivery semantics!
#19Â First lets see how it scales. What if there are many consumers for a topic?
#20Â A single counter introduces delays and limits concurrency
#21Â More counters increases concurrency and can reduce delays
#22Â Kafka Topics have 1 or more Partitions, partitions function like multiple counters and enable high concurrency.
#23Â Before looking at delivery semantics what does a message actually look like? In Kafka a message is called a Record and is like a letter.
#25Â The âPostmarkâ includes a timestamp, offset in the topic, and the partition it was sent to. Time semantics are flexible, either the time of event creation, ingestion, or processing.
#26Â The âPostmarkâ includes a timestamp, offset in the topic, and the partition it was sent to. Time semantics are flexible, either the time of event creation, ingestion, or processing.
#27Â Thereâs also a thing called a Key, which is is optional.
It refines the destination so itâs a bit like the rest of the address. We want this letter sent to Santa not just any Elf.
#28Â Thereâs also a thing called a Key, which is is optional.
It refines the destination so itâs a bit like the rest of the address. We want this letter sent to Santa not just any Elf.
#29Â The value is the contents (just a byte array). Kafka Producers and consumers need to have a shared serializer and de-serializer for both the key and value.
#30Â The value is the contents (just a byte array). Kafka Producers and consumers need to have a shared serializer and de-serializer for both the key and value.
#31 Kafka doesnât look inside the value, but the Producer and Consumer can, and the Consumer can try and make sense of the messageâŚ
I wonder what Santa will be delivering?
#32Â Next lets look at delivery semantics. For example, do we care if the message actually arrives or not?
#33Â Yes we do! Guaranteed message delivery is desirable.
Homing pigeons got lost or eaten, so need to send the message with multiple pigeons
#34Â Yes we do! Guaranteed message delivery is desirable.
Homing pigeons got lost or eaten, so need to send the message with multiple pigeons
#35Â Yes we do! Guaranteed message delivery is desirable.
Homing pigeons got lost or eaten, so need to send the message with multiple pigeons
#36Â Yes we do! Guaranteed message delivery is desirable.
Homing pigeons got lost or eaten, so need to send the message with multiple pigeons
#37Â How does Kafka guarantee delivery?
A Message (M1) is written to a broker (2)
#38Â How does Kafka guarantee delivery?
A Message (M1) is written to a broker (2)
#41Â The message is also replicated on multiple âbrokersâ , 3 is typical
#42Â And makes it resilient to loss of most servers
#43Â Finally the producer gets acknowledgement once the message is persisted and replicated (configurable for number, and sync or async)
The message is now available from more than one broker in case some fail.
This also increases the read concurrency as partitions are spread over multiple brokers.
#44Â Finally the producer gets acknowledgement once the message is persisted and replicated (configurable for number, and sync or async)
The message is now available from more than one broker in case some fail.
This also increases the read concurrency as partitions are spread over multiple brokers.
#45Â Now letâs look at the 2nd aspect of delivery semantics.
Who gets the messages and how many times are messages delivered?
Kafka is âpub-subâ, Itâs loosely coupled, producers and consumers donât know about each other
#46Â Now letâs look at the 2nd aspect of delivery semantics.
Who gets the messages and how many times are messages delivered?
Kafka is âpub-subâ, Itâs loosely coupled, producers and consumers donât know about each other
#47Â Filtering, or which consumers get which messages, is topic based
Publishers send messages to topics
Consumers subscribe to topics of interest, e.g. parties. When they poll they only receive messages sent to those topics.
#48Â Filtering, or which consumers get which messages, is topic based
Publishers send messages to topics
Consumers subscribe to topics of interest, e.g. parties. When they poll they only receive messages sent to those topics.
#49Â Filtering, or which consumers get which messages, is topic based
Publishers send messages to topics
Consumers subscribe to topics of interest, e.g. parties. When they poll they only receive messages sent to those topics.
#50Â Filtering, or which consumers get which messages, is topic based
Publishers send messages to topics
Consumers subscribe to topics of interest, e.g. parties. When they poll they only receive messages sent to those topics.
#51Â Just a few more details and we can see how this works. Partitions and consumer groups enable sharing of work across multiple consumers, the more partitions a topic has the more consumers it supports
#52Â Kafka supports delivery of the same message to multiple consumers. Kafka doesnât throw messages away immediately they are delivered, so the same message can easily be delivered to multiple consumer groups.
#53Â Consumers subscribed to âpartiesâ topic are allocated partitions. When they poll they will only get messages from their allocated partitions.
#54Â This enables consumers in the same group to share the work around. Each consumer gets only a subset of the available messages.
#55Â Multiple groups enable message broadcasting. Messages are duplicated across groups, as each consumer group receives a copy of each message.
#56Â Which messages are delivered to which consumers? The final aspect of delivery semantics is to do with message keys.
If a message has a key, then Kafka uses Partition based delivery.
Messages with the same key are always sent to the same partition and therefore the same consumer.
And the order is guaranteed.
#57Â If the key is null, then Kafka uses round robin delivery. Each message is delivered to the next partition.
#58Â Letâs look at an example with 2 consumer groups. Nerds, which has multiple consumers, and Hairy which has a single consumer.
#59Â Letâs look at an example with 2 consumer groups. Nerds, which has multiple consumers, and Hairy which has a single consumer.
#60Â Letâs look at an example with 2 consumer groups. Nerds, which has multiple consumers, and Hairy which has a single consumer.
#61Â Looking at the case where thereâs No Key 1st, each message (1, 2, etc) is sent to the next partition, and all consumers allocated to that partition will receive the message when they poll next.
#62Â Hereâs what actually happens. Weâre not showing the producer or topic for simplicity. Youâll have to imagine them.
#63Â Both Groups subscribe to Topic âpartiesâ (11 partitions, so 1 consumer per partition).
#64Â 2 Producer sends record with the value âCool pool party - Invitationâ to âpartiesâ topic (thereâs no key)
#65Â 3 Bill and Chewbacca receive a copy of the invitation and plan to attend
#66Â 4. Producer sends another record with the value âCool pool party â Cancelledâ to âpartiesâ topic
#67Â In the Nerds group, Paul gets the message this time as itâs round robin, and Chewy gets it as heâs the only consumer in his group.
Paul ignores it as he didnât get the original invite
Bill wastes his time trying to go
The rest of the gang arenât surprised at not receiving any invites and stay home to do some hacking
Chewy plans something else fun instead...
#69Â How does it work if there is a Key? The key is hashed to a partition, and the Message is sent to that partition.
Assume there are 3 messages, and messages 1 and 2 are hashed to same partition.
#70Â Hereâs what happens with a key, assuming that the key is the âtitleâ of the message (âCool pool partyâ)
As before Both Groups subscribe to Topic âpartiesâ (11 partitions).
Producer sends record <key=âCool pool partyâ, value=âInvitationâ> to âpartiesâ topic
#71Â Hereâs what happens with a key, assuming that the key is the âtitleâ of the message (âCool pool partyâ)
As before Both Groups subscribe to Topic âpartiesâ (11 partitions).
Producer sends record <key=âCool pool partyâ, value=âInvitationâ> to âpartiesâ topic
#72Â As before, Bill and Chewbacca receive a copy of the invitation and plan to attend
#73Â 4. Producer sends another record with the same key but a cancellation value to âpartiesâ topic
#74Â This time, Bill and Chewbacca receive the cancellation (the same consumers as the key is identical)
#75Â Producer sends out another invitation to a Halloween party. The key is different this time.
#76Â Paul receives the Halloween invitation as the key is different and the record is sent to the partition that Paul is allocated
Chewy is the only consumer in his group so he gets every record no matter what partition itâs sent to
#77Â This time Chewy gets dressed up and goes to the party.
#78Â Who did you imagine was producing the invitations? Maybe this fellow.
#80Â Hereâs a UML-like diagram with the main Kafka components. Weâve introduced Producers, Topics, Partitions, Consumer Groups and Consumes today.
Thereâs a lot more to explore, including how Kafka provides replication, and the Connect and Streaming APIs.
#81Â To finish up here are three important use cases for Kafka
#82Â The Real-time Data pipeline features
Ingestion of multiple heterogeneous sources
Sending data to multiple heterogeneous sinks
Acts as a buffer to smooth out load spikes
Enables use cases which reprocess data (e.g. disaster recovery)
#83Â Real-time Event processing features:
Simple event driven applications (If X then YâŚ)
May write and read from other data sources (e.g. Cassandra)
New Events sent back to Kafka or to other systems
E.g. Anomaly Detection, check out my current blog series if you are interested in this example.
#84Â Streams processing features:
Complex streams processing (multiple events and streams)
Time, windows, and transformations
Streams API, includes state store
This a visualization of the streams topology from the streams processing pipeline from my previous blog series, Kongo, a Kafka IoT logistics application.
It continuously computes the loads for trucks and checks if they are overloaded.
#85Â Hereâs a real example from Linkedin, who developed Kafka. Before Kafka they had spaghetti integration of monolithic applications.
To accommodate growing membership and increasing site complexity, they migrated from a monolithic application infrastructure to one based on microservices, which made the integration even more complex.
#86Â After Kafka
Rather than maintaining and scaling each pipeline individually, they invested in the development of a single, distributed pub-sub platform. Thus, Kafka was born. This main benefit was better Service decoupling and independent scaling.