KEMBAR78
A Tour of Apache Kafka | PDF
1Confidential
A Tour of Apache Kafka
Matt Howlett
Engineer, Confluent Inc.
2Confidential
Agenda
1. Technical Overview of
Apache Kafka
2. Use Cases
3Confidential
What is Apache Kafka?
Kafka is a streaming platform.
A distinct tool in your toolbox, like a relational database or a traditional
messaging system.
A streaming platform encourages architectures that have an emphasis on
events and changes to data (not data at rest).
Widely applicable. E.g. consider Walmart.
4Confidential
Who Uses Kafka Today?
● 35% of Fortune 500 + thousands of companies world wide use Kafka
● Across all industries
● High growth of usage within companies
5Confidential
Core Kafka Pt. 1
Traditional messaging: move data
Kafka: make data available
6Confidential
● Simple:
○ High performance
○ Robust horizontal scalability
● Suitable for real-time, streaming and batch operations
● Ad-hoc consumption & reprocessing
● Immutable:
○ Easier to debug/reason about v.s. ephemeral data
○ Auditable by default
Core Kafka Pt. 2: Why Logs?
7Confidential
Core Kafka Pt. 3 - Scaling
Notes:
ordering per partition only
re-partitioning
Key Value
Message
Kafka topics are partitioned logs
8Confidential
Core Kafka Pt. 4 - Durability
Kafka topics are replicated partitioned logs
Notes:
all reads and writes are to leader replica
9Confidential
Core Kafka Pt. 5: How Scalable is Kafka?
● No bottleneck!
○ Many brokers
○ Many producers
○ Many consumers
● Limits?
○ Internet giants are driving
the limits higher - you
won’t need to worry.
○ e.g. LinkedIn > 1 trillion
messages / day through
Kafka clusters.
○ 100 brokers / 2 billion
messages a day is
“straightforward” to
operate
○ Don’t over partition
~< 100k partitions
producers
brokers
consumers
10Confidential
Components of Apache Kafka
11Confidential
Kafka Clients
Use cases:
● Integration with custom applications
○ log application events
○ Invoke REST API
● Stateless stream processing (filter, transform)
Confluent supported clients:
● Java
● C (librdkafka)
○ C++
○ Python
○ C# / .NET
○ Go
● REST Proxy
12Confidential
● Off-the-shelf connectors
○ Confluent Hub
● Standardized framework
○ Scalable
○ Fault Tolerant
Kafka Connect
● Stateless workers
○ Un-opinionated deployment
● REST API
● Transforms
● Exactly Once
13Confidential
Kafka Streams
● Just a library! A library that
makes it easy to do stateful
operations (joins, aggregations,
windowing).
● Elastically scalable
○ distributed!
● Fault tolerant
● Un-opinionated deployment
● State backed by Kafka used as
a changelog
● Exactly once processing
● Record-at-a-time processing
● Complex topologies
○ (but keep it simple)
● JVM only (Java, Scala, etc.)
14Confidential
Confluent: A More Complete Streaming Platform
+
15Confidential
Use Cases
16Confidential
When Should You Use Kafka?
Scalability
● Quantity of Data
○ Simple Applications (or not)
● Complexity
○ Architectural
○ Organizational
17Confidential
Buffering Pt. 1
Kafka is a very good buffer:
● Write optimized
● Highly reliable
● Tolerate data spikes
● Tolerate downstream outages
● Used by KStreams (no back
pressure problems)
Other examples:
18Confidential
Buffering Pt. 2
Move data to multiple locations
19Confidential
Data Integration
Explosion of Data Sources and Processing Frameworks
20Confidential
Point-to-Point vs …
21Confidential
… Hub-and-Spoke
● # connections: O(N) vs potentially O(N^2)
● standardized, reliable data transport
● standardized data format
22Confidential
Dual Writes Problem
Without a log, there’s potential consistency problem:
23Confidential
Data Integration - Eventual Consistency
24Confidential
Change Data Capture
25Confidential
You can think of Kafka as a commit
log for your entire organization
“turning the database inside out”
26Confidential
Advanced ETL #1: PII Data Filter
● Kafka security: TLS encryption, flexible authentication + authorization
(secured)
kafka streams
27Confidential
Advanced ETL #2: Enriching stream data
28Confidential
Advanced ETL #2: Stream / Table Join in KSQL
CREATE STREAM enriched_weblog AS
SELECT
ip,
text,
g.location AS location
FROM weblog w
LEFT JOIN geo g ON w.ip = g.ip;
● Query is long running on a KSQL cluster.
● Create weblog stream and geo table (backed by kafka topics) first.
● Currently, KSQL can interpret AVRO, JSON and CSV.
29Confidential
Stream Processing App #1: Anomaly Detection / Alerting
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 10 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
● Use Kafka Streams
and/or additional
input streams in
more sophisticated
algorithm
● possible_fraud
is a change log
stream where key is
[card_number,
window_start]
authorization_attempts
possible_fraud
SMS Gateway
30Confidential
Microservices
What are Microservices?
● Independently deployable, small units of functionality
○ (not a formal definition)
○ Primary motivation: decouple teams (scale in people terms)
○ Usually REST endpoints + commands/queries
Microservices can also be built on a backbone of events:
○ PII Filter
○ Weblog enricher
○ SMS fraud alert notifier
○ ... just the start
31Confidential
Microservices - Commands Pt. 1
32Confidential
Microservices - Commands Pt. 2
Adding new functionality that depends on placing orders requires changes to
the Orders Service.
33Confidential
Microservices - Receiver Driven Flow Control
● Pricing Service team does not need to talk to Orders Service team
● Trade off: no statement of overall behavior
34Confidential
Microservices - Queries
Note: Eventual Consistency
35Confidential
Microservices
● Decreased coupling - Orders Service
materializes the view it requires (Kafka can
even act as system of record)
● Often appropriate at larger scales
36Confidential
Microservices - Data Dichotomy
The Data Dichotomy
“Data systems are about exposing data. Services are about hiding it”
37Confidential
Thank You!
@matt_howlett
https://www.confluent.io/blog
We’re hiring!

A Tour of Apache Kafka