Apache Kafka at LinkedIn

Jay Kreps
Introduction to Apache Kafka

The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing

A
brief
history
of
Apache
Kafka

Characteristics
• Scalability of a filesystem
– Hundreds of MB/sec/server throughput
– Many TB per server
• Guarantees of a database
– Messages strictly ordered
– All data persistent
• Distributed by default
– Replication
– Partitioning model

Consumer’s Divide Up
Partitions

Kafka At LinkedIn
• 175 TB of in-flight log data per colo
• Replicated to each datacenter
• Tens of thousands of data producers
• Thousands of consumers
• 7 million messages written/sec
• 35 million messages read/sec
• Hadoop integration

Performance
• Producer (3x replication):
– Async: 786,980 records/sec (75.1 MB/sec)
– Sync: 421,823 records/sec (40.2 MB/sec)
• Consumer:
– 940,521 records/sec (89.7 MB/sec)
• End-to-end latency:
– 2 ms (median)
– 14 ms (99.9th percentile)

New Types of Data
• Database data
– Users, products, orders, etc
• Events
– Clicks, Impressions, Pageviews, etc
• Application metrics
– CPU usage, requests/sec
• Application logs
– Service calls, errors

New Types of Systems
• Live Stores
– Voldemort
– Espresso
– Graph
– OLAP
– Search
– InGraphs
• Offline
– Hadoop
– Teradata

Comparing Data Transfer
Mechanisms

Stream processing is a
generalization
of batch processing

Stream Processing = Logs + Jobs

Examples
• Monitoring
• Security
• Content processing
• Recommendations
• Newsfeed
• ETL

Kafka
http://kafka.apache.org
Samza
http://samza.incubator.apache.org
Log Blog
http://linkd.in/199iMwY
Benchmark:
http://t.co/40fkKJvanx
Me
http://www.linkedin.com/in/jaykreps
@jaykreps

Apache Kafka at LinkedIn

More Related Content

What's hot

Similar to Apache Kafka at LinkedIn

Recently uploaded

Apache Kafka at LinkedIn

Editor's Notes