KEMBAR78
Apache Kafka at LinkedIn | PPTX
Jay Kreps
Introduction to Apache Kafka
The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing
Apache Kafka
A
brief
history
of
Apache
Kafka
Characteristics
• Scalability of a filesystem
– Hundreds of MB/sec/server throughput
– Many TB per server
• Guarantees of a database
– Messages strictly ordered
– All data persistent
• Distributed by default
– Replication
– Partitioning model
Kafka is about logs
What is a log?
Logs: pub/sub done right
Partitioning
Nodes Host Many Partitions
Producers Balance Load
Consumer’s Divide Up
Partitions
End-to-End
Kafka At LinkedIn
• 175 TB of in-flight log data per colo
• Replicated to each datacenter
• Tens of thousands of data producers
• Thousands of consumers
• 7 million messages written/sec
• 35 million messages read/sec
• Hadoop integration
Performance
• Producer (3x replication):
– Async: 786,980 records/sec (75.1 MB/sec)
– Sync: 421,823 records/sec (40.2 MB/sec)
• Consumer:
– 940,521 records/sec (89.7 MB/sec)
• End-to-end latency:
– 2 ms (median)
– 14 ms (99.9th percentile)
The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing
Data Integration
Maslow’s Hierarchy
For Data
New Types of Data
• Database data
– Users, products, orders, etc
• Events
– Clicks, Impressions, Pageviews, etc
• Application metrics
– CPU usage, requests/sec
• Application logs
– Service calls, errors
New Types of Systems
• Live Stores
– Voldemort
– Espresso
– Graph
– OLAP
– Search
– InGraphs
• Offline
– Hadoop
– Teradata
Bad
Good
Example: User views job
Comparing Data Transfer
Mechanisms
The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing
Stream Processing
Stream processing is a
generalization
of batch processing
Stream Processing = Logs + Jobs
Examples
• Monitoring
• Security
• Content processing
• Recommendations
• Newsfeed
• ETL
Frameworks Can Help
Samza Architecture
Log-centric Architecture
Kafka
http://kafka.apache.org
Samza
http://samza.incubator.apache.org
Log Blog
http://linkd.in/199iMwY
Benchmark:
http://t.co/40fkKJvanx
Me
http://www.linkedin.com/in/jaykreps
@jaykreps

Apache Kafka at LinkedIn

Editor's Notes

  • #2 Who are you? What is this talk about? Exciting topic More
  • #4 Messaging system, like JMS (but different!) Producers, consumers distributed
  • #5 Start with state at LinkedIn, describe each pipeline 1 Pipeline for database data 1 Pipeline for metrics 1 Pipeline for events 1 JMS-based pipeline No pipeline for application logs 300 ActiveMQ brokers
  • #6 10,000 messages/sec * 100 byte messages = ~1MB/sec
  • #7 The log is fundamental abstraction Kafka provides You can use a log as a drop-in replacement for a messaging system, but it can also do a lot more
  • #8 What is a log? Traditional uses? Non-traditional uses…
  • #9 Time ordered Semi-structured
  • #10 Data structure not a text file List of changes Contents of record doesn’t matter Indexed by “time” Not application log (i.e. text file)
  • #11 Remotely accessible State machine replication
  • #12 Data model of Kafka: A topic Partitions can be spread over machines, replicated
  • #16 Path of a write Leadership failover Guarantees
  • #21 AKA ETL Many systems Event data Most important problem for data-centric companies Integration >> ML
  • #22 Maslow’s Hiearchy Abraham Maslow, Physchologist, 1943 Physiological – eat, drink, sleep Safety – Not being attacked Love/Belonging – friends and family Esteem – respect of others Self-Actualization – morality, creativity, spontenaity
  • #23 Want to do Deep Learning Instead finding that their CSV data ALSO has commas in it Copying files around Ugh The Caveman Data Warehousing has a bad reputation
  • #24 Two exacerbating factors 15 years ago, just the first one (transactional data) New categories are very high volume, maybe 100x the transactional data Look like events Internet of things
  • #25 One-size fits all
  • #26 Tell story: Started with Hadoop, added arrows to get data there Want to build fancy algorithms, need data (expectation 90% of time for fancy, 10% for data) Holy shit this is hard! Data is missing, data is late, computation runs on wrong data Hadoop without good data is just a very expensive space heater Never get to full connectivity
  • #27 Metcalfe’s law Each new system connects to get/give data All data in multi-subscriber, real-time logs The company is a big distributed system The data center is the distributed system
  • #29 Three dims: Throughput Guarantees Latency Advantages over messaging: Huge data backlog Order Advantages over files Real-time Advantage over both: principled notion of time
  • #31 Whole organization is big distributed system Commit log = data transfer Stream processing = triggers Batch is dominant paradigm for data processing, why?
  • #32 Service: One input = one output Batch job: All inputs = all outputs Stream computing: any window = output for that window
  • #33 No different from batch processing flow (instead of files/tables, logs)
  • #35 Storm and Samza About process management – both integrate with Kafka MapReduce and HDFS