KEMBAR78
Processing IoT Data with Apache Kafka | PPTX
1
Processing IoT Data with
Apache Kafka
Matt Howlett
Confluent Inc.
2
Pub Sub
Messaging Protocol
Pub Sub
Messaging System
(rethought as a distributed commit log)
Distributed Streaming Platform
● Pub Sub Messaging
● Event Storage
● Processing Framework
3
OBD-II Adapters
4
Problem Statement
Let’s build a system to:
• Transport OBD-II data over unreliable links from cars to the data center
• Capable of handling millions of devices*
• Extract information from + respond to this data in (near) real time (at scale)
• Handle surges in usage
• Potential for ad-hoc historical processing
* also less
Architecture / technology / methods applicable to many scenarios.
5
Publish / subscribe messaging protocol:
• Built on top of TCP/IP
• Features that make it well suited to poor connectivity / high latency scenarios
• Lightweight
• Efficient client implementations, low network overhead
• MQTT-SN for non IP networks (’virtual connections’)
• Many (open source) broker implementations
• Mosquitto, RabbitMQ, HiveMQ, VerneMQ
• Many Client Libraries
• C, C++, Java, C#, Python, Javascript, websockets, Arduino …
• Widely used (incl. phone apps!)
• Oil pipeline sensor via satellite link
• Facebook Messenger
• AWS IoT
MQTT Introduction
6
• Simple API
• Hierarchical topics
• myhome/kitchen/door/front/battery/level
• wildcard subscription: myhome/*/door/*/battery/level
• 3 qualities of service (on both produce and consume)
• At most once (QoS 0)
• At least once (QoS 1)
• Exactly once (QoS 2) [not universally supported]
• Persistent consumer sessions
• Important for QoS 1, QoS 2
• Last will and testament
• Last known good value
• Authorization, SSL/TLS
MQTT Features
7
• Device Id
• GPS Location [lon, lat]
• Ignition on / off
• Speedometer reading
• Timestamp
• …plus a lot more
Assume: data sent via 3G wireless connection at ~30 second interval
OBD-II Data
8
Deficiencies:
• Single MQTT server can handle maybe ~100K
connections
• Can’t handle usage surges (no buffering)
• No storage of events or reprocess capability
MQTT
Server 1
Processor 1 Processor 2 ...
Ingest Architecture V1
topic: [deviceid]/obd
9
MQTT
Server
Coordinator
MQTT
Server 1
MQTT
Server 2
MQTT
Server 3
MQTT
Server 4
topic: [deviceid]/obd
http / REST
...
• Easily Shardable
• Treat MQTT server as
commodity service
Ingest Architecture V2
10
MQTT
Server
Coordinator
MQTT
Server 1
MQTT
Server 2
MQTT
Server 3
MQTT
Server 4
topic: [deviceid]/obd
Kafka Connect
OBD_Data
Stream
processing
kafka
OBD -> MQTT -> Kafka
11
Apache Kafka
Distributed Streaming Platform:
• Pub Sub Messaging
• (typically clients are within data-center)
• Data Store
• Messages not deleted after delivery
• Stream Processing
• Low or high level libraries
• Data re-processing
12
Apache Kafka adoption spans
companies across industries.
13
● Persisted
● Append only
● Immutable
● Delete earliest data based on time / size / never
14
• Allows topics to scale past constraints
of single server
• Message → partition_id deterministic.
Partition relevant to application.
• Ordering guarantees per partition but
not across partitions
15
Apache Kafka Replication
• cheap durability!
• choose # acks for
message produced
confirmation
16
Apache Kafka Consumer Groups
partitions possibly across different brokers
17
Kafka Connect
• Use client library producers / consumers in custom applications.
• Often want to bulk transfer data between standard systems:
• Don’t re-invent the wheel – configure Kafka Connect
• Narrow scope: move data into & out of Kafka
• Off-the-shelf connectors
• Fault Tolerant
• Auto-balances load
• Pluggable Serialization
• Standalone and distributed modes of operation
• Configuration / management via REST API
18
19
MQTT Connector
https://github.com/evokly/kafka-connect-mqtt
• Single Task
• Single MQTT Broker
• Source only
Either:
• Start a bunch of these connectors (in one connect cluster), one per server, or:
• Implement a new multi-task connector, one task per MQTT broker.
• Communicate with MQTT Controller
20
• user_id
• device_id
• name
• address
• phone_number
• speed_alert_level
• ...
SQL Db
User_Info
User Data
21
Example: Car Towed Alert
Detect movement of car when ignition off, send SMS alert
kafka
OBD_Data P1
OBD_Data P5
Consumer 1
Consumer 2
Broker 1
...
OBD_Data P3
OBD_Data P7
Broker 2
...
...
...
SMS Gateway
Last loc. in mem
KV store
Last loc. in mem
KV store
User Info
22
Consumer Implementation
on_message(message m)
{
var device_id = m.key;
var obd_data = m.value;
if (obd_data.ignition_on)
return;
if (!kv_store.contains(device_id)) {
kv_store.add(device_id, obd_data.lon_lat);
return;
}
var prev_lon_lat = kv_store.get(device_id);
var dist = calc_dist(obd_data.lon_lat, prev_lon_lat);
kv_store.set(device_id, obd_data.lon_lat);
if (dist > alert_max_dist) {
// infrequent
send_alert(SQL.get_phone_number(device_id));
}
}
• Message can be from any partition
assigned to this consumer
• Ordering guaranteed per partition, but
not predictable across partitions
• All messages from a particular device
guaranteed to arrive at the same
consumer instance
23
Example: Speed Alert
• Scenario: Parent wants to monitor son/daughter driving and be alerted if they exceed a
specified speed.
• In the Tow Alert example User_Info only needs to be queried in the event of an alert.
• In this example, the table needs to be queried for every OBD data record in every partition.
OBD_data
[can update
at any time]
User Info
table
Not scalable! Cache?
...
Highfrequency
P1
24
Time = 0 1 60 {device_id=1, speed_limit=60}
Time = 1 1 60 {device_id=2, speed_limit=80}
2 80
Time = 2 1 60 {device_id=3, speed_limit=70}
2 80
3 70
Time = 3 1 80 {device_id=1, speed_limit=80}
2 80
3 70
Time = 4 1 80 {device_id=1, speed_limit=65}
2 80
3 70
Table can be represented as stream of updates
device_id speed_limit
Log compaction!
25
Debezium
Kafka Connector that turns database tables into streams of update records.
debezium
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
...
MySQL
User Info
[key: userId]
User_Info
[changelog topic]Partition by device_id
26
Stream / Table Join
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Partition 7
...
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
...
Consumer 1
Relevant subset of
User_Info
device_id speed_limit
1 80
3 70
User_Info
[ChangeLog, compacted]
OBD_Data
[Record Stream]
...
debezium
key:device_id
key:device_id
27
Speed Alert: Message handler
on_message(message m)
{
var device_id = m.key;
var obd_data = m.value;
var user_info = user_info_local.get(device_id);
if (obd_data.speedometer > user_info.max_speed) {
alert_user(device_id, user_info);
}
}
28
MQTT Phone Client Connectivity
MQTT
Server
Coordinator
MQTT
Server 1
MQTT
Server 2
[deviceid]/alert
...
Consumer 1 ...
MQTT
Server 3
...
[deviceid]/obd
29
Speed Limit Alert: Rate limiting
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Partition 7
...
app_state kafka topic
• Prefer to rate limit on server to minimize network overhead.
• Create new Kafka topic app_state, partitioned on
device_id.
• When alert triggered, store alert time in this topic.
• [can use this topic as general store for other per device
state info too]
• Materialize this change-log stream on consumers as
necessary.
30
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Partition 7
...
Partition 1
Partition 2
Partition 3
...
Consumer 1
Relevant
subset of
User_Info
...
OBD_Data
[Record Stream]
User_Info
[ChangeLog, compacted]
Partition 4
Partition 1
Partition 2
Partition 3
...
Partition 4
App_State
[compacted]
Relevant
subset of
App_State
31
Example: Location Based Special Offers
When Car enters specific region, send available special offers to the user’s phone.
Require:
• User_Info
• Address – so we know whether they are local to their current location or not
• App_state
• Use to persist already sent offers
• Special_Offer_Info
• Table that store list of all special offers.
32
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31 32 33 34 35
36 37 38 39 40 41 42
Regions
• Regions may be simple (as depicted
here) or complex
• F(lon, lat) -> locationId.
• Note: could also implement ride—share
surge pricing using similar partitioning.
33
Special Offer Change-log Stream
debezium
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
...
MySQL
Special Offer
Info
Special_Offers
[changelog,
compacted]
Partition by location_id
34
Multi-stage Data Pipeline
OBD_Data App_State
[offers already sent]
User_Info
[address]
K: device_id
V: OBD record
consume enrich
K: device_id
V: OBD record
address
K: device_id
V: OBD record
Address
offers_sent
enrich
35
Multi-stage Data Pipeline (continued)
K: [device_id]
V: OBD record
Address
offers_sent
K: location_id
V: OBD record
Address
offers_sent
OBD_Data_By_Location
P1
……
…
Repartition by location_id
P2
P1
P3
Data from given device will still all be on the same partition
(except when region changes)
36
Multi-stage Data Pipeline (continued)
K: location_id
V: OBD record
Address
offers_sent
Special_Offers
K: location_id
V: OBD record
address
offers_sent
available_offers
re-partition
enrich
37
Multi-stage Data Pipeline (continued)
Special offer available in
location
Special offer not already
sent
User address near location?
MQTT
Server
filter
filter
filter
...
[deviceId]/alert
38
39
40
Discount code: kafcom17
Use the Apache Kafka community discount code to get $50 off
www.kafka-summit.org
Kafka Summit New York: May 8
Kafka Summit San Francisco: August 28
Presented by
41
Thank You
@matt_howlett
@confluentinc

Processing IoT Data with Apache Kafka

  • 1.
    1 Processing IoT Datawith Apache Kafka Matt Howlett Confluent Inc.
  • 2.
    2 Pub Sub Messaging Protocol PubSub Messaging System (rethought as a distributed commit log) Distributed Streaming Platform ● Pub Sub Messaging ● Event Storage ● Processing Framework
  • 3.
  • 4.
    4 Problem Statement Let’s builda system to: • Transport OBD-II data over unreliable links from cars to the data center • Capable of handling millions of devices* • Extract information from + respond to this data in (near) real time (at scale) • Handle surges in usage • Potential for ad-hoc historical processing * also less Architecture / technology / methods applicable to many scenarios.
  • 5.
    5 Publish / subscribemessaging protocol: • Built on top of TCP/IP • Features that make it well suited to poor connectivity / high latency scenarios • Lightweight • Efficient client implementations, low network overhead • MQTT-SN for non IP networks (’virtual connections’) • Many (open source) broker implementations • Mosquitto, RabbitMQ, HiveMQ, VerneMQ • Many Client Libraries • C, C++, Java, C#, Python, Javascript, websockets, Arduino … • Widely used (incl. phone apps!) • Oil pipeline sensor via satellite link • Facebook Messenger • AWS IoT MQTT Introduction
  • 6.
    6 • Simple API •Hierarchical topics • myhome/kitchen/door/front/battery/level • wildcard subscription: myhome/*/door/*/battery/level • 3 qualities of service (on both produce and consume) • At most once (QoS 0) • At least once (QoS 1) • Exactly once (QoS 2) [not universally supported] • Persistent consumer sessions • Important for QoS 1, QoS 2 • Last will and testament • Last known good value • Authorization, SSL/TLS MQTT Features
  • 7.
    7 • Device Id •GPS Location [lon, lat] • Ignition on / off • Speedometer reading • Timestamp • …plus a lot more Assume: data sent via 3G wireless connection at ~30 second interval OBD-II Data
  • 8.
    8 Deficiencies: • Single MQTTserver can handle maybe ~100K connections • Can’t handle usage surges (no buffering) • No storage of events or reprocess capability MQTT Server 1 Processor 1 Processor 2 ... Ingest Architecture V1 topic: [deviceid]/obd
  • 9.
    9 MQTT Server Coordinator MQTT Server 1 MQTT Server 2 MQTT Server3 MQTT Server 4 topic: [deviceid]/obd http / REST ... • Easily Shardable • Treat MQTT server as commodity service Ingest Architecture V2
  • 10.
    10 MQTT Server Coordinator MQTT Server 1 MQTT Server 2 MQTT Server3 MQTT Server 4 topic: [deviceid]/obd Kafka Connect OBD_Data Stream processing kafka OBD -> MQTT -> Kafka
  • 11.
    11 Apache Kafka Distributed StreamingPlatform: • Pub Sub Messaging • (typically clients are within data-center) • Data Store • Messages not deleted after delivery • Stream Processing • Low or high level libraries • Data re-processing
  • 12.
    12 Apache Kafka adoptionspans companies across industries.
  • 13.
    13 ● Persisted ● Appendonly ● Immutable ● Delete earliest data based on time / size / never
  • 14.
    14 • Allows topicsto scale past constraints of single server • Message → partition_id deterministic. Partition relevant to application. • Ordering guarantees per partition but not across partitions
  • 15.
    15 Apache Kafka Replication •cheap durability! • choose # acks for message produced confirmation
  • 16.
    16 Apache Kafka ConsumerGroups partitions possibly across different brokers
  • 17.
    17 Kafka Connect • Useclient library producers / consumers in custom applications. • Often want to bulk transfer data between standard systems: • Don’t re-invent the wheel – configure Kafka Connect • Narrow scope: move data into & out of Kafka • Off-the-shelf connectors • Fault Tolerant • Auto-balances load • Pluggable Serialization • Standalone and distributed modes of operation • Configuration / management via REST API
  • 18.
  • 19.
    19 MQTT Connector https://github.com/evokly/kafka-connect-mqtt • SingleTask • Single MQTT Broker • Source only Either: • Start a bunch of these connectors (in one connect cluster), one per server, or: • Implement a new multi-task connector, one task per MQTT broker. • Communicate with MQTT Controller
  • 20.
    20 • user_id • device_id •name • address • phone_number • speed_alert_level • ... SQL Db User_Info User Data
  • 21.
    21 Example: Car TowedAlert Detect movement of car when ignition off, send SMS alert kafka OBD_Data P1 OBD_Data P5 Consumer 1 Consumer 2 Broker 1 ... OBD_Data P3 OBD_Data P7 Broker 2 ... ... ... SMS Gateway Last loc. in mem KV store Last loc. in mem KV store User Info
  • 22.
    22 Consumer Implementation on_message(message m) { vardevice_id = m.key; var obd_data = m.value; if (obd_data.ignition_on) return; if (!kv_store.contains(device_id)) { kv_store.add(device_id, obd_data.lon_lat); return; } var prev_lon_lat = kv_store.get(device_id); var dist = calc_dist(obd_data.lon_lat, prev_lon_lat); kv_store.set(device_id, obd_data.lon_lat); if (dist > alert_max_dist) { // infrequent send_alert(SQL.get_phone_number(device_id)); } } • Message can be from any partition assigned to this consumer • Ordering guaranteed per partition, but not predictable across partitions • All messages from a particular device guaranteed to arrive at the same consumer instance
  • 23.
    23 Example: Speed Alert •Scenario: Parent wants to monitor son/daughter driving and be alerted if they exceed a specified speed. • In the Tow Alert example User_Info only needs to be queried in the event of an alert. • In this example, the table needs to be queried for every OBD data record in every partition. OBD_data [can update at any time] User Info table Not scalable! Cache? ... Highfrequency P1
  • 24.
    24 Time = 01 60 {device_id=1, speed_limit=60} Time = 1 1 60 {device_id=2, speed_limit=80} 2 80 Time = 2 1 60 {device_id=3, speed_limit=70} 2 80 3 70 Time = 3 1 80 {device_id=1, speed_limit=80} 2 80 3 70 Time = 4 1 80 {device_id=1, speed_limit=65} 2 80 3 70 Table can be represented as stream of updates device_id speed_limit Log compaction!
  • 25.
    25 Debezium Kafka Connector thatturns database tables into streams of update records. debezium Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 ... MySQL User Info [key: userId] User_Info [changelog topic]Partition by device_id
  • 26.
    26 Stream / TableJoin Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Partition 7 ... Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 ... Consumer 1 Relevant subset of User_Info device_id speed_limit 1 80 3 70 User_Info [ChangeLog, compacted] OBD_Data [Record Stream] ... debezium key:device_id key:device_id
  • 27.
    27 Speed Alert: Messagehandler on_message(message m) { var device_id = m.key; var obd_data = m.value; var user_info = user_info_local.get(device_id); if (obd_data.speedometer > user_info.max_speed) { alert_user(device_id, user_info); } }
  • 28.
    28 MQTT Phone ClientConnectivity MQTT Server Coordinator MQTT Server 1 MQTT Server 2 [deviceid]/alert ... Consumer 1 ... MQTT Server 3 ... [deviceid]/obd
  • 29.
    29 Speed Limit Alert:Rate limiting Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Partition 7 ... app_state kafka topic • Prefer to rate limit on server to minimize network overhead. • Create new Kafka topic app_state, partitioned on device_id. • When alert triggered, store alert time in this topic. • [can use this topic as general store for other per device state info too] • Materialize this change-log stream on consumers as necessary.
  • 30.
    30 Partition 1 Partition 2 Partition3 Partition 4 Partition 5 Partition 6 Partition 7 ... Partition 1 Partition 2 Partition 3 ... Consumer 1 Relevant subset of User_Info ... OBD_Data [Record Stream] User_Info [ChangeLog, compacted] Partition 4 Partition 1 Partition 2 Partition 3 ... Partition 4 App_State [compacted] Relevant subset of App_State
  • 31.
    31 Example: Location BasedSpecial Offers When Car enters specific region, send available special offers to the user’s phone. Require: • User_Info • Address – so we know whether they are local to their current location or not • App_state • Use to persist already sent offers • Special_Offer_Info • Table that store list of all special offers.
  • 32.
    32 1 2 34 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Regions • Regions may be simple (as depicted here) or complex • F(lon, lat) -> locationId. • Note: could also implement ride—share surge pricing using similar partitioning.
  • 33.
    33 Special Offer Change-logStream debezium Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 ... MySQL Special Offer Info Special_Offers [changelog, compacted] Partition by location_id
  • 34.
    34 Multi-stage Data Pipeline OBD_DataApp_State [offers already sent] User_Info [address] K: device_id V: OBD record consume enrich K: device_id V: OBD record address K: device_id V: OBD record Address offers_sent enrich
  • 35.
    35 Multi-stage Data Pipeline(continued) K: [device_id] V: OBD record Address offers_sent K: location_id V: OBD record Address offers_sent OBD_Data_By_Location P1 …… … Repartition by location_id P2 P1 P3 Data from given device will still all be on the same partition (except when region changes)
  • 36.
    36 Multi-stage Data Pipeline(continued) K: location_id V: OBD record Address offers_sent Special_Offers K: location_id V: OBD record address offers_sent available_offers re-partition enrich
  • 37.
    37 Multi-stage Data Pipeline(continued) Special offer available in location Special offer not already sent User address near location? MQTT Server filter filter filter ... [deviceId]/alert
  • 38.
  • 39.
  • 40.
    40 Discount code: kafcom17 Usethe Apache Kafka community discount code to get $50 off www.kafka-summit.org Kafka Summit New York: May 8 Kafka Summit San Francisco: August 28 Presented by
  • 41.