KSQL: Open Source Streaming for Apache Kafka

KSQLOpen source streaming for Apache Kafka®
@tlberglund

Kafka Architecture
consumer
producer
consumer consumer
broker
broker
broker
broker
producer

Scalable Consumption
consumer
group A
producer
consumer
group A
consumer
group B
consumer
group B
…
…
…
partition 1
partition 2
partition 3
Partitioned Topic

Logs and Pub/Sub
consumer A
producer
consumer B
8764321 5
first record
latest record

Streaming Platform
Applications
Databases
Offline SystemsDWH
hdfs/
spark
Stream
Processors
Real-time
analytics Streaming Platform

Declarative
Stream
Language
Processing
KSQLis a

KSQLis the
Streaming
SQL Engine
for
Apache Kafka

Stream Processing by Analogy
Kafka Cluster
Connect API Stream Processing Connect API
$ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt

KSQL for Data Exploration
SELECT status, bytes
FROM clickstream
WHERE user_agent =
‘Mozilla/5.0 (compatible; MSIE 6.0)’;
An easy way to inspect data in a running cluster

KSQL for Streaming ETL
• Kafka is popular for data pipelines.
• KSQL enables easy transformations of data within the pipe.
• Transforming data while moving from Kafka to another system.
CREATE STREAM vip_actions AS  
SELECT userid, page, action FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id  
WHERE u.level = 'Platinum';

KSQL for Anomaly Detection
CREATE TABLE possible_fraud AS 
SELECT card_number, count(*) 
FROM authorization_attempts  
WINDOW TUMBLING (SIZE 5 SECONDS) 
GROUP BY card_number 
HAVING count(*) > 3;
Identifying patterns or anomalies in real-time data,
surfaced in milliseconds

KSQL for Real-Time Monitoring
• Log data monitoring, tracking and alerting
• Sensor / IoT data
CREATE TABLE error_counts AS  
SELECT error_code, count(*)  
FROM monitoring_stream  
WINDOW TUMBLING (SIZE 1 MINUTE)  
WHERE type = 'ERROR'  
GROUP BY error_code;

KSQL for Data Transformation
CREATE STREAM views_by_userid
WITH (PARTITIONS=6,
VALUE_FORMAT=‘JSON’,
TIMESTAMP=‘view_time’) AS  
SELECT * FROM clickstream PARTITION BY user_id;
Make simple derivations of existing topics from the command line

Where is KSQL not such a great fit?
BI reports (Tableau etc.)
• No indexes
• No JDBC (most BI tools are not
good with continuous results!)
Ad-hoc queries
• Limited span of time usually
retained in Kafka
• No indexes

CREATE STREAM clickstream (
time BIGINT,
url VARCHAR,
status INTEGER,
bytes INTEGER,
userid VARCHAR,
agent VARCHAR)
WITH (
value_format = ‘JSON’,
kafka_topic=‘my_clickstream_topic’
);
Creating a Stream

CREATE TABLE users (
user_id INTEGER,
registered_at LONG,
username VARCHAR,
name VARCHAR,
city VARCHAR,
level VARCHAR)
WITH (
key = ‘user_id',
kafka_topic=‘clickstream_users’,
value_format=‘JSON');
Creating a Table

CREATE STREAM vip_actions AS
SELECT userid, fullname, url, status  
FROM clickstream c  
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
Joins for Enrichment

How to run KSQL
JVM
KSQL Server
KSQL CLI
JVM
KSQL Server
JVM
KSQL Server
Kafka Cluster
#2 CLIENT-SERVER

• Start any number of server nodes
bin/ksql-server-start
• Start one or more CLIs and point them to a server
bin/ksql-cli remote https://myksqlserver:8090
• All servers share the processing load
Technically, instances of the same Kafka Streams Applications
Scale up/down without restart
How to run KSQL#2 CLIENT-SERVER

How to run KSQL
Kafka Cluster
JVM
KSQL Server
JVM
KSQL Server
JVM
KSQL Server
#3 AS A STANDALONE APPLICATION

• Start any number of server nodes
Pass a file of KSQL statement to execute
bin/ksql-node query-file=foo/bar.sql
• Ideal for streaming ETL application deployment
Version-control your queries and transformations as code
• All running engines share the processing load
Technically, instances of the same Kafka Streams Applications
Scale up/down without restart
How to run KSQL#3 AS A STANDALONE APPLICATION

Resources and Next Steps
http://confluent.io/ksql
https://slackpass.io/confluentcommunity #ksql

Thank you!
@tlberglund tim@confluent.io

KSQL: Open Source Streaming for Apache Kafka

In this document

More Related Content

What's hot

Similar to KSQL: Open Source Streaming for Apache Kafka

More from confluent

Recently uploaded

KSQL: Open Source Streaming for Apache Kafka