This document provides an overview of KSQL, an open source streaming SQL engine for Apache Kafka. It describes the core concepts of Kafka and KSQL, including how KSQL can be used for streaming ETL, anomaly detection, real-time monitoring, and data transformation. It also discusses how KSQL fits into a streaming platform and can be run in both client-server and standalone modes.
KSQL for DataExploration
SELECT status, bytes
FROM clickstream
WHERE user_agent =
‘Mozilla/5.0 (compatible; MSIE 6.0)’;
An easy way to inspect data in a running cluster
11.
KSQL for StreamingETL
• Kafka is popular for data pipelines.
• KSQL enables easy transformations of data within the pipe.
• Transforming data while moving from Kafka to another system.
CREATE STREAM vip_actions AS
SELECT userid, page, action FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
12.
KSQL for AnomalyDetection
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
Identifying patterns or anomalies in real-time data,
surfaced in milliseconds
13.
KSQL for Real-TimeMonitoring
• Log data monitoring, tracking and alerting
• Sensor / IoT data
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
14.
KSQL for DataTransformation
CREATE STREAM views_by_userid
WITH (PARTITIONS=6,
VALUE_FORMAT=‘JSON’,
TIMESTAMP=‘view_time’) AS
SELECT * FROM clickstream PARTITION BY user_id;
Make simple derivations of existing topics from the command line
15.
Where is KSQLnot such a great fit?
BI reports (Tableau etc.)
• No indexes
• No JDBC (most BI tools are not
good with continuous results!)
Ad-hoc queries
• Limited span of time usually
retained in Kafka
• No indexes
16.
CREATE STREAM clickstream(
time BIGINT,
url VARCHAR,
status INTEGER,
bytes INTEGER,
userid VARCHAR,
agent VARCHAR)
WITH (
value_format = ‘JSON’,
kafka_topic=‘my_clickstream_topic’
);
Creating a Stream
17.
CREATE TABLE users(
user_id INTEGER,
registered_at LONG,
username VARCHAR,
name VARCHAR,
city VARCHAR,
level VARCHAR)
WITH (
key = ‘user_id',
kafka_topic=‘clickstream_users’,
value_format=‘JSON');
Creating a Table
18.
CREATE STREAM vip_actionsAS
SELECT userid, fullname, url, status
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
Joins for Enrichment
How to runKSQL
JVM
KSQL Server
KSQL CLI
JVM
KSQL Server
JVM
KSQL Server
Kafka Cluster
#2 CLIENT-SERVER
21.
• Start anynumber of server nodes
bin/ksql-server-start
• Start one or more CLIs and point them to a server
bin/ksql-cli remote https://myksqlserver:8090
• All servers share the processing load
Technically, instances of the same Kafka Streams Applications
Scale up/down without restart
How to run KSQL#2 CLIENT-SERVER
22.
How to runKSQL
Kafka Cluster
JVM
KSQL Server
JVM
KSQL Server
JVM
KSQL Server
#3 AS A STANDALONE APPLICATION
23.
• Start anynumber of server nodes
Pass a file of KSQL statement to execute
bin/ksql-node query-file=foo/bar.sql
• Ideal for streaming ETL application deployment
Version-control your queries and transformations as code
• All running engines share the processing load
Technically, instances of the same Kafka Streams Applications
Scale up/down without restart
How to run KSQL#3 AS A STANDALONE APPLICATION
24.
Resources and NextSteps
http://confluent.io/ksql
https://slackpass.io/confluentcommunity #ksql