KEMBAR78
KSQL: Open Source Streaming for Apache Kafka | PDF
KSQLOpen source streaming for Apache Kafka®
@tlberglund
Kafka Architecture
consumer
producer
consumer consumer
broker
broker
broker
broker
producer
Scalable Consumption
consumer
group A
producer
consumer
group A
consumer
group B
consumer
group B
…
…
…
partition 1
partition 2
partition 3
Partitioned Topic
Logs and Pub/Sub
consumer A
producer
consumer B
8764321 5
first record
latest record
Streaming Platform
Applications
Databases
Offline SystemsDWH
hdfs/
spark
Stream
Processors
Real-time
analytics Streaming Platform
Declarative
Stream
Language
Processing
KSQLis a
KSQLis the
Streaming
SQL Engine
for
Apache Kafka
Stream Processing by Analogy
Kafka Cluster
Connect API Stream Processing Connect API
$ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt
KSQL
are some
what
use cases?
KSQL for Data Exploration
SELECT status, bytes
FROM clickstream
WHERE user_agent =
‘Mozilla/5.0 (compatible; MSIE 6.0)’;
An easy way to inspect data in a running cluster
KSQL for Streaming ETL
• Kafka is popular for data pipelines.
• KSQL enables easy transformations of data within the pipe.
• Transforming data while moving from Kafka to another system.
CREATE STREAM vip_actions AS 

SELECT userid, page, action FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id 

WHERE u.level = 'Platinum';
KSQL for Anomaly Detection
CREATE TABLE possible_fraud AS

SELECT card_number, count(*)

FROM authorization_attempts 

WINDOW TUMBLING (SIZE 5 SECONDS)

GROUP BY card_number

HAVING count(*) > 3;
Identifying patterns or anomalies in real-time data,
surfaced in milliseconds
KSQL for Real-Time Monitoring
• Log data monitoring, tracking and alerting
• Sensor / IoT data
CREATE TABLE error_counts AS 

SELECT error_code, count(*) 

FROM monitoring_stream 

WINDOW TUMBLING (SIZE 1 MINUTE) 

WHERE type = 'ERROR' 

GROUP BY error_code;
KSQL for Data Transformation
CREATE STREAM views_by_userid
WITH (PARTITIONS=6,
VALUE_FORMAT=‘JSON’,
TIMESTAMP=‘view_time’) AS 

SELECT * FROM clickstream PARTITION BY user_id;
Make simple derivations of existing topics from the command line
Where is KSQL not such a great fit?
BI reports (Tableau etc.)
• No indexes
• No JDBC (most BI tools are not
good with continuous results!)
Ad-hoc queries
• Limited span of time usually
retained in Kafka
• No indexes
CREATE STREAM clickstream (
time BIGINT,
url VARCHAR,
status INTEGER,
bytes INTEGER,
userid VARCHAR,
agent VARCHAR)
WITH (
value_format = ‘JSON’,
kafka_topic=‘my_clickstream_topic’
);
Creating a Stream
CREATE TABLE users (
user_id INTEGER,
registered_at LONG,
username VARCHAR,
name VARCHAR,
city VARCHAR,
level VARCHAR)
WITH (
key = ‘user_id',
kafka_topic=‘clickstream_users’,
value_format=‘JSON');
Creating a Table
CREATE STREAM vip_actions AS
SELECT userid, fullname, url, status 

FROM clickstream c 

LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
Joins for Enrichment
Demo
How to run KSQL
JVM
KSQL Server
KSQL CLI
JVM
KSQL Server
JVM
KSQL Server
Kafka Cluster
#2 CLIENT-SERVER
• Start any number of server nodes
bin/ksql-server-start
• Start one or more CLIs and point them to a server
bin/ksql-cli remote https://myksqlserver:8090
• All servers share the processing load
Technically, instances of the same Kafka Streams Applications
Scale up/down without restart
How to run KSQL#2 CLIENT-SERVER
How to run KSQL
Kafka Cluster
JVM
KSQL Server
JVM
KSQL Server
JVM
KSQL Server
#3 AS A STANDALONE APPLICATION
• Start any number of server nodes
Pass a file of KSQL statement to execute
bin/ksql-node query-file=foo/bar.sql
• Ideal for streaming ETL application deployment
Version-control your queries and transformations as code
• All running engines share the processing load
Technically, instances of the same Kafka Streams Applications
Scale up/down without restart
How to run KSQL#3 AS A STANDALONE APPLICATION
Resources and Next Steps
http://confluent.io/ksql
https://slackpass.io/confluentcommunity #ksql
Thank you!
@tlberglund tim@confluent.io

KSQL: Open Source Streaming for Apache Kafka

  • 1.
    KSQLOpen source streamingfor Apache Kafka® @tlberglund
  • 2.
  • 3.
    Scalable Consumption consumer group A producer consumer groupA consumer group B consumer group B … … … partition 1 partition 2 partition 3 Partitioned Topic
  • 4.
    Logs and Pub/Sub consumerA producer consumer B 8764321 5 first record latest record
  • 5.
  • 6.
  • 7.
  • 8.
    Stream Processing byAnalogy Kafka Cluster Connect API Stream Processing Connect API $ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt
  • 9.
  • 10.
    KSQL for DataExploration SELECT status, bytes FROM clickstream WHERE user_agent = ‘Mozilla/5.0 (compatible; MSIE 6.0)’; An easy way to inspect data in a running cluster
  • 11.
    KSQL for StreamingETL • Kafka is popular for data pipelines. • KSQL enables easy transformations of data within the pipe. • Transforming data while moving from Kafka to another system. CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum';
  • 12.
    KSQL for AnomalyDetection CREATE TABLE possible_fraud AS
 SELECT card_number, count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds
  • 13.
    KSQL for Real-TimeMonitoring • Log data monitoring, tracking and alerting • Sensor / IoT data CREATE TABLE error_counts AS 
 SELECT error_code, count(*) 
 FROM monitoring_stream 
 WINDOW TUMBLING (SIZE 1 MINUTE) 
 WHERE type = 'ERROR' 
 GROUP BY error_code;
  • 14.
    KSQL for DataTransformation CREATE STREAM views_by_userid WITH (PARTITIONS=6, VALUE_FORMAT=‘JSON’, TIMESTAMP=‘view_time’) AS 
 SELECT * FROM clickstream PARTITION BY user_id; Make simple derivations of existing topics from the command line
  • 15.
    Where is KSQLnot such a great fit? BI reports (Tableau etc.) • No indexes • No JDBC (most BI tools are not good with continuous results!) Ad-hoc queries • Limited span of time usually retained in Kafka • No indexes
  • 16.
    CREATE STREAM clickstream( time BIGINT, url VARCHAR, status INTEGER, bytes INTEGER, userid VARCHAR, agent VARCHAR) WITH ( value_format = ‘JSON’, kafka_topic=‘my_clickstream_topic’ ); Creating a Stream
  • 17.
    CREATE TABLE users( user_id INTEGER, registered_at LONG, username VARCHAR, name VARCHAR, city VARCHAR, level VARCHAR) WITH ( key = ‘user_id', kafka_topic=‘clickstream_users’, value_format=‘JSON'); Creating a Table
  • 18.
    CREATE STREAM vip_actionsAS SELECT userid, fullname, url, status 
 FROM clickstream c 
 LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum'; Joins for Enrichment
  • 19.
  • 20.
    How to runKSQL JVM KSQL Server KSQL CLI JVM KSQL Server JVM KSQL Server Kafka Cluster #2 CLIENT-SERVER
  • 21.
    • Start anynumber of server nodes bin/ksql-server-start • Start one or more CLIs and point them to a server bin/ksql-cli remote https://myksqlserver:8090 • All servers share the processing load Technically, instances of the same Kafka Streams Applications Scale up/down without restart How to run KSQL#2 CLIENT-SERVER
  • 22.
    How to runKSQL Kafka Cluster JVM KSQL Server JVM KSQL Server JVM KSQL Server #3 AS A STANDALONE APPLICATION
  • 23.
    • Start anynumber of server nodes Pass a file of KSQL statement to execute bin/ksql-node query-file=foo/bar.sql • Ideal for streaming ETL application deployment Version-control your queries and transformations as code • All running engines share the processing load Technically, instances of the same Kafka Streams Applications Scale up/down without restart How to run KSQL#3 AS A STANDALONE APPLICATION
  • 24.
    Resources and NextSteps http://confluent.io/ksql https://slackpass.io/confluentcommunity #ksql
  • 25.