KEMBAR78
Stream-Native Processing with Pulsar Functions | PDF
1
Lightweight Computing
with Pulsar Functions
Sanjeev Kulkarni, Sijie Guo
2
Event Driven Architectures
The rise of RealTime
BigData began with Batch
HDFS/MapReduce/Hive
ReacBon Times became important
Reduce Bme between data arrival and data analysis/acBon
Emergence of Real-Time Streaming ystems
3
What do we really mean by Real-Time?
Aims
Aim is to react to events as they happen in real-Bme
Where do Events happen/arrive?
Message Bus
Whats a reacBon
An acBon/transformaBon/funcBon
4
Compute Representation
Abstract View
f(x)
Incoming Messages Output Messages
5
Traditional Compute representation
DAG
%
%
%
%
%
Source
1
Source
2
Actio
n
Actio
n
Actio
Sink 1
Sink 2
6
Traditional Compute API
SBtching all of this by programmers
public static class SplitSentence extends BaseBasicBolt {
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
public void execute(Tuple tuple, BasicOutputCollector
basicOutputCollector) {
String sentence = tuple.getStringByField("sentence");
String words[] = sentence.split(" ");
for (String w : words) {
basicOutputCollector.emit(new Values(w));
}
}
}
7
Traditional Compute API
SBtching all of this by programmers
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
8
Compute API 2.0
FuncBonal
Builder.newBuilder()
.newSource(() -> StreamletUtils.randomFromList(SENTENCES))
.flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+")))
.reduceByKeyAndWindow(word -> word, word -> 1,
WindowConfig.TumblingCountWindow(50),
(x, y) -> x + y);
9
Compute API 2.0
CharacterisBcs
Compact
Complicated
Map vs FlatMap
10
Traditional Real-Time Systems
Separate
Messaging Compute
11
Traditional Real-Time Systems
Developer Experience
Powerful API but complicated
Does everyone really need to learn funcBonal programming?
Configurable/Scaleable but management overhead
Edge systems have resource/manageability constraints
12
Traditional Real-Time Systems
OperaBonal Experience
Another system to operate is one too many
IOT deployment rouBnely have thousands of edge systems
SemanBc difference
Mismatch/DuplicaBon between Systems
Creates Developer and Operator FricBon
13
Lessons learnt
UseCases
A significant percentage of transformaBons are simple
ETL
ReacBve Services
ClassificaBon
Real-Bme AggregaBon
Event RouBng
Microservices
14
Meanwhile
The world of Cloud
The emergence of Serverless
Simple FuncBon API
FuncBons are submi^ed to the system
Run per event
ComposiBon APIs to do complex things
Wildly popular
15
Serverless vs Streaming
Whats really the difference
Both are event driven architectures
Both can be used for analyBcs/serving
Both have composiBon APIs
Conf based for Serverless vs DSL based for Streaming
Serverless typically don’t care for ordering
Really the funcBon of the underlying source
Pay per acBon
Really a product billing interfaces
16
Whats needed:- Stream-Native Compute
Insight gained from serverless
Simplest possible API
Method/Procedure/FuncBon
MulB Language API
Scale developers
Message bus naBve concepts
Input/Output/Log as topics
Flexible runBme
Simple standalone applicaBons vs system managed applicaBons
17
Introducing Pulsar Functions
18
Apache Pulsar
19
Ordering
Guaranteed ordering
Multi-tenancy
A single cluster can
support many tenants
and use cases
High throughput
Can reach 1.8 M
messages/s in a
single partition
Durability
Data replicated and
synced to disk
Geo-replication
Out of box support for
geographically
distributed
applications
Unified messaging
model
Support both
Streaming and
Queuing in a single
model
Delivery Guarantees
At least once, at most
once and effectively once
Low Latency
Low publish latency of
5ms at 99pct
Highly scalable
Can support millions of
topics
What is Apache Pulsar?
20
Pulsar Architecture
Pulsar Broker 1 Pulsar Broker 1 Pulsar Broker 1
Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5
Apache BookKeeper
Apache Pulsar
Producer Consumer
Stateless Serving
BROKER
Clients interact only with brokers
No state is stored in brokers
BOOKIES
Apache BookKeeper as the storage
Storage is append only
Provides high performance, low latency
Durability
No data loss. fsync before acknowledgement
21
Pulsar Architecture
Pulsar Broker 1 Pulsar Broker 1 Pulsar Broker 1
Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5
Apache BookKeeper
Apache Pulsar
Producer Consumer
SeparaBon of Storage and Serving
SERVING
Brokers can be added independently
Traffic can be shifted quickly across brokers
STORAGE
Bookies can be added independently
New bookies will ramp up traffic quickly
22
Segment Centric Storage
23
Flexible Messaging Model
24
Multi Tenancy
25
Topic (T1) Topic (T1)
Topic (T1)
SubscripBon (S1) SubscripBon (S1)
Producer
(P1)
Consumer
(C1)
Producer
(P3)
Producer
(P2)
Consumer
(C2)
Data Center A Data Center B
Data Center C
Multi Cluster Replication
26
Back to Pulsar Functions
27
Pulsar Functions
API
SDK less API
import java.util.function.Function;
public class ExclamationFunction implements Function<String, String> {
@Override
public String apply(String input) {
return input + "!";
}
}
28
Pulsar Functions
API
SDK API
import org.apache.pulsar.functions.api.PulsarFunction;
import org.apache.pulsar.functions.api.Context;
public class ExclamationFunction implements PulsarFunction<String, String> {
@Override
public String process(String input, Context context) {
return input + "!";
}
}
29
Pulsar Functions
Input and Output
FuncBon executed for every message of input topic
Supports mulBple topics as inputs
FuncBon Output goes to the output topic
FuncBon Output can be void/null
SerDe takes care of serializaBon/deserializaBon of messages
Custom SerDe can be provided by the users
Integrates with Schema Registry
30
Pulsar Functions
Processing Guarantees
ATMOST_ONCE
Message is acked to Pulsar as soon as we receive it
ATLEAST_ONCE
Message acked to Pulsar aeer the funcBon completes
Default behaviour:- Not many ppl want to loose data
EFFECTIVELY_ONCE
Uses Pulsar’s inbuilt effecBvely once semanBcs
Controlled at runBme by user
31
Pulsar Functions
Built in State
FuncBons can store state in StreamStore
Framework provides an simple library around this
Support server side operaBons like counters
Simplified applicaBon development
No need to standup an extra system
32
Pulsar Functions
WordCount Topology
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.PulsarFunction;
public class CounterFunction implements PulsarFunction<String, Void> {
@Override
public Void process(String input, Context context) throws Exception {
for (String word : input.split(".")) {
context.incrCounter(word, 1);
}
return null;
}
}
33
Built-in State Management
Pulsar uses BookKeeper as its stream storage
FuncBons can store State in BookKeeper
Framework provides the Context object for users to access State
Support server side operaBons like Counters
Simplified applicaBon development
No need to standup an extra system to develop/test/integrate/operate
34
State Storage w/ BookKeeper
The built-in state management is powered by Table Service in BookKeeper
BP-30: Table Service
Originated for a built-in metadata management within BookKeeper
Expose for general usage. e.g. State management for Pulsar FuncBons
Developer Preview
Pulsar FuncBons at Pulsar 2.0
Direct usage at BookKeeper 4.7
35
State Storage w/ BookKeeper
Updates are wri^en in the log streams in BookKeeper
Materialized into a key/value table view
The key/value table is indexed with rocksdb for fast lookup
The source-of-truth is the log streams in BookKeeper
Rocksdb are transient key/value indexes
Rocksdb instances are incrementally checkpointed and stored into BookKeeper for
fast recovery
36
Pulsar Functions
Running as a standalone applicaBon
bin/pulsar-admin functions localrun 
--input persistent://sample/standalone/ns1/test_input 
--output persistent://sample/standalone/ns1/test_result 
--className org.mycompany.ExclamationFunction 
--jar myjar.jar
Runs as a standalone process
Run as many instances as you want. Framework automaBcally balances data
Run and manage via Mesos/K8/Nomad/your favorite tool
37
Pulsar Functions
Running inside Pulsar cluster
‘Create’ and ‘Delete’ FuncBons in a Pulsar Cluster
Pulsar brokers run funcBons as either threads/processes/docker containers
Unifies Messaging and Compute cluster into one, significantly improving
manageability
Ideal match for Edge or small startup environment
Serverless in a jar
38
Pulsar Functions
Stepping back: Where Pulsar FuncBons belong
Powerful/Complicated systems have their place
Data Centers/Cloud
Complex analysis
A significant percentage of analyBcs/acBons are mundane
ETL/CounBng/RouBng
Use simple tools for simple things
39
Pulsar Functions: Use Cases
Edge CompuBng
Sensor devices generate tons of data
We need local acBons
Simple filtering, threshold detecBon, regex matching, etc
Manageability is a big concern
The less moving parts, the be^er
Resource Constrained
Limited scope for Full blown schedulers/Job Managers
40
Pulsar Functions: Use Cases
Model Serving
Models computed via offline analysis
Incoming requests should be classified using the model
FuncBon is a natural representaBon for the classificaBon acBon
Model itself can be stored in Bookkeeper
41
Roadmap
More language supports - Go, Javascript, C++
Cross FuncBons : FuncBon ComposiBon API
More State operaBons exposed to FuncBons
42
Conclusion
Stream-NaBve Compute (aka FuncBons) is the new paradigm in Messaging Systems
Stream-NaBve Storage (aka States) is the new paradigm in Storage Systems
Pulsar FuncBons bridges lightweight compuBng capability into messaging and
storage system, which is the trends that streaming applicaBons need
h^ps://pulsar.incubator.apache.org/docs/latest/funcBons/quickstart/
43
Questions and Thank You!

Stream-Native Processing with Pulsar Functions

  • 1.
    1 Lightweight Computing with PulsarFunctions Sanjeev Kulkarni, Sijie Guo
  • 2.
    2 Event Driven Architectures Therise of RealTime BigData began with Batch HDFS/MapReduce/Hive ReacBon Times became important Reduce Bme between data arrival and data analysis/acBon Emergence of Real-Time Streaming ystems
  • 3.
    3 What do wereally mean by Real-Time? Aims Aim is to react to events as they happen in real-Bme Where do Events happen/arrive? Message Bus Whats a reacBon An acBon/transformaBon/funcBon
  • 4.
  • 5.
  • 6.
    6 Traditional Compute API SBtchingall of this by programmers public static class SplitSentence extends BaseBasicBolt { @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map<String, Object> getComponentConfiguration() { return null; } public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) { String sentence = tuple.getStringByField("sentence"); String words[] = sentence.split(" "); for (String w : words) { basicOutputCollector.emit(new Values(w)); } } }
  • 7.
    7 Traditional Compute API SBtchingall of this by programmers public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }
  • 8.
    8 Compute API 2.0 FuncBonal Builder.newBuilder() .newSource(()-> StreamletUtils.randomFromList(SENTENCES)) .flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+"))) .reduceByKeyAndWindow(word -> word, word -> 1, WindowConfig.TumblingCountWindow(50), (x, y) -> x + y);
  • 9.
  • 10.
  • 11.
    11 Traditional Real-Time Systems DeveloperExperience Powerful API but complicated Does everyone really need to learn funcBonal programming? Configurable/Scaleable but management overhead Edge systems have resource/manageability constraints
  • 12.
    12 Traditional Real-Time Systems OperaBonalExperience Another system to operate is one too many IOT deployment rouBnely have thousands of edge systems SemanBc difference Mismatch/DuplicaBon between Systems Creates Developer and Operator FricBon
  • 13.
    13 Lessons learnt UseCases A significantpercentage of transformaBons are simple ETL ReacBve Services ClassificaBon Real-Bme AggregaBon Event RouBng Microservices
  • 14.
    14 Meanwhile The world ofCloud The emergence of Serverless Simple FuncBon API FuncBons are submi^ed to the system Run per event ComposiBon APIs to do complex things Wildly popular
  • 15.
    15 Serverless vs Streaming Whatsreally the difference Both are event driven architectures Both can be used for analyBcs/serving Both have composiBon APIs Conf based for Serverless vs DSL based for Streaming Serverless typically don’t care for ordering Really the funcBon of the underlying source Pay per acBon Really a product billing interfaces
  • 16.
    16 Whats needed:- Stream-NativeCompute Insight gained from serverless Simplest possible API Method/Procedure/FuncBon MulB Language API Scale developers Message bus naBve concepts Input/Output/Log as topics Flexible runBme Simple standalone applicaBons vs system managed applicaBons
  • 17.
  • 18.
  • 19.
    19 Ordering Guaranteed ordering Multi-tenancy A singlecluster can support many tenants and use cases High throughput Can reach 1.8 M messages/s in a single partition Durability Data replicated and synced to disk Geo-replication Out of box support for geographically distributed applications Unified messaging model Support both Streaming and Queuing in a single model Delivery Guarantees At least once, at most once and effectively once Low Latency Low publish latency of 5ms at 99pct Highly scalable Can support millions of topics What is Apache Pulsar?
  • 20.
    20 Pulsar Architecture Pulsar Broker1 Pulsar Broker 1 Pulsar Broker 1 Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5 Apache BookKeeper Apache Pulsar Producer Consumer Stateless Serving BROKER Clients interact only with brokers No state is stored in brokers BOOKIES Apache BookKeeper as the storage Storage is append only Provides high performance, low latency Durability No data loss. fsync before acknowledgement
  • 21.
    21 Pulsar Architecture Pulsar Broker1 Pulsar Broker 1 Pulsar Broker 1 Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5 Apache BookKeeper Apache Pulsar Producer Consumer SeparaBon of Storage and Serving SERVING Brokers can be added independently Traffic can be shifted quickly across brokers STORAGE Bookies can be added independently New bookies will ramp up traffic quickly
  • 22.
  • 23.
  • 24.
  • 25.
    25 Topic (T1) Topic(T1) Topic (T1) SubscripBon (S1) SubscripBon (S1) Producer (P1) Consumer (C1) Producer (P3) Producer (P2) Consumer (C2) Data Center A Data Center B Data Center C Multi Cluster Replication
  • 26.
  • 27.
    27 Pulsar Functions API SDK lessAPI import java.util.function.Function; public class ExclamationFunction implements Function<String, String> { @Override public String apply(String input) { return input + "!"; } }
  • 28.
    28 Pulsar Functions API SDK API importorg.apache.pulsar.functions.api.PulsarFunction; import org.apache.pulsar.functions.api.Context; public class ExclamationFunction implements PulsarFunction<String, String> { @Override public String process(String input, Context context) { return input + "!"; } }
  • 29.
    29 Pulsar Functions Input andOutput FuncBon executed for every message of input topic Supports mulBple topics as inputs FuncBon Output goes to the output topic FuncBon Output can be void/null SerDe takes care of serializaBon/deserializaBon of messages Custom SerDe can be provided by the users Integrates with Schema Registry
  • 30.
    30 Pulsar Functions Processing Guarantees ATMOST_ONCE Messageis acked to Pulsar as soon as we receive it ATLEAST_ONCE Message acked to Pulsar aeer the funcBon completes Default behaviour:- Not many ppl want to loose data EFFECTIVELY_ONCE Uses Pulsar’s inbuilt effecBvely once semanBcs Controlled at runBme by user
  • 31.
    31 Pulsar Functions Built inState FuncBons can store state in StreamStore Framework provides an simple library around this Support server side operaBons like counters Simplified applicaBon development No need to standup an extra system
  • 32.
    32 Pulsar Functions WordCount Topology importorg.apache.pulsar.functions.api.Context; import org.apache.pulsar.functions.api.PulsarFunction; public class CounterFunction implements PulsarFunction<String, Void> { @Override public Void process(String input, Context context) throws Exception { for (String word : input.split(".")) { context.incrCounter(word, 1); } return null; } }
  • 33.
    33 Built-in State Management Pulsaruses BookKeeper as its stream storage FuncBons can store State in BookKeeper Framework provides the Context object for users to access State Support server side operaBons like Counters Simplified applicaBon development No need to standup an extra system to develop/test/integrate/operate
  • 34.
    34 State Storage w/BookKeeper The built-in state management is powered by Table Service in BookKeeper BP-30: Table Service Originated for a built-in metadata management within BookKeeper Expose for general usage. e.g. State management for Pulsar FuncBons Developer Preview Pulsar FuncBons at Pulsar 2.0 Direct usage at BookKeeper 4.7
  • 35.
    35 State Storage w/BookKeeper Updates are wri^en in the log streams in BookKeeper Materialized into a key/value table view The key/value table is indexed with rocksdb for fast lookup The source-of-truth is the log streams in BookKeeper Rocksdb are transient key/value indexes Rocksdb instances are incrementally checkpointed and stored into BookKeeper for fast recovery
  • 36.
    36 Pulsar Functions Running asa standalone applicaBon bin/pulsar-admin functions localrun --input persistent://sample/standalone/ns1/test_input --output persistent://sample/standalone/ns1/test_result --className org.mycompany.ExclamationFunction --jar myjar.jar Runs as a standalone process Run as many instances as you want. Framework automaBcally balances data Run and manage via Mesos/K8/Nomad/your favorite tool
  • 37.
    37 Pulsar Functions Running insidePulsar cluster ‘Create’ and ‘Delete’ FuncBons in a Pulsar Cluster Pulsar brokers run funcBons as either threads/processes/docker containers Unifies Messaging and Compute cluster into one, significantly improving manageability Ideal match for Edge or small startup environment Serverless in a jar
  • 38.
    38 Pulsar Functions Stepping back:Where Pulsar FuncBons belong Powerful/Complicated systems have their place Data Centers/Cloud Complex analysis A significant percentage of analyBcs/acBons are mundane ETL/CounBng/RouBng Use simple tools for simple things
  • 39.
    39 Pulsar Functions: UseCases Edge CompuBng Sensor devices generate tons of data We need local acBons Simple filtering, threshold detecBon, regex matching, etc Manageability is a big concern The less moving parts, the be^er Resource Constrained Limited scope for Full blown schedulers/Job Managers
  • 40.
    40 Pulsar Functions: UseCases Model Serving Models computed via offline analysis Incoming requests should be classified using the model FuncBon is a natural representaBon for the classificaBon acBon Model itself can be stored in Bookkeeper
  • 41.
    41 Roadmap More language supports- Go, Javascript, C++ Cross FuncBons : FuncBon ComposiBon API More State operaBons exposed to FuncBons
  • 42.
    42 Conclusion Stream-NaBve Compute (akaFuncBons) is the new paradigm in Messaging Systems Stream-NaBve Storage (aka States) is the new paradigm in Storage Systems Pulsar FuncBons bridges lightweight compuBng capability into messaging and storage system, which is the trends that streaming applicaBons need h^ps://pulsar.incubator.apache.org/docs/latest/funcBons/quickstart/
  • 43.