Low Latency Execution For Apache Spark

Low latency execution for apache spark
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout

Who am I ?
PhD candidate, AMPLab UC Berkeley

Dissertation: System design for large scale machine learning

Apache Spark PMC Member. Contributions to Spark core, MLlib, SparkR

Low latency: execution
Scheduler
Few milliseconds
Large Clusters

Low latency execution engine for iterative workloads
Machine learning
Stream processing
State
This talk

Centralized task
scheduling
Lineage, Parallel
Recovery
Microsoft Dryad
Execution models: batch processing
Task
Control Message
Driver
Driver: Coordinate
shuffles, data sharing
S
H
U
F
F
L
E
Network Transfer
Iteration
B
R
O
A
D
C
A
S
T

Execution models: parallel operators
Long-lived operators
Checkpointing based
fault tolerance
Naiad
S
H
U
F
F
L
E
S
H
U
F
F
L
E
Iteration
Task
Control Message
Driver
Network Transfer
Low Latency

SCALING behavior
0
50
100
150
200
250
4
8
16
32
64
128
Time(ms)
Machines
Network Wait
Compute
Task Fetch
Scheduler Delay
Cluster: 4 core, r3.xlarge machines Workload: Sum of 10k numbers per-core
Median-task time breakdown

Can we achieve low latency with Apache Spark ?

DESIGN INSIGHT
Fine-grained execution with coarse-grained scheduling

DRIZZLE
S
H
U
F
F
L
E
B
R
O
A
D
C
A
S
T
Iteration
Batch Scheduling
Pre-Scheduling Shufﬂes
Distributed Shared Variables

DAG scheduling
Assign tasks to hosts using
(a) locality preferences
(b) straggler mitigation
(c) fair sharing etc.
Tasks
Host1
Host2
Driver
Host1
Host2
Serialize &
Launch
Host
Metadata
Scheduler
Same DAG structure
for many iterations
Can reuse scheduling decisions

Batch scheduling
Schedule a batch
of iterations at once
Fault tolerance, scheduling
at batch boundaries
1 stage in each iteration
b = 2

How much does this help ?
1
10
100
1000
4
8
16
32
64
128
Time/Iter(ms)
Machines
Apache Spark
Drizzle-10
Drizzle-50
Drizzle-100
Workload: Sum of 10k numbers per-core
Single Stage Job, 100 iterations – Varying Drizzle batch size

coordinating shuffles: APACHE SPARK
Task
Control Message
Data Message
Driver
Intermediate Data
Driver sends metadata Tasks pull data

coordinating shuffles: PRE-SCHEDULING
Pre-schedule down-stream
tasks on executors
Trigger tasks once
dependencies are met
Task
Control Message
Data Message
Driver
Intermediate Data
Pre-scheduled task

Push-metadata, pull-data
Coalesce metadata across cores
Transfer during downstream stage
Data Message
Intermediate Data
Metadata Message

Micro-benchmark: 2-stages
0
50
100
150
200
250
300
350
4
8
16
32
64
128
Time/Iter(ms)
Machines
Apache Spark
Drizzle-Pull
Workload: Sum of 10k numbers per-core
100 iterations

Shared variables
Fine-grained updates to shared data

Distributed shared variables

- MPI using MPI_AllReduce

- Bloom, CRDTs

- Parameter servers, key-value stores

- reduce followed by broadcast

Drizzle: Replicated shared variables
Custom merge strategies
1
Commutative updates
e.g. vector addition
Task
Shared Variable
M
E
R
G
E
1
1
1
1
2
2
2
2

Enabling async updates
Synchronous

Asynchronous semantics within a batch
Synchronous semantics enforced at batch boundaries
Task
Shared Variable
1 2
Async
Stale versions
1 1 2
3
Merge

Evaluation
Micro-benchmarks
-  Single stage
-  Multiple stages
-  Shared variables

End-to-end experiments
-  Streaming benchmarks
-  Logistic Regression
Implemented on Apache Spark 1.6.1
Integrations with Spark Streaming, MLlib

USING DRIZZLE API
DAGScheduler.scala
def runJob(
rdd: RDD[T],
func: Iterator[T] => U)

Internal
def runBatchJob(
rdds: Seq[RDD[T]],
funcs: Seq[Iterator[T] => U])

Using drizzle
StreamingContext.scala

def this(
sc: SparkContext,
batchDuration: Duration)
Spark Streaming
def this(
sc: SparkContext,
batchDuration: Duration,
jobsPerBatch: Int)

b=1 à Batch processing
Choosing batch size
Higher overhead
Smaller window for fault tolerance
In practice: Largest batch such that overhead is below ﬁxed threshold
e.g., For 128 machines, batch of few seconds is enough
b=N à Parallel operators
Lower overhead
Larger window for fault tolerance

Streaming BENCHMARK
0
25
50
75
100
0
300
600
900
1200
1500
CDF
Event Latency (ms)
Drizzle, 90ms
Spark Streaming 560ms
Yahoo Streaming Benchmark: 1M JSON Ad-events / second, 64 machines
Event Latency: Difference between window end, processing end

Mllib: LOGISTIC REGRESSION
0
100
200
300
400
500
600
4
8
16
32
64
128
Time/Iter(ms)
Machines
Apache Spark
Drizzle
Sparse Logistic Regression on RCV1 677K examples, 47K features

Work In progress
Automatic batch size tuning

Open source release

Apache Spark JIRA to discuss potential contribution

conclusion
Overheads when using Apache Spark for streaming, ML workloads

Drizzle: Low Latency Execution Engine

- Decouple execution from centralized scheduling

- Milliseconds latency for iterative workloads

Workloads / Questions / Contributions
Shivaram Venkataraman
shivaram@cs.berkeley.edu

Low Latency Execution For Apache Spark

More Related Content

What's hot

Similar to Low Latency Execution For Apache Spark

More from Jen Aman

Recently uploaded

Low Latency Execution For Apache Spark