KEMBAR78
Low Latency Execution For Apache Spark | PDF
Low latency execution for apache spark
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout
Who am I ?
PhD candidate, AMPLab UC Berkeley 


Dissertation: System design for large scale machine learning


Apache Spark PMC Member. Contributions to Spark core, MLlib, SparkR
Low latency: SPARK STREAMING
Low latency: SPARK STREAMING
Low latency: SPARK STREAMING
Low latency: execution
Scheduler
Few milliseconds
Large Clusters
Low latency execution engine for iterative workloads
Machine learning
Stream processing
State
This talk
Execution Models
Centralized task
scheduling
Lineage, Parallel
Recovery
Microsoft Dryad
Execution models: batch processing
Task
Control Message
Driver
Driver: Coordinate
shuffles, data sharing
S
H
U
F
F
L
E
Network Transfer
Iteration
B
R
O
A
D
C
A
S
T
Execution models: parallel operators
Long-lived operators
Checkpointing based
fault tolerance
Naiad
S
H
U
F
F
L
E
S
H
U
F
F
L
E
Iteration
Task
Control Message
Driver
Network Transfer
Low Latency
SCALING behavior
0
50
100
150
200
250
4
 8
 16
 32
 64
 128
Time(ms)
Machines
Network Wait
 Compute
 Task Fetch
 Scheduler Delay
Cluster: 4 core, r3.xlarge machines Workload: Sum of 10k numbers per-core
Median-task time breakdown
Can we achieve low latency with Apache Spark ?
DESIGN INSIGHT
Fine-grained execution with coarse-grained scheduling
DRIZZLE
S
H
U
F
F
L
E
B
R
O
A
D
C
A
S
T
Iteration
Batch Scheduling
 Pre-Scheduling Shuffles
 Distributed Shared Variables
DAG scheduling
Assign tasks to hosts using
(a) locality preferences 
(b) straggler mitigation 
(c) fair sharing etc.
Tasks
 Host1
Host2
Driver
Host1
Host2
Serialize &
Launch
Host
Metadata
Scheduler 
Same DAG structure
for many iterations
Can reuse scheduling decisions
Batch scheduling
Schedule a batch
of iterations at once
Fault tolerance, scheduling
at batch boundaries
1 stage in each iteration
b = 2
How much does this help ?
1
10
100
1000
4
 8
 16
 32
 64
 128
Time/Iter(ms)
Machines
Apache Spark
 Drizzle-10
 Drizzle-50
 Drizzle-100
Workload: Sum of 10k numbers per-core
Single Stage Job, 100 iterations – Varying Drizzle batch size
DRIZZLE
S
H
U
F
F
L
E
B
R
O
A
D
C
A
S
T
Iteration
Batch Scheduling
 Pre-Scheduling Shuffles
 Distributed Shared Variables
coordinating shuffles: APACHE SPARK
Task
Control Message
Data Message
Driver
Intermediate Data
Driver sends metadata Tasks pull data
coordinating shuffles: PRE-SCHEDULING
Pre-schedule down-stream
tasks on executors
Trigger tasks once
dependencies are met
Task
Control Message
Data Message
Driver
Intermediate Data
Pre-scheduled task
Push-metadata, pull-data
Coalesce metadata across cores
Transfer during downstream stage
Data Message
Intermediate Data
 Metadata Message
Micro-benchmark: 2-stages
0
50
100
150
200
250
300
350
4
 8
 16
 32
 64
 128
Time/Iter(ms)
Machines
Apache Spark
 Drizzle-Pull
Workload: Sum of 10k numbers per-core
100 iterations
DRIZZLE
S
H
U
F
F
L
E
B
R
O
A
D
C
A
S
T
Iteration
Batch Scheduling
 Pre-Scheduling Shuffles
 Distributed Shared Variables
Shared variables
Fine-grained updates to shared data

Distributed shared variables 


- MPI using MPI_AllReduce		

- Bloom, CRDTs

- Parameter servers, key-value stores

- reduce followed by broadcast
Drizzle: Replicated shared variables
Custom merge strategies
1
Commutative updates
e.g. vector addition
Task
Shared Variable
M
E
R
G
E
1
1
1
1
2
2
2
2
Enabling async updates
Synchronous

Asynchronous semantics within a batch
Synchronous semantics enforced at batch boundaries
Task
Shared Variable
1 2
Async
Stale versions
 1 1 2
3
Merge
Evaluation
Micro-benchmarks
-  Single stage
-  Multiple stages
-  Shared variables

End-to-end experiments
-  Streaming benchmarks
-  Logistic Regression
Implemented on Apache Spark 1.6.1
Integrations with Spark Streaming, MLlib
USING DRIZZLE API
DAGScheduler.scala
	def	runJob(	
	 	rdd:	RDD[T],		
	 	func:	Iterator[T]	=>	U)	
	
	
	
Internal
	def	runBatchJob(	
	 	rdds:	Seq[RDD[T]],	
	 	funcs:	Seq[Iterator[T]	=>	U])
Using drizzle
StreamingContext.scala

def	this(	
	 	sc:	SparkContext,	
	 	batchDuration:	Duration)	
	Spark Streaming
	def	this(	
	 	sc:	SparkContext,	
	 	batchDuration:	Duration,	
	 	jobsPerBatch:	Int)
b=1 à Batch processing 
Choosing batch size
Higher overhead
Smaller window for fault tolerance
In practice: Largest batch such that overhead is below fixed threshold
e.g., For 128 machines, batch of few seconds is enough 
b=N à Parallel operators
Lower overhead
Larger window for fault tolerance
Evaluation
Micro-benchmarks
-  Single stage
-  Multiple stages
-  Shared variables

End-to-end experiments
-  Streaming benchmarks
-  Logistic Regression
Implemented on Apache Spark 1.6.1
Integrations with Spark Streaming, MLlib
Streaming BENCHMARK
0
25
50
75
100
0
 300
 600
 900
 1200
 1500
CDF
Event Latency (ms)
Drizzle, 90ms
 Spark Streaming 560ms
Yahoo Streaming Benchmark: 1M JSON Ad-events / second, 64 machines
Event Latency: Difference between window end, processing end
Mllib: LOGISTIC REGRESSION
0
100
200
300
400
500
600
4
 8
 16
 32
 64
 128
Time/Iter(ms)
Machines
Apache Spark
 Drizzle
Sparse Logistic Regression on RCV1 677K examples, 47K features
Work In progress
Automatic batch size tuning

Open source release

Apache Spark JIRA to discuss potential contribution
conclusion
Overheads when using Apache Spark for streaming, ML workloads

Drizzle: Low Latency Execution Engine

- Decouple execution from centralized scheduling

- Milliseconds latency for iterative workloads

Workloads / Questions / Contributions
Shivaram Venkataraman
shivaram@cs.berkeley.edu

Low Latency Execution For Apache Spark