Seminar on Parallel and Concurrent Programming

Stefan Marr, Daniele Bonetta
2016
Seminar on
Parallel and Concurrent Programming

Agenda
1. Modus Operandi
2. Introduction to
Concurrent Programming Models
3. Seminar Paper Overview
2

Tasks and Deadlines
• Talk on selected paper (student 1)
– 30min with slides (+ 15min discussion)
• to be discussed with us 1 week before
– Summary (max. 500 word)
• 2 days before seminar, 11:59am
• Questions on assigned paper (student 2)
– Min. 5 questions
– 2 days before seminar, 11:59am
4

Report
Category 1: Theoretical treatment
• Focus on paper, related work, state of the art
of the field
• Detailed discussion
Category 2: Practical treatment of topic, for
instance
• Reproduce experiments/results
• Extend experiments
• Experiment with variations
5

Report
• paper summary (500 words)
• outline, content, and experiments to be
discussed with us
• Cat. 1: ca. 4000 word (excl. references)
– state of the art, context in field, and specific
technique from paper
• Cat. 2: ca. 2000 word (excl. references)
– Discuss experiments, gained insights, found
limitations, etc.
Deadline: Feb. 6th
6

Consultations
• For alternative paper proposals
• To prepare presentation!
• To agree on focus of report/experiments
– For experiments mandatory
7

Grading
• Required attendance: 80% of all meetings
• 50% slides, presentation, and discussion
• 50% write-up/experiments
8

Timeline
Oct. 5th Introduction to Concurrent
Programming Models
Oct. 10th Deadline: List of ranked papers
Oct. 12th Runtime Techniques for Big Data
and Parallelism
Week 3-5 Preparations and Consultations
Week 6-12 Presentations
Feb. 6th Deadline for Report
9

Got Background in
Concurrency/Parallelism?
Show of Hands!
10

Multicore is the Norm
8 Cores
200 Euro Phones
24 Cores
Workstation
>=72 Cores
Embedded System

Problem: Power Wall at ca. 5 GHz

CPUs don’t get Faster But Multiply
0.2
1.5
3.8
3.33
3.8
0
1
2
3
4
1990 1995 2000 2005 2010 2015
4, 6, 12,
… cores
GHz
1 core
Based on the Clock Frequency of Intel Processors

Power ≈ Voltage2  Frequency
Voltage = -15%
Frequency = -15%
Power = 1
Performance ≈ 1.8

Memory Wall
1
10
100
1000
10000
1980 1985 1990 1995 2000 2005
CPU Frequency
DRAM Speeds
Relative
Performance
Gap
Source: Sun World Wide Analyst Conference Feb. 25, 2003

Multicore Transition
Work around physical limitations
Power Wall and Memory Wall
10/5/2016 17
MemoryMemory
MemoryMemory
Main Memory Main Memory

For a brief bit of history:
ENIAC’s recessive gene
Marcus Mitch, and Akera Atsushi. Penn Printout (March 1996)
http://www.upenn.edu/computing/printout/archive/v12/4/pdf/gene.pdf
ENIAC's main control panel, U. S. Army Photo

Decades of Research
and Solutions for Everything
10/5/2016 19

…
But no Silver Bullet
CSP
Locks, Monitors, …
Fork/Join
Transactional Memory
20
Data Flow
Actors

A Rough Categorization
21
Communicating
Isolates
Threads and Locks Coordinating
Threads

A Rough Categorization
22
Marr, S. (2013), 'Supporting Concurrency Abstractions in High-level Language Virtual Machines', PhD
thesis, Software Languages Lab, Vrije Universiteit Brussel.
Data Parallelism

THREADS AND LOCKS
Powerful but hard
23

Uniform Shared Memory
A Model
for the Machines We Used to Have
24
C/C++

Threads
• Sequences of instructions
• Unit of scheduling
– Preemptive and concurrent
– Or parallel
25
time

A Snake Game
• Multiple players
• Compete for ‘apples’
• Shared board
10/5/2016 26

Race Conditions and Data Races
Race Condition
• Result depending on
timing of operations
Data Race
• Race condition on
memory
• Synchronization
absent or incomplete
27

Locks
synchronized (board) {
board.moveLeft(snake)
}
28

Optimized Locking for more Parallelism
synchronized (board[3][3]) {
synchronized (board[3][2]) {
}
}
29
Strategy: Lock only cells you need to update
What could go
wrong?

Common Issues
• Lack of Progress
– Deadlock
– Livelock
• Race Condition
– Data race
– Atomicity violation
• Performance
– Sequential bottle necks
– False sharing
30

Basic Concepts
Shared Memory with Threads and Locks
• Threads
• Synchronization
• No safety guarantees
– Data Races
– Deadlocks
31
P1.9 The Linux Scheduler: A Decade of Wasted Cores, J.-P. Lozi et al.
P2.1 Optimistic Concurrency with OPTIK, R. Guerraoui, V. Trigonakis
P2.9 OCTET: Capturing and Controlling Cross-Thread Dependences Efficiently, M. Bond et al.
P2.10 Efficient and Thread-Safe Objects for Dynamically-Typed Languages, B. Daloze et al.
Questions?

COORDINATING THREADS
Making Coordination Explicit
32
Communicating
Threads

Shared Memory with
Explicit Coordination
Raising the Abstraction Level
Libraries for
most languages

Two Main Variants
Temporal Isolation
Explicit Communication
Channel or Message-based
34

atomic {
}
35
Coordinated by
Runtime System

Simple Programing Model
• No Data Races
(within transactions)
• No Deadlocks
36
Issues
• Performance overhead
• Still experimental
• Livelocks
• Inter-transactional
race conditions
• I/O semantics

Some Issues
atomic {
dataArray = getData();
fork { compute(dataArray[0]); }
compute(dataArray[1]);
}
37
P2.2 Transactional Tasks: Parallelism in Software Transactions, J. Swalens et al.
P1.1 Transactional Data Structure Libraries, A. Spiegelman et al.
P1.2 Type-Aware Transactions for Faster Concurrent Code, N. Herman et al.
What happens with
forked thread when
transaction aborts?

Channel-based Communication
coordChannel ! (#moveLeft, snake)
38
for i in players():
msg ? coordChannels[i]
match msg:
(#moveLeft, snake):
board[…,…] = …
Player Thread
Coordinator Thread
Coordinator Thread
Player Thread Player Thread
send
receive
High-level communication
but no safety guarantees

Coordinating Threads
• Transactions
• Simple Programming Model
• Practical Issues
Channel/Message Communication
• Explicit coordination
– Channels or message sending
– Higher abstraction level
• No safety guarantees
39
P1.4 Why Do Scala Developers Mix the Actor Model with other Concurrency Models?, S.
Tasharofi et al.
P1.6 The Asynchronous Partitioned Global Address Space Model, V. Saraswat et al. (conc-
model, AMP'10)
Questions?

COMMUNICATING ISOLATES
Communication is Everything
40

Explicit Communication Only
Absence of Low-level Data Races
41

All Interactions Explicit
42
Actor A Actor B
Actor Principle

Many Many Variations
• Channel based
– Communicating Sequential Processes
• Message based
– Actor models
43
P1.3 43 Years of Actors: a Taxonomy of Actor Models and Their Key
Properties, J. De Koster et al.

Communicating Event Loops
44
Actor A Actor B
One Message at a Time

45
Actor A Actor B
Actors Contain Objects

46
Actor A Actor B
Interacting via Messages

Message-based Communication
47
Player 1
Player 1
Board Actor
board <- moveLeft(snake)
class Board {
private array;
public moveLeft(snake) {
array[snake.x][snake.y] = ...
}
}
Player Actor
Board Actor
async send
actors.create(Board)
actors.create(Snake)
actors.create(Snake)
Main Program

Communicating Isolates
Message or Channel Based
• Explicit communication
• No shared memory
• Still potential for
– Behavioral deadlocks
– Livelocks
– Bad message inter-leavings
– Message protocol violations
48
P1.3 43 Years of Actors: a Taxonomy of Actor Models and Their Key Properties, J. De Koster et
al.
P1.11 Distributed Debugging for Mobile Networks, E. Gonzalez Boix et al. (tooling, JSS'14)
Questions?

DATA PARALLELISM
Parallelism for Structured Problems
49

DATA PARALLELISM WITH FORK/JOIN
Just one Example
50

Fork/Join with Work-Stealing
• Recursive
divide-and-conquer
• Automatic and efficient
parallel scheduling
• Widely available for C++,
Java, and .NET
10/5/2016 51
Blumofe, R. D.; Joerg, C. F.; Kuszmaul, B. C.; Leiserson, C. E.; Randall, K. H. & Zhou, Y. (1995),
'Cilk: An Efficient Multithreaded Runtime System', SIGPLAN Not. 30 (8), 207-216.

Typical Applications
• Recursive Algorithms1
– Mergesort
– List and tree traversals
• Parallel prefix, pack, and
sorting problems2
• Irregular and unbalanced
computation
– On directed acyclic graphs
(DAGs)
– Ideally tree-shaped
52
1) More material can be found at: http://homes.cs.washington.edu/~djg/teachingMaterials/spac/
2) Prefix Sums and Their Applications: http://www.cs.cmu.edu/~guyb/papers/Ble93.pdf

Tiny Example: Summing a large Array
• Simple array with numbers
• Recursively divide
– Every ‘ ’ is a parallel fork
• Then do addition
– Every ‘ ’ is a join
53
Note: This example is academic, and could be better expressed with a parallel map/reduce
library, such as Scala’s Parallel Collections, Java 8 Streams, or Microsoft’s PLINQ.
46 9 42 7 55
45724965
4965
5 6
11
49
13
24
4572
72 45
9 9
18
42

Data Parallelism with Fork/Join
• Parallel programming
technique
• Recursive divide-and-
conquer
• Automatic and efficient
load-balancing
58
P1.5 A Java Fork/Join Framework, D. Lea (conc-model, runtime, Java'00)

CONCLUSION CONCURRENCY
MODELS
59

Four Rough Categories
60
Communicating
Isolates
Threads and Locks
Coordinating
Threads
Data Parallelism

These are Suggestions
Please, feel free to
propose papers of your interest.
(Papers need to be approved by us)
62

Topics of Interest
• High-level language
concurrency models
– Actors, Communicating
Sequential Processes,
STM, Stream Processing,
...
• Tooling
– Debugging
– Profiling
• Implementation and
runtime systems
– Communication
mechanisms
– Data/object
representation
– System-level aspects
• Big Data Frameworks
– Programming models
– Runtime level problems
63

Papers without Artifacts
P1.1 Transactional Data Structure Libraries, A. Spiegelman et al.
(conc-model, PLDI'16)
P1.2 Type-Aware Transactions for Faster Concurrent Code, N.
Herman et al. (conc-model, runtime, EuroSys'16)
P1.3 43 Years of Actors: a Taxonomy of Actor Models and Their Key
Properties, J. De Koster et al. (conc-model, Agere'16)
P1.4 Why Do Scala Developers Mix the Actor Model with other
Concurrency Models?, S. Tasharofi et al. (conc-model, ECOOP'13)
P1.5 A Java Fork/Join Framework, D. Lea (conc-model, runtime,
Java'00)
P1.6 The Asynchronous Partitioned Global Address Space Model, V.
Saraswat et al. (conc-model, AMP'10)
64

Papers without Artifacts
P1.7 Pydron: Semi-Automatic Parallelization for Multi-
Core and the Cloud, S. C. Müller et al. (conc-model,
runtime, OSDI'15)
P1.8 Fast Splittable Pseudorandom Number Generators,
G. L. Steele et al. (runtime, OOPSLA'14)
P1.9 The Linux Scheduler: A Decade of Wasted Cores, J.-
P. Lozi et al. (runtime, EuroSys'15)
P1.10Application-Assisted Live Migration of Virtual
Machines with Java Applications, K.-Y. Hou et al.
(runtime, EuroSys'15)
P1.11Distributed Debugging for Mobile Networks, E.
Gonzalez Boix et al. (tooling, JSS'14)
65

Papers with Artifacts
P2.1 Optimistic Concurrency with OPTIK, R. Guerraoui,
V. Trigonakis (conc-model, PPoPP'16)
P2.2 Transactional Tasks: Parallelism in Software
Transactions, J. Swalens et al. (conc-model,
ECOOP'16)
P2.3 StreamJIT: a commensal compiler for high-
performance stream programming, J. Bosboom et
al. (conc-model, runtime, OOPSLA'14)
P2.4 An Efficient Synchronization Mechanism for Multi-
core Systems, M. Aldinucci et al. (conc-model,
runtime, EuroPar'12)
P2.5 Parallel parsing made practical, A. Barenghi et al.
(runtime, SCP'15) 66

Papers with Artifacts
P2.6 SparkR : Scaling R Program with Spark, S.
Venkataraman et al. (conc-model, bigdata,
SIGMOD'16)
P2.7 SparkSQL: Relational Data Processing in Spark, M.
Armbrust et al. (bigdata, runtime, VLDB'14)
P2.8 Twitter Heron: Stream Processing at Scale, S.
Kulkarni et al. (bigdata, SIGMOD'15)
P2.9 OCTET: Capturing and Controlling Cross-Thread
Dependences Efficiently, M. D. Bond et al. (tooling,
OOPSLA'13)
P2.10Efficient and Thread-Safe Objects for Dynamically-
Typed Languages, B. Daloze et al. (runtime,
OOPSLA'16) 67

Seminar on Parallel and Concurrent Programming

More Related Content

What's hot

Similar to Seminar on Parallel and Concurrent Programming

More from Stefan Marr

Recently uploaded

Seminar on Parallel and Concurrent Programming

Editor's Notes