0% found this document useful (0 votes)

113 views39 pages

01 Whyparallelism

The document discusses the history of parallel computing and how performance improvements in single processors reached a limit, leading to the need for parallel programming. It describes the course content which will cover different models of parallel programming including shared memory, GPUs, and message passing. Students will complete assignments, exercises, and a final project involving parallel programming.

Uploaded by

zfzmka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views39 pages

01 Whyparallelism

Uploaded by

zfzmka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Lecture 1:

Why Parallelism?
Why Eﬃciency?
Parallel Computer Architecture and Programming
CMU 15-418/15-618, Spring 2019
Hi!
Plus . . .

Randy Bryant An evolving collection of teaching assistants

Nathan Beckmann
CMU 15-418/618, Spring 2019
Getting into the Class
▪ Status (Mon Jan. 14, 09:30) ▪ Clearing Wait List
- 127 students enrolled - Complete Assignment 1 by
- 142 on wait list Jan. 23, 23:59
- 144 max. enrollment - No Autolab account
required
- 15 slots (maybe more?)
▪ If you are registered - We will enroll top-
performing students
- Do Assignment 1
- It’s that simple!
- Due Jan. 30
- You will know by Jan. 28
- If find too challenging,
then please drop by Jan. 28

CMU 15-418/618, Spring 2019

What will you be doing in this course?

CMU 15-418/618, Spring 2019

Assignments
▪ Four programming assignments
- First assignment is done individually, the rest will be done in pairs
- Each uses a diﬀerent parallel programming environment

Assignment 1: SIMD and multi-core Assignment 2: CUDA

parallelism programming on NVIDIA GPUs

Assignment 3: Parallel Programming Assignment 4: Parallel Programming

via a Shared-Address Space Model via a Message Passing Model

CMU 15-418/618, Spring 2019

Final project
▪ 6-week self-selected final project
▪ Performed in groups (by default, 2 people per group)
▪ Keep thinking about your project ideas starting TODAY!
▪ Poster session at end of term

▪ Check out previous projects:

http://15418.courses.cs.cmu.edu/spring2016/competition

http://15418.courses.cs.cmu.edu/fall2017/article/10

http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15418-s18/www/15418-s18-projects.pdf

CMU 15-418/618, Spring 2019

Exercises
▪ Six homework exercises
- Scheduled throughout term
- Designed to prepare you for the exams
- We will grade your work to give you feedback, but only a
participation grade will go into the gradebook

CMU 15-418/618, Spring 2019

Grades

40% Programming assignments (4)

30% Exams (2)
24% Final project
6% Exercises

Each student gets up to five late days on programming

assignments (see syllabus for details)

CMU 15-418/618, Spring 2019

Getting started
▪ Visit course home page
- http://www.cs.cmu.edu/~418/
▪ Sign up for the course on Piazza
- http://piazza.com/cmu/spring2019/1541815618
▪ Textbook
- There is no course textbook, but please see web site for
suggested references
▪ Find a Partner
- Assignments 2–4, final project

CMU 15-418/618, Spring 2019

Regarding the class meeting times
▪ Class MWF 1:30–2:50
- Lectures (mostly)
- Some designated “Recitations”
- Targeted toward things you need to know for an upcoming
assignment
▪ No classes last part of the term
- Let you focus on projects

CMU 15-418/618, Spring 2019

Collaboration (Acceptable & Unacceptable)
▪ Do
- Become familiar with course policy
http://www.cs.cmu.edu/~418/academicintegrity.html

C.mmp at CMU (1971) Cray XMP (circa 1984) Thinking Machines CM-2 (circa 1987)
16 PDP-11 processors 4 vector processors 65,536 1-bit processors +
2048 floating-point co-processors

Bridges at the Pittsburgh

Supercomputer Center

800+ compute nodes

Heterogenous Structure
CMU 15-418/618, Spring 2019
A Brief History of Parallel Computing
▪ Initial Focus (starting in 1970s): “Supercomputers” for Scientific Computing
▪ Another Driving Application (starting in early ‘90s): Databases

Sun Enterprise 10000 (circa 1997) Oracle Supercluster M6-32 (today)

16 UltraSPARC-II processors 32 SPARC M2 processors

CMU 15-418/618, Spring 2019

Setting Some Context
▪ Before we continue our multiprocessor story, let’s pause to consider:
- Q: what had been happening with single-processor performance?
▪ A: since forever, they had been getting exponentially faster
- Why?

Image credit: Olukutun and Hammond, ACM Queue 2005

CMU 15-418/618, Spring 2019
A Brief History of Processor Performance
▪ Wider data paths
-4 bit → 8 bit → 16 bit → 32 bit → 64 bit
▪ More eﬃcient pipelining
- e.g., 3.5 Cycles Per Instruction (CPI) → 1.1 CPI
▪ Exploiting instruction-level parallelism (ILP)
- “Superscalar” processing: e.g., issue up to 4 instructions/cycle
- “Out-of-order” processing: extract parallelism from instruction
stream
▪ Faster clock rates
- e.g., 10 MHz → 200 MHz → 3 GHz

▪ During the 80s and 90s: large exponential performance gains

- and then… CMU 15-418/618, Spring 2019
A Brief History of Parallel Computing
▪ Initial Focus (starting in 1970s): “Supercomputers” for Scientific Computing
▪ Another Driving Application (starting in early ‘90s): Databases
▪ Inflection point in 2004: Intel hits the Power Density Wall

Pat Gelsinger, ISSCC 2001

CMU 15-418/618, Spring 2019
From the New York Times
Intel's Big Shift After Hitting Technical Wall
The warning came first from a group of hobbyists that tests the speeds of computer chips. This
year, the group discovered that the Intel Corporation's newest microprocessor was running
slower and hotter than its predecessor.

What they had stumbled upon was a major threat to Intel's longstanding approach to dominating
the semiconductor industry - relentlessly raising the clock speed of its chips.

Then two weeks ago, Intel, the world's largest chip maker, publicly acknowledged that it had hit
a "thermal wall" on its microprocessor line. As a result, the company is changing its product
strategy and disbanding one of its most advanced design groups. Intel also said that it would
abandon two advanced chip development projects, code-named Tejas and Jayhawk.

Now, Intel is embarked on a course already adopted by some of its major rivals: obtaining more
computing power by stamping multiple processors on a single chip rather than straining to
increase the speed of a single processor.
… John Markoﬀ, New York Times, May 17, 2004

CMU 15-418/618, Spring 2019

ILP tapped out + end of frequency scaling

Processor clock rate stops

increasing

No further benefit from ILP

= Transistor density
= Clock frequency
= Power
= Instruction-level parallelism (ILP)

Image credit: “The free Lunch is Over” by Herb Sutter, Dr. Dobbs 2005 CMU 15-418/618, Spring 2019
Programmer’s Perspective on Performance
Question: How do you make your program run faster?

Answer before 2004:

- Just wait 6 months, and buy a new machine!
- (Or if you’re really obsessed, you can learn about parallelism.)

Answer after 2004:

- You need to write parallel software.

CMU 15-418/618, Spring 2019

Parallel Machines Today
Examples from Apple’s product line:

Mac Pro
8 Intel Xeon E5 cores
iPhone XR
4 CPU cores
6 GPU cores

iMac Pro MacBook Pro Retina 15”

18 Intel Xeon W cores 6 Intel Core i9 cores
(images from apple.com)

CMU 15-418/618, Spring 2019

Intel Skylake (2015) (aka “6th generation Core i7”)
Quad-core CPU + multi-core GPU integrated on one chip

CPU CPU
core core

Integrated GPU

CPU CPU
core core

CMU 15-418/618, Spring 2019

Intel Xeon Phi 7120A “coprocessor”
▪ 61 “simple” x86 cores (1.3 Ghz, derived from Pentium)
▪ Targeted as an accelerator for supercomputing applications

CMU 15-418/618, Spring 2019

NVIDIA GV100 Volta GPU (2017)
80 major processing blocks
(but much, much more parallelism available... details coming soon)

CMU 15-418/618, Spring 2019

Mobile parallel processing
Power constraints heavily influence design of mobile systems

Apple A12: (in iPhone XR)

4 CPU cores NVIDIA Tegra K1:
4 GPU cores Quad-core ARM A57 CPU + 4 ARM A53 CPUs +
Neural net engine NVIDIA GPU + image processor...
+ much more

CMU 15-418/618, Spring 2019

Supercomputing
▪ Today: clusters of multi-core CPUs + GPUs
▪ Oak Ridge National Laboratory: Summit (#1 supercomputer in world)
- 4,608 nodes
- Each with two 22-core CPUs + 6 GPUs

CMU 15-418/618, Spring 2019

What is a parallel computer?

CMU 15-418/618, Spring 2019

One common definition
A parallel computer is a collection of processing elements
that cooperate to solve problems quickly

We care about performance * We’re going to use multiple

We care about eﬃciency processors to get it

* Note: diﬀerent motivation from “concurrent programming” using pthreads in 15-213 CMU 15-418/618, Spring 2019
DEMO 1
(This semester’s first parallel program)

CMU 15-418/618, Spring 2019

Speedup
One major motivation of using parallel processing: achieve a speedup

For a given problem:

execution time (using 1 processor)

speedup( using P processors ) =
execution time (using P processors)

CMU 15-418/618, Spring 2019

Class observations from demo 1

▪ Communication limited the maximum speedup achieved

- In the demo, the communication was telling each other the partial sums

▪ Minimizing the cost of communication improves speedup

- Moving students (“processors”) closer together (or let them shout)

CMU 15-418/618, Spring 2019

DEMO 2
(scaling up to four “processors”)

CMU 15-418/618, Spring 2019

Class observations from demo 2

▪ Imbalance in work assignment limited speedup

- Some students (“processors”) ran out work to do (went idle),
while others were still working on their assigned task

▪ Improving the distribution of work improved speedup

CMU 15-418/618, Spring 2019

DEMO 3
(massively parallel execution)

CMU 15-418/618, Spring 2019

Class observations from demo 3

▪ The problem I just gave you has a significant amount of

communication compared to computation

▪ Communication costs can dominate a parallel

computation, severely limiting speedup

CMU 15-418/618, Spring 2019

Course theme 1:
Designing and writing parallel programs ... that scale!

▪ Parallel thinking
1. Decomposing work into pieces that can safely be performed in parallel
2. Assigning work to processors
3. Managing communication/synchronization between the processors so
that it does not limit speedup

▪ Abstractions/mechanisms for performing the above tasks

- Writing code in popular parallel programming languages

CMU 15-418/618, Spring 2019

Course theme 2:
Parallel computer hardware implementation: how parallel
computers work

▪ Mechanisms used to implement abstractions eﬃciently

- Performance characteristics of implementations
- Design trade-oﬀs: performance vs. convenience vs. cost

▪ Why do I need to know about hardware?

- Because the characteristics of the machine really matter
(recall speed of communication issues in earlier demos)
- Because you care about eﬃciency and performance
(you are writing parallel programs after all!)

CMU 15-418/618, Spring 2019

Course theme 3:
Thinking about eﬃciency
▪ FAST != EFFICIENT

▪ Just because your program runs faster on a parallel computer, it does

not mean it is using the hardware eﬃciently
- Is 2x speedup on computer with 10 processors a good result?
▪ Programmer’s perspective: make use of provided machine capabilities

▪ HW designer’s perspective: choosing the right capabilities to put in

system (performance/cost, cost = silicon area?, power?, etc.)

CMU 15-418/618, Spring 2019

Fundamental Shift in CPU Design Philosophy
Before 2004:
- within the chip area budget, maximize performance
- increasingly aggressive speculative execution for ILP
After 2004:
- area within the chip matters (limits # of cores/chip):
- maximize performance per area
- power consumption is critical (battery life, data centers)
- maximize performance per Watt
- upshot: major focus on eﬃciency of cores

CMU 15-418/618, Spring 2019

Summary
▪ Today, single-thread performance is improving very slowly
- To run programs significantly faster, programs must utilize multiple
processing elements
- Which means you need to know how to write parallel code

▪ Writing parallel programs can be challenging

- Requires problem partitioning, communication, synchronization
- Knowledge of machine characteristics is important

▪ I suspect you will find that modern computers have tremendously

more processing power than you might realize, if you just use it!

▪ Welcome to 15-418!

CMU 15-418/618, Spring 2019

01 Introduction
No ratings yet
01 Introduction
32 pages
1 Introduction
No ratings yet
1 Introduction
65 pages
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
No ratings yet
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
11 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
30 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
40 pages
Lec1 Introduction To Parallel Computing
No ratings yet
Lec1 Introduction To Parallel Computing
40 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Parallel & Distributed Computing Course Overview
No ratings yet
Parallel & Distributed Computing Course Overview
47 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
39 pages
Generic Questions
No ratings yet
Generic Questions
70 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
PDC Complete Course File
No ratings yet
PDC Complete Course File
422 pages
Parallel Computing Terminology
No ratings yet
Parallel Computing Terminology
11 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
Lecture 4
No ratings yet
Lecture 4
27 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
38 pages
Lecture 1 Introduction 1
No ratings yet
Lecture 1 Introduction 1
49 pages
Parallel & Distributed Computing Course Overview
No ratings yet
Parallel & Distributed Computing Course Overview
63 pages
Parallel Computing Course Module
No ratings yet
Parallel Computing Course Module
8 pages
CS-3006 2 PDC Overview Compressed
No ratings yet
CS-3006 2 PDC Overview Compressed
107 pages
Introduction To Computing
No ratings yet
Introduction To Computing
6 pages
Parallel Computing Course Guide
100% (1)
Parallel Computing Course Guide
49 pages
Pda 2
No ratings yet
Pda 2
105 pages
Lecture Parallel Computing
No ratings yet
Lecture Parallel Computing
6 pages
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
No ratings yet
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
22 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
28 pages
Unit 1
No ratings yet
Unit 1
22 pages
Unit4 Session1 Intro To Parallel Computing
No ratings yet
Unit4 Session1 Intro To Parallel Computing
24 pages
Lecture-2-06 01 2025
No ratings yet
Lecture-2-06 01 2025
21 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
Advancedcomputer Architecture
No ratings yet
Advancedcomputer Architecture
91 pages
Parallelism in Computer Architecture
No ratings yet
Parallelism in Computer Architecture
27 pages
Topic 1 2024
No ratings yet
Topic 1 2024
41 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
90 pages
Term Paper Cse 211
No ratings yet
Term Paper Cse 211
20 pages
Parallel Computing Concepts Guide
No ratings yet
Parallel Computing Concepts Guide
32 pages
Parallel Computing Course Guide
No ratings yet
Parallel Computing Course Guide
50 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
Cloud Computing: Mr. Ajay B. Kapase
No ratings yet
Cloud Computing: Mr. Ajay B. Kapase
20 pages
Lec 01
No ratings yet
Lec 01
67 pages
Lecture 01 - Parallel and Distributed Computing
No ratings yet
Lecture 01 - Parallel and Distributed Computing
97 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
Parallel Computers Architecture and Programming V. Rajaraman PDF Download
No ratings yet
Parallel Computers Architecture and Programming V. Rajaraman PDF Download
179 pages
Module - 4 - Parallel Processing
No ratings yet
Module - 4 - Parallel Processing
32 pages
Parallel Computing Course Guide
No ratings yet
Parallel Computing Course Guide
58 pages
Parallel Computing Varun Patial
No ratings yet
Parallel Computing Varun Patial
41 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Parallel Programming & Multithreading
No ratings yet
Parallel Programming & Multithreading
168 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
57 pages
OSCAR-III Treatment System O& M And: Troubleshooting Manual January 2019
No ratings yet
OSCAR-III Treatment System O& M And: Troubleshooting Manual January 2019
14 pages
Theory and Application of Field Effect Transistors
No ratings yet
Theory and Application of Field Effect Transistors
73 pages
4ch1 1cr Rms 20250821
No ratings yet
4ch1 1cr Rms 20250821
15 pages
Msi MS-7641 Rev. 3
No ratings yet
Msi MS-7641 Rev. 3
36 pages
3 HEalth Education
No ratings yet
3 HEalth Education
23 pages
Paradise Weekly Model Test Cee Mds Based Model Test: (Saturday, Kartik 19, 2079)
No ratings yet
Paradise Weekly Model Test Cee Mds Based Model Test: (Saturday, Kartik 19, 2079)
35 pages
Dedication Booklet
No ratings yet
Dedication Booklet
16 pages
Nephritic Syndrome Overview & Types
No ratings yet
Nephritic Syndrome Overview & Types
14 pages
Zhan Zhuang
No ratings yet
Zhan Zhuang
8 pages
Module 01 Notes
No ratings yet
Module 01 Notes
25 pages
3M PGF Cutting Tools Catalog LR 61 5002 8282 9
No ratings yet
3M PGF Cutting Tools Catalog LR 61 5002 8282 9
12 pages
Spesifikasi Barang Medis Habis Pakai Reagen Laboratorium TA. 2021
No ratings yet
Spesifikasi Barang Medis Habis Pakai Reagen Laboratorium TA. 2021
3 pages
Sacred Songs and Solos
56% (9)
Sacred Songs and Solos
288 pages
Properties of Plastic - eMachineShop PDF
100% (1)
Properties of Plastic - eMachineShop PDF
3 pages
Brochure Kulswamini Bheri Bhavani-3
No ratings yet
Brochure Kulswamini Bheri Bhavani-3
9 pages
Assignment 2.1
No ratings yet
Assignment 2.1
11 pages
Forensic Examination of Sweat
No ratings yet
Forensic Examination of Sweat
8 pages
Ahdp Gse Deicer Cat
No ratings yet
Ahdp Gse Deicer Cat
4 pages
Math Olympiad Sample Test
No ratings yet
Math Olympiad Sample Test
3 pages
Tyco Switch
No ratings yet
Tyco Switch
30 pages
Inventory Inspection
No ratings yet
Inventory Inspection
1 page
Causes of Hydrophobic Interactions
No ratings yet
Causes of Hydrophobic Interactions
3 pages
Clarivate Top 100 New Global Brands Report 2022
No ratings yet
Clarivate Top 100 New Global Brands Report 2022
17 pages
0625 W24 Question Paper 2
No ratings yet
0625 W24 Question Paper 2
14 pages
Graphing Notes - TAILS and DRY MIX-1
No ratings yet
Graphing Notes - TAILS and DRY MIX-1
12 pages
Helyer #3 Ans
100% (1)
Helyer #3 Ans
3 pages
Lecture 01
No ratings yet
Lecture 01
25 pages
EAE 133 Lab 3 Report
No ratings yet
EAE 133 Lab 3 Report
8 pages
Document
No ratings yet
Document
5 pages
BS en 13674-1-2011
100% (2)
BS en 13674-1-2011
112 pages

01 Whyparallelism

Uploaded by

01 Whyparallelism

Uploaded by

Lecture 1:

Randy Bryant An evolving collection of teaching assistants

CMU 15-418/618, Spring 2019

CMU 15-418/618, Spring 2019

Assignment 1: SIMD and multi-core Assignment 2: CUDA

Assignment 3: Parallel Programming Assignment 4: Parallel Programming

CMU 15-418/618, Spring 2019

▪ Check out previous projects:

CMU 15-418/618, Spring 2019

CMU 15-418/618, Spring 2019

40% Programming assignments (4)

Each student gets up to five late days on programming

CMU 15-418/618, Spring 2019

CMU 15-418/618, Spring 2019

CMU 15-418/618, Spring 2019

-Talk with instructors, TAs, partner

Bridges at the Pittsburgh

800+ compute nodes

Sun Enterprise 10000 (circa 1997) Oracle Supercluster M6-32 (today)

CMU 15-418/618, Spring 2019

Image credit: Olukutun and Hammond, ACM Queue 2005

▪ During the 80s and 90s: large exponential performance gains

Pat Gelsinger, ISSCC 2001

CMU 15-418/618, Spring 2019

Processor clock rate stops

No further benefit from ILP

Answer before 2004:

Answer after 2004:

CMU 15-418/618, Spring 2019

iMac Pro MacBook Pro Retina 15”

CMU 15-418/618, Spring 2019

CMU 15-418/618, Spring 2019

CMU 15-418/618, Spring 2019

CMU 15-418/618, Spring 2019

Apple A12: (in iPhone XR)

CMU 15-418/618, Spring 2019

CMU 15-418/618, Spring 2019

CMU 15-418/618, Spring 2019

We care about performance * We’re going to use multiple

CMU 15-418/618, Spring 2019

For a given problem:

execution time (using 1 processor)

CMU 15-418/618, Spring 2019

▪ Communication limited the maximum speedup achieved

▪ Minimizing the cost of communication improves speedup

CMU 15-418/618, Spring 2019

CMU 15-418/618, Spring 2019

▪ Imbalance in work assignment limited speedup

▪ Improving the distribution of work improved speedup

CMU 15-418/618, Spring 2019

CMU 15-418/618, Spring 2019

▪ The problem I just gave you has a significant amount of

▪ Communication costs can dominate a parallel

CMU 15-418/618, Spring 2019

▪ Abstractions/mechanisms for performing the above tasks

CMU 15-418/618, Spring 2019

▪ Mechanisms used to implement abstractions eﬃciently

▪ Why do I need to know about hardware?

CMU 15-418/618, Spring 2019

▪ Just because your program runs faster on a parallel computer, it does

▪ HW designer’s perspective: choosing the right capabilities to put in

CMU 15-418/618, Spring 2019

CMU 15-418/618, Spring 2019

▪ Writing parallel programs can be challenging

▪ I suspect you will find that modern computers have tremendously

CMU 15-418/618, Spring 2019

You might also like