1
CS 133 Parallel & Distributed Computing
Course Instructor: Adam Kaplan (kaplan@cs.ucla.edu) Lecture #1: 4/2/2012
Vanishing from your Desktops: The Uniprocessor
Uniprocessor
Single processor plus associated cache(s) on a chip Traditional computer until 2003 Supports traditional sequential programming model Where can you still find them?
From Herlihy & Shavit, Art of Multiprocessor Programming
cpu
cache
Traditional Server: Shared Memory Multiprocessor (SMP)
Multi-chip computer systems High-performance computing Servers Supercomputers
cache
cache
Bus
cache
Bus
shared memory
Each processor chip had a CPU and cache Multiple chips connected by a bus to a shared main memory
From Herlihy & Shavit, Art of Multiprocessor Programming
Your New Server or Desktop: The Chip Multiprocessor (CMP)
All on the same chip
cache
cache
Bus
cache
Bus
shared memory
Sun T2000 Niagara
From Herlihy & Shavit, Art of Multiprocessor Programming
How did this happen?
Moores Law Every 18 months, # transistors on chip doubles Until early 2000s Single processor performance got better all the time The same sequential code would automatically get faster on new hardware Computer marketing all about the MHz/GHz
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
Application performance was increasing by 52% per year as measured by the widely used SpecInt benchmark suite
due to transistor density due to architecture changes, e.g., Instruction Level Parallelism (ILP)
VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
1990s: How to make a faster processor
Increase the clock speed (frequency scaling) Deeper pipelinesmore/shorter stages BUTeventually chips get too hot Speculative Superscalar (SS) Multiple instructions can execute at a time (instruction level parallelism, ILP)
Hardware finds independent instructions in a sequential program that can execute simultaneously Hardware predicts which branches will be taken Executes instructions from a likely execution path before it is known whether the path will be taken
BUTeventually diminishing returns Nice feature: programmers did not need to know/care about this
Chip density grows by 2x every 18 mos
Clock speed does not
2000s: How to make a faster processor
Diminishing returns seen by speculative superscalar
Only so much ILP to exploit
Use additional transistors to create more/simpler processors on chip
BUTcan application software keep them busy?
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
How can simpler processors help?
Potential performance the same
Source: Intel
10
What is Parallel Computing?
Parallel computing
Using multiple processors in parallel to solve problems more quickly than with a single processor
How to realize speedup
Divide a single task into subtasks Execute these subtasks simultaneously on multiple processors
What can go wrong for the programmer?
Task does not divide easily/evenly into subtasks Subtasks need to take turns accessing resources Heisenbugs: different runs produce different results And so on
Percent Multiprocessor Papers in ISCA
100
10
20
30
40
50
60
70
80
90
0
Parallel Computing: Not Always Popular?
Source: Mark Hill, 2/2007
1973
1974
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
11
12
Why Parallel Computing Waned?
We have been using parallel computing for decades Mostly used in computational science/engineering Problems too large to solve on one computer?
Use 100s or 1000s
Many companies in the 80s/90s gambled on parallel computing and lost Computers got faster too quickly
Parallel platforms quickly became obsolete as they were outperformed by better uniprocessors
Why bother with parallel programming?
Just wait a year or two for a faster uniprocessor
13
Why Parallel Computing Thrives Again
We are dedicating all of our future product development to multicore designs. This is a sea change in computing.
Paul Otellini, President, Intel (2005)
The entire computing industry has bet on parallelism There is now a desperate need for parallel programmers
Parallelism must be exposed to and managed by software Unfortunately, most programmers have been trained to think sequentially about software
14
Multicore Products
All microprocessor companies switch to MP (2X CPUs / 2 yrs)
Manufacturer/Year
Processors/chip
Threads/Processor
Threads/chip
AMD/05
2
1
2
Intel/06
2
2
4
IBM/04 Sun/07
2
2
4
8
16
128
And at the same time, The STI Cell processor (PS3) has 1 main core + 8 helper cores The latest NVidia Graphics Processing Unit (GPU)
GTX 680 has 1,536 small cores
Intel has demonstrated an 80-core research chip
15
Looking Ahead
All major players are producing multicore chips Every machine will soon be a parallel machine Will all programmers be parallel programmers?! New software model Hide the cost of new features - first speed up the code Will all programmers be performance programmers?! Some overhead may eventually be hidden In libraries, compilers, and higher-level languages
But a lot of work is needed to get there
Big open questions: What will be the killer apps for multicore machines? How should the chips be designed and programmed?
16
Why writing (fast) parallel programs is hard
Finding enough parallelism (Amdahls Law) Granularity Locality Load balance Coordination and synchronization Performance modeling All of these things makes parallel programming harder than sequential programming.
17
Finding Enough Parallelism
Suppose only part of an application can be parallelized Amdahls law
Let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable P = number of processors
Speedup(P) = Time(1)/Time(P) = 1 / ( s + (1 - s) / P ) ~= 1/s as P approaches
Even
if
the
parallel
part
speeds
up
perfectly
performance
is
limited
by
the
sequenMal
part
18
Overhead of Parallelism
Given enough parallel work, this is the biggest barrier to getting desired speedup Parallelism overheads include:
cost of starting a thread or process cost of communicating shared data cost of synchronizing extra (redundant) computation
Each can be in the range of milliseconds on some systems
Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (i.e. large granularity), but not so large that there is not enough parallel work
19
Locality and Parallelism
Conventional Storage Hierarchy Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache potential interconnects
L3 Cache
L3 Cache
L3 Cache
Memory
Memory
Memory
Large memories are slow, fast memories are small Storage hierarchies are large and fast on average Parallel processors, collectively, have large, fast cache
the slow accesses to remote data we call communication
Algorithm should do most work on local data
20
Load Imbalance
Load imbalance is the time that some processors in the system are idle due to
insufficient parallelism (during that phase) unequal size tasks
adapting to interesting parts of a domain tree-structured computations fundamentally unstructured problems
Parallel algorithm/platform needs to balance load
21
What makes parallel programming easier?
Standardized parallel programming platforms
OpenMP Message Passing Interface (MPI) Posix Threads (pthreads) Java thread model Compute Unified Device Architecture (CUDA)
Why do they help?
Longer life-cycle for parallel programs Code works across platforms Automatic scaling?
22
New Adventures In Parallel Computing
Internet can be seen as a large parallel/distributed computing environment
The cloud
A set of computers on the internet available on demand, like a public utility
Googles MapReduce
A software framework enabling the computing of large data sets on clusters of computers
Can map a parallel algorithm to worker nodes in the cloud Reduce results from worker nodes to a single output/answer