Introduction to Parallel Processing
Shantanu Dutt University of Illinois at Chicago
Acknowledgements
Ashish Agrawal, IIT Kanpur, Fundamentals of Parallel Processing (slides), w/ some modifications and augmentations by Shantanu Dutt John Urbanic, Parallel Computing: Overview (slides), w/ some modifications and augmentations by Shantanu Dutt John Mellor-Crummey, COMP 422 Parallel Computing: An Introduction, Department of Computer Science, Rice University, (slides), w/ some modifications and augmentations by Shantanu Dutt
Outline
Moore's Law and its limits Different uni-processor performance enhancement techniques and their limits Classification of parallel computations Classification of parallel architectures - Distributed and Shared memory Simple examples of parallel processing Example applications Future advances Summary
Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur
Moores Law & Need for Parallel Processing
Chip performance doubles every 18-24 months Power consumption is prop. to freq. Limits of Serial computing Heating issues Limit to transmissions speeds Leakage currents Limit to miniaturization Multi-core processors already commonplace. Most high performance servers already parallel.
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur
Quest for Performance
Pipelining Superscalar Architecture Out of Order Execution Caches Instruction Set Design Advancements Parallelism Multi-core processors Clusters Grid
This is the future
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur
Pipelining
Illustration of Pipeline using the fetch, load, execute, store stages. At the start of execution Wind up. At the end of execution Wind down. Pipeline stalls due to data dependency (RAW, WAR), resource conflict, incorrect branch prediction Hit performance and speedup. Pipeline depth No of cycles in execution simultaneously. Intel Pentium 4 35 stages.
Top text from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur
Pipelining
Tpipe(n) is pipelined time to process n instructions = fill-time + max{ti}, ti = exec. time of the ith stage
7
Cache
Desire for fast cheap and non volatile memory Memory speed growth at 7% per annum while processor growth at 50% p.a. Cache fast small memory. L1 and L2 caches. Retrieval from memory takes several hundred clock cycles Retrieval from L1 cache takes the order of one clock cycle and from L2 cache takes the order of 10 clock cycles. Cache hit and miss. Prefetch used to avoid cache misses at the start of the execution of the program. Cache lines used to avoid latency time in case of a cache miss Order of search L1 cache -> L2 cache -> RAM -> Disk Cache coherency Correctness of data. Important for distributed parallel computing Limit to cache improvement: Improving cache performance will at most improve efficiency to match processor efficiency
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur
: instruction-level parallelismdegree generally low and dependent on how the sequential code has been written, so not v. effective
(single-instr. multiple data)
(exs. of limited data parallelism)
(exs. of limited & low-level functional parallelism)
9
10
Thus need development of explicit parallel algorithms that are based on a fundamental understanding of the parallelism inherent in a problem, and exploiting that parallelism with minimum interaction/communication between the parallel parts
11
12
13
(simultaneous multithreading)
(multi-threading)
14
15
16
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur
17
18
19
Applications of Parallel Processing
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur
20
21
22
23
24
25
Example problems & solutions
Easy Parallel Situation Each data part is independent. No communication is required between the execution units solving two different parts. Heat Equation The initial temperature is zero on the boundaries and high in the middle The boundary temperature is held at zero. The calculation of an element is dependent upon its neighbor elements
data1 data2 Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur
...
data N 26
1. 2. 3. 4. 5. 6. 7. 8. 9.
find out if I am MASTER or WORKER if I am MASTER initialize array send each WORKER starting info and subarray do until all WORKERS converge gather from all WORKERS convergence data broadcast to all WORKERS convergence signal end do receive results from each WORKER
14.
15.
16. 17. 18. 19. 20.
update border of my portion of solution array determine if my solution has converged if so {send MASTER convergence signal recv. from MASTER convergence signal} end do } send MASTER results endif Serial Code do iy=2, ny-1 do ix=2, nx-1 u2(ix,iy)=u1(ix,iy)+cx*{u1(ix+1,iy)} + u1(ix1,iy) + cy*{u1(ix,iy+1)} + u1(ix,iy-1) enddo Master (can be one of the workers) enddo
1. 2. 3. 4. 1.
5.
6. 7.
else if I am WORKER receive from MASTER starting info and subarray do until solution converged { update time non-blocking send neighbors my border info non-blocking receive neighbors border info Workers update interior of my portion of solution array wait for non-block. commun. to complete
Problem Grid
Code from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur
27
How to interconnect the multiple cores/processors is a major consideration in a parallel architecture
28
Tflops
Tflops
kW
29
Parallelism - A simplistic understanding
Data Parallelism Functional or Control Parallelism
Data Parallelism - Divide the dataset and solve each sector similarly on a separate execution unit. Functional Parallelism Divide the 'problem' into different tasks and execute the tasks on different units. What would func. parallelism look like for the example on the right?
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur
Data Parallelism
Sequential
Multiple tasks at once. Distribute work into multiple execution units. A classification of parallelism:
30
Data Parallelism
Functional Parallelism
16/12/2008
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur
31
Flynns Classification
Flynn's Classical Taxonomy Single Instruction, Single Data (SISD)your single-core uniprocessor PC Single Instruction, Multiple Data (SIMD)special purpose lowgranularity multi-processor m/c w/ a single control unit relaying the same instruction to all processors (w/ different data) every cc Multiple Instruction, Single Data (MISD)pipelining is a major example Multiple Instruction, Multiple Data (MIMD)the most prevalent model. SPMD (Single Program Multiple Data) is a very useful subset. Note that this is v. different from SIMD. Why? Note that Data vs Control Parallelism is another independent classification to the above
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur
32
Flynns Classification (contd).
33
Flynns Classification (contd).
34
Flynns Classification (contd).
35
Flynns Classification (contd).
36
Flynns Classification (contd).
37
Flynns Classification (contd).
Data Parallelism: SIMD and SPMD fall into this category Functional Parallelism: MISD falls into this category
38
Parallel Arch. Classification
Multi-processor Architectures Distributed MemoryMost prevalent architecture model for # processors > 8 Indirect interconnectionn n/ws Direct interconnection n/ws Shared Memory
Uniform Memory Access (UMA) Non- Uniform Memory Access (NUMA)Distributed shared memory
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur
39
Distributed MemoryMessage Passing Architectures
Each processor P (with its own local cache C) is connected to exclusive local memory, i.e. no other CPU has direct access to it. Each node comprises at least one network interface (NI) that mediates the connection to a communication network. On each CPU runs a serial process that can communicate with other processes on other CPUs by means of the network. Non-blocking vs Blocking communication Example: A 2x4 Direct vs Indirect mesh n/w (direct Communication/Interconnection connection n/w) network
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 40
The ARGO Beowulf Cluster at UIC (http://accc.uic.edu/service/argo-cluster)
Has 56 compute nodes/computers and a master node
Master here has a different meaninggenerally a system front-end where you login and perform various tasks before submitting your parallel code to run on several compute nodesthan the master node in a parallel algorithm (e.g., the one we saw for the finite-element heat distribution problem), which would actually be one of the compute nodes, and generally distributes data to the other compute nodes, monitors progress of the computation, determines the end of the computation, etc., and may also additionally perform a part of the computation
Compute nodes are divided among 14 zones, each zone containing 4 nodes which are connected as a ring network. Zones are connected to each other by a higher-level n/w. Each node (compute or master) has 2 processors. Each processor on some nodes are single-core ones, and dual cores in others; see http://accc.uic.edu/service/arg/nodes
41
System Computational Actions in a Message-Passing Program
Proc. X Proc. Y Proc. X Proc. Y
Message passing mapping
a := b+c;
b := x*y;
recv(P2, b); a := b+c;
b := x*y; send(P1,b);
(a) Two basic parallel processes X, Y, and their data dependency
Processor/core containing X
b
P(X) P(Y)
Processor/core containing Y
Message passing Link (direct of data item b. or indirect) betw. the 2 processors
(b) Their mapping to a message-passing multicomputer
42
Distributed Shared Memory Arch.: UMA
Flat memory model Memory bandwidth and latency are the same for all processors and all memory locations. Simplest example dual core processor Most commonly represented today by Symmetric Multiprocessor (SMP) machines Cache coherent UMAconsistent cache values of the same data item in different proc./core caches
L1 cache L2 cache
Dual-Core
Quad-Core
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 43
System Computational Actions in a Shared-Memory Program
Proc. X Proc. Y Proc. X Proc. Y
Shared-memory mapping
a := b+c;
b := x*y;
Possible Actions by O.S.: (i) Since b is a shared data item (e.g., designated by compiler or programmer), check bs location to see if it has been written to by Y or any process (if dont care about the writing process). (ii) If so {read b & decrement read_cntr for b} else go to (i) and busy wait (check periodically).
a := b+c;
b := x*y; Possible Actions by O.S.:
(i) Since b is a shared data item (e.g., designated by compiler or programmer), check bs location to see if it can be written to (all prev. reads done: read_cntr for b = 0). (ii) If so, write b to its location and mark status bit as written by Y. Initialize read_cntr for b to pre-determined value
(a) Two basic parallel processes X, Y, and their data dependency
P(X)
P(Y)
Shared Memory
(b) Their mapping to a shared-memory multiprocessor
44
Distributed Shared Memory Arch.: NUMA
Memory is physically distributed but logically shared. The physical layout similar to the distributed-memory message-passing case Aggregated memory of the whole system appear as one single address space. Due to the distributed nature, memory access performance varies depending on which CPU accesses which parts of memory (local vs. remote access). Two locality domains linked through a high speed connection called Hyper Transport (in general via a link, as in message passing archs, only here these links are used by the O.S. to transmit read/write non-local data to/from processor/non-local memory). Advantage Scalability (compared to UMAs) Disadvantage a) Locality Problems and Connection congestion. b) Not a natural parallel prog./algo. Model (it is easier to partition data among procs instead of think of all of it occupying a large monolithic address space that each proc. can access).
2x2 mesh connection
Most text from Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur
45
46
47
An example of an SPMD message-passing parallel program
48
SPMD message-passing parallel program (contd.)
node xor D,
49
Summary
Serial computers / microprocessors will probably not get much faster parallelization unavoidable Pipelining, cache and other optimization strategies for serial computers reaching a plateau Data and functional parallelism Flynns taxonomy: SIMD, MISD, MIMD/SPMD Parallel Architectures Intro
Distributed Memory Shared Memory
Uniform Memory Access Non Uniform Memory Access
Application examples Parallel program/algorithm examples
Most text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur
50
Additional References
Computer Organization and Design Patterson Hennessey Modern Operating Systems Tanenbaum Concepts of High Performance Computing Georg Hager Gerhard Wellein Cramming more components onto Integrated Circuits Gordon Moore, 1965 Introduction to Parallel Computing https://computing.llnl.gov/tutorials/parallel_comp The Landscape of Parallel Computing Research A view from Berkeley, 2006
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur
51