0% found this document useful (0 votes)

174 views51 pages

Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago

This document provides an introduction to parallel processing. It begins with an acknowledgements section and outline. The outline discusses Moore's Law and the need for parallel processing. It then covers various techniques for improving single processor performance like pipelining, caches, and instruction-level parallelism. The document notes the limitations of these techniques and the need for explicit parallel algorithms. It introduces classifications of parallel computations and architectures. It provides examples of parallel programming and applications. It concludes with a discussion of future advances in parallel processing.

Uploaded by

Dattatray Bhate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

174 views51 pages

Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago

Uploaded by

Dattatray Bhate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 51

Introduction to Parallel Processing

Shantanu Dutt University of Illinois at Chicago

Acknowledgements

Ashish Agrawal, IIT Kanpur, Fundamentals of Parallel Processing (slides), w/ some modifications and augmentations by Shantanu Dutt John Urbanic, Parallel Computing: Overview (slides), w/ some modifications and augmentations by Shantanu Dutt John Mellor-Crummey, COMP 422 Parallel Computing: An Introduction, Department of Computer Science, Rice University, (slides), w/ some modifications and augmentations by Shantanu Dutt

Outline

Moore's Law and its limits Different uni-processor performance enhancement techniques and their limits Classification of parallel computations Classification of parallel architectures - Distributed and Shared memory Simple examples of parallel processing Example applications Future advances Summary

Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur

Moores Law & Need for Parallel Processing

Chip performance doubles every 18-24 months Power consumption is prop. to freq. Limits of Serial computing Heating issues Limit to transmissions speeds Leakage currents Limit to miniaturization Multi-core processors already commonplace. Most high performance servers already parallel.

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Quest for Performance

Pipelining Superscalar Architecture Out of Order Execution Caches Instruction Set Design Advancements Parallelism Multi-core processors Clusters Grid

This is the future

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Pipelining

Illustration of Pipeline using the fetch, load, execute, store stages. At the start of execution Wind up. At the end of execution Wind down. Pipeline stalls due to data dependency (RAW, WAR), resource conflict, incorrect branch prediction Hit performance and speedup. Pipeline depth No of cycles in execution simultaneously. Intel Pentium 4 35 stages.

Top text from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur

Pipelining

Tpipe(n) is pipelined time to process n instructions = fill-time + max{ti}, ti = exec. time of the ith stage
7

Cache

Desire for fast cheap and non volatile memory Memory speed growth at 7% per annum while processor growth at 50% p.a. Cache fast small memory. L1 and L2 caches. Retrieval from memory takes several hundred clock cycles Retrieval from L1 cache takes the order of one clock cycle and from L2 cache takes the order of 10 clock cycles. Cache hit and miss. Prefetch used to avoid cache misses at the start of the execution of the program. Cache lines used to avoid latency time in case of a cache miss Order of search L1 cache -> L2 cache -> RAM -> Disk Cache coherency Correctness of data. Important for distributed parallel computing Limit to cache improvement: Improving cache performance will at most improve efficiency to match processor efficiency

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

: instruction-level parallelismdegree generally low and dependent on how the sequential code has been written, so not v. effective

(single-instr. multiple data)

(exs. of limited data parallelism)

(exs. of limited & low-level functional parallelism)

Thus need development of explicit parallel algorithms that are based on a fundamental understanding of the parallelism inherent in a problem, and exploiting that parallelism with minimum interaction/communication between the parallel parts

(simultaneous multithreading)

(multi-threading)

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Applications of Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Example problems & solutions

Easy Parallel Situation Each data part is independent. No communication is required between the execution units solving two different parts. Heat Equation The initial temperature is zero on the boundaries and high in the middle The boundary temperature is held at zero. The calculation of an element is dependent upon its neighbor elements

data1 data2 Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

...

data N 26

1. 2. 3. 4. 5. 6. 7. 8. 9.

find out if I am MASTER or WORKER if I am MASTER initialize array send each WORKER starting info and subarray do until all WORKERS converge gather from all WORKERS convergence data broadcast to all WORKERS convergence signal end do receive results from each WORKER

14.

15.
16. 17. 18. 19. 20.

update border of my portion of solution array determine if my solution has converged if so {send MASTER convergence signal recv. from MASTER convergence signal} end do } send MASTER results endif Serial Code do iy=2, ny-1 do ix=2, nx-1 u2(ix,iy)=u1(ix,iy)+cx*{u1(ix+1,iy)} + u1(ix1,iy) + cy*{u1(ix,iy+1)} + u1(ix,iy-1) enddo Master (can be one of the workers) enddo

1. 2. 3. 4. 1.

5.
6. 7.

else if I am WORKER receive from MASTER starting info and subarray do until solution converged { update time non-blocking send neighbors my border info non-blocking receive neighbors border info Workers update interior of my portion of solution array wait for non-block. commun. to complete

Problem Grid

Code from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur

How to interconnect the multiple cores/processors is a major consideration in a parallel architecture

Tflops

Parallelism - A simplistic understanding

Data Parallelism Functional or Control Parallelism

Data Parallelism - Divide the dataset and solve each sector similarly on a separate execution unit. Functional Parallelism Divide the 'problem' into different tasks and execute the tasks on different units. What would func. parallelism look like for the example on the right?

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Data Parallelism

Sequential

Multiple tasks at once. Distribute work into multiple execution units. A classification of parallelism:

Data Parallelism

Functional Parallelism

16/12/2008

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Flynns Classification

Flynn's Classical Taxonomy Single Instruction, Single Data (SISD)your single-core uniprocessor PC Single Instruction, Multiple Data (SIMD)special purpose lowgranularity multi-processor m/c w/ a single control unit relaying the same instruction to all processors (w/ different data) every cc Multiple Instruction, Single Data (MISD)pipelining is a major example Multiple Instruction, Multiple Data (MIMD)the most prevalent model. SPMD (Single Program Multiple Data) is a very useful subset. Note that this is v. different from SIMD. Why? Note that Data vs Control Parallelism is another independent classification to the above

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Flynns Classification (contd).

Data Parallelism: SIMD and SPMD fall into this category Functional Parallelism: MISD falls into this category

Parallel Arch. Classification

Multi-processor Architectures Distributed MemoryMost prevalent architecture model for # processors > 8 Indirect interconnectionn n/ws Direct interconnection n/ws Shared Memory

Uniform Memory Access (UMA) Non- Uniform Memory Access (NUMA)Distributed shared memory

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Distributed MemoryMessage Passing Architectures

Each processor P (with its own local cache C) is connected to exclusive local memory, i.e. no other CPU has direct access to it. Each node comprises at least one network interface (NI) that mediates the connection to a communication network. On each CPU runs a serial process that can communicate with other processes on other CPUs by means of the network. Non-blocking vs Blocking communication Example: A 2x4 Direct vs Indirect mesh n/w (direct Communication/Interconnection connection n/w) network
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 40

The ARGO Beowulf Cluster at UIC (http://accc.uic.edu/service/argo-cluster)

Has 56 compute nodes/computers and a master node

Master here has a different meaninggenerally a system front-end where you login and perform various tasks before submitting your parallel code to run on several compute nodesthan the master node in a parallel algorithm (e.g., the one we saw for the finite-element heat distribution problem), which would actually be one of the compute nodes, and generally distributes data to the other compute nodes, monitors progress of the computation, determines the end of the computation, etc., and may also additionally perform a part of the computation

Compute nodes are divided among 14 zones, each zone containing 4 nodes which are connected as a ring network. Zones are connected to each other by a higher-level n/w. Each node (compute or master) has 2 processors. Each processor on some nodes are single-core ones, and dual cores in others; see http://accc.uic.edu/service/arg/nodes

System Computational Actions in a Message-Passing Program

Proc. X Proc. Y Proc. X Proc. Y

Message passing mapping

a := b+c;

b := x*y;

recv(P2, b); a := b+c;

b := x*y; send(P1,b);

(a) Two basic parallel processes X, Y, and their data dependency

Processor/core containing X

b
P(X) P(Y)

Processor/core containing Y

Message passing Link (direct of data item b. or indirect) betw. the 2 processors

(b) Their mapping to a message-passing multicomputer

Distributed Shared Memory Arch.: UMA

Flat memory model Memory bandwidth and latency are the same for all processors and all memory locations. Simplest example dual core processor Most commonly represented today by Symmetric Multiprocessor (SMP) machines Cache coherent UMAconsistent cache values of the same data item in different proc./core caches

L1 cache L2 cache

Dual-Core

Quad-Core
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 43

System Computational Actions in a Shared-Memory Program

Proc. X Proc. Y Proc. X Proc. Y

Shared-memory mapping

a := b+c;

b := x*y;
Possible Actions by O.S.: (i) Since b is a shared data item (e.g., designated by compiler or programmer), check bs location to see if it has been written to by Y or any process (if dont care about the writing process). (ii) If so {read b & decrement read_cntr for b} else go to (i) and busy wait (check periodically).

a := b+c;

b := x*y; Possible Actions by O.S.:

(i) Since b is a shared data item (e.g., designated by compiler or programmer), check bs location to see if it can be written to (all prev. reads done: read_cntr for b = 0). (ii) If so, write b to its location and mark status bit as written by Y. Initialize read_cntr for b to pre-determined value

(a) Two basic parallel processes X, Y, and their data dependency

P(X)

P(Y)

Shared Memory

(b) Their mapping to a shared-memory multiprocessor

Distributed Shared Memory Arch.: NUMA

Memory is physically distributed but logically shared. The physical layout similar to the distributed-memory message-passing case Aggregated memory of the whole system appear as one single address space. Due to the distributed nature, memory access performance varies depending on which CPU accesses which parts of memory (local vs. remote access). Two locality domains linked through a high speed connection called Hyper Transport (in general via a link, as in message passing archs, only here these links are used by the O.S. to transmit read/write non-local data to/from processor/non-local memory). Advantage Scalability (compared to UMAs) Disadvantage a) Locality Problems and Connection congestion. b) Not a natural parallel prog./algo. Model (it is easier to partition data among procs instead of think of all of it occupying a large monolithic address space that each proc. can access).

2x2 mesh connection

Most text from Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur

An example of an SPMD message-passing parallel program

SPMD message-passing parallel program (contd.)

node xor D,

Summary

Serial computers / microprocessors will probably not get much faster parallelization unavoidable Pipelining, cache and other optimization strategies for serial computers reaching a plateau Data and functional parallelism Flynns taxonomy: SIMD, MISD, MIMD/SPMD Parallel Architectures Intro

Distributed Memory Shared Memory

Uniform Memory Access Non Uniform Memory Access

Application examples Parallel program/algorithm examples

Most text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur

Additional References

Computer Organization and Design Patterson Hennessey Modern Operating Systems Tanenbaum Concepts of High Performance Computing Georg Hager Gerhard Wellein Cramming more components onto Integrated Circuits Gordon Moore, 1965 Introduction to Parallel Computing https://computing.llnl.gov/tutorials/parallel_comp The Landscape of Parallel Computing Research A view from Berkeley, 2006

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
32 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Parallel Computing: Types of Parallelism
No ratings yet
Parallel Computing: Types of Parallelism
27 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Parallel & Distributed Computing Course Overview
No ratings yet
Parallel & Distributed Computing Course Overview
47 pages
Parallel & Distributed Computing Course Overview
No ratings yet
Parallel & Distributed Computing Course Overview
63 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
43 pages
Intro HPC IITK
No ratings yet
Intro HPC IITK
44 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
Parallel Processor Computing Unit 1
No ratings yet
Parallel Processor Computing Unit 1
10 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
40 pages
Parallel Processing & Distributed Systems: Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam
No ratings yet
Parallel Processing & Distributed Systems: Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam
16 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
CSC580 Quick Notes Lect1and2
100% (1)
CSC580 Quick Notes Lect1and2
18 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
34 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Module 1
No ratings yet
Module 1
14 pages
Unit 5
No ratings yet
Unit 5
66 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
Lecture 2 General Parallelism Terms
No ratings yet
Lecture 2 General Parallelism Terms
22 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
38 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Aca
No ratings yet
Aca
13 pages
Parallel Computing MCSE011
No ratings yet
Parallel Computing MCSE011
189 pages
Unit 1
No ratings yet
Unit 1
22 pages
Group3 - Parallel - Computing - Techniques - Presentation Power Point 2025
No ratings yet
Group3 - Parallel - Computing - Techniques - Presentation Power Point 2025
27 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
High Performance Computing
No ratings yet
High Performance Computing
17 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Slides
No ratings yet
Slides
36 pages
PDC 3
No ratings yet
PDC 3
26 pages
MCSA Answer
No ratings yet
MCSA Answer
12 pages
OS Important Questions
No ratings yet
OS Important Questions
4 pages
Operating System 2 Marks and 16 Marks - Answers
75% (4)
Operating System 2 Marks and 16 Marks - Answers
45 pages
Docking Station Driver Install Guide
No ratings yet
Docking Station Driver Install Guide
4 pages
HIP2019-Advanced XXE Exploitation
No ratings yet
HIP2019-Advanced XXE Exploitation
30 pages
LOOG Improving GPU Efficiency With Light-Weight Out-Of-Order Execution
No ratings yet
LOOG Improving GPU Efficiency With Light-Weight Out-Of-Order Execution
4 pages
Windows Learning Roadmap
No ratings yet
Windows Learning Roadmap
2 pages
Operating System Goals: Execute User Programs and Solve User Problems
No ratings yet
Operating System Goals: Execute User Programs and Solve User Problems
24 pages
Cisco Switch Password Recovery
No ratings yet
Cisco Switch Password Recovery
7 pages
Batch Renderer
No ratings yet
Batch Renderer
6 pages
How To Install OpenCV On Windows 7 - 8 (64bit) Using MinGW (64) and Codeblocks - Zahid Hasan
No ratings yet
How To Install OpenCV On Windows 7 - 8 (64bit) Using MinGW (64) and Codeblocks - Zahid Hasan
6 pages
Hi Sense
No ratings yet
Hi Sense
66 pages
Case Study - Linux - Part 1
No ratings yet
Case Study - Linux - Part 1
9 pages
Dev w0
No ratings yet
Dev w0
11 pages
Deadlock Detection Using Java
0% (1)
Deadlock Detection Using Java
3 pages
Operating System Basics Guide
100% (1)
Operating System Basics Guide
89 pages
Log
No ratings yet
Log
8 pages
Unit-5 Notes-Part-3
No ratings yet
Unit-5 Notes-Part-3
17 pages
Linux Basics for Beginners
No ratings yet
Linux Basics for Beginners
10 pages
Project Based Learning
No ratings yet
Project Based Learning
11 pages
ADB Shell Commands
No ratings yet
ADB Shell Commands
17 pages
Remastering UBUNTU
No ratings yet
Remastering UBUNTU
26 pages
WWW Javatpoint Com Operating System Interview Questions
No ratings yet
WWW Javatpoint Com Operating System Interview Questions
19 pages
Android Debugging Log Analysis
No ratings yet
Android Debugging Log Analysis
12 pages
Application
No ratings yet
Application
265 pages
How To Perform 3PAR Deinstallation
No ratings yet
How To Perform 3PAR Deinstallation
103 pages
WPS Uae
No ratings yet
WPS Uae
38 pages
Arch Linux Install Guide
No ratings yet
Arch Linux Install Guide
20 pages
Practical N0.2 AIM: Install Hadoop Hadoop Installation On Windows 10
No ratings yet
Practical N0.2 AIM: Install Hadoop Hadoop Installation On Windows 10
12 pages
Evolution of UNIX to Kali Linux
100% (3)
Evolution of UNIX to Kali Linux
22 pages

Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago

Uploaded by

Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago

Uploaded by

Introduction to Parallel Processing

Shantanu Dutt University of Illinois at Chicago

Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur

Moores Law & Need for Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Quest for Performance

This is the future

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Top text from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

(single-instr. multiple data)

(exs. of limited data parallelism)

(exs. of limited & low-level functional parallelism)

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Applications of Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Example problems & solutions

data1 data2 Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Code from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur

How to interconnect the multiple cores/processors is a major consideration in a parallel architecture

Parallelism - A simplistic understanding

Data Parallelism Functional or Control Parallelism

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Flynns Classification (contd).

Flynns Classification (contd).

Flynns Classification (contd).

Flynns Classification (contd).

Flynns Classification (contd).

Flynns Classification (contd).

Parallel Arch. Classification

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Distributed MemoryMessage Passing Architectures

The ARGO Beowulf Cluster at UIC (http://accc.uic.edu/service/argo-cluster)

Has 56 compute nodes/computers and a master node

System Computational Actions in a Message-Passing Program

Message passing mapping

recv(P2, b); a := b+c;

(a) Two basic parallel processes X, Y, and their data dependency

(b) Their mapping to a message-passing multicomputer

Distributed Shared Memory Arch.: UMA

System Computational Actions in a Shared-Memory Program

b := x*y; Possible Actions by O.S.:

(a) Two basic parallel processes X, Y, and their data dependency

(b) Their mapping to a shared-memory multiprocessor

Distributed Shared Memory Arch.: NUMA

2x2 mesh connection

Most text from Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur

An example of an SPMD message-passing parallel program

SPMD message-passing parallel program (contd.)

Distributed Memory Shared Memory

Uniform Memory Access Non Uniform Memory Access

Application examples Parallel program/algorithm examples

Most text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

You might also like