Module 1
Module 1
UCS645
Lec - 1
Saif Nalband
Contents
Parallelism Fundamentals:
● Scope and issues of parallel and distributed computing,
● Parallelism,
● Goals of parallelism,
● Parallelism and concurrency,
● Multiple simultaneous computations.
One common definition
A parallel computer is a collection of processing elements
that cooperate to solve problems quickly
▪ Parallel thinking
1.Decomposing work into pieces that can safely be performed in parallel
2.Assigning work to processors
3.Managing communication/synchronization between the processors so that
it does not limit speedup
Year
Image credit: Olukutun and Hammond, ACM Queue 2005
Until ~15 years ago: two significant reasons for
processor performance improvement
int x = 1;
printf(“%d\n”, x);
return 0;
}
What is a program? (from a processor’s perspective)
A program is just a list of processor instructions!
_main:
100000f10: pushq %rbp
100000f11: movq %rsp, %rbp
int main(int argc, char** argv) 100000f14: subq $32, %rsp
100000f18: movl $0, -4(%rbp)
{ 100000f1f: movl %edi, -8(%rbp)
int x = 1; 100000f22: movq %rsi, -16(%rbp)
100000f26: movl $1, -20(%rbp)
100000f2d: movl $0, -24(%rbp)
for (int i=0; i<10; i++) 100000f34: cmpl $10, -24(%rbp)
{ x = x + x; Compile 100000f38:
100000f3e:
jge 23 <_main+0x45>
movl -20(%rbp), %eax
} code 100000f41: addl -20(%rbp), %eax
100000f44: movl %eax, -20(%rbp)
100000f47: movl -24(%rbp), %eax
printf(“%d\n”, x);
100000f4a: addl $1, %eax
100000f4d: movl %eax, -24(%rbp)
return 0; 100000f50: jmp -33 <_main+0x24>
100000f55: leaq 58(%rip), %rdi
} 100000f5c: movl -20(%rbp), %esi
100000f5f: movb $0, %al
100000f61: callq 14
100000f66: xorl %esi, %esi
100000f68: movl %eax, -28(%rbp)
100000f6b: movl %esi, %eax
100000f6d: addq $32, %rsp
100000f71: popq %rbp
100000f72: rets
What does a processor do?
A processor executes instructions
VerySimpleProcessor
Fetch/
Decode
Determine what instruction
to run next
ALU Execution unit: performs the operation
(Execution Unit) described by an instruction, which may
Execution
modify values in the processor’s
Context registers or the computer’s memory
Register 0 (R0)
Register 1 (R1) Registers: maintain program state:
Register 2 (R2) store value of variables used as
Register 3 (R3)
inputs and outputs to operations
One example instruction: add two numbers
VerySimpleProcessor Step 1:
Processor gets next program instruction from memory
(figure out what the processor should do next)
Fetch/ add R0 ← R0, R1
Decode “Please add the contents of register R0 to the contents of
register R1 and put the result of the addition into register
ALU R0”
(Execution Unit)
Step2:
Getoperationinputsfromregisters
Execution ContentsofR0inputtoexecutionunit: 32
Context
ContentsofR1inputtoexecutionunit: 64
R0: 96
R1: 64
R2: 0xff681080
Step3:
R3: 0x80486412
Perform additionoperation:
Executionunitperformsarithmetic,theresultis: 96
Step4:
Storeresult 96 backtoregisterR0
Execute program
My very simple processor: executes one
instruction per clock
Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
Execution Unit mul r1, r1, r0
(ALU) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
Review of how computers work…
What is a computer program? (from a processor’s perspective)
It is a list of instructions to execute!
What is an instruction?
It describes an operation for a processor to perform.
Executing an instruction typically modifies the computer’s
state.
Assume register R0 = x, R1 = y, R2 = z
5
What does it mean for our parallel to scheduling
to that “respects program order”?
What about three instructions at once?
a = x*x + y*y + z*z
Processor1 Processor2 Processor3
Assume register
R0 = x, R1 = y, R2 = z
5
What about three instructions at once?
a = x*x + y*y + z*z
Processor1 Processor2 Processor3
Assume register
R0 = x, R1 = y, R2 = z
5
Instruction level parallelism (ILP) example
▪ ILP= 3 a = x*x + y*y + z*z
x x y y z z
ILP= 3 * * *
ILP= 1 +
ILP= 1 +
a
Superscalar processor execution
a = x*x + y*y + z*z
Assume register
R0 = x, R1 = y, R2 = z Idea #1:
Superscalar execution: processor automatically finds*
1
2
mul R0, R0, R0 independent instructions in an instruction sequence
3 mul R1, R1, R1 and executes them in parallel on multiple execution
4 mul R2, R2, R2 units!
5 add R0, R0, R1
add R3, R0, R2
Fetch/ Fetch/
Decode Decode
1 2
Exec Exec
1 2
Execution
Context
Diminishing returns of superscalar execution
Most available ILP is exploited by a processor capable of issuing four instructions per clock
(Little performance benefit from building a processor that can issue more)
2
Speedup
0
0 4 8 12 16
Instruction issue capability of processor (instructions/clock)
ILP tapped out + end of frequency scaling
Processor clock
rate stops
increasing
No further benefit
from ILP
= Transistor density
= Clock frequency
= Power
= Instruction-level parallelism (ILP)
The “power wall”
Power consumed by a
transistor: Dynamic power ∝ capacitive load × voltage2 × frequency
Static power: transistors burn power even when inactive due to
leakage
•~32-40x faster!
AMD Ryzen Threadripper 3990X
64 cores, 4.3 GHz
Four 8-core
chiplets
StanfordCS149,Fall2023
NVIDIA AD102 GPU
GeForce RTX 4090 (2022)
76 billion transistors
5GPUblocks AppleA15Bionic
(iniPhone13,14)
15billiontransistors
6-coreCPU
2“big”CPUcores Multi-coreGPU
4“small”CPUcores
ImageCredit:TechInsightsInc.
Mobile parallel processing
Raspberry Pi 3
Quad-core ARM A53 CPU
High Performance Computing Projects
1.Early Warning and Flood Prediction for River Basins of India- CDAC,
CWC
2.A HPC Software Suite for Seismic Imaging to Aid Oil & Gas
Exploration- CDAC, ONGC, NGRI, IITR
3.NSM Urban Modeling Project- CDAC, CPCB
4.NSM Platform for Genomics and Drug Discovery (NPGDD)- CDAC, IISC
B, NII Delhi, IIT Delhi, NCBS Bangalore, NIBGM, Ministry of Aayush
5.Materials and Computational Chemistry- CDAC, IITK, IACS Kolkata,
IISER Bhopal, SPPU Pune,
6.Design & Development of DCLC Based System- CDAC, CMET, IITB
7.MPPLab Project- CDAC, CDOT, IISc
But in modern computing
software must be more than
just parallel…
Memory
A program’s memory address space Address Value
0x0 16
▪ A computer’s memory is organized as an array 0x1 255
of bytes 0x2 14
0x3 0
.
.
.
.
.
.
(so valid addresses range from 0x0 to 0x1F)
0x1F 0
Terminology
▪ Memory access latency
- The amount of time it takes the memory system to provide data to the
processor
- Example: 100 clock cycles, 100 nsec
Datarequest
Memory
Latency~ 2sec
Stalls
▪ A processor “stalls” (can’t make progress) when it cannot run
the next instruction in an instruction stream because
future instructions depend on a previous instruction that is
not yet complete.
▪ Accessing memory is a major source of stalls
ld r0 mem[r2]
Dependency: cannot execute ‘add’ instruction until data from
ld r1 mem[r3]
mem[r2] and mem[r3] have been loaded from memory
add r0, r0, r1
Processor
L1 cache
(32 KB)