Morgan Kaufmann Publishers 105年12月14日
Chapter 4 (Part IV)
The Processor:
Datapath and Control
(Parallelism and ILP)
陳瑞奇(J.C. Chen)
亞洲大學資訊工程學系
Adapted from class notes by
Prof. M.J. Irwin, PSU and Prof. D.
Patterson, UCB
§4.10 Parallelism and Advanced Instruction Level Parallelism
4.10 Instruction-Level Parallelism (ILP)
ILP: executing multiple instructions in
parallel
To increase ILP
Deeper pipeline (superpipelining)
Less work per stage shorter clock cycle
Increase the depth of the pipeline to increase the
clock rate
Multiple issue (多重分發)
Fetch more than one instructions at one time
Replicate pipeline stages multiple pipelines
Start multiple instructions per clock cycle
But dependencies reduce this in practice
Chapter 4 — The Processor — 2
Chapter 4 — The Processor 1
Morgan Kaufmann Publishers 105年12月14日
樹上有10隻鳥,打死1隻,還剩幾隻?
如何提升效率?
一箭雙鵰?
http://imgs.ntdtv.com/pic/2015/6-24/p6481351a175614304.jpg Chapter 4 — The Processor — 3
Multiple issue (多重分發)
一次同時提取2個指令或更多
多發裝子彈
http://p3.pstatp.com/large/3792/19517668
Chapter 4 — The Processor — 5
Chapter 4 — The Processor 2
Morgan Kaufmann Publishers 105年12月14日
MIPS with Static Dual Issue
Two-issue packets
One ALU/branch instruction
One load/store instruction
64-bit aligned
ALU/branch, then load/store
http://p3.pstatp.com/large/3792/19517668
Pad an unused instruction with nop
Address Instructiontype Pipeline Stages
n ALU/branch IF ID EX MEM WB
n+4 Load/store IF ID EX MEM WB
n+8 ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB
n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB
p. 323(頁335) Fig. 4.68 Chapter 4 — The Processor — 6
Instruction-Level Parallelism (cont.)
Launching multiple instructions per
stage allows the instruction execution
rate, CPI, to be less than 1
So instead we use IPC: instructions
per clock cycle
E.g., a 6 GHz, four-way multiple-issue
processor can execute at a peak rate of
24 billion instructions per second with a
best case CPI of 0.25 or
a best case IPC of 4
Chapter 4 — The Processor — 7
Chapter 4 — The Processor 3
Morgan Kaufmann Publishers 105年12月14日
Multiple Issue Processor Styles
Static multiple issue (aka VLIW)
Compiler groups instructions to be issued together
Packages them into “issue slots”
Compiler detects and avoids hazards (at compile
time by the compiler)
E.g., Intel Itanium and Itanium 2 for the IA-64 ISA –
EPIC (Explicit Parallel Instruction Computer)
Chapter 4 — The Processor — 8
http://i1111.photobucket.com/albums/h466/kazorptb/Informatica/Intel-itanium-2-microprocessor-chipsss.png
Multiple Issue Processor Styles
Dynamic multiple issue (aka superscalar)
CPU examines instruction stream and chooses
instructions to issue each cycle
Compiler can help by reordering instructions
CPU resolves hazards using advanced techniques
at runtime (at run time by the hardware)
E.g., IBM Power 2, Pentium 4, MIPS R10K
Chapter 4 — The Processor — 9
http://cdn.shopclues.net/images/detailed/316/northwoodp413micron_1361196559.jpg
Chapter 4 — The Processor 4
Morgan Kaufmann Publishers 105年12月14日
Static Multiple Issue
Compiler groups instructions into “issue
packets”
Group of instructions that can be issued on a
single cycle
Determined by pipeline resources required
Think of an issue packet as a very long
instruction
Specifies multiple concurrent operations
Very Long Instruction Word (VLIW)
1 2 3 4
Chapter 4 — The Processor — 13
Scheduling Static Multiple Issue
Compiler must remove some/all hazards
Reorder instructions into issue packets
No dependencies with a packet
Possibly some dependencies between
packets
Varies between ISAs; compiler must know!
Pad with nop if necessary
1 2 3 4
5 6 7 8
Chapter 4 — The Processor — 14
Chapter 4 — The Processor 5
Morgan Kaufmann Publishers 105年12月14日
MIPS with Static Dual Issue
Two-issue packets
1 2
One ALU/branch instruction
One load/store instruction 3 4
64-bit aligned
5 6
ALU/branch, then load/store
Pad an unused instruction with nop
Address Instructiontype Pipeline Stages
n ALU/branch IF ID EX MEM WB
n+4 Load/store IF ID EX MEM WB
n+8 ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB
n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB
p. 323(頁335) Fig. 4.68 Chapter 4 — The Processor — 15
p. 324(頁336) Fig. 4.69
MIPS with Static Dual Issue
Store
Load
effective
addr.
rd
Chapter 4 — The Processor — 16
Chapter 4 — The Processor 6
Morgan Kaufmann Publishers 105年12月14日
Hazards in the Dual-Issue MIPS
More instructions executing in parallel
EX data hazard
Forwarding avoided stalls with single-issue
Now can’t use ALU result in load/store in same packet
add $t0, $s0, $s1
load $s2, 0($t0)
Split into two packets, effectively a stall
Load-use hazard
Still one cycle use latency, but now two instructions
More aggressive scheduling required
Chapter 4 — The Processor — 18
p. 325(頁338)
Scheduling Example Fig. 4.70
Schedule this for dual-issue MIPS
Loop: lw $t0, 0($s1) # $t0=array element
addu $t0, $t0, $s2 # add scalar in $s2
sw $t0, 0($s1) # store result
addi $s1, $s1,–4 # decrement pointer
bne $s1, $zero, Loop # branch $s1!=0
ALU/branch Load/store cycle
Loop: nop lw $t0, 0($s1) 1
addi $s1, $s1,–4 nop 2
addu $t0, $t0, $s2 nop 3
bne $s1, $zero, sw $t0, 4($s1) 4
Loop
IPC = 5/4 = 1.25 (c.f. peak IPC = 2)
Chapter 4 — The Processor — 19
Chapter 4 — The Processor 7
Morgan Kaufmann Publishers 105年12月14日
Loop Unrolling
Replicate loop body to expose more
parallelism
Reduces loop-control overhead
Use different registers per replication
Called “register renaming”
Chapter 4 — The Processor — 20
p. 326(頁338)
Loop Unrolling Example Fig. 4.71
ALU/branch Load/store cycle
Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1
nop lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
addu $t2, $t2, $s2 sw $t0, 16($s1) 5
addu $t3, $t3, $s2 sw $t1, 12($s1) 6
nop sw $t2, 8($s1) 7
bne $s1, $zero, Loop sw $t3, 4($s1) 8
IPC = 14/8 = 1.75
Closer to 2, but at cost of registers and code size
Chapter 4 — The Processor — 21
Chapter 4 — The Processor 8
Morgan Kaufmann Publishers 105年12月14日
Dynamic Multiple Issue
“Superscalar” processors
CPU (Hardware) decides whether to issue
0, 1, 2, … each cycle
Avoiding structural and data hazards
Avoids the need for compiler scheduling
Though it may still help
Code semantics ensured by the CPU
Chapter 4 — The Processor — 22
Dynamic Pipeline Scheduling
Allow the CPU to execute instructions out
of order to avoid stalls
But commit result to registers in order
Example
lw $t0, 20($s2)
addu $t1, $t0, $t2
sub $s4, $s4, $t3
slti $t5, $s4, 20
Can start sub while addu is waiting for lw
Chapter 4 — The Processor — 23
Chapter 4 — The Processor 9
Morgan Kaufmann Publishers 105年12月14日
Speculation
Predict branch and continue issuing
Don’t commit until branch outcome
determined
Load speculation
Avoid load and cache miss delay
Predict the effective address
Predict loaded value
Load before completing outstanding stores
Bypass stored values to load unit
Don’t commit load until speculation cleared
Chapter 4 — The Processor — 26
Why Do Dynamic Scheduling?
Why not just let the compiler schedule
code?
Not all stalls are predictable
e.g., cache misses
Can’t always schedule around branches
Branch outcome is dynamically determined
Different implementations (hardware) of an
ISA have different latencies and hazards
Chapter 4 — The Processor — 27
Chapter 4 — The Processor 10
Morgan Kaufmann Publishers 105年12月14日
§4.14 Concluding Remarks
Concluding Remarks
ISA influences design of datapath and control
Datapath and control influence design of ISA
Pipelining improves instruction throughput
using parallelism
More instructions completed per second
Latency for each instruction not reduced
Hazards: structural, data, control
Multiple issue and dynamic scheduling (ILP)
Dependencies limit achievable parallelism
Complexity leads to the power wall
Chapter 4 — The Processor — 34
第四次作業:第四章後半部習題 (Due in 2 weeks)
4.9 在本習題中,我們檢視資料相依性會如何影響4.5節所述的基本
5-階管道的執行。習題中的各問題請參考下列指令序:
or $s1,$s2,$s3
or $s2,$s1,$s4
or $s1,$s1,$s2
另外,假設每一種前饋方法的相關週期時間如下:
無前饋 有充分前饋 僅有ALU-ALU前饋
250ps 300ps 290ps
4.9.1 (10%) 指出所有的相依關係以及其類別。
4.9.2 (10%) 假設在此管道化的處理器中無前饋能力。指出所有的危
障,並加入nop指令以消除之。
4.9.3 (10%) 假設有充分的前饋能力。指出所有的危障,並加入nop指
令以消除之。
4.9.5 (10%) 假設僅有ALU-ALU的前饋(但是沒有自MEM至EX階的前饋)
,在該碼中加入nop指令以消除危障。
Chapter 4 — The Processor 11
Morgan Kaufmann Publishers 105年12月14日
4.10 在本習題中,我們檢視資料危障、控制危障,以及指令集架
構(ISA)的設計如何能影響管道的執行。習題中的各問題請參考下列
的MIPS程式碼片段:
sw $s2,12($s6)
lw $s2,8($s6)
beq $s5,$s4,Label #假設$s5!=$s4
add $s5,$s1,$s4
slt $s5,$s3,$s4
假設個別的管道階級中有如下的延遲:
IF ID EX MEM WB
200ps 120ps 150ps 190ps 100ps
4.10.1 (10%) 在本問題中,假設所有分支均可被準確預測(因而消除
了所有的控制危障)且未使用任何延遲槽。設若我們僅有一個記憶體(
其中存放指令以及資料),則每當我們需要在某一指令存取資料的同一
週期中擷取指令時便會引發結構危障。為了確保程式的推進,該危障
必須給予存取資料的指令較高的優先度。在僅有一個記憶體的五階管
道中,該指令串的總執行時間為何?我們已知資料危障可經由在程式中
加入nops來消除。你可以對結構危障使用相同的方法嗎?為何如此?
4.13 本習題旨在幫助你了解前饋、危障偵測以及ISA設計間的關係。習
題中的各問題請參考下列指令串,並假設其執行於五階的管道式資
料通道中:
add $s5,$s2,$s1
lw $s3,4($s5)
lw $s2,0($s2)
or $s3,$s5,$s3
sw $s3,0($s5)
4.13.1 (10%) 若無前饋或是危障偵測,試插入nops以確保正確的執行
。
4.13.2 (10%) 重複4.13.1,但是只有在無法改變或重新安排這些指令
來避免危障時才使用nops。你可以假設程式碼中可以使用暫存
器$s7來存放暫時值。
4.13.3 (10%) 若處理器可做前饋但是我們忘記置入危障偵測單元,則
執行該指令串將會有何結果?
4.13.4 (10%) 若有前饋功能,則在執行該碼的前五個週期中,指出每
一週期中圖4.60(如次頁)所示的危障偵測及前饋單元將會設定的訊
號。
Chapter 4 — The Processor 12
Morgan Kaufmann Publishers 105年12月14日
圖4.60
4.16本習題檢視不同的分支預測器對以下重複出現的分支結果樣式(
例如迴圈分支)可以得到的準確度: T,NT,T,T,NT。
4.16.1 (5%) 對於該分支結果樣式,總是會發生(Taken)與總是不會
發生(Not taken)的預測器其準確度各為何?
4.16.2 (5%) 對於該樣式的前四個分支,假設2位元分支預測器的開
始狀態位於圖4.63的左下方(即預測不發生),則其準確度為何?
4.16.3 (10%) 假設該樣式不斷重複,則2位元分支預測器的準確度
為何?
圖4.63
Chapter 4 — The Processor 13