KEMBAR78
Advanced Processor Techniques | PDF | Central Processing Unit | Computer Hardware
0% found this document useful (0 votes)
240 views13 pages

Advanced Processor Techniques

This document discusses instruction-level parallelism (ILP) and how it can be achieved through deeper pipelines, multiple instruction issue, and static or dynamic multiple issue processor designs. It provides examples of MIPS with static dual issue where instructions are issued in two-instruction packets. The compiler must schedule instructions to avoid hazards either through reordering or inserting NOP instructions. Both static and dynamic multiple issue designs aim to increase parallelism and the instruction execution rate.

Uploaded by

許藝蓁
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
240 views13 pages

Advanced Processor Techniques

This document discusses instruction-level parallelism (ILP) and how it can be achieved through deeper pipelines, multiple instruction issue, and static or dynamic multiple issue processor designs. It provides examples of MIPS with static dual issue where instructions are issued in two-instruction packets. The compiler must schedule instructions to avoid hazards either through reordering or inserting NOP instructions. Both static and dynamic multiple issue designs aim to increase parallelism and the instruction execution rate.

Uploaded by

許藝蓁
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Morgan Kaufmann Publishers 105年12月14日

Chapter 4 (Part IV)


The Processor:
Datapath and Control
(Parallelism and ILP)
陳瑞奇(J.C. Chen)
亞洲大學資訊工程學系
Adapted from class notes by
Prof. M.J. Irwin, PSU and Prof. D.
Patterson, UCB

§4.10 Parallelism and Advanced Instruction Level Parallelism

4.10 Instruction-Level Parallelism (ILP)


 ILP: executing multiple instructions in
parallel
 To increase ILP
 Deeper pipeline (superpipelining)
 Less work per stage  shorter clock cycle
 Increase the depth of the pipeline to increase the
clock rate
 Multiple issue (多重分發)
 Fetch more than one instructions at one time
 Replicate pipeline stages  multiple pipelines
 Start multiple instructions per clock cycle
 But dependencies reduce this in practice
Chapter 4 — The Processor — 2

Chapter 4 — The Processor 1


Morgan Kaufmann Publishers 105年12月14日

樹上有10隻鳥,打死1隻,還剩幾隻?

如何提升效率?
一箭雙鵰?

http://imgs.ntdtv.com/pic/2015/6-24/p6481351a175614304.jpg Chapter 4 — The Processor — 3

Multiple issue (多重分發)


一次同時提取2個指令或更多

多發裝子彈

http://p3.pstatp.com/large/3792/19517668
Chapter 4 — The Processor — 5

Chapter 4 — The Processor 2


Morgan Kaufmann Publishers 105年12月14日

MIPS with Static Dual Issue


 Two-issue packets
 One ALU/branch instruction
 One load/store instruction
 64-bit aligned
 ALU/branch, then load/store
http://p3.pstatp.com/large/3792/19517668
 Pad an unused instruction with nop
Address Instructiontype Pipeline Stages
n ALU/branch IF ID EX MEM WB
n+4 Load/store IF ID EX MEM WB
n+8 ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB

n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB

p. 323(頁335) Fig. 4.68 Chapter 4 — The Processor — 6

Instruction-Level Parallelism (cont.)

 Launching multiple instructions per


stage allows the instruction execution
rate, CPI, to be less than 1
 So instead we use IPC: instructions
per clock cycle
 E.g., a 6 GHz, four-way multiple-issue
processor can execute at a peak rate of
24 billion instructions per second with a
best case CPI of 0.25 or
a best case IPC of 4
Chapter 4 — The Processor — 7

Chapter 4 — The Processor 3


Morgan Kaufmann Publishers 105年12月14日

Multiple Issue Processor Styles


 Static multiple issue (aka VLIW)
 Compiler groups instructions to be issued together
 Packages them into “issue slots”
 Compiler detects and avoids hazards (at compile
time by the compiler)
 E.g., Intel Itanium and Itanium 2 for the IA-64 ISA –
EPIC (Explicit Parallel Instruction Computer)

Chapter 4 — The Processor — 8


http://i1111.photobucket.com/albums/h466/kazorptb/Informatica/Intel-itanium-2-microprocessor-chipsss.png

Multiple Issue Processor Styles


 Dynamic multiple issue (aka superscalar)
 CPU examines instruction stream and chooses
instructions to issue each cycle
 Compiler can help by reordering instructions
 CPU resolves hazards using advanced techniques
at runtime (at run time by the hardware)
 E.g., IBM Power 2, Pentium 4, MIPS R10K

Chapter 4 — The Processor — 9


http://cdn.shopclues.net/images/detailed/316/northwoodp413micron_1361196559.jpg

Chapter 4 — The Processor 4


Morgan Kaufmann Publishers 105年12月14日

Static Multiple Issue


 Compiler groups instructions into “issue
packets”
 Group of instructions that can be issued on a
single cycle
 Determined by pipeline resources required
 Think of an issue packet as a very long
instruction
 Specifies multiple concurrent operations
  Very Long Instruction Word (VLIW)
1 2 3 4
Chapter 4 — The Processor — 13

Scheduling Static Multiple Issue


 Compiler must remove some/all hazards
 Reorder instructions into issue packets
 No dependencies with a packet
 Possibly some dependencies between
packets
 Varies between ISAs; compiler must know!
 Pad with nop if necessary

1 2 3 4

5 6 7 8

Chapter 4 — The Processor — 14

Chapter 4 — The Processor 5


Morgan Kaufmann Publishers 105年12月14日

MIPS with Static Dual Issue


 Two-issue packets
1 2
 One ALU/branch instruction
 One load/store instruction 3 4
 64-bit aligned
5 6
 ALU/branch, then load/store
 Pad an unused instruction with nop
Address Instructiontype Pipeline Stages
n ALU/branch IF ID EX MEM WB
n+4 Load/store IF ID EX MEM WB
n+8 ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB

n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB

p. 323(頁335) Fig. 4.68 Chapter 4 — The Processor — 15

p. 324(頁336) Fig. 4.69


MIPS with Static Dual Issue

Store
Load

effective
addr.

rd

Chapter 4 — The Processor — 16

Chapter 4 — The Processor 6


Morgan Kaufmann Publishers 105年12月14日

Hazards in the Dual-Issue MIPS


 More instructions executing in parallel
 EX data hazard
 Forwarding avoided stalls with single-issue
 Now can’t use ALU result in load/store in same packet
add $t0, $s0, $s1

load $s2, 0($t0)


 Split into two packets, effectively a stall

 Load-use hazard
 Still one cycle use latency, but now two instructions
 More aggressive scheduling required

Chapter 4 — The Processor — 18

p. 325(頁338)
Scheduling Example Fig. 4.70
 Schedule this for dual-issue MIPS
Loop: lw $t0, 0($s1) # $t0=array element
addu $t0, $t0, $s2 # add scalar in $s2
sw $t0, 0($s1) # store result
addi $s1, $s1,–4 # decrement pointer
bne $s1, $zero, Loop # branch $s1!=0

ALU/branch Load/store cycle


Loop: nop lw $t0, 0($s1) 1
addi $s1, $s1,–4 nop 2
addu $t0, $t0, $s2 nop 3
bne $s1, $zero, sw $t0, 4($s1) 4
Loop

 IPC = 5/4 = 1.25 (c.f. peak IPC = 2)


Chapter 4 — The Processor — 19

Chapter 4 — The Processor 7


Morgan Kaufmann Publishers 105年12月14日

Loop Unrolling
 Replicate loop body to expose more
parallelism
 Reduces loop-control overhead
 Use different registers per replication
 Called “register renaming”

Chapter 4 — The Processor — 20

p. 326(頁338)
Loop Unrolling Example Fig. 4.71

ALU/branch Load/store cycle


Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1
nop lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
addu $t2, $t2, $s2 sw $t0, 16($s1) 5
addu $t3, $t3, $s2 sw $t1, 12($s1) 6
nop sw $t2, 8($s1) 7
bne $s1, $zero, Loop sw $t3, 4($s1) 8

 IPC = 14/8 = 1.75


 Closer to 2, but at cost of registers and code size
Chapter 4 — The Processor — 21

Chapter 4 — The Processor 8


Morgan Kaufmann Publishers 105年12月14日

Dynamic Multiple Issue


 “Superscalar” processors
 CPU (Hardware) decides whether to issue
0, 1, 2, … each cycle
 Avoiding structural and data hazards
 Avoids the need for compiler scheduling
 Though it may still help
 Code semantics ensured by the CPU

Chapter 4 — The Processor — 22

Dynamic Pipeline Scheduling


 Allow the CPU to execute instructions out
of order to avoid stalls
 But commit result to registers in order
 Example
lw $t0, 20($s2)
addu $t1, $t0, $t2
sub $s4, $s4, $t3
slti $t5, $s4, 20
 Can start sub while addu is waiting for lw

Chapter 4 — The Processor — 23

Chapter 4 — The Processor 9


Morgan Kaufmann Publishers 105年12月14日

Speculation
 Predict branch and continue issuing
 Don’t commit until branch outcome
determined
 Load speculation
 Avoid load and cache miss delay
 Predict the effective address
 Predict loaded value
 Load before completing outstanding stores
 Bypass stored values to load unit
 Don’t commit load until speculation cleared

Chapter 4 — The Processor — 26

Why Do Dynamic Scheduling?


 Why not just let the compiler schedule
code?
 Not all stalls are predictable
 e.g., cache misses
 Can’t always schedule around branches
 Branch outcome is dynamically determined
 Different implementations (hardware) of an
ISA have different latencies and hazards

Chapter 4 — The Processor — 27

Chapter 4 — The Processor 10


Morgan Kaufmann Publishers 105年12月14日

§4.14 Concluding Remarks


Concluding Remarks
 ISA influences design of datapath and control
 Datapath and control influence design of ISA
 Pipelining improves instruction throughput
using parallelism
 More instructions completed per second
 Latency for each instruction not reduced
 Hazards: structural, data, control
 Multiple issue and dynamic scheduling (ILP)
 Dependencies limit achievable parallelism
 Complexity leads to the power wall

Chapter 4 — The Processor — 34

第四次作業:第四章後半部習題 (Due in 2 weeks)


4.9 在本習題中,我們檢視資料相依性會如何影響4.5節所述的基本
5-階管道的執行。習題中的各問題請參考下列指令序:
or $s1,$s2,$s3
or $s2,$s1,$s4
or $s1,$s1,$s2
另外,假設每一種前饋方法的相關週期時間如下:
無前饋 有充分前饋 僅有ALU-ALU前饋
250ps 300ps 290ps

4.9.1 (10%) 指出所有的相依關係以及其類別。

4.9.2 (10%) 假設在此管道化的處理器中無前饋能力。指出所有的危


障,並加入nop指令以消除之。

4.9.3 (10%) 假設有充分的前饋能力。指出所有的危障,並加入nop指


令以消除之。

4.9.5 (10%) 假設僅有ALU-ALU的前饋(但是沒有自MEM至EX階的前饋)


,在該碼中加入nop指令以消除危障。

Chapter 4 — The Processor 11


Morgan Kaufmann Publishers 105年12月14日

4.10 在本習題中,我們檢視資料危障、控制危障,以及指令集架
構(ISA)的設計如何能影響管道的執行。習題中的各問題請參考下列
的MIPS程式碼片段:
sw $s2,12($s6)
lw $s2,8($s6)
beq $s5,$s4,Label #假設$s5!=$s4
add $s5,$s1,$s4
slt $s5,$s3,$s4
假設個別的管道階級中有如下的延遲:
IF ID EX MEM WB

200ps 120ps 150ps 190ps 100ps

4.10.1 (10%) 在本問題中,假設所有分支均可被準確預測(因而消除


了所有的控制危障)且未使用任何延遲槽。設若我們僅有一個記憶體(
其中存放指令以及資料),則每當我們需要在某一指令存取資料的同一
週期中擷取指令時便會引發結構危障。為了確保程式的推進,該危障
必須給予存取資料的指令較高的優先度。在僅有一個記憶體的五階管
道中,該指令串的總執行時間為何?我們已知資料危障可經由在程式中
加入nops來消除。你可以對結構危障使用相同的方法嗎?為何如此?

4.13 本習題旨在幫助你了解前饋、危障偵測以及ISA設計間的關係。習
題中的各問題請參考下列指令串,並假設其執行於五階的管道式資
料通道中:
add $s5,$s2,$s1
lw $s3,4($s5)
lw $s2,0($s2)
or $s3,$s5,$s3
sw $s3,0($s5)
4.13.1 (10%) 若無前饋或是危障偵測,試插入nops以確保正確的執行

4.13.2 (10%) 重複4.13.1,但是只有在無法改變或重新安排這些指令


來避免危障時才使用nops。你可以假設程式碼中可以使用暫存
器$s7來存放暫時值。

4.13.3 (10%) 若處理器可做前饋但是我們忘記置入危障偵測單元,則


執行該指令串將會有何結果?

4.13.4 (10%) 若有前饋功能,則在執行該碼的前五個週期中,指出每


一週期中圖4.60(如次頁)所示的危障偵測及前饋單元將會設定的訊
號。

Chapter 4 — The Processor 12


Morgan Kaufmann Publishers 105年12月14日

圖4.60

4.16本習題檢視不同的分支預測器對以下重複出現的分支結果樣式(
例如迴圈分支)可以得到的準確度: T,NT,T,T,NT。
4.16.1 (5%) 對於該分支結果樣式,總是會發生(Taken)與總是不會
發生(Not taken)的預測器其準確度各為何?

4.16.2 (5%) 對於該樣式的前四個分支,假設2位元分支預測器的開


始狀態位於圖4.63的左下方(即預測不發生),則其準確度為何?

4.16.3 (10%) 假設該樣式不斷重複,則2位元分支預測器的準確度


為何?

圖4.63

Chapter 4 — The Processor 13

You might also like