Pipelining Basic Concepts
Partition the function of instruction execution into smaller functions
(called a stage), allocate separate concurrently operating hardware
components so that at any given time, there exist several instructions
in various stages of execution.
Instruction Operand Execute
fetch fetch
IF OF EX
1
Typical Non-Pipelined Execution
EX I0 I1 I2
OF I0 I1 I2
IF I0 I1 I2 I3
1 2 3 4 5 6 7 8 9 10
Time to execute n instructions: 3nt
2
Ideal Pipelined Execution
EX I0 I1 I2 I3 I4 I5 I6 I7
OF I0 I1 I2 I3 I4 I5 I6 I7 I8
IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
1 2 3 4 5 6 7 8 9 10
Time to execute n instructions: (2 + n)t
3
Pipeline Turbulence
Consider the presence of a branch instruction in the pipeline:
EX I0 I1 I2 I3 Ibr Ik
OF I0 I1 I2 I3 Ibr Ik Ik+1
IF I0 I1 I2 I3 Ibr Ik Ik+1 Ik+2
1 2 3 4 5 6 7 8 9 10
Branch instructions introduce control hazards into the pipeline,
negatively impacting pipeline performance.
4
Multicycle Execution Units
EX
Integer unit
EX
FP/integer
multiply
IF ID MEM WB
EX
FP adder
EX
FP/integer
divider
FIGURE 3.42 The DLX pipeline with three additional unpipelined, floating-point,
functional units.
5
SuperScalar/Multiple Issue
Floating
Point
Unit
Instr Buffer
Issue
Unit Integer Memory
Switch
Crossbar
Unit 1 Module 0
Integer Memory
Unit 2 Module 1
6
Hazards
Structural Hazards: arising from resource conflicts when the
hardware cannot support all possible combinations of instructions in
simultaneous overlapped execution.
Data Hazards: arising when an instruction in the pipeline depends on
the results of a previous instruction still in the pipeline.
Control Hazards: arising from the pipelining of branches or other
instructions that change the PC.
Hazards are easily solved by stalling, but stalling reduces pipeline
efficiency.
Structural hazards can be solved by extra hardware resource — a
cost/performance tradeoff.
Load/Store architectures substantially reduce complexity of hazards.
7
Structural Hazards
• Split I/D caches
• Pipelined Execute Units
• Buffers on Execute Units
• Hardware replication
8
Pipelined & Buffered Execution Units
Integer unit
EX
FP/integer multiply
M1 M2 M3 M4 M5 M6 M7
IF ID MEM WB
FP adder
A1 A2 A3 A4
FP/integer divider
DIV
FIGURE 3.44 A pipeline that supports multiple outstanding FP operations.
9
Data Hazards
RAW: read after write
WAW: write after write (only present when writes occur at different
stages in a pipeline)
WAR: write after read (only possible when writes may occur earlier
than some reads)
Note: RAR (read after read) is not a hazard.
10
Data Hazards: Potential Solutions
• Pipeline interlocking
• Forwarding
• Compiler optimizations
11
Pipeline Interlocking
• Stall pipeline until data hazard is eliminated
• Loses performance benefits of pipelining
12
Pipeline Interlocking
Assume RAW conflict between I3 and I4.
EX I0 I1 I2 I3 I4 I5 I6
OF I0 I1 I2 I3 I4 I5 I6 I7
IF I0 I1 I2 I3 I4 I5 I6 I7 I8
1 2 3 4 5 6 7 8 9 10
13
Data Forwarding
• Organize data path with routes back from later pipe stages into
earlier pipe stages.
• Efficient operation best supported when operand encoding is simple.
14
Data Forwarding
rd rs1 rs2 rd rs1 rs2
=
Operand Execute
fetch
15
Static Scheduling
• Compiler optimization to schedule instructions as per data hazard
considerations.
• Organize code into basic blocks (single entry/single exit)
• Schedule instructions to minimize data hazards between adjacent
instructions.
16
Dynamic Scheduling
• Scoreboarding
• Tomasulo’s Algorithm
• VLIW
17
An Architecture for Scoreboarding
Registers Data buses
FP mult
FP mult
FP divide
FP add
Integer unit
Scoreboard
Control/ Control/
status status
FIGURE 4.3 The basic structure of a DLX processor with a scoreboard.
18
An Architecture for Tomasulo
From instruction unit
Floating-
From point
memory operation
queue FP registers
Load buffers
6
5
4
3
2 Operand Store buffers
1 buses 3
2
1
Operation bus To
memory
3 2
2 Reservation 1
1 stations
FP adders FP multipliers
Common data bus (CDB)
FIGURE 4.8 The basic structure of a DLX FP unit using Tomasulo's algorithm.
19
Structural/Data Hazards and the Original Pentium
Floating
Point
Unit
Instr Buffer
Issue
Unit Integer Memory
Switch
Crossbar
Unit 1 Module 0
Integer Memory
Unit 2 Module 1
20
Control Hazards
• Evaluate branches earlier
• Stalls/pipelined interlocks
• Delayed Branch
• Statically predict taken/not-taken & flush when not
• Canceling (or nullifying) branch: compiler prediction/flush
• Dynamic (hardware) prediction
• Conditional Instructions
21
Delayed Branch
Redefine architecture so branches take effect after n instrs after branch
• Where to get instructions to fill branch delay slot?
– Noop
– Before the branch
– After the branch (from both destinations or, if needed/possible,
only one)
• Compiler effectiveness
– Fills about 60% of delay slots (when one slot)
– About 80% instrs in delay slot useful
• Complications: deep pipelines & superscalar implementations
22
Predict Taken/Not-Taken, Statically or Dynamically
Assume I3 is a conditional branch & branch resolved in EX stage.
Correct Prediction:
EX I0 I1 I2 I3 I4 I5 I6 I7
OF I0 I1 I2 I3 I4 I5 I6 I7 I8
IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
1 2 3 4 5 6 7 8 9 10
Incorrect Prediction:
EX I0 I1 I2 I3 Ik
OF I0 I1 I2 I3 I4 Ik Ik+1
IF I0 I1 I2 I3 I4 I5 Ik Ik+1 IK k + 2
1 2 3 4 5 6 7 8 9 10
23
Branch Prediction
Taken
Not taken
Predict taken Predict taken
Taken
Taken Not taken
Not taken
Predict not taken Predict not taken
Taken
Not taken
FIGURE 4.13 The states in a two-bit prediction scheme.
24
Branch Prediction w/ History
Branch address
4
2–bit per branch predictors
XX XX prediction
2–bit global branch history
FIGURE 4.20 A (2,2) branch-prediction buffer uses a two-bit global history to
choose from among four predictors for each branch address.
25
Branch Target Buffers
PC of instruction to fetch
Look up Predicted PC
Number of
entries
in branch-
target
buffer
No: instruction is
= not predicted to be Branch
branch. Proceed normally predicted
taken or
Yes: then instruction is branch and predicted untaken
PC should be used as the next PC
FIGURE 4.22 A branch-target buffer.
26
Send PO to
memory and
Step-by-step use of Branch Target Buffers
branch-target
buffer
IF
No Entry found in Yes
branch-target
buffer?
Send out
predicted
Is
PC
No instruction Yes
a taken
branch?
ID
No Taken Yes
branch?
Normal
instruction
execution
Enter Mispredicted Branch
branch IO branch, kill fetched correctly
and next PC instruction; restart predicted;
EX into branch fetch at other continue
target buffer target; delete execution with
entry from no stalls
target buffer
FIGURE 4.23 The steps involved in handling an instruction with a branch-target
buffer.
27
Exceptions
1. Synchronous versus Asynchronous
2. User requested versus coerced
3. User maskable (or not)
4. Within or between instructions
5. Resumption versus Termination
6. Precise/Imprecise
28