Advanced Topics In Computer
Architecture
•Pipelining
•Superscalar processor
• Multithreading (explicit and implicit)
• Multicore Machines
• Clusters
Superscalar Processors
• Definition of Superscalar
• Design Issues:
- Instruction Issue Policy
- Register renaming
- Machine parallelism
- Branch Prediction
- Execution
• Pentium 4 example
What is Superscalar?
A Superscalar machine executes multiple independent
instructions in parallel.
They are pipelined as well.
• “Common” instructions (arithmetic, load/store, conditional branch)
can be executed independently.
• Equally applicable to RISC & CISC, but more straightforward in
RISC machines.
• The order of execution is usually assisted by the compiler.
Example of Superscalar Organization
• 2 Integer ALU pipelines,
• 2 FP ALU pipelines,
• 1 memory pipeline (?)
Superscalar v Superpipelined
Limitations of Superscalar
• Dependent upon:
- Instruction level parallelism possible
- Compiler based optimization
- Hardware support
• Limited by
— Data dependency
— Procedural dependency
— Resource conflicts
(Recall) True Data Dependency
(Must W before R)
ADD r1, r2 r1+r2 🡺 r1
MOVE r3, r1 r1 🡺 r3
• Can fetch and decode second instruction in parallel with
first
LOAD r1, X x (memory) 🡺 r1
MOVE r3, r1 r1🡺 r3
• Can NOT execute second instruction until first is
finished
Second instruction is dependent on first (R after W)
(recall) Antidependancy (Must R before W)
ADD R4, R3, 1 R3 + 1 🡺 R4
ADD R3, R5, 1 R5 + 1 🡺 R3
• Cannot complete the second instruction before the first has
read R3
(Recall) Procedural Dependency
• Can’t execute instructions after a branch in parallel
with instructions before a branch, because?
Note: Also, if instruction length is not fixed,
instructions have to be decoded to find out how many
fetches are needed
(recall) Resource Conflict
• Two or more instructions requiring access to the
same resource at the same time
— e.g. two arithmetic instructions need the ALU
• Solution - Can possibly duplicate resources
— e.g. have two arithmetic units
Effect of Dependencies on Superscalar Operation
Notes:
1) Superscalar operation is double impacted by a stall.
2) CISC machines typically have different length instructions and need to be at least
partially decoded before the next can be fetched – not good for superscalar operation
Instruction-level Parallelism – degree of
• Consider:
LOAD R1, R2
ADD R3, 1
ADD R4, R2
These can be handled in parallel.
• Consider:
ADD R3, 1
ADD R4, R3
STO (R4), R0
These cannot be handled in parallel.
The “degree” of instruction-level parallelism is determined by the
number of instructions that can be executed in parallel without
stalling for dependencies
Instruction Issue Policies
• Order in which instructions are fetched
• Order in which instructions are executed
• Order in which instructions update registers and
memory values (order of completion)
Standard Categories:
• In-order issue with in-order completion
• In-order issue with out-of-order completion
• Out-of order issue with out-of-order completion
In-Order Issue -- In-Order Completion
Issue instructions in the order they occur:
• Not very efficient
• Instructions must stall if necessary (and stalling in
superpipelining is expensive)
In-Order Issue -- In-Order Completion
(Example)
Assume:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
In-Order Issue -- Out-of-Order Completion
(Example)
Again:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
How does this effect interrupts?
Out-of-Order Issue -- Out-of-Order Completion
• Decouple decode pipeline from execution pipeline
• Can continue to fetch and decode until the “window”
is full
• When a functional unit becomes available an
instruction can be executed (usually in as much
in-order as possible)
• Since instructions have been decoded, processor can
look ahead
Out-of-Order Issue -- Out-of-Order Completion
(Example)
Again:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
Note: I5 depends upon I4, but I6 does not
Register Renaming
to avoid hazards
• Output and antidependencies occur because register
contents may not reflect the correct ordering from the
program
• Can require a pipeline stall
• One solution: Allocate Registers dynamically
(renaming registers)
Register Renaming example
Add R3, R3, R5 R3b:=R3a + R5a (I1)
Add R4, R3, 1 R4b:=R3b + 1 (I2)
Add R3, R5, 1 R3c:=R5a + 1 (I3)
Add R7, R3, R4 R7b:=R3c + R4b (I4)
• Without “subscript” refers to logical register in
instruction
• With subscript is hardware register allocated:
R3a R3b R3c
Note: R3c avoids: antidependency on I2
output dependency I1
Recaping: Machine Parallelism Support
• Duplication of Resources
• Out of order issue hardware
• Windowing to decouple execution from decode
• Register Renaming capability
Speedups of Machine Organizations
(Without Procedural Dependencies)
• Not worth duplication of functional units without register renaming
• Need instruction window large enough (more than 8, probably not more than 32)
Branch Prediction in Superscalar Machines
• Delayed branch not used much. Why?
Multiple instructions need to execute in the delay slot.
This leads to much complexity in recovery.
• Branch prediction should be used - Branch history is
very useful
View of Superscalar Execution
Committing or Retiring Instructions
Results need to be put into order (commit or retire)
• Results sometimes must be held in temporary storage
until it is certain they can be placed in “permanent”
storage.
(either committed or retired/flushed)
• Temporary storage requires regular clean up –
overhead – done in hardware.
Superscalar Hardware Support
• Facilities to simultaneously fetch multiple
instructions
• Logic to determine true dependencies involving
register values and Mechanisms to communicate
these values
• Mechanisms to initiate multiple instructions in
parallel
• Resources for parallel execution of multiple
instructions
• Mechanisms for committing process state in correct
order
Example: Pentium 4
A Superscalar CISC Machine
Pentium 4 alternate view
Pentium 4 pipeline
20 stages !
a) Generation of Micro-ops (stages 1 &2)
• Using the Branch Target Buffer and Instruction Translation
Lookaside Buffer, the x86 instructions are fetched 64 bytes at a
time from the L2 cache
•The instruction boundaries are determined and instructions decoded
into 1-4 118-bit RISC micro-ops
• Micro-ops are stored in the trace cache
b) Trace cache next instruction pointer (stage 3)
• The Trace Cache Branch Target Buffer contains dynamic
gathered history information (4 bit tag)
• If target is not in BTB
- Branch not PC relative: predict branch taken if it is a return, predict
not taken otherwise
- For PC relative backward conditional branches, predict take,
otherwise not taken
c) Trace Cache fetch (stage 4)
• Orders micro-ops in program-ordered sequences called traces
• These are fetched in order, subject to branch prediction
• Some micro-ops require many micro-ops (CISC instructions).
These are coded into the ROM and fetched from the ROM
d) Drive (stage 5)
• Delivers instructions from the Trace Cache to the
Rename/Allocator module for reordering
e) Allocate: register naming (stages 6, 7, & 8)
• Allocates resources for execution (3 micro-ops arrive per clock cycle):
- Each micro-op is allocated to a slot in the 126 position circular Reorder Buffer (ROB) which
tracks progress of the micro-ops.
Buffer entries include:
- State – scheduled, dispatched, completed, ready for retire
- Address that generated the micro-op
- Operation
- Alias registers are assigned for one of 16 arch reg (128 alias registers)
{to remove data
dependencies}
• The micro-ops are dispatched out of order as resources are available
• Allocates an entry to one of the 2 scheduler queues - memory access or not
• The micro-ops are retired in order from the ROB
f) Micro-op queuing (stage 9)
• Micro-ops are loaded into one of 2 queues:
- one for memory operations
- one for non memory operations
• Each queue operates on a FIFO policy
g) Micro-op scheduling h) Dispatch
(stages 10, 11, & 12) (stages 13 & 14)
• The 2 schedulers retrieve micro-ops based upon having all
the operands ready and dispatch them to an available unit (up
to 6 per clock cycle)
• If two micro-ops need the same unit, they are dispatched in
sequence.
i) Register file j) Execute: flags
(stages 15 & 16) (stages 17 & 18)
• The register files are the sources for pending fixed and FF
operations
• A separate stage is used to compute the flags
k) Branch check l) Branch check results
(stage 19) (stage 20)
• Checks flags and compares results with predictions
• If the branch prediction was wrong:
- all incorrect micro-ops must be flushed (don’t want to be wrong!)
- the correct branch destination is provided to the Branch Predictor
- the pipeline is restarted from the new target address
Definitions of Threads and Processes
• Process:
— An instance of program running on computer
• Thread: dispatchable unit of work within process
— Includes processor context (which includes the program
counter and stack pointer) and data area for stack
— Threads execute sequentially, but are Interruptible
– the processor can turn to another thread
• Thread switch
— Switching processor between threads within same process
– Typically less costly than process switch
Implicit and Explicit Multithreading
• Explicit Multithreading is Concurrently executing
instructions from different explicit threads
— Instructions are Interleaved from different threads on
shared pipelines or executed in Parallel on separate pipelines
• Implicit multithreading is concurrent execution of
multiple threads extracted from a single sequential
program
— Implicit threads are defined statically by the compiler or
dynamically by hardware
Scalar Threading
Multiple Instruction Issue Threading
Parallel Diagram
Multicore Organization Alternatives
Intel x86 Multicore Organization
Core i7
• Released November 2008
• Speculative pre-fetch for caches
• Simultaneous multi-threading (SMT)
— 4 SMT cores, each supporting 4 threads 🡪 appears as 16 cores
• On chip DDR3 memory controller
— Three 8 byte channels (192 bits) giving 32GB/s
• QuickPath Interconnection
— Cache coherent point-to-point link
— High speed communications between processor chips
– 6.4G transfers per second, 16 bits per transfer
– Total bandwidth 25.6GB/s
Intel Core i7 Block Diagram
.3 ns/B !
Intel Core i7
approx 45x45 mm
45 nm feature size
Parallel Processor Architecture Summary
Very Tightly Coupled Tightly Coupled Moderately Coupled
MultiCore Organization
(Very tightly Coupled or Single Processor)
Symmetric Multiprocessor (SMP) Organization
(Tightly Coupled)
Non-Uniform Memory Access (NUMA) Organization
(Moderately Coupled)
Cluster Organization
(Loosely Coupled)