0% found this document useful (0 votes)

21 views52 pages

S6 - Advanced Topics in Computer Architecture

Computer architecture nsu cse332

Uploaded by

hasan.bannah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views52 pages

S6 - Advanced Topics in Computer Architecture

Computer architecture nsu cse332

Uploaded by

hasan.bannah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Advanced Topics In Computer

Architecture

•Pipelining

•Superscalar processor

• Multithreading (explicit and implicit)

• Multicore Machines

• Clusters
Superscalar Processors

• Definition of Superscalar
• Design Issues:
- Instruction Issue Policy
- Register renaming
- Machine parallelism
- Branch Prediction
- Execution
• Pentium 4 example
What is Superscalar?

A Superscalar machine executes multiple independent

instructions in parallel.
They are pipelined as well.

• “Common” instructions (arithmetic, load/store, conditional branch)

can be executed independently.

• Equally applicable to RISC & CISC, but more straightforward in

RISC machines.

• The order of execution is usually assisted by the compiler.

Example of Superscalar Organization

• 2 Integer ALU pipelines,

• 2 FP ALU pipelines,
• 1 memory pipeline (?)
Superscalar v Superpipelined
Limitations of Superscalar

• Dependent upon:
- Instruction level parallelism possible
- Compiler based optimization
- Hardware support

• Limited by
— Data dependency
— Procedural dependency
— Resource conflicts
(Recall) True Data Dependency
(Must W before R)

ADD r1, r2 r1+r2 🡺 r1

MOVE r3, r1 r1 🡺 r3
• Can fetch and decode second instruction in parallel with
first

LOAD r1, X x (memory) 🡺 r1

MOVE r3, r1 r1🡺 r3
• Can NOT execute second instruction until first is
finished
Second instruction is dependent on first (R after W)
(recall) Antidependancy (Must R before W)

ADD R4, R3, 1 R3 + 1 🡺 R4

ADD R3, R5, 1 R5 + 1 🡺 R3

• Cannot complete the second instruction before the first has

read R3
(Recall) Procedural Dependency

• Can’t execute instructions after a branch in parallel

with instructions before a branch, because?

Note: Also, if instruction length is not fixed,

instructions have to be decoded to find out how many
fetches are needed
(recall) Resource Conflict

• Two or more instructions requiring access to the

same resource at the same time
— e.g. two arithmetic instructions need the ALU

• Solution - Can possibly duplicate resources

— e.g. have two arithmetic units
Effect of Dependencies on Superscalar Operation

Notes:
1) Superscalar operation is double impacted by a stall.
2) CISC machines typically have different length instructions and need to be at least
partially decoded before the next can be fetched – not good for superscalar operation
Instruction-level Parallelism – degree of
• Consider:
LOAD R1, R2
ADD R3, 1
ADD R4, R2
These can be handled in parallel.

• Consider:
ADD R3, 1
ADD R4, R3
STO (R4), R0
These cannot be handled in parallel.

The “degree” of instruction-level parallelism is determined by the

number of instructions that can be executed in parallel without
stalling for dependencies
Instruction Issue Policies

• Order in which instructions are fetched

• Order in which instructions are executed
• Order in which instructions update registers and
memory values (order of completion)

Standard Categories:
• In-order issue with in-order completion
• In-order issue with out-of-order completion
• Out-of order issue with out-of-order completion
In-Order Issue -- In-Order Completion

Issue instructions in the order they occur:

• Not very efficient

• Instructions must stall if necessary (and stalling in

superpipelining is expensive)
In-Order Issue -- In-Order Completion
(Example)
Assume:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
In-Order Issue -- Out-of-Order Completion
(Example)

Again:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit

How does this effect interrupts?

Out-of-Order Issue -- Out-of-Order Completion

• Decouple decode pipeline from execution pipeline

• Can continue to fetch and decode until the “window”

is full

• When a functional unit becomes available an

instruction can be executed (usually in as much
in-order as possible)

• Since instructions have been decoded, processor can

look ahead
Out-of-Order Issue -- Out-of-Order Completion
(Example)

Again:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit

Note: I5 depends upon I4, but I6 does not

• Output and antidependencies occur because register

contents may not reflect the correct ordering from the
program

• Can require a pipeline stall

• One solution: Allocate Registers dynamically

(renaming registers)
Register Renaming example

Add R3, R3, R5 R3b:=R3a + R5a (I1)

Add R4, R3, 1 R4b:=R3b + 1 (I2)
Add R3, R5, 1 R3c:=R5a + 1 (I3)
Add R7, R3, R4 R7b:=R3c + R4b (I4)

• Without “subscript” refers to logical register in

instruction
• With subscript is hardware register allocated:
R3a R3b R3c

Note: R3c avoids: antidependency on I2

output dependency I1
Recaping: Machine Parallelism Support

• Duplication of Resources

• Out of order issue hardware

• Windowing to decouple execution from decode

• Register Renaming capability

Speedups of Machine Organizations
(Without Procedural Dependencies)

• Not worth duplication of functional units without register renaming

• Need instruction window large enough (more than 8, probably not more than 32)
Branch Prediction in Superscalar Machines

• Delayed branch not used much. Why?

Multiple instructions need to execute in the delay slot.
This leads to much complexity in recovery.

• Branch prediction should be used - Branch history is

very useful
View of Superscalar Execution
Committing or Retiring Instructions

Results need to be put into order (commit or retire)

• Results sometimes must be held in temporary storage

until it is certain they can be placed in “permanent”
storage.
(either committed or retired/flushed)

• Temporary storage requires regular clean up –

overhead – done in hardware.
Superscalar Hardware Support
• Facilities to simultaneously fetch multiple
instructions

• Logic to determine true dependencies involving

• Mechanisms to initiate multiple instructions in

parallel

• Resources for parallel execution of multiple

instructions

• Mechanisms for committing process state in correct

order
Example: Pentium 4
A Superscalar CISC Machine
Pentium 4 alternate view
Pentium 4 pipeline

20 stages !
a) Generation of Micro-ops (stages 1 &2)

• Using the Branch Target Buffer and Instruction Translation

Lookaside Buffer, the x86 instructions are fetched 64 bytes at a
time from the L2 cache

•The instruction boundaries are determined and instructions decoded

into 1-4 118-bit RISC micro-ops

• Micro-ops are stored in the trace cache

b) Trace cache next instruction pointer (stage 3)

• The Trace Cache Branch Target Buffer contains dynamic

gathered history information (4 bit tag)

• If target is not in BTB

- Branch not PC relative: predict branch taken if it is a return, predict
not taken otherwise
- For PC relative backward conditional branches, predict take,
otherwise not taken
c) Trace Cache fetch (stage 4)

• Orders micro-ops in program-ordered sequences called traces

• These are fetched in order, subject to branch prediction

• Some micro-ops require many micro-ops (CISC instructions).

These are coded into the ROM and fetched from the ROM
d) Drive (stage 5)

• Delivers instructions from the Trace Cache to the

Rename/Allocator module for reordering
e) Allocate: register naming (stages 6, 7, & 8)

• Allocates resources for execution (3 micro-ops arrive per clock cycle):

- Each micro-op is allocated to a slot in the 126 position circular Reorder Buffer (ROB) which
tracks progress of the micro-ops.
Buffer entries include:
- State – scheduled, dispatched, completed, ready for retire
- Address that generated the micro-op
- Operation
- Alias registers are assigned for one of 16 arch reg (128 alias registers)
{to remove data
dependencies}
• The micro-ops are dispatched out of order as resources are available
• Allocates an entry to one of the 2 scheduler queues - memory access or not
• The micro-ops are retired in order from the ROB
f) Micro-op queuing (stage 9)

• Micro-ops are loaded into one of 2 queues:

- one for memory operations
- one for non memory operations
• Each queue operates on a FIFO policy
g) Micro-op scheduling h) Dispatch
(stages 10, 11, & 12) (stages 13 & 14)

• The 2 schedulers retrieve micro-ops based upon having all

the operands ready and dispatch them to an available unit (up
to 6 per clock cycle)

• If two micro-ops need the same unit, they are dispatched in

sequence.
i) Register file j) Execute: flags
(stages 15 & 16) (stages 17 & 18)

• The register files are the sources for pending fixed and FF
operations

• A separate stage is used to compute the flags

k) Branch check l) Branch check results
(stage 19) (stage 20)

• Checks flags and compares results with predictions

• If the branch prediction was wrong:

- all incorrect micro-ops must be flushed (don’t want to be wrong!)
- the correct branch destination is provided to the Branch Predictor
- the pipeline is restarted from the new target address
Definitions of Threads and Processes

• Process:
— An instance of program running on computer

• Thread: dispatchable unit of work within process

— Includes processor context (which includes the program
counter and stack pointer) and data area for stack
— Threads execute sequentially, but are Interruptible
– the processor can turn to another thread

• Thread switch
— Switching processor between threads within same process
– Typically less costly than process switch
Implicit and Explicit Multithreading

• Explicit Multithreading is Concurrently executing

instructions from different explicit threads
— Instructions are Interleaved from different threads on
shared pipelines or executed in Parallel on separate pipelines

• Implicit multithreading is concurrent execution of

multiple threads extracted from a single sequential
program
— Implicit threads are defined statically by the compiler or
dynamically by hardware
Scalar Threading
Multiple Instruction Issue Threading
Parallel Diagram
Multicore Organization Alternatives
Intel x86 Multicore Organization
Core i7

• Released November 2008

• Speculative pre-fetch for caches

• Simultaneous multi-threading (SMT)

— 4 SMT cores, each supporting 4 threads 🡪 appears as 16 cores

• On chip DDR3 memory controller

— Three 8 byte channels (192 bits) giving 32GB/s

• QuickPath Interconnection
— Cache coherent point-to-point link
— High speed communications between processor chips
– 6.4G transfers per second, 16 bits per transfer
– Total bandwidth 25.6GB/s
Intel Core i7 Block Diagram

.3 ns/B !
Intel Core i7

approx 45x45 mm
45 nm feature size
Parallel Processor Architecture Summary

Very Tightly Coupled Tightly Coupled Moderately Coupled

MultiCore Organization
(Very tightly Coupled or Single Processor)
Symmetric Multiprocessor (SMP) Organization
(Tightly Coupled)
Non-Uniform Memory Access (NUMA) Organization
(Moderately Coupled)
Cluster Organization
(Loosely Coupled)

Superscalar
No ratings yet
Superscalar
38 pages
Presentation Cea Chapter16 2 Demo
No ratings yet
Presentation Cea Chapter16 2 Demo
30 pages
10 Week
No ratings yet
10 Week
35 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
49 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
Hafta 14
No ratings yet
Hafta 14
23 pages
Superscalar Processor Simulation
No ratings yet
Superscalar Processor Simulation
16 pages
cs152 Notes
No ratings yet
cs152 Notes
34 pages
CH - 14 - Instruction Level Parallelism and Superscalar Processors
No ratings yet
CH - 14 - Instruction Level Parallelism and Superscalar Processors
42 pages
Instruction Level Parallelism and Superscalar Processors
No ratings yet
Instruction Level Parallelism and Superscalar Processors
34 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
L27,28 Superscaler
No ratings yet
L27,28 Superscaler
28 pages
05 Wideissue
No ratings yet
05 Wideissue
77 pages
Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture
No ratings yet
Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture
114 pages
Superscalar Processors & Parallelism
No ratings yet
Superscalar Processors & Parallelism
50 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
No ratings yet
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
17 pages
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
No ratings yet
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
49 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
Branch Handling 1
No ratings yet
Branch Handling 1
50 pages
10th Lecture: Multiple-Issue Processors: Please Recall: Branch Prediction
No ratings yet
10th Lecture: Multiple-Issue Processors: Please Recall: Branch Prediction
28 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
Module 5 - Processor Structure and Function
No ratings yet
Module 5 - Processor Structure and Function
74 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
Pipelining
No ratings yet
Pipelining
21 pages
Unit - 1 Microprocessor Architecture
No ratings yet
Unit - 1 Microprocessor Architecture
52 pages
RN ACA-5 Unit-II
No ratings yet
RN ACA-5 Unit-II
42 pages
CPU Structure & Functions
No ratings yet
CPU Structure & Functions
44 pages
Computer Systems Pipelining Guide
No ratings yet
Computer Systems Pipelining Guide
7 pages
CompArch 17e ILP-1
No ratings yet
CompArch 17e ILP-1
15 pages
03ILP Speculation and Advanced Topics
No ratings yet
03ILP Speculation and Advanced Topics
48 pages
CH18 COA11e
No ratings yet
CH18 COA11e
40 pages
Unit V
No ratings yet
Unit V
23 pages
Ca06 2014 PDF
No ratings yet
Ca06 2014 PDF
53 pages
Contact Session 8
No ratings yet
Contact Session 8
63 pages
CPU Structure & Function Guide
No ratings yet
CPU Structure & Function Guide
22 pages
Computer Organization and Architecture What Does Superscalar Mean?
No ratings yet
Computer Organization and Architecture What Does Superscalar Mean?
14 pages
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
No ratings yet
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
7 pages
CPU Architecture for Students
100% (1)
CPU Architecture for Students
30 pages
Unit 7 - Basic Processing
No ratings yet
Unit 7 - Basic Processing
85 pages
CPU With Systems Bus
33% (3)
CPU With Systems Bus
35 pages
12 - Processor Structure and Function
No ratings yet
12 - Processor Structure and Function
73 pages
3 Pipeline
No ratings yet
3 Pipeline
38 pages
Lecture Notes Pipelining Stages 7B
No ratings yet
Lecture Notes Pipelining Stages 7B
7 pages
Cpe 242 Computer Architecture and Engineering Instruction Level Parallelism
No ratings yet
Cpe 242 Computer Architecture and Engineering Instruction Level Parallelism
46 pages
A4 版本1 （未使用）
No ratings yet
A4 版本1 （未使用）
2 pages
Computer Architecture Revision For Final Exam
No ratings yet
Computer Architecture Revision For Final Exam
60 pages
The Pentium 4 Architecture
No ratings yet
The Pentium 4 Architecture
5 pages
Superscalar Microprocessors
No ratings yet
Superscalar Microprocessors
9 pages
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
74 pages
CMP3010L07 Tomasulo
No ratings yet
CMP3010L07 Tomasulo
70 pages
William Stallings Computer Organization and Architecture: CPU Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture: CPU Structure and Function
40 pages
Processor Organization & Instruction Cycle
No ratings yet
Processor Organization & Instruction Cycle
31 pages
P14-15 Superscalar
No ratings yet
P14-15 Superscalar
28 pages
MIPS
No ratings yet
MIPS
70 pages
Chapter 13 - Instruction Level Parallelism
No ratings yet
Chapter 13 - Instruction Level Parallelism
16 pages
MIPS Pipeline & Dynamic Scheduling
No ratings yet
MIPS Pipeline & Dynamic Scheduling
5 pages
Onur Ddca 2025 Lecture15b Branch Prediction Beforelecture
No ratings yet
Onur Ddca 2025 Lecture15b Branch Prediction Beforelecture
188 pages
Hyperion Core White Paper 2
No ratings yet
Hyperion Core White Paper 2
14 pages
Me Syllabus
100% (1)
Me Syllabus
3 pages
Coa - 1.1
No ratings yet
Coa - 1.1
30 pages
HPC Chapter 1
No ratings yet
HPC Chapter 1
12 pages
CS 200 Schedule
No ratings yet
CS 200 Schedule
1 page
The 50 Year History of The Microprocessor As Five Technology Eras
No ratings yet
The 50 Year History of The Microprocessor As Five Technology Eras
2 pages
Lecture 36
No ratings yet
Lecture 36
15 pages
2022 Microprocessor and Interfacing
No ratings yet
2022 Microprocessor and Interfacing
10 pages
Microprocessors in 2020
No ratings yet
Microprocessors in 2020
4 pages
Computer Organization and Architecture 10th Edition by Stallings ISBN Test Bank
100% (67)
Computer Organization and Architecture 10th Edition by Stallings ISBN Test Bank
9 pages
Chapter 1 - Basic Concepts and Computer Evolution
No ratings yet
Chapter 1 - Basic Concepts and Computer Evolution
23 pages
William Stallings Computer Organization and Architecture 10 Edition
0% (1)
William Stallings Computer Organization and Architecture 10 Edition
52 pages
Handbook HPC 23-24
No ratings yet
Handbook HPC 23-24
18 pages
Fundamentals of Computer Architecture
No ratings yet
Fundamentals of Computer Architecture
43 pages
Final Mod1
No ratings yet
Final Mod1
193 pages
CS8491 Ca Unit 4
No ratings yet
CS8491 Ca Unit 4
32 pages
Superscalar vs Superpipelined CPUs
No ratings yet
Superscalar vs Superpipelined CPUs
4 pages
Superscalar Architecture
No ratings yet
Superscalar Architecture
9 pages
VLIW Architecture
No ratings yet
VLIW Architecture
5 pages
Intro
No ratings yet
Intro
66 pages
Amd Micro Architecture
No ratings yet
Amd Micro Architecture
15 pages
A. Instruction-Level Parallelism: Ntroduction
No ratings yet
A. Instruction-Level Parallelism: Ntroduction
3 pages
DSP - Presentation - Sumit 3
No ratings yet
DSP - Presentation - Sumit 3
63 pages
CPU vs GPU Parallelism Explained
No ratings yet
CPU vs GPU Parallelism Explained
12 pages
Computer Architecture Overview
No ratings yet
Computer Architecture Overview
6 pages
OLP Notes
No ratings yet
OLP Notes
11 pages
Multifunctional & Superscalar Pipelines
No ratings yet
Multifunctional & Superscalar Pipelines
5 pages
5 B.tech - CSECyber7th 8th Sem
No ratings yet
5 B.tech - CSECyber7th 8th Sem
50 pages
Instruction-Level Parallelism Explained
No ratings yet
Instruction-Level Parallelism Explained
3 pages
CS 6303 Computer Architecture TWO Mark With Answer
100% (1)
CS 6303 Computer Architecture TWO Mark With Answer
14 pages

S6 - Advanced Topics in Computer Architecture

Uploaded by

S6 - Advanced Topics in Computer Architecture

Uploaded by

Advanced Topics In Computer

• Multithreading (explicit and implicit)

A Superscalar machine executes multiple independent

• “Common” instructions (arithmetic, load/store, conditional branch)

• Equally applicable to RISC & CISC, but more straightforward in

• The order of execution is usually assisted by the compiler.

• 2 Integer ALU pipelines,

ADD r1, r2 r1+r2 🡺 r1

LOAD r1, X x (memory) 🡺 r1

ADD R4, R3, 1 R3 + 1 🡺 R4

• Cannot complete the second instruction before the first has

• Can’t execute instructions after a branch in parallel

Note: Also, if instruction length is not fixed,

• Two or more instructions requiring access to the

• Solution - Can possibly duplicate resources

The “degree” of instruction-level parallelism is determined by the

• Order in which instructions are fetched

Issue instructions in the order they occur:

• Not very efficient

• Instructions must stall if necessary (and stalling in

How does this effect interrupts?

• Decouple decode pipeline from execution pipeline

• Can continue to fetch and decode until the “window”

• When a functional unit becomes available an

• Since instructions have been decoded, processor can

Note: I5 depends upon I4, but I6 does not

• Output and antidependencies occur because register

• Can require a pipeline stall

• One solution: Allocate Registers dynamically

Add R3, R3, R5 R3b:=R3a + R5a (I1)

• Without “subscript” refers to logical register in

Note: R3c avoids: antidependency on I2

• Out of order issue hardware

• Windowing to decouple execution from decode

• Register Renaming capability

• Not worth duplication of functional units without register renaming

• Delayed branch not used much. Why?

• Branch prediction should be used - Branch history is

Results need to be put into order (commit or retire)

• Results sometimes must be held in temporary storage

• Temporary storage requires regular clean up –

• Logic to determine true dependencies involving

• Mechanisms to initiate multiple instructions in

• Resources for parallel execution of multiple

• Mechanisms for committing process state in correct

• Using the Branch Target Buffer and Instruction Translation

•The instruction boundaries are determined and instructions decoded

• Micro-ops are stored in the trace cache

• The Trace Cache Branch Target Buffer contains dynamic

• If target is not in BTB

• Orders micro-ops in program-ordered sequences called traces

• These are fetched in order, subject to branch prediction

• Some micro-ops require many micro-ops (CISC instructions).

• Delivers instructions from the Trace Cache to the

• Allocates resources for execution (3 micro-ops arrive per clock cycle):

• Micro-ops are loaded into one of 2 queues:

• The 2 schedulers retrieve micro-ops based upon having all

• If two micro-ops need the same unit, they are dispatched in

• A separate stage is used to compute the flags

• Checks flags and compares results with predictions

• If the branch prediction was wrong:

• Thread: dispatchable unit of work within process

• Explicit Multithreading is Concurrently executing

• Implicit multithreading is concurrent execution of

• Released November 2008

• Speculative pre-fetch for caches

• Simultaneous multi-threading (SMT)

• On chip DDR3 memory controller

Very Tightly Coupled Tightly Coupled Moderately Coupled

You might also like