0% found this document useful (0 votes)

64 views13 pages

Computer Science 146 Computer Architecture

This document summarizes a lecture on software pipelining and global scheduling techniques for improving instruction level parallelism. The key points covered are: 1) Software pipelining reorganizes loop iterations so instructions are chosen from different iterations, similar to hardware pipelining, to better utilize functional units. 2) Global scheduling techniques like trace scheduling aim to find and compact long straight-line traces of code across conditional branches using profile information. 3) Trace scheduling requires bookkeeping code to handle cases where static branch predictions are incorrect.

Uploaded by

harshv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views13 pages

Computer Science 146 Computer Architecture

Uploaded by

harshv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Computer Science 146

Computer Architecture
Spring 2004
Harvard University
Instructor: Prof. David Brooks
dbrooks@eecs.harvard.edu
Lecture 11: Software Pipelining and
Global Scheduling
Computer Science 146
David Brooks

Lecture Outline
Review of Loop Unrolling
Software Pipelining
Global Scheduling
Trace Scheduling, Superblocks

Next Time
Hardware-Assisted, Software ILP
Conditional, Predicated Instructions
Compiler Speculation with Hardware Support

Hardware vs. Software comparison

Itanium Implementation
Computer Science 146
David Brooks

Compiler Loop Unrolling

1.
2.
3.
4.
5.

Check OK to move the S.D after DSUBUI and BNEZ, and find amount to
adjust S.D offset
Determine unrolling the loop would be useful by finding that the loop
iterations were independent
Rename registers to avoid name dependencies
Eliminate extra test and branch instructions and adjust the loop
termination and iteration code
Determine loads and stores in unrolled loop can be interchanged by
observing that the loads and stores from different iterations are
independent

requires analyzing memory addresses and finding that they do not refer to the
same address.

Schedule the code, preserving any dependences needed to yield same

result as the original code
Computer Science 146
David Brooks

Loop Unrolling Limitations

Decrease in amount of overhead amortized per
unroll
Diminishing returns in reducing loop overheads

Growth in code size

Can hurt instruction-fetch performance

Computer Science 146

David Brooks

Loop Unrolling Problem

overlapped ops

Every loop unrolling iteration requires pipeline to fill and

drain
Occurs every m/n times if loop has m iterations and is
unrolled n times
Proportional to
Number of Unrolls

Overlap between
Unrolled iterations

Time
Computer Science 146
David Brooks

More advanced Technique:

Software Pipelining
Observation: if iterations from loops are independent, then can get
more ILP by taking instructions from different iterations
Software pipelining: reorganizes loops so that each iteration is
made from instructions chosen from different iterations of the
original loop (~ Tomasulo in SW)
Iteration
0
Iteration
Iteration
1
Iteration
2
3
Iteration
4

Softwarepipelined
iteration

Computer Science 146

David Brooks

Software Pipelining
Now must optimize
inner loop
Want to do as much
work as possible in
each iteration
Keep all of the
functional units busy in
the processor

for(j = 0; j < MAX; j++)

C[j] += A * B[j];
Dataflow graph:
load B[j]

load C[j]

+
store C[j]

Computer Science 146

David Brooks

Software Pipelining Example

for(j = 0; j < MAX; j++)

Pipelined:

Not pipelined:
load B[j]

load C[j]

load B[j]

load C[j]

load B[j]

load C[j]

load B[j]

store C[j]

load C[j]

load B[j]

store C[j]

load C[j]

load B[j]

store C[j]

load C[j]

load B[j]

store C[j]

load C[j]

load B[j]

store C[j]

load C[j]

store C[j]

Fill

C[j] += A * B[j];

*
+

load C[j]

A
*

+
store C[j]
load B[j]

A
*

+
store C[j]

Drain

load C[j]

Steady State

store C[j]
load B[j]

store C[j]

Software Pipelining Example

Symbolic Loop Unrolling

overlapped ops

After: Software Pipelined

Before: Unrolled 3 times
1 S.D
0(R1),F4 ; Stores M[i]
1 L.D
F0,0(R1)
2 ADD.D F4,F0,F2 ; Adds to M[i-1]
2 ADD.D
F4,F0,F2
3 L.D
F0,-16(R1); Loads M[i-2]
3 S.D
0(R1),F4
4
DSUBUI
R1,R1,#8
4 L.D
F6,-8(R1)
5 BNEZ
R1,LOOP
5 ADD.D
F8,F6,F2
6 S.D
-8(R1),F8
7 L.D
F10,-16(R1)
SW Pipeline
8 ADD.D
F12,F10,F2
9 S.D
-16(R1),F12
10 DSUBUI
R1,R1,#24
Time
11 BNEZ
R1,LOOP
Loop Unrolled
Maximize result-use distance
Less code space than unrolling
Fill & drain pipe only once per loop
vs. once per each unrolled iteration in loop unrolling

Time

5 cycles per iteration

Software Pipelining vs. Loop

Unrolling
Software pipelining is symbolic loop unrolling
Consumes less code space

Actually they are targeting different things

Both provide a better scheduled inner loop
Loop Unrolling
Targets loop overhead code (branch/counter update code)

Software Pipelining
Targets time when pipelining is filling and draining

Best performance can come from doing both

Computer Science 146
David Brooks

When Safe to Unroll Loop?

Example: Where are data dependencies?
(A,B,C distinct & nonoverlapping)
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i];
/* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
1. S2 uses the value, A[i+1], computed by S1 in the same iteration.
2. S1 uses a value computed by S1 in an earlier iteration, since iteration i
computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i]
and B[i+1].
This is a loop-carried dependence: between iterations

For our prior example, each iteration was distinct

Implies that iterations cant be executed in parallel?
Computer Science 146
David Brooks

VLIW vs. SuperScalar

Superscalar processors decide on the fly how
many instructions to issue
HW complexity of Number of instructions to issue
O(n2)

Proposal: Allow compiler to schedule instruction

level parallelism explicitly
Format the instructions in a potential issue packet
so that HW need not check explicitly for
dependences
Computer Science 146
David Brooks

VLIW: Very Large Instruction

Word
Each instruction has explicit coding for multiple
operations
In IA-64, grouping called a packet

Tradeoff instruction space for simple decoding

Slots are available for many ops in the instruction word
By definition, all the operations the compiler puts in the long
instruction word are independent => execute in parallel
E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide

Need compiling technique that schedules across several

branches (Discussed next time)
Computer Science 146
David Brooks

Recall: Unrolled Loop that

Minimizes Stalls for Scalar
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14

L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
S.D
S.D
S.D
DSUBUI
BNEZ
S.D

F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
R1,LOOP
8(R1),F16

L.D to ADD.D: 1 Cycle

ADD.D to S.D: 2 Cycles

; 8-32 = -24

14 clock cycles, or 3.5 per iteration

Computer Science 146
David Brooks

Loop Unrolling in VLIW

Memory
reference 1

Memory
reference 2

L.D F0,0(R1)
L.D F10,-16(R1)
L.D F18,-32(R1)
L.D F26,-48(R1)

L.D F6,-8(R1)
L.D F14,-24(R1)
L.D F22,-40(R1)

S.D 0(R1),F4
S.D -16(R1),F12
S.D -32(R1),F20
S.D -0(R1),F28

S.D -8(R1),F8
S.D -24(R1),F16
S.D -40(R1),F24

FP
operation 1

ADD.D F4,F0,F2
ADD.D F12,F10,F2
ADD.D F20,F18,F2
ADD.D F28,F26,F2

FP
op. 2

Int. op/
branch

Clock
1
2
3
4
5
6
7
8
9

ADD.D F8,F6,F2
ADD.D F16,F14,F2
ADD.D F24,F22,F2

DSUBUI R1,R1,#48
BNEZ R1,LOOP

Unrolled 7 times to avoid delays

7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)
Average: 2.5 ops per clock, 50% efficiency
Note: Need more registers in VLIW (15 vs. 6 in SS)

Software Pipelining with

Loop Unrolling in VLIW
Memory
reference 1

Memory
reference 2

FP
operation 1

L.D F0,-48(R1)
L.D F6,-56(R1)
L.D F10,-40(R1)

ST 0(R1),F4
ST -8(R1),F8
ST 8(R1),F12

ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2

FP
op. 2

Int. op/
branch
DSUBUI R1,R1,#24
BNEZ R1,LOOP

Clock
1
2
3

Software pipelined across 9 iterations of original loop

In each iteration of above loop, we:
Store to m,m-8,m-16
Compute for m-24,m-32,m-40
Load from m-48,m-56,m-64

(iterations I-3,I-2,I-1)
(iterations I,I+1,I+2)
(iterations I+3,I+4,I+5)

9 results in 9 cycles, or 1 clock per iteration

Average: 3.3 ops per clock, 66% efficiency
Note: Need fewer registers for software pipelining
(only using 7 registers here, was using 15)

Global Scheduling
Previously we focused on loop-level parallelism
Unrolling, Software Pipelining + scheduling work well
These work best on single basic blocks (repeatable
schedules)
Basic Block Single Entry/Single Exit Instruction Sequence

What about internal control flow?

What about if-branches instead of loop-branches?

Computer Science 146

David Brooks

Global Scheduling
How to move computation and
assignment of B[i]?
Relative execution frequency?
How cheap to execute B[i] above
the branch?
How much benefit to executing B[i]
early? (critical path?)
What is the cost of compensation
code for the else case?

What about moving C[i]?

Computer Science 146
David Brooks

Static Branch Prediction

Simplest: Predict taken
Misprediction rate = untaken branch frequency => for SPEC
programs is 34%.
Range is quite large though (from not very accurate (59%) to highly
accurate (9%))

Predict on the basis of branch direction? (P6 on BTB miss)

choosing backward-going branches to be taken (loop)
forward-going branches to be not taken (if)
SPEC programs, however, most forward-going branches are taken
=> predict taken is better

Predict branches on the basis of profile information

collected from earlier runs
Misprediction varies from 5% to 22%
Computer Science 146
David Brooks

Trace Scheduling
Parallelism across IF branches vs. LOOP branches?
Two steps:
Trace Selection
Find likely sequence of basic blocks (trace)
of (statically predicted or profile predicted)
long sequence of straight-line code

Trace Compaction
Squeeze trace into few VLIW instructions
Need bookkeeping code in case prediction is wrong

This is a form of compiler-generated speculation

Compiler must generate fixup code to handle cases in which trace is
not the taken branch
Needs extra registers: undoes bad guess by discarding
Computer Science 146
David Brooks

Trace Scheduling
Use loop unrolling, static
branch prediction to
generate long traces
Trace scheduling:
Bookkeeping code is
needed when code is
moved across trace entry
and exit points

Computer Science 146

David Brooks

Superblocks
Fixes a major drawback of trace scheduling
Entries and exits in the middle of the trace are complicated

Superblocks
Use a similar process as trace generation, but superblocks
are restriced to a single entry point with multiple exit points
Scheduling (compaction) is simpler
Only code motion across exits must be considered
Only one entrance?
Tail duplication is used to create a separate block that
corresponds to the portion of the trace after entry
Computer Science 146
David Brooks

Superblocks

What if branches are not

statically predictable?
Loop Unrolling, Trace scheduling work great
when branches are fairly predictable statically
Same thing with memory reference dependencies
Compiler Speculation is needed to solve this
Conditional/Predicated instructions if-conversion
Hardware support for exception/memory-dependence
checks

Computer Science 146

David Brooks

Hardware Support for Exposing

More Parallelism at Compile-Time
Conditional or Predicated Instructions
Conditional instruction execution

Full predication every instruction has predicate tag

(IA64)
Conditional Moves (Alpha, IA32, etc)
if(r3==0)r1=r2
BNEZ R3, L
ADDU R1, R2, R0

cmoveqz r1, r2, r3

L:
Computer Science 146
David Brooks

Schedule for next few lectures

Next Time (Mar. 17th) HW#3 Due Friday
Hardware support for software-ILP
Itanium (IA64) case study

Review for midterm (Mar 22nd)

Midterm March 24th

Computer Science 146

David Brooks

Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
Exploiting Instruction-Level Parallelism With Software Approaches
No ratings yet
Exploiting Instruction-Level Parallelism With Software Approaches
108 pages
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
No ratings yet
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
26 pages
Unit II
No ratings yet
Unit II
84 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
2.advanced Compiler Support For ILP
100% (1)
2.advanced Compiler Support For ILP
16 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Computer Architecture: Speculation & Multiple Issue
No ratings yet
Computer Architecture: Speculation & Multiple Issue
22 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
Software Pipelining Patterson 1996
No ratings yet
Software Pipelining Patterson 1996
60 pages
Computer Architecture ILP - Techniques For Increasing
No ratings yet
Computer Architecture ILP - Techniques For Increasing
11 pages
Advanced Loop Optimization Techniques
No ratings yet
Advanced Loop Optimization Techniques
21 pages
43-Instruction Scheduling and Software Pipelining-19!11!2024
No ratings yet
43-Instruction Scheduling and Software Pipelining-19!11!2024
25 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
18 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
Intro To Static Pipelining: CS252 Graduate Computer Architecture
No ratings yet
Intro To Static Pipelining: CS252 Graduate Computer Architecture
52 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Vliw/Epic:: Statically Scheduled ILP
No ratings yet
Vliw/Epic:: Statically Scheduled ILP
34 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
Lec 15
No ratings yet
Lec 15
15 pages
Computer Architecture 09-Superscalar
No ratings yet
Computer Architecture 09-Superscalar
83 pages
Optimizing Instruction-Level Parallelism
No ratings yet
Optimizing Instruction-Level Parallelism
18 pages
Lec-10 Software Pipelining
No ratings yet
Lec-10 Software Pipelining
24 pages
Lecture 5
No ratings yet
Lecture 5
80 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
06 Ooo Basics
No ratings yet
06 Ooo Basics
74 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
05 Wideissue
No ratings yet
05 Wideissue
77 pages
Advanced ILP Techniques for Developers
No ratings yet
Advanced ILP Techniques for Developers
104 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
ACA Unit 3
No ratings yet
ACA Unit 3
50 pages
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
No ratings yet
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
11 pages
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
Superpipelining
No ratings yet
Superpipelining
7 pages
Sp11-Quiz1 Soln
No ratings yet
Sp11-Quiz1 Soln
20 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
49 pages
Pipe 3
No ratings yet
Pipe 3
32 pages
Advanced Computer Architecture FAQ
60% (5)
Advanced Computer Architecture FAQ
18 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
Very Large Instruction Word (Vliw) Processors: What Is Good and What Is Bad With Superscalars ?
No ratings yet
Very Large Instruction Word (Vliw) Processors: What Is Good and What Is Bad With Superscalars ?
11 pages
VLIW Architecture
No ratings yet
VLIW Architecture
53 pages
Slides Chapter 6 Pipelining
No ratings yet
Slides Chapter 6 Pipelining
60 pages
CA - Slides
No ratings yet
CA - Slides
28 pages
Compiling For Vliws and Ilp: Profiling Region Formation Acyclic Scheduling Cyclic Scheduling
No ratings yet
Compiling For Vliws and Ilp: Profiling Region Formation Acyclic Scheduling Cyclic Scheduling
46 pages
VLIW Processors for CS Students
No ratings yet
VLIW Processors for CS Students
19 pages
5 Advanced-1
No ratings yet
5 Advanced-1
60 pages
Lec9 Multiple Issue Processors
No ratings yet
Lec9 Multiple Issue Processors
33 pages
Out-of-Order Superscalar Optimization
No ratings yet
Out-of-Order Superscalar Optimization
156 pages
SRM Pipelining 05
No ratings yet
SRM Pipelining 05
42 pages
Software Pipelining: An Alternative Method of Reorganizing Loops To Increase Instruction Level Parallelism
No ratings yet
Software Pipelining: An Alternative Method of Reorganizing Loops To Increase Instruction Level Parallelism
14 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
Pipelining Achieves Instruction Level Parallelism (ILP)
No ratings yet
Pipelining Achieves Instruction Level Parallelism (ILP)
59 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
Unit 6
No ratings yet
Unit 6
22 pages
Lec 11
No ratings yet
Lec 11
19 pages
CompArch 17e ILP-1
No ratings yet
CompArch 17e ILP-1
15 pages
Bitcoin Price Prediction via Graph Analysis
No ratings yet
Bitcoin Price Prediction via Graph Analysis
8 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
17 pages
Multithreading & I/O in Computer Architecture
No ratings yet
Multithreading & I/O in Computer Architecture
19 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
17 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
18 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
16 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
52 pages
Day 2
No ratings yet
Day 2
2 pages
Kami Export - Multiplying - & - Dividing - Rational - Expressions - Wks
No ratings yet
Kami Export - Multiplying - & - Dividing - Rational - Expressions - Wks
1 page
Mans Best Friend British English Teacher
No ratings yet
Mans Best Friend British English Teacher
11 pages
Denver International Airport Baggage Handling System Failure
No ratings yet
Denver International Airport Baggage Handling System Failure
12 pages
Eds3000 Us Manual
No ratings yet
Eds3000 Us Manual
17 pages
Oedipus Rex: Tragic Hero Analysis
100% (1)
Oedipus Rex: Tragic Hero Analysis
8 pages
PDF Pioneer b2 Tests Compress
100% (4)
PDF Pioneer b2 Tests Compress
65 pages
Class 7 Science Chapter 15 LIGHT
No ratings yet
Class 7 Science Chapter 15 LIGHT
6 pages
CC3L-CP2 Syllabus
No ratings yet
CC3L-CP2 Syllabus
8 pages
Physical Assessment of Various Systems
No ratings yet
Physical Assessment of Various Systems
15 pages
Retractionofrizal 220617061037 C38e6e11
No ratings yet
Retractionofrizal 220617061037 C38e6e11
40 pages
Winxp PDF
No ratings yet
Winxp PDF
18 pages
PI 100 Lecture Notes
No ratings yet
PI 100 Lecture Notes
4 pages
KumarAnupam SDE FullStack 0 2 EXP
No ratings yet
KumarAnupam SDE FullStack 0 2 EXP
1 page
(FREE PDF Sample) Introduction To Mathematical Proofs A Transition To Advanced Mathematics 2nd Edition Charles E. Roberts Ebooks
No ratings yet
(FREE PDF Sample) Introduction To Mathematical Proofs A Transition To Advanced Mathematics 2nd Edition Charles E. Roberts Ebooks
67 pages
Basic Math & Pre-Algebra All-in-One For Dummies 1st Edition Mark Zegarelli Download
No ratings yet
Basic Math & Pre-Algebra All-in-One For Dummies 1st Edition Mark Zegarelli Download
118 pages
CMS Substructure in ANSYS Workbench
No ratings yet
CMS Substructure in ANSYS Workbench
8 pages
Plan Lectie A 4a
No ratings yet
Plan Lectie A 4a
3 pages
7 - Phil Literature, Newest
100% (1)
7 - Phil Literature, Newest
96 pages
Applications of Ladder Diagrams
No ratings yet
Applications of Ladder Diagrams
16 pages
Clases 3-4
No ratings yet
Clases 3-4
9 pages
Aspiring Teacher's Journey
No ratings yet
Aspiring Teacher's Journey
2 pages
Are Comprehension Questions Good Reading Exercises
100% (1)
Are Comprehension Questions Good Reading Exercises
16 pages
Music of Southeast Asian: Lesson
No ratings yet
Music of Southeast Asian: Lesson
22 pages
Proto Indo European - Language
No ratings yet
Proto Indo European - Language
20 pages
Grammar Test 2024
No ratings yet
Grammar Test 2024
4 pages
Neelam
No ratings yet
Neelam
21 pages
Cisco IOS XR Getting Started Guide For The Cisco CRS Router
No ratings yet
Cisco IOS XR Getting Started Guide For The Cisco CRS Router
220 pages
Unit 1 Exam AP English Language, September 2021 Multiple Choice and Free Response: 80 Minutes
No ratings yet
Unit 1 Exam AP English Language, September 2021 Multiple Choice and Free Response: 80 Minutes
11 pages
JD - SoftwareDeveloper - Jivu Infosolutions Software Development Company
No ratings yet
JD - SoftwareDeveloper - Jivu Infosolutions Software Development Company
2 pages

Computer Science 146 Computer Architecture

Uploaded by

Computer Science 146 Computer Architecture

Uploaded by

Computer Science 146

Hardware vs. Software comparison

Compiler Loop Unrolling

Schedule the code, preserving any dependences needed to yield same

Loop Unrolling Limitations

Growth in code size

Computer Science 146

Loop Unrolling Problem

Every loop unrolling iteration requires pipeline to fill and

More advanced Technique:

Computer Science 146

for(j = 0; j < MAX; j++)

Computer Science 146

Software Pipelining Example

Software Pipelining Example

Symbolic Loop Unrolling

After: Software Pipelined

5 cycles per iteration

Software Pipelining vs. Loop

Actually they are targeting different things

Best performance can come from doing both

When Safe to Unroll Loop?

For our prior example, each iteration was distinct

VLIW vs. SuperScalar

Proposal: Allow compiler to schedule instruction

VLIW: Very Large Instruction

Tradeoff instruction space for simple decoding

Need compiling technique that schedules across several

Recall: Unrolled Loop that

L.D to ADD.D: 1 Cycle

14 clock cycles, or 3.5 per iteration

Loop Unrolling in VLIW

Unrolled 7 times to avoid delays

Software Pipelining with

Software pipelined across 9 iterations of original loop

9 results in 9 cycles, or 1 clock per iteration

What about internal control flow?

Computer Science 146

What about moving C[i]?

Static Branch Prediction

Predict on the basis of branch direction? (P6 on BTB miss)

Predict branches on the basis of profile information

This is a form of compiler-generated speculation

Computer Science 146

What if branches are not

Computer Science 146

Hardware Support for Exposing

Full predication every instruction has predicate tag

cmoveqz r1, r2, r3

Schedule for next few lectures

Review for midterm (Mar 22nd)

Computer Science 146

You might also like