KEMBAR78
Pipelining and ILP (Instruction Level Parallelism) | PDF
Pipelining
&
Instruction Level Parallelism
Mr. A. B. Shinde
Assistant Professor,
Electronics Engineering,
P.V.P.I.T., Budhgaon
Contents…
2
 Fundamentals of computer design:
 Measuring and reports performance
 Instruction level parallelism and its
exploitation
 Concepts and challenges
 Overcoming data hazards with
dynamics scheduling
 Dynamic branch prediction
 Speculation
 Multiple issue processors
Early Computers
3
Early Computers
First Generation
-1940-1956:
-Vacuum Tubes
Second Generation
-1956-1963:
-Transistors
Third Generation –
1964-1971:
Integrated Circuits
Fourth Generation
-1971-Present:
- Microprocessors
Fifth Generation –
Present and Beyond:
Artificial Intelligence
4
Fundamentals of Computer
 Computer technology has made incredible progress over the last 60
years.
 This improvement has come:
 From advances in the technology used to build computers and
 From innovation in computer design.
 During the first 25 years, both forces made a major contribution,
delivering performance improvement of about 25% per year.
 The late 1970s (after emergence of the microprocessor):
The higher rate of improvement— roughly 35% growth per year in
performance.
5
Fundamentals of Computer
 Two significant changes in the computer made new architecture:
 The virtual elimination of assembly language programming
reduced the need for object-code compatibility.
 The creation of standardized, vendor-independent operating
systems, such as UNIX, Linux, lowered the cost.
 These changes developed a new set of architectures with simpler
instructions, called RISC (Reduced Instruction Set Computer)
architectures, in the early 1980s.
 The RISC-based machines focused on instruction level parallelism and
the use of caches.
 The RISC-based computers raised the performance.
6
Fundamentals of Computer Design
Growth in processor performance since the mid-1980s.
7
Classes of Computers
 The 1980s: desktop computer were invented
(based on microprocessors)
 Personal computers and workstations
 The 1990s: Emergence of the Internet and the World Wide Web
(www).
 Cell phones has been introduced in 2000, with rapid improvements in
functions and sales.
 More recent applications used embedded computers.
8
Classes of Computers
 Desktop Computing:
 Desktop computing spans from low-end systems to high-end (heavily
configured workstations)
 The desktop market tends to be driven to optimize price
performance.
 Desktop computers are widely used for applications and
benchmarking.
9
Classes of Computers
 Servers:
 Servers are used to provide larger-scale and more reliable file and
computing services.
 Consider the servers running Google, taking orders for CISCO, or
running auctions on eBay. Failure of such server systems is far more
catastrophic than failure of a single desktop, since these servers must
operate 24 x 7.
10
Classes of Computers
 Servers:
 Servers are designed for efficient throughput i.e. in terms of
transactions per minute or Web pages served per second.
 Supercomputers are the most expensive computers, emphasize
floating-point performance.
 Clusters of desktop computers, have largely overtaken this class of
computer.
11
Classes of Computers
 Clusters/Warehouse-Scale Computers
 The growth of Software as a Service (SaaS) for applications like
search, social networking, video sharing, multiplayer games, online
shopping, and so on has led to the growth of a class of computers
called clusters.
 Clusters are collections of desktop computers or servers
connected by local area networks to act as a single larger computer.
 Each node runs its own operating system, and nodes communicate
using a networking protocol.
 The largest of the clusters are called warehouse-scale computers
(WSCs), in that they are designed so that tens of thousands of servers
can act as one.
12
Classes of Computers
 Embedded Computers
 Embedded computers are the fastest growing computer market.
 They range from microwaves, washing machines, printers, networking
switches and all cars contain simple embedded microprocessors — to
handheld digital devices, such as cell phones and smart cards to video
games and digital set-top boxes.
 Embedded applications: Minimize memory and minimize power.
13
Classes of Parallelism and Parallel Architectures
 Parallelism at multiple levels is now the driving force of computer
design across all four classes of computers.
 There are basically two kinds of parallelism in applications:
1. Data-Level Parallelism (DLP): Arises because there are many data
items that can be operated on at the same time.
2. Task-Level Parallelism (TLP): Arises because tasks of work are
created that can operate independently and largely in parallel.
14
Classes of Parallelism and Parallel Architectures
 Data parallelism:
 Consider a 2-processor system (CPUs A and B) in a parallel
environment, and we wish to do a task on some data „d‟.
 It is possible to tell CPU A to do that task on one part of „d‟ and CPU
B on another part simultaneously, thereby reducing the duration of the
execution.
 The data can be assigned using conditional statements
 As a specific example, consider adding two matrices.
In a data parallel implementation, CPU A could add all elements from
the top half of the matrices, while CPU B could add all elements from
the bottom half of the matrices.
15
Classes of Parallelism and Parallel Architectures
 Task parallelism:
 Task parallelism (function parallelism or control parallelism) is a
form of parallelization of computer code across multiple processors in
parallel computing environments.
 Task parallelism focuses on distributing execution processes
(threads) across different parallel computing nodes.
 In a multiprocessor system, task parallelism is achieved when each
processor executes a different thread (or process) on the same or
different data.
16
Classes of Parallelism and Parallel Architectures
 Task parallelism
 As a simple example, if we are running code on a 2- processor system
(CPUs "a" & "b") in a parallel environment and we wish to do
tasks "A" and "B”.
 It is possible to tell CPU "a" to do task "A" & CPU "b" to do task 'B"
simultaneously, thereby reducing the runtime of the execution.
 The tasks can be assigned using conditional statements.
17
Defining Computer Architecture
 The Computer designer faces the problems:
 To maximize performance while staying within cost, power, and
availability constraints.
 Instruction set design, functional organization, logic design, and
implementation.
 The implementation may encompass integrated circuit design,
packaging, power and cooling.
18
Defining Computer Architecture
 Instruction Set Architecture:
 Class of ISA —
ISAs are classified as general-purpose register architectures, where
the operands are either registers or memory locations.
All recent ISAs have load-store architecture.
 Memory addressing —
All desktops and servers uses byte addressing to access memory
operands.
19
Defining Computer Architecture
 Instruction Set Architecture:
 Addressing modes — Specifies registers and constant operands, &
the address of a memory object.
1. MIPS addressing modes are Register, Immediate (for constants),
and Displacement,
2. The 80x86 supports those three plus three variations of
displacement
 Types and sizes of operands — Like most ISAs, MIPS and 80x86
support operand sizes of 8-bit, 16-bit, 32-bit, 64-bit, and IEEE 754
floating point in 32-bit and 64-bit.
MIPS = Microprocessor without Interlocked Pipeline Stages, is basically RISC
20
Defining Computer Architecture
 Instruction Set Architecture:
 Operations — The categories of operations are data transfer,
arithmetic logical, control and floating point.
 Control flow instructions — All ISAs, supports conditional branches,
unconditional jumps, procedure calls, and returns.
 Encoding an ISA — There are two basic choices on encoding:
 fixed length and variable length .
 All MIPS instructions are 32 bits long.
 Variable length instructions can take less space than fixed-length
instructions.
21
Trends in Technology
 Integrated circuit logic technology —
Transistor density increases by about 35% per year. Increases in die
size are less predictable and slower. The combined effect is a growth
rate in transistor count on a chip of about 40% to 55% per year.
 Semiconductor DRAM — Capacity increases by about 40% per year,
doubling roughly every two years.
 Magnetic disk technology — Prior to 1990, density increased by
about 30% per year, doubling in three years. It rose to 60% per year
thereafter, and increased to 100% per year in 1996. Since in 2004, it
has dropped back to 30% per year.
 Network technology — Network performance depends on switches
and transmission system.
22
Trends in Power in Integrated
Circuits
 Initially power must be brought in and distributed around the chip.
 Second, power is dissipated as heat and must be removed.
 For CMOS chips, energy consumption is due to switching
transistors, also called dynamic power.
 The power required per transistor is proportional to the product of the
load capacitance of the transistor, the square of the voltage, and the
frequency of switching:
Mobile devices care about battery life more than power, so energy is
the proper metric, measured in joules:
23
Trends in Power in Integrated
Circuits
 Example: Some microprocessors today are designed to have
adjustable voltage, so that a 15% reduction in voltage may result in a
15% reduction in frequency. What would be the impact on dynamic
power?
 Answer: Since the capacitance is unchanged, the answer is the ratios
of the voltages and frequencies:
thereby reducing power to about 60% of the original.
24
Trends in Power in Integrated
Circuits
 The increase in the number of transistors switching, and the
frequency of switching, dominates the decrease in load capacitance
and voltage, leading to an overall growth in power consumption and
energy.
 Power is now the major limitation, therefore most microprocessors
today turn off the clock of inactive modules to save energy and dynamic
power e.g. if no floating-point instructions are executing, the clock of the
floating-point unit is disabled.
 Although dynamic power is the primary source of power dissipation in
CMOS, static power is becoming an important issue because leakage
current flows even when a transistor is off:
 Static Power is Calculated by:
25
Trends in Cost
 Although there are computers where cost tends to be less important—
specifically supercomputers.
 In the past 20 years:
The technology improvements to lower cost, increased
performance, was a major theme in the computer industry.
 Yet an understanding of cost and its factors is essential for
designers to make intelligent decisions about whether or not a new
feature should be included in designs.
26
Trends in Cost:
 The Impact of Time, Volume, and Commodification
 Cost of a manufactured computer component decreases over time
even without major improvements in the implementation technology.
 One example is that the price per megabyte of DRAM has dropped
over the long term by 40% per year.
 Volume is a second key factor in determining cost.
 Increasing volumes affects cost in several ways.
 First, they decrease the time needed, which is proportional to the
number of systems manufactured.
 Second, volume decreases cost, since it increases manufacturing
efficiency.
27
Trends in Cost:
 Cost of an Integrated Circuit:
 Although the costs of integrated circuits have dropped exponentially, the
basic process of silicon manufacture is unchanged.
 The cost of a packaged integrated circuit is
28
Trends in Cost:
 Example: Find the number of dies per 300 mm (30 cm) wafer for a die
that is 1.5 cm on a side.
 Answer: The die area is 2.25 cm2.
Thus
29
Trends in Cost:
 However, this only gives the maximum number of dies per wafer.
The critical question is:
 What is the fraction of good dies on a wafer number, or the die yield?
 It is assumed that the defects are randomly distributed over the wafer
and the yield is inversely proportional to the complexity of the fabrication
process
30
Trends in Cost:
 Example:
Find the die yield for dies that are 1.5 cm on a side and 1.0 cm on a
side, assuming a defect density of 0.4 per cm 2 and α is 4.
 Answer:
The total die areas are 2.25 cm 2 and 1.00 cm 2 . For the larger die, the
yield is
That is, less than half of all the large die are good but more than two-
thirds of the small die are good.
31
Measuring, Reporting and
Summarizing Performance
32
Measuring Performance
 When we say one computer is faster than another, what do we
mean?
 The user of a desktop computer may say a computer is faster when a
program runs in less time,
 While an amazon.com administrator may say a computer is faster
when it completes more transactions per hour.
 The computer user is interested in reducing response time — the
time between the start and the completion of an event (execution time).
 The administrator of a large data processing center may be
interested in increasing throughput — the total amount of work done
in a given time.
33
Measuring Performance
 Consider two different computers, X and Y.
 The phrase “X is faster than Y” is used to mean that the response
time or execution time is lower on X than on Y for the given task.
 In particular, “X is n times faster than Y” will mean
Since execution time is the reciprocal of performance, the following
relationship holds:
The throughput of X is 1.3 times higher than Y, signifies that the number of
tasks completed per unit time on computer X is 1.3 times the number
completed on Y.
34
Measuring Performance
 Execution time can be defined in different ways like
wall-clock time,
response time, or
elapsed time,………which is the latency to complete a task.
 The response time seen by the user is the elapsed time of the
program, not the CPU time.
 To evaluate a new system the users would simply compare the
execution time of their workloads.
35
Reporting Performance
 Reporting Performance Results
 The reporting performance measurements should be for
reproducibility — list everything another experimenter would need to
duplicate the results.
 A SPEC (Standard Performance Evaluation Corporation)
(www.spec.org.) benchmark report requires an extensive description of
the computer and the compiler flags, as well as the publication of both
the baseline and optimized results.
36
Reporting Performance
 Reporting Performance Results
 In addition to hardware, software, and baseline tuning parameter
descriptions, a SPEC report contains the actual performance times,
shown both in tabular form and as a graph.
 A TPC (Transaction Processing Council) benchmark report is even
more complete.
These reports are excellent sources for finding the real cost of
computing systems.
37
Summarizing Performance
 Summarizing Performance Results
 A straightforward approach to computing a summary result would be
to compare the arithmetic means of the execution times of the
programs in the suite.
 An alternative would be to add a weighting factor to each benchmark
and use the weighted arithmetic mean as the single number to
summarize performance.
 Each company might have their own set of weights.
38
Summarizing Performance
 Summarizing Performance Results
 Rather than pick weights, we could normalize execution times to a
reference computer by dividing the time on the reference computer by
the time on the computer being rated.
 SPEC uses this approach, calling the ratio the SPECRatio.
 For example, suppose that the SPECRatio of computer A on a
benchmark was 1.25 times higher than computer B; then you would
know
39
Summarizing Performance
 SPECRatio is a ratio rather than an absolute execution time, the mean
must be computed using the geometric mean.
(Since, SPECRatios have no units, comparing SPECRatios
arithmetically is meaningless.)
 The formula is
40
Summarizing Performance
 Example: Show that the ratio of the geometric means is equal to the
geometric mean of the performance ratios, and that the reference
computer of SPECRatio matters not.
 Answer: Assume two computers A and B and a set of SPECRatios for
each.
That is, the ratio of the geometric means of the SPECRatios of A and B is
the geometric mean of the performance ratios of A to B.
41
Pipelining
42
Pipelining (Concept)
 Lets, consider the example of washing a car:
Suppose washing, drying & polishing of car requires 30 minutes each.
To wash, dry and polish:
1 car will take 1.5 hrs
4 cars will need (1.5 hrs x 4) 6 hours…
Suppose,
After washing the first car, it is sent for drying at the same time second
car was taken for washing.
Washing of second car and drying of first car was done simultaneously;
and will be done/over at same time.
When washing of second car and drying of first car is over, then first car
was sent for polishing, washed car is sent to drying and third car was
taken for washing.
The total time to complete all three operations for 4 cars is ___.
43
What is Pipelining?
 In computing, a pipeline is a set of data processing elements
connected in series, so that the output of one element is the input
of the next one.
 Pipelining is an implementation technique whereby multiple
instructions are overlapped in execution;
 The elements of a pipeline are often executed in parallel or in time-
sliced fashion.
 Today, pipelining is the key implementation technique used to make
fast CPUs
44
Pipelining Types
 Buffered, Synchronous pipelines:
Conventional microprocessors are synchronous circuits that use
buffered, synchronous pipelines.
In these pipelines, "pipeline registers" are inserted in-between
pipeline stages, and are clocked synchronously.
 Buffered, Asynchronous pipelines:
Asynchronous pipelines are used in asynchronous circuits, and have
their pipeline registers clocked asynchronously.
They use a request/acknowledge system, wherein each stage can
detect when it's finished.
45
Pipelining Types
 Unbuffered pipelines:
Unbuffered pipelines, called "wave pipelines", do not have registers in-
between pipeline stages.
Instead, the delays in the pipeline are "balanced" so that, for each
stage, the difference between the first stabilized output data and the last
is minimized.
46
Pipelining
 Because all stages proceed at the same time, the length of a processor
cycle is determined by the time required for the slowest pipe stage
 In a computer, this processor cycle is usually 1 clock cycle (sometimes
it is 2, rarely more).
 The pipeline designer’s goal is to balance the length of each pipeline
stage.
 If the stages are perfectly balanced, then the time per instruction on the
pipelined processor is
47
Implementation of a RISC Instruction Set
 How RISC instruction set is implemented without pipelining?
 RISC instruction takes at most 5 clock cycles.
 This basic implementation to a pipelined version, resulting in a much
lower CPI.
 Unpipelined implementation is not the most economical or the highest-
performance implementation.
 Implementing the instruction set requires the introduction of several
temporary registers that are not part of the architecture
48
Implementation of a RISC Instruction Set
 Every instruction in this RISC subset can be implemented in at most 5
clock cycles. The 5 clock cycles are as follows.
1. Instruction fetch cycle (IF):
Send the program counter (PC) to memory and fetch the current
instruction from memory. Update the PC to the next sequential PC by
adding 4 (since each instruction is 4 bytes) to the PC.
2. Instruction decode/register fetch cycle (ID):
Decode the instruction and read the registers corresponding to register
source specifiers from the register file.
Decoding is done in parallel with reading registers, which is possible
because the register specifiers are at a fixed location in a RISC
architecture. This technique is known as fixed-field decoding
49
Implementation of a RISC Instruction Set
 Every instruction in this RISC subset can be implemented in at most 5
clock cycles. The 5 clock cycles are as follows.
3. Execution/effective address cycle (EX):
The ALU operates on the operands prepared in the prior cycle,
performing one of three functions.
 Memory reference:.
 Register-Register ALU instruction:
 Register-Immediate ALU instruction:
 In a load-store architecture the effective address and execution cycles
can be combined into a single clock cycle.
50
Implementation of a RISC Instruction Set
 Every instruction in this RISC subset can be implemented in at most 5
clock cycles. The 5 clock cycles are as follows.
4. Memory access (MEM):
 If the instruction is a load, memory does a read using the effective
address computed in the previous cycle. If it is a store, then the memory
writes the data from the second register read from the register file using
the effective address.
5. Write-back cycle (WB):
 Register-Register ALU instruction or Load instruction: Write the result
into the register file, whether it comes from the memory system (for a
load) or from the ALU (for an ALU instruction).
51
Simple RISC pipeline
IF = instruction fetch,
ID = instruction decode,
EX = execution,
MEM = memory access, and
WB = write back.
52
Simple RISC pipeline
Pipeline as a series of data paths shifted in time
IM = instruction memory,
DM = data memory, and
CC = clock cycle
53
Performance Issues in Pipelining
 Pipelining increases the CPU instruction throughput — the number
of instructions completed per unit of time — but it does not reduce the
execution time of an individual instruction.
 In fact, it usually slightly increases the execution time of each
instruction due to overhead in the control of the pipeline.
 The increase in instruction throughput means that a program runs
faster and has lower total execution time, even though no single
instruction runs faster.
54
Performance Issues in Pipelining
 Imbalance among the pipeline stages reduces performance.
 Pipeline overhead arises from the combination of pipeline register
delay and clock skew.
 The pipeline registers add setup time, (time that a register input must
be stable before the clock signal).
 Clock skew, also contributes to the lower limit on the clock cycle.
 Once the clock cycle is as small as the sum of the clock skew and
latch overhead, no further pipelining is useful.
(there is no time left in the cycle for useful work)
55
Performance Issues in Pipelining
Example: Consider the unpipelined processor in the previous section.
Assume that it has a 1 ns clock cycle and that it uses 4 cycles for ALU
operations and branches and 5 cycles for memory operations. Assume
that the relative frequencies of these operations are 40%, 20%, and
40%, respectively. Suppose that due to clock skew and setup, pipelining
the processor adds 0.2 ns of overhead to the clock. Ignoring any latency
impact, how much speedup in the instruction execution rate will we gain
from a pipeline?
Answer: The average instruction execution time on the unpipelined
processor is
Average instruction execution time = Clock cycle × Average CPI
= 1 n s × ((40% + 20%) × 4 + 40% × 5)
= 1 n s × 4.4
= 4.4 ns
56
Performance Issues in Pipelining
 In the pipelined implementation, the clock must run at the speed of
the slowest stage plus overhead, which will be 1 + 0.2 or 1.2 ns; this
is the average instruction execution time.
 Thus, the speedup from pipelining is
The 0.2 ns overhead essentially establishes a limit on the effectiveness of
pipelining. If the overhead is not affected by changes in the clock cycle,
Amdahl’s Law tells us that the overhead limits the speedup.
57
Pipeline Hazards
 There are situations, called hazards, that prevents the next
instruction in the instruction stream from executing during its
designated clock cycle.
 Hazards reduce the performance gained from pipelining.
 There are three classes of hazards:
1. Structural hazards arise from resource conflicts when the hardware
cannot support all possible combinations of instructions simultaneously
in overlapped execution.
2. Data hazards arise when an instruction depends on the results of a
previous instruction, because of overlapping of instructions.
3. Control hazards arise from the pipelining of branches and other
instructions that change the PC.
58
Pipeline Hazards
 Hazards in pipelines can make it necessary to stall (stop, halt or pause),
the pipeline.
 Avoiding a hazard often requires that some instructions in the pipeline
be allowed to proceed while others are delayed.
 When an instruction is stalled, all instructions issued later than the
stalled instruction are also stalled.
 Instructions issued earlier than the stalled instruction must continue,
otherwise the hazard will never clear.
59
Pipeline Hazards
 Structural Hazards
If some combination of instructions cannot be accommodated because
of resource conflicts, the processor is said to have a structural
hazard.
 Structural Hazards arise because of:
1. When some functional units are not fully pipelined.
2. When some resource has not been duplicated enough to allow
all combinations of instructions in the pipeline to execute.
 For example, a processor may have only one register-file write port, but
the pipeline might want to perform two writes in a clock cycle.
This will generate a structural hazard.
60
Pipeline Hazards
 Structural Hazards
 When a instructions encounters this hazard, the pipeline will stall
one of the instructions until the required unit is available.
 Such stalls will increase the CPI from its usual ideal value of 1.
 To resolve this hazard, we need to stall the pipeline for 1 clock
cycle. A stall is commonly called a pipeline bubble or just bubble.
 The effect of the pipeline bubble is actually to occupy the resources
for that instruction slot as it travels through the pipeline.
61
Pipeline Hazards
 Data Hazards
 A major effect of pipelining is to change the relative timing of instructions
by overlapping their execution.
 This overlap introduces data and control hazards.
 Data hazards occur when the pipeline changes the order of read/write
accesses to operands so that the order differs from the order seen by
sequentially executing instructions on an unpipelined processor.
62
Pipeline Hazards
 Data Hazards
 Consider the pipelined execution of these instructions:
DADD R1,R2,R3
DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
 All the instructions after the DADD use the result of the DADD
instruction.
 The DADD instruction writes the value of R1 in the WB pipe stage, but
the DSUB instruction reads the value during its ID stage.
 This problem is called a data hazard. Unless precautions are taken to
prevent it, the DSUB instruction will read the wrong value and try to use
it.
63
Pipeline Hazards
 Data Hazards
 If an interrupt occurs between the DADD and DSUB instructions, then
WB stage of the DADD will complete, and the value of R1 at that point
will be the result of the DADD.
 The AND instruction is also affected by this hazard. The AND instruction
that reads the registers during clock cycle 4 will receive the wrong
results.
 The XOR instruction operates properly because its register read occurs
in clock cycle 6, after the register write.
 The OR instruction also operates without incurring a hazard.
64
DADD R1,R2,R3
DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
Pipeline Hazards
 Minimizing Data Hazard Stalls by Forwarding
 The problem of data hazard can be solved with a simple hardware
technique called forwarding (also called bypassing and sometimes
short-circuiting).
 The result is not needed by the DSUB until the DADD produces it.
 If the result moved from the pipeline register where the DADD stores
it, to where the DSUB needs it, then the need for a stall can be
avoided.
65
DADD R1,R2,R3
DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
Pipeline Hazards
 Minimizing Data Hazard Stalls by Forwarding
 Using this observation, forwarding works as follows:
1. The ALU result from both the EX/MEM and MEM/WB pipeline registers
is always fed back to the ALU inputs.
2. If the forwarding hardware detects that the previous ALU operation has
written to the register corresponding to a source for the current ALU
operation, control logic selects the forwarded result.
66
DADD R1,R2,R3
DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
Pipeline Hazards
 Branch Hazards
 Control hazards can cause a greater performance loss for MIPS
pipeline than data hazards.
 When a branch is executed, it may or may not change the PC to
something other than its current value plus 4.
 If the branch is not taken, then the repetition of the IF stage is
unnecessary since the correct instruction was fetched.
 One stall cycle for every branch will yield a performance loss of
10% to 30% depending on the branch frequency.
67
Pipeline Hazards
 Reducing Pipeline Branch Penalties
 The software can try to minimize the branch penalty using
knowledge of the hardware scheme and of branch behavior.
 The simplest scheme to handle branches is to freeze or flush the
pipeline, holding or deleting any instructions after the branch until
the branch destination is known.
 It is simple from both sides hardware and software.
68
Pipeline Hazards
 Reducing Pipeline Branch Penalties
 Treat every branch as not taken.
(Allow the hardware to continue as if the branch were not
executed).
 In the simple five-stage pipeline, this predicted untaken scheme is
implemented by continuing to fetch instructions as if the branch
were a normal instruction.
 If the branch is taken, however, we need to turn the fetched
instruction into a no-op and restart the fetch at the target address.
69
Pipeline Hazards
 Reducing Pipeline Branch Penalties
 An alternative scheme is to treat every branch as taken.
 As soon as the branch is decoded and the target address is
computed, we assume the branch to be taken and begin fetching
and executing at the target address (location).
 In some processors — powerful (hence slower) branch conditions —
the branch target is known, and a predicted-taken scheme might make
sense.
 A another scheme used in some processors is called delayed branch.
This technique was heavily used in early RISC processors.
70
Performance of Pipelines with Stalls
 A stall causes the pipeline performance to degrade from the ideal
performance.
 Equation for finding the actual speedup from pipelining, starting with the
earlier formula.
71
Performance of Pipelines with Stalls
 Pipelining is decreasing the CPI or the clock cycle time.
 The ideal CPI on a pipelined processor is always 1.
 Hence, we can compute the pipelined CPI:
If we ignore the cycle time overhead of pipelining and assume the
stages are perfectly balanced, then the cycle time of the two processors
can be equal, leading to
72
Performance of Pipelines with Stalls
If all instructions take the same number of cycles, which must also
equal the number of pipeline stages, then the unpipelined CPI is equal to
the depth of the pipeline, leading to
If there are no pipeline stalls, then pipelining can improve performance
by the depth of the pipeline.
If pipelining improves the clock cycle time, then we can calculate the CPI
of the unpipelined processor, as well as pipelined processor.
This leads to…
73
Performance of Pipelines with Stalls
 In cases where the pipe stages are perfectly balanced and there is
no overhead, the clock cycle on the pipelined processor is smaller than
the clock cycle of the unpipelined processor by a factor equal to the
pipelined depth:
This leads to the following
74
MIPS Instructions (load and store)
75
MIPS Instructions (arithmetic/logical)
76
MIPS Instructions (control flow)
77
ILP
(Instruction-Level Parallelism)
78
Instruction-Level Parallelism
 All processors since about 1985 use pipelining to overlap the execution
of instructions and improve performance.
 This potential overlap among instructions is called instruction-level
parallelism (ILP), since the instructions can be evaluated in parallel.
79
Instruction-Level Parallelism
 Instruction-level parallelism (ILP) is a measure of how many of the
operations in a computer program can be performed simultaneously.
Consider the following program:
 For Example:
1. e = a + b
2. f = c + d
3. g = e * f
Here, Operation 3 depends on the results of operations 1 and 2, so
it cannot be calculated until both of them are completed. As, operations
1 and 2 do not depend on any other operation, so they can be
calculated simultaneously.
If each operation is completed in one unit of time then three
instructions can be completed in two units of time, giving an ILP of 3/2.
80
Instruction-Level Parallelism
 Instruction-level parallelism (ILP):
 Ordinary programs are written and executed sequentially.
ILP allows the compiler and the processor to overlap the execution
of multiple instructions or even to change the order in which
instructions are executed.
 How much ILP exists in programs is very application specific.
In certain fields, such as graphics and scientific computing the amount
can be very large.
However, cryptography exhibit much less parallelism.
81
Instruction-Level Parallelism
 There are two largely separable approaches to exploiting (utilizing) ILP:
(1) an approach that relies on hardware to help discover and exploit
(utilize) the parallelism, and
(2) an approach that relies on software technology to find parallelism
at compile time.
82
Instruction-Level Parallelism
 The value of the CPI for a pipelined processor is the sum of the base
CPI and all contributions from stalls:
Pipeline CPI (Ideal pipeline)=CPI + Structural stalls + Data hazard stalls
+ Control stalls
 The ideal pipeline CPI is a measure of the maximum performance
attainable by the implementation.
83
Instruction-Level Parallelism
 The simplest and most common way to increase the ILP is to exploit
parallelism among iterations of a loop, which is called as loop-level
parallelism.
 Example: A loop that adds two 1000 - element arrays:
for (i=0; i<=999; i=i+1)
x[i] = x[i] + y[i];
 Every iteration of the loop can overlap with any other iteration.
 Such techniques works by unrolling the loop either statically by the
compiler or dynamically by the hardware
84
Instruction-Level Parallelism
 An important alternative method for exploiting loop-level parallelism
is the use of SIMD in both vector processors and Graphics Processing
Units (GPUs).
 A SIMD instruction exploits data-level parallelism by operating on a
small to moderate number of data items in parallel.
85
SIMD
 SIMD machines are capable of
applying the exact same
instruction stream to multiple
streams of data simultaneously.
 This type of architecture is
perfectly suited to achieving very
high processing rates
86
SIMD
SIMD Processable Patterns SIMD Unprocesable Patterns
Example: Brightness Computation by SIMD Operations
87
ILP Challenges
 Determining how one instruction depends on another is critical &
determining how much parallelism exists in a program and how that
parallelism can be exploited is major problem.
 If two instructions are independent, they can execute
simultaneously in a pipeline, provided that pipeline has sufficient
resources (and hence no structural hazards exist).
 If two instructions are dependent, they are not parallel and must be
executed in predefined order.
88
ILP Challenges
 There are three different types of dependences:
 Data dependences (also called true data dependences),
 Name dependences, and
 Control dependences.
 An instruction j is data dependent on instruction i if either of the
following conditions are true:
 Instruction i produces a result that may be used by instruction j.
 Instruction j is data dependent on instruction k, and instruction k
is data dependent on instruction i.
89
Instruction i
…
Instruction j
…
Instruction k
ILP Challenges
 Data Dependences
 For example, consider the following MIPS code sequence that
increments a vector of values in memory (starting at 0(R1) and with the
last element at 8(R2) ) by a scalar in register F2.
(For simplicity, ignore the effects of delayed branches.)
Here, each instruction depends on the previous one
90
ILP Challenges
 Data Dependences
 A data dependence conveys three things:
(1) The possibility of a hazard,
(2) The order in which results must be calculated, and
(3) An upper bound on how much parallelism can be exploited.
 A data dependence can be overcomed in two different ways:
(1) Maintaining the dependence but avoiding a hazard, and
(2) Eliminating a dependence by transforming the code.
Scheduling the code is the primary method used to avoid a hazard
without altering a dependence, and such scheduling can be done both
by the compiler and by the hardware.
91
ILP Challenges
 Name Dependences
 A name dependence occurs when two instructions use the same
register or memory location, called a name.
 There are two types of name dependences between an instruction i
that precedes instruction j in program order:
1. An antidependence: When instruction j writes a register or memory
location that instruction i reads.
2. An output dependence: When instruction i and instruction j write the
same register or memory location.
92
Instruction i
Instruction j
ILP Challenges
 Name Dependences
 Because a name dependence is not a true dependence, instructions
involved in a name dependence can execute simultaneously or be
reordered.
 If the name (register number or memory location) used in the
instructions is changed, then the instructions do not conflict.
 This renaming can be more easily done for register operands, where it
is called register renaming.
 Register renaming can be done either statically by a compiler or
dynamically by the hardware.
93
ILP Challenges
 Data Hazards
 A hazard exists whenever there is a name or data dependence
between instructions.
 Normally, we must preserve program order
 The goal of both software and hardware techniques is to exploit
parallelism by preserving program order.
 Data hazards, may be classified as one of three types, depending on
the order of read and write accesses in the instructions.
94
ILP Challenges
 Data Hazards
 The possible data hazards are
■ RAW (Read After Write) — j tries to read a source before i writes it,
so j incorrectly gets the old value. This hazard is the most common type
and corresponds to a true data dependence.
■ WAW (Write After Write) — This hazard corresponds to an output
dependence. WAW hazards are present only in pipelines that write in
more than one pipe stage.
■ WAR (Write After Read) — This hazard arises from an
antidependence. WAR hazards cannot occur in most static issue
pipelines. A WAR hazard occurs when there are some instructions that
write results early in the instruction pipeline.
Note that the RAR (Read After Read) case is not a hazard.
95
Instruction i
Instruction j
ILP Challenges
 Control Dependences
 A control dependence determines the ordering of an instruction, i,
with respect to a branch instruction so that instruction i is executed in
correct program order.
 Examples of a control dependence is the dependence of the statements
in the “then” part of an “if” statement on the branch.
96
ILP Challenges
 Control Dependences
 S1 is control dependent on p1, and S2 is control
dependent on p2 but not on p1.
 In general, two constraints are imposed by control
dependences:
1. An instruction that is control dependent on a branch cannot
be moved before the branch
2. An instruction that is not control dependent on a branch
cannot be moved after the branch.
For example,
We cannot take a statement before the if statement and
move it into the then portion.
97
Overcoming Data Hazards
 A simple statically scheduled pipeline fetches an instruction and issues
it, unless there is a data dependence.
 If there is a data dependence, then the hazard detection hardware
stalls the pipeline starting with the instruction that uses the result.
 No new instructions are fetched or issued until the dependence is
cleared.
98
Dynamic Scheduling
 In dynamic scheduling, the hardware rearranges the instruction
execution to reduce the stalls while maintaining data flow.
 Dynamic scheduling offers several advantages.
 First, it allows code that was compiled with one pipeline to run
efficiently on a different pipelines, eliminating the need to recompile
for a different microarchitectures.
 Second, it enables handling of code when dependences are
unknown at time of compilation.
 Third, and most important, it allows the processor to tolerate
unpredictable delays.
99
Dynamic Scheduling: The Idea
 For example, consider this code:
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14
 The SUB.D instruction cannot execute because the dependence of
ADD.D on DIV.D causes the pipeline to stall; yet, SUB.D is not data
dependent on anything in the pipeline.
 This hazard creates a performance limitation that can be eliminated by
not requiring instructions to execute in program order.
100
Dynamic Scheduling:
The Idea
 In the classic five-stage pipeline, both structural and data hazards
could be checked during instruction decode (ID).
 To allow us to begin executing the SUB.D in the above example,
We must separate the issue process into two parts:
- checking for any structural hazards and
- waiting for the absence of a data hazard.
 Thus, we still use in-order instruction issue, but we want an instruction
to begin execution as soon as its data operands are available.
101
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14
Dynamic Scheduling
 To understand how register renaming eliminates WAR and WAW
hazards, consider the following example code sequence:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
S.D F6,0(R1)
SUB.D F8,F10,F14
MUL.D F6,F10,F8
 There are two antidependences: between the ADD.D and the SUB.D
and between the S.D and the MUL.D.
 There is also an output dependence between the ADD.D and MUL.D.
 There are also three true data dependences: between the DIV.D and
the ADD.D, between the SUB.D and the MUL.D, and between the
ADD.D and the S.D.
102
Dynamic Scheduling
 These name dependences can all be eliminated by register
renaming.
 For simplicity, assume the existence of two temporary registers, S
and T. Using S and T, the sequence can be rewritten without any
dependences as:
DIV.D F0,F2,F4 DIV.D F0,F2,F4
ADD.D F6,F0,F8 ADD.D S,F0,F8
S.D F6,0(R1) S.D S,0(R1)
SUB.D F8,F10,F14 SUB.D T,F10,F14
MUL.D F6,F10,F8 MUL.D F6,F10,T
 In addition, any subsequent uses of F8 must be replaced by the register
T. In this code segment, the renaming process can be done statically by
the compiler.
103
Speculation
 Exploiting more parallelism requires that we should overcome the
limitation of control dependence.
 Overcoming control dependence is done by speculating
(guessing) on the outcome of branches and executing the program
as if our guesses were correct.
 With speculation (guesswork), we can fetch, issue, and execute
instructions, as if our branch predictions were always correct;
dynamic scheduling only fetches and issues such instructions.
 Hardware speculation, extends the ideas of dynamic scheduling.
104
Hardware-Based Speculation
 Hardware-based speculation combines three key ideas:
(1) Dynamic branch prediction to choose which instructions to execute,
(2) Speculation to allow the execution of instructions before the
control dependences are resolved and
(3) Dynamic scheduling to deal with the scheduling of different
combinations of basic blocks.
 Hardware-based speculation follows the predicted flow of data values to
choose when to execute instructions.
105
Hardware-Based Speculation
 Speculation allows instructions to execute out of order but to force
them to commit in order execution.
 Adding this commit phase to the instruction execution sequence
requires an additional set of hardware buffers.
 This hardware buffer (reorder buffer), is also used to pass results
among instructions that may be speculated.
 The ROB supplies operands in the interval between completion of
instruction execution and instruction commit.
106
Hardware-Based Speculation
 Each entry in the ROB contains four fields:
- the instruction type,
- the destination field,
- the value field, and
- the ready field.
 The instruction type field indicates whether the instruction is a branch
(and has no destination), a store (memory address), or a register
operation.
 The destination field supplies the register number (for load) or the
memory address (for stores)
 The value field is used to hold the value of the instruction result until the
instruction commits.
 The ready field indicates that the instruction has completed execution,
and the value is ready.
107
Hardware-
Based
Speculation
The basic
structure of a FP
unit using
Tomasulo’s
algorithm and
extended to
handle
speculation.
108
Hardware-Based Speculation
 The hardware structure of the processor including the ROB is shown in
figure.
 The ROB includes the store buffers. The renaming function of the
reservation stations is replaced by the ROB.
 This tagging requires that the ROB assigned for an instruction must be
tracked in the reservation station.
109
Hardware-Based Speculation
 Here are the four steps involved in instruction execution:
1. Issue —
Get an instruction from the instruction queue.
Issue the instruction if there is an empty reservation station and an
empty slot in the ROB.
If either all reservations are full or the ROB is full, then instruction issue
is stalled until both have available entries.
2. Execute —
If one or more of the operands is not yet available, monitor the CDB
(Common Data Bus) while waiting for the register to be computed.
This step checks for RAW hazards. When both operands are available
at a reservation station, execute the operation.
Instructions may take multiple clock cycles in this stage.
110
Hardware-Based Speculation
 Here are the four steps involved in instruction execution:
3. Write result—
When the result is available, write it on the CDB (Common Data Bus)
and from the CDB into the ROB, as well as to any reservation stations
waiting for this result.
4. Commit—
This is the final stage of completing an instruction, after which only its
result remains. (Commit phase is also called as “completion” or
“graduation”)
111
Hardware-Based Speculation
There are three different sequences of actions at commit:
The normal commit case occurs when an instruction reaches the head
of the ROB and its result is present in the buffer.
Committing a store is similar except that memory is updated rather
than a result register.
When a branch with incorrect prediction reaches the head of the
ROB, it indicates that the speculation was wrong. The ROB is flushed
and execution is restarted at the correct successor of the branch. If the
branch was correctly predicted, the branch is finished.
112
ILP Using Multiple Issue
 To improve performance, we would like to decrease the CPI < 1, but the
CPI cannot be reduced below one if we issue only one instruction every
clock cycle.
 The goal of the multiple-issue processors, is to allow multiple
instructions to issue in a clock cycle.
 Multiple-issue processors come in three major flavors:
1. Statically scheduled superscalar processors
2. VLIW (very long instruction word) processors
3. Dynamically scheduled superscalar processors
113
Thank You …
shindesir.pvp@gmail.com
(This Presentation is Published Only for Educational Purpose)
114

Pipelining and ILP (Instruction Level Parallelism)

  • 1.
    Pipelining & Instruction Level Parallelism Mr.A. B. Shinde Assistant Professor, Electronics Engineering, P.V.P.I.T., Budhgaon
  • 2.
    Contents… 2  Fundamentals ofcomputer design:  Measuring and reports performance  Instruction level parallelism and its exploitation  Concepts and challenges  Overcoming data hazards with dynamics scheduling  Dynamic branch prediction  Speculation  Multiple issue processors
  • 3.
  • 4.
    Early Computers First Generation -1940-1956: -VacuumTubes Second Generation -1956-1963: -Transistors Third Generation – 1964-1971: Integrated Circuits Fourth Generation -1971-Present: - Microprocessors Fifth Generation – Present and Beyond: Artificial Intelligence 4
  • 5.
    Fundamentals of Computer Computer technology has made incredible progress over the last 60 years.  This improvement has come:  From advances in the technology used to build computers and  From innovation in computer design.  During the first 25 years, both forces made a major contribution, delivering performance improvement of about 25% per year.  The late 1970s (after emergence of the microprocessor): The higher rate of improvement— roughly 35% growth per year in performance. 5
  • 6.
    Fundamentals of Computer Two significant changes in the computer made new architecture:  The virtual elimination of assembly language programming reduced the need for object-code compatibility.  The creation of standardized, vendor-independent operating systems, such as UNIX, Linux, lowered the cost.  These changes developed a new set of architectures with simpler instructions, called RISC (Reduced Instruction Set Computer) architectures, in the early 1980s.  The RISC-based machines focused on instruction level parallelism and the use of caches.  The RISC-based computers raised the performance. 6
  • 7.
    Fundamentals of ComputerDesign Growth in processor performance since the mid-1980s. 7
  • 8.
    Classes of Computers The 1980s: desktop computer were invented (based on microprocessors)  Personal computers and workstations  The 1990s: Emergence of the Internet and the World Wide Web (www).  Cell phones has been introduced in 2000, with rapid improvements in functions and sales.  More recent applications used embedded computers. 8
  • 9.
    Classes of Computers Desktop Computing:  Desktop computing spans from low-end systems to high-end (heavily configured workstations)  The desktop market tends to be driven to optimize price performance.  Desktop computers are widely used for applications and benchmarking. 9
  • 10.
    Classes of Computers Servers:  Servers are used to provide larger-scale and more reliable file and computing services.  Consider the servers running Google, taking orders for CISCO, or running auctions on eBay. Failure of such server systems is far more catastrophic than failure of a single desktop, since these servers must operate 24 x 7. 10
  • 11.
    Classes of Computers Servers:  Servers are designed for efficient throughput i.e. in terms of transactions per minute or Web pages served per second.  Supercomputers are the most expensive computers, emphasize floating-point performance.  Clusters of desktop computers, have largely overtaken this class of computer. 11
  • 12.
    Classes of Computers Clusters/Warehouse-Scale Computers  The growth of Software as a Service (SaaS) for applications like search, social networking, video sharing, multiplayer games, online shopping, and so on has led to the growth of a class of computers called clusters.  Clusters are collections of desktop computers or servers connected by local area networks to act as a single larger computer.  Each node runs its own operating system, and nodes communicate using a networking protocol.  The largest of the clusters are called warehouse-scale computers (WSCs), in that they are designed so that tens of thousands of servers can act as one. 12
  • 13.
    Classes of Computers Embedded Computers  Embedded computers are the fastest growing computer market.  They range from microwaves, washing machines, printers, networking switches and all cars contain simple embedded microprocessors — to handheld digital devices, such as cell phones and smart cards to video games and digital set-top boxes.  Embedded applications: Minimize memory and minimize power. 13
  • 14.
    Classes of Parallelismand Parallel Architectures  Parallelism at multiple levels is now the driving force of computer design across all four classes of computers.  There are basically two kinds of parallelism in applications: 1. Data-Level Parallelism (DLP): Arises because there are many data items that can be operated on at the same time. 2. Task-Level Parallelism (TLP): Arises because tasks of work are created that can operate independently and largely in parallel. 14
  • 15.
    Classes of Parallelismand Parallel Architectures  Data parallelism:  Consider a 2-processor system (CPUs A and B) in a parallel environment, and we wish to do a task on some data „d‟.  It is possible to tell CPU A to do that task on one part of „d‟ and CPU B on another part simultaneously, thereby reducing the duration of the execution.  The data can be assigned using conditional statements  As a specific example, consider adding two matrices. In a data parallel implementation, CPU A could add all elements from the top half of the matrices, while CPU B could add all elements from the bottom half of the matrices. 15
  • 16.
    Classes of Parallelismand Parallel Architectures  Task parallelism:  Task parallelism (function parallelism or control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments.  Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes.  In a multiprocessor system, task parallelism is achieved when each processor executes a different thread (or process) on the same or different data. 16
  • 17.
    Classes of Parallelismand Parallel Architectures  Task parallelism  As a simple example, if we are running code on a 2- processor system (CPUs "a" & "b") in a parallel environment and we wish to do tasks "A" and "B”.  It is possible to tell CPU "a" to do task "A" & CPU "b" to do task 'B" simultaneously, thereby reducing the runtime of the execution.  The tasks can be assigned using conditional statements. 17
  • 18.
    Defining Computer Architecture The Computer designer faces the problems:  To maximize performance while staying within cost, power, and availability constraints.  Instruction set design, functional organization, logic design, and implementation.  The implementation may encompass integrated circuit design, packaging, power and cooling. 18
  • 19.
    Defining Computer Architecture Instruction Set Architecture:  Class of ISA — ISAs are classified as general-purpose register architectures, where the operands are either registers or memory locations. All recent ISAs have load-store architecture.  Memory addressing — All desktops and servers uses byte addressing to access memory operands. 19
  • 20.
    Defining Computer Architecture Instruction Set Architecture:  Addressing modes — Specifies registers and constant operands, & the address of a memory object. 1. MIPS addressing modes are Register, Immediate (for constants), and Displacement, 2. The 80x86 supports those three plus three variations of displacement  Types and sizes of operands — Like most ISAs, MIPS and 80x86 support operand sizes of 8-bit, 16-bit, 32-bit, 64-bit, and IEEE 754 floating point in 32-bit and 64-bit. MIPS = Microprocessor without Interlocked Pipeline Stages, is basically RISC 20
  • 21.
    Defining Computer Architecture Instruction Set Architecture:  Operations — The categories of operations are data transfer, arithmetic logical, control and floating point.  Control flow instructions — All ISAs, supports conditional branches, unconditional jumps, procedure calls, and returns.  Encoding an ISA — There are two basic choices on encoding:  fixed length and variable length .  All MIPS instructions are 32 bits long.  Variable length instructions can take less space than fixed-length instructions. 21
  • 22.
    Trends in Technology Integrated circuit logic technology — Transistor density increases by about 35% per year. Increases in die size are less predictable and slower. The combined effect is a growth rate in transistor count on a chip of about 40% to 55% per year.  Semiconductor DRAM — Capacity increases by about 40% per year, doubling roughly every two years.  Magnetic disk technology — Prior to 1990, density increased by about 30% per year, doubling in three years. It rose to 60% per year thereafter, and increased to 100% per year in 1996. Since in 2004, it has dropped back to 30% per year.  Network technology — Network performance depends on switches and transmission system. 22
  • 23.
    Trends in Powerin Integrated Circuits  Initially power must be brought in and distributed around the chip.  Second, power is dissipated as heat and must be removed.  For CMOS chips, energy consumption is due to switching transistors, also called dynamic power.  The power required per transistor is proportional to the product of the load capacitance of the transistor, the square of the voltage, and the frequency of switching: Mobile devices care about battery life more than power, so energy is the proper metric, measured in joules: 23
  • 24.
    Trends in Powerin Integrated Circuits  Example: Some microprocessors today are designed to have adjustable voltage, so that a 15% reduction in voltage may result in a 15% reduction in frequency. What would be the impact on dynamic power?  Answer: Since the capacitance is unchanged, the answer is the ratios of the voltages and frequencies: thereby reducing power to about 60% of the original. 24
  • 25.
    Trends in Powerin Integrated Circuits  The increase in the number of transistors switching, and the frequency of switching, dominates the decrease in load capacitance and voltage, leading to an overall growth in power consumption and energy.  Power is now the major limitation, therefore most microprocessors today turn off the clock of inactive modules to save energy and dynamic power e.g. if no floating-point instructions are executing, the clock of the floating-point unit is disabled.  Although dynamic power is the primary source of power dissipation in CMOS, static power is becoming an important issue because leakage current flows even when a transistor is off:  Static Power is Calculated by: 25
  • 26.
    Trends in Cost Although there are computers where cost tends to be less important— specifically supercomputers.  In the past 20 years: The technology improvements to lower cost, increased performance, was a major theme in the computer industry.  Yet an understanding of cost and its factors is essential for designers to make intelligent decisions about whether or not a new feature should be included in designs. 26
  • 27.
    Trends in Cost: The Impact of Time, Volume, and Commodification  Cost of a manufactured computer component decreases over time even without major improvements in the implementation technology.  One example is that the price per megabyte of DRAM has dropped over the long term by 40% per year.  Volume is a second key factor in determining cost.  Increasing volumes affects cost in several ways.  First, they decrease the time needed, which is proportional to the number of systems manufactured.  Second, volume decreases cost, since it increases manufacturing efficiency. 27
  • 28.
    Trends in Cost: Cost of an Integrated Circuit:  Although the costs of integrated circuits have dropped exponentially, the basic process of silicon manufacture is unchanged.  The cost of a packaged integrated circuit is 28
  • 29.
    Trends in Cost: Example: Find the number of dies per 300 mm (30 cm) wafer for a die that is 1.5 cm on a side.  Answer: The die area is 2.25 cm2. Thus 29
  • 30.
    Trends in Cost: However, this only gives the maximum number of dies per wafer. The critical question is:  What is the fraction of good dies on a wafer number, or the die yield?  It is assumed that the defects are randomly distributed over the wafer and the yield is inversely proportional to the complexity of the fabrication process 30
  • 31.
    Trends in Cost: Example: Find the die yield for dies that are 1.5 cm on a side and 1.0 cm on a side, assuming a defect density of 0.4 per cm 2 and α is 4.  Answer: The total die areas are 2.25 cm 2 and 1.00 cm 2 . For the larger die, the yield is That is, less than half of all the large die are good but more than two- thirds of the small die are good. 31
  • 32.
  • 33.
    Measuring Performance  Whenwe say one computer is faster than another, what do we mean?  The user of a desktop computer may say a computer is faster when a program runs in less time,  While an amazon.com administrator may say a computer is faster when it completes more transactions per hour.  The computer user is interested in reducing response time — the time between the start and the completion of an event (execution time).  The administrator of a large data processing center may be interested in increasing throughput — the total amount of work done in a given time. 33
  • 34.
    Measuring Performance  Considertwo different computers, X and Y.  The phrase “X is faster than Y” is used to mean that the response time or execution time is lower on X than on Y for the given task.  In particular, “X is n times faster than Y” will mean Since execution time is the reciprocal of performance, the following relationship holds: The throughput of X is 1.3 times higher than Y, signifies that the number of tasks completed per unit time on computer X is 1.3 times the number completed on Y. 34
  • 35.
    Measuring Performance  Executiontime can be defined in different ways like wall-clock time, response time, or elapsed time,………which is the latency to complete a task.  The response time seen by the user is the elapsed time of the program, not the CPU time.  To evaluate a new system the users would simply compare the execution time of their workloads. 35
  • 36.
    Reporting Performance  ReportingPerformance Results  The reporting performance measurements should be for reproducibility — list everything another experimenter would need to duplicate the results.  A SPEC (Standard Performance Evaluation Corporation) (www.spec.org.) benchmark report requires an extensive description of the computer and the compiler flags, as well as the publication of both the baseline and optimized results. 36
  • 37.
    Reporting Performance  ReportingPerformance Results  In addition to hardware, software, and baseline tuning parameter descriptions, a SPEC report contains the actual performance times, shown both in tabular form and as a graph.  A TPC (Transaction Processing Council) benchmark report is even more complete. These reports are excellent sources for finding the real cost of computing systems. 37
  • 38.
    Summarizing Performance  SummarizingPerformance Results  A straightforward approach to computing a summary result would be to compare the arithmetic means of the execution times of the programs in the suite.  An alternative would be to add a weighting factor to each benchmark and use the weighted arithmetic mean as the single number to summarize performance.  Each company might have their own set of weights. 38
  • 39.
    Summarizing Performance  SummarizingPerformance Results  Rather than pick weights, we could normalize execution times to a reference computer by dividing the time on the reference computer by the time on the computer being rated.  SPEC uses this approach, calling the ratio the SPECRatio.  For example, suppose that the SPECRatio of computer A on a benchmark was 1.25 times higher than computer B; then you would know 39
  • 40.
    Summarizing Performance  SPECRatiois a ratio rather than an absolute execution time, the mean must be computed using the geometric mean. (Since, SPECRatios have no units, comparing SPECRatios arithmetically is meaningless.)  The formula is 40
  • 41.
    Summarizing Performance  Example:Show that the ratio of the geometric means is equal to the geometric mean of the performance ratios, and that the reference computer of SPECRatio matters not.  Answer: Assume two computers A and B and a set of SPECRatios for each. That is, the ratio of the geometric means of the SPECRatios of A and B is the geometric mean of the performance ratios of A to B. 41
  • 42.
  • 43.
    Pipelining (Concept)  Lets,consider the example of washing a car: Suppose washing, drying & polishing of car requires 30 minutes each. To wash, dry and polish: 1 car will take 1.5 hrs 4 cars will need (1.5 hrs x 4) 6 hours… Suppose, After washing the first car, it is sent for drying at the same time second car was taken for washing. Washing of second car and drying of first car was done simultaneously; and will be done/over at same time. When washing of second car and drying of first car is over, then first car was sent for polishing, washed car is sent to drying and third car was taken for washing. The total time to complete all three operations for 4 cars is ___. 43
  • 44.
    What is Pipelining? In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one.  Pipelining is an implementation technique whereby multiple instructions are overlapped in execution;  The elements of a pipeline are often executed in parallel or in time- sliced fashion.  Today, pipelining is the key implementation technique used to make fast CPUs 44
  • 45.
    Pipelining Types  Buffered,Synchronous pipelines: Conventional microprocessors are synchronous circuits that use buffered, synchronous pipelines. In these pipelines, "pipeline registers" are inserted in-between pipeline stages, and are clocked synchronously.  Buffered, Asynchronous pipelines: Asynchronous pipelines are used in asynchronous circuits, and have their pipeline registers clocked asynchronously. They use a request/acknowledge system, wherein each stage can detect when it's finished. 45
  • 46.
    Pipelining Types  Unbufferedpipelines: Unbuffered pipelines, called "wave pipelines", do not have registers in- between pipeline stages. Instead, the delays in the pipeline are "balanced" so that, for each stage, the difference between the first stabilized output data and the last is minimized. 46
  • 47.
    Pipelining  Because allstages proceed at the same time, the length of a processor cycle is determined by the time required for the slowest pipe stage  In a computer, this processor cycle is usually 1 clock cycle (sometimes it is 2, rarely more).  The pipeline designer’s goal is to balance the length of each pipeline stage.  If the stages are perfectly balanced, then the time per instruction on the pipelined processor is 47
  • 48.
    Implementation of aRISC Instruction Set  How RISC instruction set is implemented without pipelining?  RISC instruction takes at most 5 clock cycles.  This basic implementation to a pipelined version, resulting in a much lower CPI.  Unpipelined implementation is not the most economical or the highest- performance implementation.  Implementing the instruction set requires the introduction of several temporary registers that are not part of the architecture 48
  • 49.
    Implementation of aRISC Instruction Set  Every instruction in this RISC subset can be implemented in at most 5 clock cycles. The 5 clock cycles are as follows. 1. Instruction fetch cycle (IF): Send the program counter (PC) to memory and fetch the current instruction from memory. Update the PC to the next sequential PC by adding 4 (since each instruction is 4 bytes) to the PC. 2. Instruction decode/register fetch cycle (ID): Decode the instruction and read the registers corresponding to register source specifiers from the register file. Decoding is done in parallel with reading registers, which is possible because the register specifiers are at a fixed location in a RISC architecture. This technique is known as fixed-field decoding 49
  • 50.
    Implementation of aRISC Instruction Set  Every instruction in this RISC subset can be implemented in at most 5 clock cycles. The 5 clock cycles are as follows. 3. Execution/effective address cycle (EX): The ALU operates on the operands prepared in the prior cycle, performing one of three functions.  Memory reference:.  Register-Register ALU instruction:  Register-Immediate ALU instruction:  In a load-store architecture the effective address and execution cycles can be combined into a single clock cycle. 50
  • 51.
    Implementation of aRISC Instruction Set  Every instruction in this RISC subset can be implemented in at most 5 clock cycles. The 5 clock cycles are as follows. 4. Memory access (MEM):  If the instruction is a load, memory does a read using the effective address computed in the previous cycle. If it is a store, then the memory writes the data from the second register read from the register file using the effective address. 5. Write-back cycle (WB):  Register-Register ALU instruction or Load instruction: Write the result into the register file, whether it comes from the memory system (for a load) or from the ALU (for an ALU instruction). 51
  • 52.
    Simple RISC pipeline IF= instruction fetch, ID = instruction decode, EX = execution, MEM = memory access, and WB = write back. 52
  • 53.
    Simple RISC pipeline Pipelineas a series of data paths shifted in time IM = instruction memory, DM = data memory, and CC = clock cycle 53
  • 54.
    Performance Issues inPipelining  Pipelining increases the CPU instruction throughput — the number of instructions completed per unit of time — but it does not reduce the execution time of an individual instruction.  In fact, it usually slightly increases the execution time of each instruction due to overhead in the control of the pipeline.  The increase in instruction throughput means that a program runs faster and has lower total execution time, even though no single instruction runs faster. 54
  • 55.
    Performance Issues inPipelining  Imbalance among the pipeline stages reduces performance.  Pipeline overhead arises from the combination of pipeline register delay and clock skew.  The pipeline registers add setup time, (time that a register input must be stable before the clock signal).  Clock skew, also contributes to the lower limit on the clock cycle.  Once the clock cycle is as small as the sum of the clock skew and latch overhead, no further pipelining is useful. (there is no time left in the cycle for useful work) 55
  • 56.
    Performance Issues inPipelining Example: Consider the unpipelined processor in the previous section. Assume that it has a 1 ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2 ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? Answer: The average instruction execution time on the unpipelined processor is Average instruction execution time = Clock cycle × Average CPI = 1 n s × ((40% + 20%) × 4 + 40% × 5) = 1 n s × 4.4 = 4.4 ns 56
  • 57.
    Performance Issues inPipelining  In the pipelined implementation, the clock must run at the speed of the slowest stage plus overhead, which will be 1 + 0.2 or 1.2 ns; this is the average instruction execution time.  Thus, the speedup from pipelining is The 0.2 ns overhead essentially establishes a limit on the effectiveness of pipelining. If the overhead is not affected by changes in the clock cycle, Amdahl’s Law tells us that the overhead limits the speedup. 57
  • 58.
    Pipeline Hazards  Thereare situations, called hazards, that prevents the next instruction in the instruction stream from executing during its designated clock cycle.  Hazards reduce the performance gained from pipelining.  There are three classes of hazards: 1. Structural hazards arise from resource conflicts when the hardware cannot support all possible combinations of instructions simultaneously in overlapped execution. 2. Data hazards arise when an instruction depends on the results of a previous instruction, because of overlapping of instructions. 3. Control hazards arise from the pipelining of branches and other instructions that change the PC. 58
  • 59.
    Pipeline Hazards  Hazardsin pipelines can make it necessary to stall (stop, halt or pause), the pipeline.  Avoiding a hazard often requires that some instructions in the pipeline be allowed to proceed while others are delayed.  When an instruction is stalled, all instructions issued later than the stalled instruction are also stalled.  Instructions issued earlier than the stalled instruction must continue, otherwise the hazard will never clear. 59
  • 60.
    Pipeline Hazards  StructuralHazards If some combination of instructions cannot be accommodated because of resource conflicts, the processor is said to have a structural hazard.  Structural Hazards arise because of: 1. When some functional units are not fully pipelined. 2. When some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute.  For example, a processor may have only one register-file write port, but the pipeline might want to perform two writes in a clock cycle. This will generate a structural hazard. 60
  • 61.
    Pipeline Hazards  StructuralHazards  When a instructions encounters this hazard, the pipeline will stall one of the instructions until the required unit is available.  Such stalls will increase the CPI from its usual ideal value of 1.  To resolve this hazard, we need to stall the pipeline for 1 clock cycle. A stall is commonly called a pipeline bubble or just bubble.  The effect of the pipeline bubble is actually to occupy the resources for that instruction slot as it travels through the pipeline. 61
  • 62.
    Pipeline Hazards  DataHazards  A major effect of pipelining is to change the relative timing of instructions by overlapping their execution.  This overlap introduces data and control hazards.  Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on an unpipelined processor. 62
  • 63.
    Pipeline Hazards  DataHazards  Consider the pipelined execution of these instructions: DADD R1,R2,R3 DSUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11  All the instructions after the DADD use the result of the DADD instruction.  The DADD instruction writes the value of R1 in the WB pipe stage, but the DSUB instruction reads the value during its ID stage.  This problem is called a data hazard. Unless precautions are taken to prevent it, the DSUB instruction will read the wrong value and try to use it. 63
  • 64.
    Pipeline Hazards  DataHazards  If an interrupt occurs between the DADD and DSUB instructions, then WB stage of the DADD will complete, and the value of R1 at that point will be the result of the DADD.  The AND instruction is also affected by this hazard. The AND instruction that reads the registers during clock cycle 4 will receive the wrong results.  The XOR instruction operates properly because its register read occurs in clock cycle 6, after the register write.  The OR instruction also operates without incurring a hazard. 64 DADD R1,R2,R3 DSUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11
  • 65.
    Pipeline Hazards  MinimizingData Hazard Stalls by Forwarding  The problem of data hazard can be solved with a simple hardware technique called forwarding (also called bypassing and sometimes short-circuiting).  The result is not needed by the DSUB until the DADD produces it.  If the result moved from the pipeline register where the DADD stores it, to where the DSUB needs it, then the need for a stall can be avoided. 65 DADD R1,R2,R3 DSUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11
  • 66.
    Pipeline Hazards  MinimizingData Hazard Stalls by Forwarding  Using this observation, forwarding works as follows: 1. The ALU result from both the EX/MEM and MEM/WB pipeline registers is always fed back to the ALU inputs. 2. If the forwarding hardware detects that the previous ALU operation has written to the register corresponding to a source for the current ALU operation, control logic selects the forwarded result. 66 DADD R1,R2,R3 DSUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11
  • 67.
    Pipeline Hazards  BranchHazards  Control hazards can cause a greater performance loss for MIPS pipeline than data hazards.  When a branch is executed, it may or may not change the PC to something other than its current value plus 4.  If the branch is not taken, then the repetition of the IF stage is unnecessary since the correct instruction was fetched.  One stall cycle for every branch will yield a performance loss of 10% to 30% depending on the branch frequency. 67
  • 68.
    Pipeline Hazards  ReducingPipeline Branch Penalties  The software can try to minimize the branch penalty using knowledge of the hardware scheme and of branch behavior.  The simplest scheme to handle branches is to freeze or flush the pipeline, holding or deleting any instructions after the branch until the branch destination is known.  It is simple from both sides hardware and software. 68
  • 69.
    Pipeline Hazards  ReducingPipeline Branch Penalties  Treat every branch as not taken. (Allow the hardware to continue as if the branch were not executed).  In the simple five-stage pipeline, this predicted untaken scheme is implemented by continuing to fetch instructions as if the branch were a normal instruction.  If the branch is taken, however, we need to turn the fetched instruction into a no-op and restart the fetch at the target address. 69
  • 70.
    Pipeline Hazards  ReducingPipeline Branch Penalties  An alternative scheme is to treat every branch as taken.  As soon as the branch is decoded and the target address is computed, we assume the branch to be taken and begin fetching and executing at the target address (location).  In some processors — powerful (hence slower) branch conditions — the branch target is known, and a predicted-taken scheme might make sense.  A another scheme used in some processors is called delayed branch. This technique was heavily used in early RISC processors. 70
  • 71.
    Performance of Pipelineswith Stalls  A stall causes the pipeline performance to degrade from the ideal performance.  Equation for finding the actual speedup from pipelining, starting with the earlier formula. 71
  • 72.
    Performance of Pipelineswith Stalls  Pipelining is decreasing the CPI or the clock cycle time.  The ideal CPI on a pipelined processor is always 1.  Hence, we can compute the pipelined CPI: If we ignore the cycle time overhead of pipelining and assume the stages are perfectly balanced, then the cycle time of the two processors can be equal, leading to 72
  • 73.
    Performance of Pipelineswith Stalls If all instructions take the same number of cycles, which must also equal the number of pipeline stages, then the unpipelined CPI is equal to the depth of the pipeline, leading to If there are no pipeline stalls, then pipelining can improve performance by the depth of the pipeline. If pipelining improves the clock cycle time, then we can calculate the CPI of the unpipelined processor, as well as pipelined processor. This leads to… 73
  • 74.
    Performance of Pipelineswith Stalls  In cases where the pipe stages are perfectly balanced and there is no overhead, the clock cycle on the pipelined processor is smaller than the clock cycle of the unpipelined processor by a factor equal to the pipelined depth: This leads to the following 74
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
    Instruction-Level Parallelism  Allprocessors since about 1985 use pipelining to overlap the execution of instructions and improve performance.  This potential overlap among instructions is called instruction-level parallelism (ILP), since the instructions can be evaluated in parallel. 79
  • 80.
    Instruction-Level Parallelism  Instruction-levelparallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program:  For Example: 1. e = a + b 2. f = c + d 3. g = e * f Here, Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are completed. As, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously. If each operation is completed in one unit of time then three instructions can be completed in two units of time, giving an ILP of 3/2. 80
  • 81.
    Instruction-Level Parallelism  Instruction-levelparallelism (ILP):  Ordinary programs are written and executed sequentially. ILP allows the compiler and the processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed.  How much ILP exists in programs is very application specific. In certain fields, such as graphics and scientific computing the amount can be very large. However, cryptography exhibit much less parallelism. 81
  • 82.
    Instruction-Level Parallelism  Thereare two largely separable approaches to exploiting (utilizing) ILP: (1) an approach that relies on hardware to help discover and exploit (utilize) the parallelism, and (2) an approach that relies on software technology to find parallelism at compile time. 82
  • 83.
    Instruction-Level Parallelism  Thevalue of the CPI for a pipelined processor is the sum of the base CPI and all contributions from stalls: Pipeline CPI (Ideal pipeline)=CPI + Structural stalls + Data hazard stalls + Control stalls  The ideal pipeline CPI is a measure of the maximum performance attainable by the implementation. 83
  • 84.
    Instruction-Level Parallelism  Thesimplest and most common way to increase the ILP is to exploit parallelism among iterations of a loop, which is called as loop-level parallelism.  Example: A loop that adds two 1000 - element arrays: for (i=0; i<=999; i=i+1) x[i] = x[i] + y[i];  Every iteration of the loop can overlap with any other iteration.  Such techniques works by unrolling the loop either statically by the compiler or dynamically by the hardware 84
  • 85.
    Instruction-Level Parallelism  Animportant alternative method for exploiting loop-level parallelism is the use of SIMD in both vector processors and Graphics Processing Units (GPUs).  A SIMD instruction exploits data-level parallelism by operating on a small to moderate number of data items in parallel. 85
  • 86.
    SIMD  SIMD machinesare capable of applying the exact same instruction stream to multiple streams of data simultaneously.  This type of architecture is perfectly suited to achieving very high processing rates 86
  • 87.
    SIMD SIMD Processable PatternsSIMD Unprocesable Patterns Example: Brightness Computation by SIMD Operations 87
  • 88.
    ILP Challenges  Determininghow one instruction depends on another is critical & determining how much parallelism exists in a program and how that parallelism can be exploited is major problem.  If two instructions are independent, they can execute simultaneously in a pipeline, provided that pipeline has sufficient resources (and hence no structural hazards exist).  If two instructions are dependent, they are not parallel and must be executed in predefined order. 88
  • 89.
    ILP Challenges  Thereare three different types of dependences:  Data dependences (also called true data dependences),  Name dependences, and  Control dependences.  An instruction j is data dependent on instruction i if either of the following conditions are true:  Instruction i produces a result that may be used by instruction j.  Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. 89 Instruction i … Instruction j … Instruction k
  • 90.
    ILP Challenges  DataDependences  For example, consider the following MIPS code sequence that increments a vector of values in memory (starting at 0(R1) and with the last element at 8(R2) ) by a scalar in register F2. (For simplicity, ignore the effects of delayed branches.) Here, each instruction depends on the previous one 90
  • 91.
    ILP Challenges  DataDependences  A data dependence conveys three things: (1) The possibility of a hazard, (2) The order in which results must be calculated, and (3) An upper bound on how much parallelism can be exploited.  A data dependence can be overcomed in two different ways: (1) Maintaining the dependence but avoiding a hazard, and (2) Eliminating a dependence by transforming the code. Scheduling the code is the primary method used to avoid a hazard without altering a dependence, and such scheduling can be done both by the compiler and by the hardware. 91
  • 92.
    ILP Challenges  NameDependences  A name dependence occurs when two instructions use the same register or memory location, called a name.  There are two types of name dependences between an instruction i that precedes instruction j in program order: 1. An antidependence: When instruction j writes a register or memory location that instruction i reads. 2. An output dependence: When instruction i and instruction j write the same register or memory location. 92 Instruction i Instruction j
  • 93.
    ILP Challenges  NameDependences  Because a name dependence is not a true dependence, instructions involved in a name dependence can execute simultaneously or be reordered.  If the name (register number or memory location) used in the instructions is changed, then the instructions do not conflict.  This renaming can be more easily done for register operands, where it is called register renaming.  Register renaming can be done either statically by a compiler or dynamically by the hardware. 93
  • 94.
    ILP Challenges  DataHazards  A hazard exists whenever there is a name or data dependence between instructions.  Normally, we must preserve program order  The goal of both software and hardware techniques is to exploit parallelism by preserving program order.  Data hazards, may be classified as one of three types, depending on the order of read and write accesses in the instructions. 94
  • 95.
    ILP Challenges  DataHazards  The possible data hazards are ■ RAW (Read After Write) — j tries to read a source before i writes it, so j incorrectly gets the old value. This hazard is the most common type and corresponds to a true data dependence. ■ WAW (Write After Write) — This hazard corresponds to an output dependence. WAW hazards are present only in pipelines that write in more than one pipe stage. ■ WAR (Write After Read) — This hazard arises from an antidependence. WAR hazards cannot occur in most static issue pipelines. A WAR hazard occurs when there are some instructions that write results early in the instruction pipeline. Note that the RAR (Read After Read) case is not a hazard. 95 Instruction i Instruction j
  • 96.
    ILP Challenges  ControlDependences  A control dependence determines the ordering of an instruction, i, with respect to a branch instruction so that instruction i is executed in correct program order.  Examples of a control dependence is the dependence of the statements in the “then” part of an “if” statement on the branch. 96
  • 97.
    ILP Challenges  ControlDependences  S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.  In general, two constraints are imposed by control dependences: 1. An instruction that is control dependent on a branch cannot be moved before the branch 2. An instruction that is not control dependent on a branch cannot be moved after the branch. For example, We cannot take a statement before the if statement and move it into the then portion. 97
  • 98.
    Overcoming Data Hazards A simple statically scheduled pipeline fetches an instruction and issues it, unless there is a data dependence.  If there is a data dependence, then the hazard detection hardware stalls the pipeline starting with the instruction that uses the result.  No new instructions are fetched or issued until the dependence is cleared. 98
  • 99.
    Dynamic Scheduling  Indynamic scheduling, the hardware rearranges the instruction execution to reduce the stalls while maintaining data flow.  Dynamic scheduling offers several advantages.  First, it allows code that was compiled with one pipeline to run efficiently on a different pipelines, eliminating the need to recompile for a different microarchitectures.  Second, it enables handling of code when dependences are unknown at time of compilation.  Third, and most important, it allows the processor to tolerate unpredictable delays. 99
  • 100.
    Dynamic Scheduling: TheIdea  For example, consider this code: DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F12,F8,F14  The SUB.D instruction cannot execute because the dependence of ADD.D on DIV.D causes the pipeline to stall; yet, SUB.D is not data dependent on anything in the pipeline.  This hazard creates a performance limitation that can be eliminated by not requiring instructions to execute in program order. 100
  • 101.
    Dynamic Scheduling: The Idea In the classic five-stage pipeline, both structural and data hazards could be checked during instruction decode (ID).  To allow us to begin executing the SUB.D in the above example, We must separate the issue process into two parts: - checking for any structural hazards and - waiting for the absence of a data hazard.  Thus, we still use in-order instruction issue, but we want an instruction to begin execution as soon as its data operands are available. 101 DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F12,F8,F14
  • 102.
    Dynamic Scheduling  Tounderstand how register renaming eliminates WAR and WAW hazards, consider the following example code sequence: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8  There are two antidependences: between the ADD.D and the SUB.D and between the S.D and the MUL.D.  There is also an output dependence between the ADD.D and MUL.D.  There are also three true data dependences: between the DIV.D and the ADD.D, between the SUB.D and the MUL.D, and between the ADD.D and the S.D. 102
  • 103.
    Dynamic Scheduling  Thesename dependences can all be eliminated by register renaming.  For simplicity, assume the existence of two temporary registers, S and T. Using S and T, the sequence can be rewritten without any dependences as: DIV.D F0,F2,F4 DIV.D F0,F2,F4 ADD.D F6,F0,F8 ADD.D S,F0,F8 S.D F6,0(R1) S.D S,0(R1) SUB.D F8,F10,F14 SUB.D T,F10,F14 MUL.D F6,F10,F8 MUL.D F6,F10,T  In addition, any subsequent uses of F8 must be replaced by the register T. In this code segment, the renaming process can be done statically by the compiler. 103
  • 104.
    Speculation  Exploiting moreparallelism requires that we should overcome the limitation of control dependence.  Overcoming control dependence is done by speculating (guessing) on the outcome of branches and executing the program as if our guesses were correct.  With speculation (guesswork), we can fetch, issue, and execute instructions, as if our branch predictions were always correct; dynamic scheduling only fetches and issues such instructions.  Hardware speculation, extends the ideas of dynamic scheduling. 104
  • 105.
    Hardware-Based Speculation  Hardware-basedspeculation combines three key ideas: (1) Dynamic branch prediction to choose which instructions to execute, (2) Speculation to allow the execution of instructions before the control dependences are resolved and (3) Dynamic scheduling to deal with the scheduling of different combinations of basic blocks.  Hardware-based speculation follows the predicted flow of data values to choose when to execute instructions. 105
  • 106.
    Hardware-Based Speculation  Speculationallows instructions to execute out of order but to force them to commit in order execution.  Adding this commit phase to the instruction execution sequence requires an additional set of hardware buffers.  This hardware buffer (reorder buffer), is also used to pass results among instructions that may be speculated.  The ROB supplies operands in the interval between completion of instruction execution and instruction commit. 106
  • 107.
    Hardware-Based Speculation  Eachentry in the ROB contains four fields: - the instruction type, - the destination field, - the value field, and - the ready field.  The instruction type field indicates whether the instruction is a branch (and has no destination), a store (memory address), or a register operation.  The destination field supplies the register number (for load) or the memory address (for stores)  The value field is used to hold the value of the instruction result until the instruction commits.  The ready field indicates that the instruction has completed execution, and the value is ready. 107
  • 108.
    Hardware- Based Speculation The basic structure ofa FP unit using Tomasulo’s algorithm and extended to handle speculation. 108
  • 109.
    Hardware-Based Speculation  Thehardware structure of the processor including the ROB is shown in figure.  The ROB includes the store buffers. The renaming function of the reservation stations is replaced by the ROB.  This tagging requires that the ROB assigned for an instruction must be tracked in the reservation station. 109
  • 110.
    Hardware-Based Speculation  Hereare the four steps involved in instruction execution: 1. Issue — Get an instruction from the instruction queue. Issue the instruction if there is an empty reservation station and an empty slot in the ROB. If either all reservations are full or the ROB is full, then instruction issue is stalled until both have available entries. 2. Execute — If one or more of the operands is not yet available, monitor the CDB (Common Data Bus) while waiting for the register to be computed. This step checks for RAW hazards. When both operands are available at a reservation station, execute the operation. Instructions may take multiple clock cycles in this stage. 110
  • 111.
    Hardware-Based Speculation  Hereare the four steps involved in instruction execution: 3. Write result— When the result is available, write it on the CDB (Common Data Bus) and from the CDB into the ROB, as well as to any reservation stations waiting for this result. 4. Commit— This is the final stage of completing an instruction, after which only its result remains. (Commit phase is also called as “completion” or “graduation”) 111
  • 112.
    Hardware-Based Speculation There arethree different sequences of actions at commit: The normal commit case occurs when an instruction reaches the head of the ROB and its result is present in the buffer. Committing a store is similar except that memory is updated rather than a result register. When a branch with incorrect prediction reaches the head of the ROB, it indicates that the speculation was wrong. The ROB is flushed and execution is restarted at the correct successor of the branch. If the branch was correctly predicted, the branch is finished. 112
  • 113.
    ILP Using MultipleIssue  To improve performance, we would like to decrease the CPI < 1, but the CPI cannot be reduced below one if we issue only one instruction every clock cycle.  The goal of the multiple-issue processors, is to allow multiple instructions to issue in a clock cycle.  Multiple-issue processors come in three major flavors: 1. Statically scheduled superscalar processors 2. VLIW (very long instruction word) processors 3. Dynamically scheduled superscalar processors 113
  • 114.
    Thank You … shindesir.pvp@gmail.com (ThisPresentation is Published Only for Educational Purpose) 114