KEMBAR78
Unit 2-Basic Processing Unit | PDF | Central Processing Unit | Digital Technology
0% found this document useful (0 votes)
127 views95 pages

Unit 2-Basic Processing Unit

The CPU fetches instructions from memory one at a time and executes them by performing more basic operations. It keeps track of the next instruction address using the program counter. Executing an instruction involves operations like register transfers to move data, ALU operations, loading data from memory, and storing data to memory. The CPU contains components like the ALU, registers, program counter, and instruction register to perform these operations according to the instructions.

Uploaded by

Akhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views95 pages

Unit 2-Basic Processing Unit

The CPU fetches instructions from memory one at a time and executes them by performing more basic operations. It keeps track of the next instruction address using the program counter. Executing an instruction involves operations like register transfers to move data, ALU operations, loading data from memory, and storing data to memory. The CPU contains components like the ALU, registers, program counter, and instruction register to perform these operations according to the instructions.

Uploaded by

Akhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 95

Basic Processing Unit

UNIT II
Overview
 Instruction Set Processor (ISP) or Central Processing
Unit (CPU) – the unit which executes instructions and
coordinates the activities of the other units
 A typical computing task consists of a series of steps
specified by a sequence of machine instructions that
constitute a program.
 An instruction is executed by carrying out a sequence
of more fundamental operations.
Some Fundamental
Concepts
Fundamental Concepts
 Processor fetches one instruction at a time and perform
the operation specified.
 Instructions are fetched from successive memory
locations until a branch or a jump instruction is
encountered.
 Processor keeps track of the address of the memory
location containing the next instruction to be fetched
using Program Counter (PC).
 Instruction Register (IR) – holds the instruction fetched
from memory
Executing an Instruction
 Fetch the contents of the memory location pointed to by
the PC. The contents of this location are loaded into the
IR (fetch phase).
IR ← [[PC]]
 Assuming that an instruction is 4bytes long, increment
the contents of the PC by 4 to point to the next
instruction. (fetch phase)
PC ← [PC] + 4
 Carry out the actions specified by the instruction in the IR
(execution phase).
Processor Organization
Internal organization of the
processor
 ALU – Used to perform arithmetic and logical operation.
 Registers for temporary storage
 Various digital circuits for executing different micro
operations.(gates, MUX, decoders, counters).
 Internal path for movement of data between ALU and
registers.
 Driver circuits for transmitting signals to external units.
 Receiver circuits for incoming signals from external
units.
 PC:
 Keeps track of execution of a program
 Contains the memory address of the next instruction to be fetched and
executed.

 MAR:
 Holds the address of the location to be accessed.
 I/p of MAR is connected to Internal bus and an O/p to external bus.

 MDR:
 Contains data to be written into or read out of the addressed location.
 It has 2 inputs and 2 Outputs.
 Data can be loaded into MDR either from memory bus or from internal
processor bus.

 The data and address lines are connected to the internal bus via MDR
and MAR
Registers:
 The processor registers R0 to Rn-1 vary considerably from one
processor to another.
 Registers are provided for general purpose, used by programmer.
 Special purpose registers-index & stack registers.
 Registers Y,Z &TEMP are temporary registers used by processor
during the execution of some instruction.
Multiplexer:
 Select either the output of the register Y or a constant value 4 to be
provided as input A of the ALU.
 Constant 4 is used by the processor to increment the contents of PC.
Data Path:
 The registers, ALU and interconnecting bus are collectively
referred to as the data path.
Executing an Instruction
 Execution of an instruction involves one or more of the
following operations:

 Register Transfers: Transfer a word of data from one processor


register to another or to the ALU.

 ALU operations: Perform an arithmetic or a logic operation


and store the result in a processor register.

 Load operation: Fetch the contents of a given memory location


and load them into a processor register.
 Store operation: Store a word of data from a processor register
into a given memory location.
Register Transfers Riin
Internal processor
bus

Ri

Riout

Yin

Constant 4

Select MUX

A B
ALU

Zin

Z out

Figure 7.2. Input and output gating for the registers in Figure 7.1.
 The input and output gates for register Ri are
controlled by signals Riin and Riout .

 Riin is set to1 – data available on internal bus are


loaded into Ri.

 Riout is set to1 – the contents of register are placed


on the bus.
Data transfer between two
registers:
EX:
Transfer the contents of R1 to R4.

1. Enable output of register R1 by setting R1out=1.


This places the contents of R1 on the processor bus.

2. Enable input of register R4 by setting R4in=1. This


loads the data from the processor bus into register
R4.
Performing an Arithmetic or
Logic Operation
 The ALU is a combinational circuit that has no internal
storage.
 ALU gets the two operands from MUX and bus. The
result is temporarily stored in register Z.
 What is the sequence of operations to add the contents of
register R1 to those of R2 and store the result in R3? i.e.
Add R1, R2, R3
1. R1out, Yin

2. R2out, SelectY, Add, Zin

3. Zout, R3in
Processor Organization

Add R1, R2, R3


1.R1out, Yin
2.R2out, SelectY, Add, Zin
3.Zout, R3in
 Step 1: Output of the register R1 and input of the
register Y are enabled, causing the contents of R1 to
be transferred to Y.

 Step 2: The multiplexer’s select signal is set to select


Y causing the multiplexer to gate the contents of
register Y to input A of the ALU.

 Step 3: The contents of Z are transferred to the


destination register R3.
Load Operation - Fetching a Word
from Memory
 Address into MAR; issue Read operation; data into MDR.
Memory-bus Internal processor
data lines MDRoutE MDRout bus

MDR

MDR inE MDRin

Figure 7.4.
Figure 7.4. Connection and control
Connection and controlsignals
signalsfor
forregister
registerMDR.
MDR.
Fetching a Word from Memory
 The response time of each memory access varies (cache miss)
 To accommodate this, the processor waits until it receives an
indication that the requested operation has been completed
(Memory-Function-Completed, MFC).
 Move (R1), R2
 MAR ← [R1] R1out, MARin, Read
 Start a Read operation on the memory bus
 Wait for the MFC response from the memory MDRinE,WMFC
 Load MDR from the memory bus
 R2 ← [MDR]
MDRout, R2in

Memory-bus Internal processor


data lines MDRoutE MDRout bus

MDR

MDR inE MDRin


Step 1 2 3

Timing Clock

MARin MAR ← [R1]


Assume MAR
is always available Address
on the address lines
of the memory bus. Start a Read operation on the memory bus
Read

MR

MDRinE

Data

Wait for the MFC response from the memory


MFC

MDR out Load MDR from the memory bus


R2 ← [MDR]

Figure 7.5. Timing of a memory Read operation.


Storing a word in memory
 Move R2,(R1)

1. R1out, MARin
2. R2out, MDRin, Write
3. MDRoutE, WMFC
Execution of a Complete
Instruction
 Add (R3), R1
 Fetch the instruction
 Fetch the first operand (the contents of the
memory location pointed to by R3)
 Perform the addition
 Store the result into R1
Execution of a Complete
Instruction Internal processor
bus

Add (R3), R1 Control signals

PC

Instruction
Step Action Address
decoder and
lines
MAR control logic

1 PC out , MAR in , Read, Select4, Add, Zin Memory


bus

2 Zout , PC in , Yin , WMF C MDR


Data
lines IR
3 MDR out , IR in
4 R3out , MAR in , Read Y
Constant 4 R0
5 R1out , Yin , WMF C
6 MDR out , SelectY, Add, Zin Select MUX

7 Zout , R1 in , End Add


A B
ALU Sub R n - 1 
control ALU
lines
Carry-in
XOR TEMP
Figure 7.6. Con trol sequence for execution of the instruction Add (R3),R1.
Z

Figure 7.1. Single-bus organization of the datapath inside a processor.


Execution of Branch Instructions
 A branch instruction replaces the contents of PC
with the branch target address, which is usually
obtained by adding an offset X given in the
branch instruction.
 The offset X is usually the difference between the
branch target address and the address immediately
following the branch instruction.
 Unconditional branch – Jump instructions
Execution of unconditional
Branch Instruction

Step Action

1 PCout , MAR in , Read, Select4, Add, Z in


2 Zout , PC in , Yin , WMFC
3 MDR out , IR in
4 Offset-field-of-IRout , Select Y, Add, Zin
5 Z out , PCin , End

Figure 7.7(a). Control sequence for an unconditional branch instruction.


Execution of conditional Branch
Instruction
Step Action

1 PCout , MAR in , Read, Select4, Add, Z in


2 Zout , PC in , Yin , WMFC
3 MDR out , IR in
4 Offset-field-of-IRout , Select Y, Add, Z in , If N=0 then End
5 Z out , PCin , End

Figure 7.7(b). Control sequence for an unconditional branch instruction.

If N=1 then branch is taken; execute an instruction from branch target.


N=0 branch is not taken; execute a sequentially next instruction.
Multiple-Bus Organization
Bus A Bus B Bus C

Incrementer

PC

Register
file

Constant 4

MUX
A

ALU R

Instruction
decoder

IR

MDR

MAR

Memory bus Address


data lines lines

Figure 7.8. Three-b us organization of the datapath.


 General purpose registers are combined into a single block
called register file.
 Two output ports allow the contents of two different registers
to be accessed simultaneously and have their contents placed
on buses A and B.
 Third port allows the data on bus C to be loaded into a third
register during the same clock cycle.

 Bus A & B are used to transfer the source operands to A & B


inputs of the ALU.
 The result of ALU operation is transferred to the destination
over the bus C.
 ALU may simply pass one of its 2 input operands
unmodified to bus C.
 The ALU control signals for such an operation R=A or R=B.
 Incrementer unit is used to increment the PC by 4.
 Using the incrementer eliminates the need to add the
constant value 4 to the PC using the main ALU.
 The source for the constant 4 at the ALU input multiplexer
can be used to increment other address such as loadmultiple
& storemultiple
Multiple-Bus Organization
 Add R4, R5, R6
Bus A Bus B Bus C

Control sequence for the instruction Add R4,R5,R6, Incrementer

on three-bus organization PC

Register
file

Step Action
Constant 4

MUX
1 PCout , R=B, MAR in , Read, IncPC A

ALU R

2 WMFC B

3 MDR outB , R=B, IR in


Instruction
decoder

4 R4outA , R5outB , SelectA, Add, R6 in , End IR

MDR

MAR

Memory bus Address


data lines lines

Figure 7.8. Three-b us organization of the datapath.


ADD (R3), R2, R1
i.e. R1← (R3) + [R2] Bus A Bus B Bus C

Incrementer

PC

 PCout, R=B, MARin, Read, IncPC Register


file

WMFC
Constant 4

MUX
MDRoutB, R=B, IRin
A
 ALU R

 R3outB, R=B, MARin, Read


Instruction

WMFC
decoder

IR

 MDRoutB, R2outA,Add, R1in, End MDR

MAR

Memory bus Address


data lines lines

Figure 7.8. Three-b us organization of the datapath.


Quiz
 What is the control
sequence for execution
of the instruction
Add R1, R2
including the
instruction fetch
phase? (Assume single
bus architecture)
On single bus (R2← [R1] + [R2])
1. PCout, MARin, Read, Select4, Add, Zin
2. Zout, PCin, Yin, WMFC Internal processor
bus

Control signals

MDRout, Irin
PC

3. Address
lines
Instruction
decoder and
MAR control logic

R1out, Yin
Memory

4.
bus

MDR
Data
lines IR

5. R2out, Add, Zin Constant 4


Y
R0

Select MUX

6. Zout, R2in, END ALU


control
lines
Add
Sub
A

ALU
B
R n - 1 

Carry-in
XOR TEMP

Figure 7.1. Single-bus organization of the datapath inside a processor.


Hardwired Control
Overview
 To execute instructions, the processor must have
some means of generating the control signals
needed in the proper sequence.
 Computer designers use a wide variety of
technology to solve this problem
 This approach fall in two categories: hardwired
control and microprogrammed control
 Hardwired system can operate at high speed; but with little
flexibility.
 The control unit uses a fixed logic circuit to interpret
instructions and generate sequence of control signals from
them.
 Each steps in this sequence is completed in one clock cycle.
 A counter may be used to keep the track of the control
steps (refer fig 7.10)
 Each count of this counter corresponds to one control step
The required control signals are determined by the
following information.
1.contents of the control step counter

2.contents of the instruction register

3.contents of the condition code flags

4.External input signals such as MFC and interrupt


request.
Hardwired Control Unit Organization
such as MFC,
interrupt requests
 The decoder/encoder block diagram is a
combinational circuit that generate the required
control outputs, depending on the state of all its
inputs.
 In the figure 7.11 the decoding and encoding
functions are separated.

 The step decoder provides a separate signal line


for each step, in the control sequence.
 The output of the instruction decoder consists of a
separate line for each machine instruction.
 For any instruction loaded in the IR, one of the
output lines INS1 through INSm is set to 1, and all
other lines are set to 0.
 The encoder generate appropriate control signals
yin, pcout , Add, End and so on by combining the
input signals.
Detailed Block Diagram
Example: Generation of Zin by the
encoder in a signal for the single-bus
processor
 Zin = T1 + T6 • ADD + T4 • BR + …
Branch Add

T4 T6

T1

Zin is asserted during time slot T1 for all instructions, during T6 for an Add instruction,
during T4 for an unconditional branch instruction, and son
Control signals for Add (R3), R1

Control signals for Conditional branch

Step Action

1 PCout , MAR in , Read, Select4, Add, Z in


2 Z out , PC in , Yin , WMF C
3 MDR out , IR in
4 Offset-field-of-IRout , Select Y, Add, Z in , If N=0 then End
5 Z out , PCin , End
Generating End control signal
 End = T7 • ADD + T5 • BR + ( T5 • N + T4 • Nˡ ) • BRN +…

The END signals starts a new instruction fetch cycle by


resetting the control step counter to its starting value.
 RUN, set to 1, causes the counter to be
incremented by 1 at the end of every clock cycle,
when it set to 0 the counter stop counting. This is
needed whenever the WMFC signal is issued.
A Complete Processor
Instruction Integer Floating-point
unit unit unit

Instruction Data
cache cache

Bus interface
Processor

System bus

Main Input/
memory Output

Figure 7.14. Block diagram of a complete processor.


Microprogrammed
Control
Overview
 Control signals are generated by a program similar to machine language
programs.
 Control Word (CW); microroutine; microinstruction

MDRout

WMFC
MAR in

Select
PCout

R1out

R3out
Micro -

Read
PCin

R1 in
Z out
Add

End
IRin
Yin
instruction

Zin
1 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
4 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
5 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0
6 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1

Figure 7.15 An example of microinstructions for Figure 7.6.


Overview

Step Action

1 PC out , MAR in , Read, Select4, Add, Zin


2 Zout , PC in , Yin , WMF C
3 MDR out , IR in
4 R3out , MAR in , Read
5 R1out , Yin , WMF C
6 MDR out , SelectY, Add, Zin
7 Zout , R1 in , End

Figure 7.6. Con trol sequence for execution of the instruction Add (R3),R1.
 Control Word (CW) is a word whose individual
bits represent various control signals.
 Every instruction will need a sequence of CWs for
its execution.
 At every step, some control signals are asserted
(=1) and all others are 0
 Sequence of CWs for an instruction forms the
microroutine for that instruction.
 Each CW in this microroutine is referred to as a
microinstruction.
 Every instruction will have its own microroutine
which is made up of microinstructions.
 Microroutines for all instructions in the instruction set
of a computer are stored in a special memory called
Control Store.
 Control signals are generated by sequentially reading
the CWs of the corresponding microroutine from the
control store.
Basic organization of a microprogrammed
control unit
 Microprogram counter (mPC) is used to read
CWs from control store sequentially.
 When a new instruction is loaded into IR, starting
address generator generates the starting address of
the microroutine.
 This address is loaded into the mPC. mPC is
automatically incremented by the clock, so
successive microinstructions are read from the
control store.
Execution of conditional Branch
Instruction
Step Action

1 PCout , MAR in , Read, Select4, Add, Z in


2 Zout , PC in , Yin , WMF C
3 MDR out , IR in
4 Offset-field-of-IRout , Add, Select Y, Z in , If N=0 then End
5 Z out , PCin , End

Figure 7.7(b). Control sequence for an conditional branch instruction.

If N=1 then branch is taken; execute an instruction from branch target.


N=0 branch is not taken; execute a sequentially next instruction.
Overview
 The previous organization cannot handle the situation when the control unit is
required to check the status of the condition codes or external inputs to choose
between alternative courses of action.
 Use conditional branch microinstruction.
Address Microinstruction

0 PCout , MAR in , Read, Select4, Add, Z in


1 Zout , PC in , Yin , WMF C
2 MDR out , IR in
3 Branch to starting addressof appropriate microroutine
. ... .. ... ... .. ... .. ... ... .. ... ... .. ... .. ... ... .. ... .. ... ... .. ... ..
25 If N=0, then branch to microinstruction 0
26 Offset-field-of-IR out , SelectY, Add, Z in
27 Zout , PC in , End

Figure 7.17. Microroutine for the instruction Branch<0.


 Control words at location 25, 26, and 27 implent
the microroutine for branch<0
 Control word at location 25 tests the N bit of the
condition code register
 If this bit is 0 branch takes place to location 0 to
fetch new instruction
 Otherwise control word at location 26 is executed,
to fetch the instruction from branch target address.
Overview External
inputs

Starting and
branch address Condition
IR codes
generator

Clock  PC

Control
store CW

Figure 7.18. Organization of the control unit to allow


conditional branching in the microprogram.
control
Thank you
Pipelining
Overview
• Pipelining is widely used in modern
processors.

• Pipelining improves system performance in


terms of throughput.

• Pipelined organization requires sophisticated


compilation techniques.
Basic Concepts
Making the Execution of Programs
Faster
• To increase the speed of execution either
– Use faster circuit technology to build the processor
and the main memory
or
– Arrange the hardware so that more than one operation
can be performed at the same time.

• In the latter way, the number of operations


performed per second is increased even though
the elapsed time needed to perform any one
operation is not changed.
Traditional Pipeline Concept
• Laundry Example

• Ann, Brian, Cathy, Dave A B C D


each have one load of clothes
to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes


Traditional Pipeline Concept

6 PM 7 8 9 10 11 Midnight

Time

30 40 20 30 40 20 30 40 20 30 40 20
• Sequential laundry takes 6 hours
A for 4 loads
• If they learned pipelining, how
long would laundry take?
B

D
Traditional Pipeline Concept

6 PM 7 8 9 10 11 Midnight

Time
T
a 30 40 40 40 40 20
s
k A
• Pipelined laundry takes 3.5
hours for 4 loads
O B
r
d C
e
r D
Traditional Pipeline Concept
6 PM 7 8 9
• Pipelining doesn’t help latency
Time of single task, it helps
throughput of entire workload
T
30 40 40 40 40 20 • Pipeline rate limited by slowest
a
pipeline stage
s
A • Multiple tasks operating
k
simultaneously using different
resources
O B
• Potential speedup = Number
r pipe stages
d C • Unbalanced lengths of pipe
e stages reduces speedup
r • Stall for Dependences
D
Use the Idea of Pipelining in a
Computer
Fetch + Execution
T ime
I1 I2 I3
Time
Clock cycle 1 2 3 4
F E F E F E
1 1 2 2 3 3 Instruction

I1 F1 E1
(a) Sequential execution

I2 F2 E2
Interstage buffer
B1
I3 F3 E3

Instruction Execution
fetch unit (c) Pipelined execution
unit

Figure 8.1. Basic idea of instruction pipelining.


(b) Hardware organization
Use the Idea of Pipelining in a Computer
Time
Clock cycle 1 2 3 4 5 6 7

Instruction

I1 F1 D1 E1 W1
Fetch + Decode
+ Execution + Write I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

(a) Instruction execution divided into four steps

Interstage buffers

D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3

(b) Hardware organization

Textbook page: 457

Figure 8.2. A 4-stage pipeline.


Role of Cache Memory

• Each pipeline stage is expected to complete in one clock


cycle.
• The clock period should be long enough to let the slowest
pipeline stage to complete.
• Faster stages can only wait for the slowest one to
complete.
• Since main memory is very slow compared to the
execution, if each instruction needs to be fetched from
main memory, pipeline is almost useless.
• Fortunately, we have cache.
Pipeline Performance

• The potential increase in performance


resulting from pipelining is proportional to the
number of pipeline stages.
• However, this increase would be achieved only
if all pipeline stages require the same time to
complete, and there is no interruption
throughout program execution.
• Unfortunately, this is not true.
Pipeline Performance
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

I5 F5 D5 E5

Figure 8.3. Effect of an execution operation taking more than one clock cycle.
Pipeline Performance

• The previous pipeline is said to have been stalled for two clock cycles.

• Any condition that causes a pipeline to stall is called a hazard.

• Data hazard – any condition in which either the source or the


destination operands of an instruction are not available at the time
expected in the pipeline. So some operation has to be delayed, and the
pipeline stalls.

• Instruction (control) hazard – a delay in the availability of an


instruction causes the pipeline to stall.

• Structural hazard – the situation when two instructions require the use
of a given hardware resource at the same time.
Pipeline Performance
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction Instruction
hazard I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

(a) Instruction execution steps in successive clock cycles

Time
Clock cycle 1 2 3 4 5 6 7 8 9

Stage
F: Fetch F1 F2 F2 F2 F2 F3 Idle periods –
D: Decode D1 idle idle idle D2 D3 stalls (bubbles)
E: Execute E1 idle idle idle E2 E3

W: Write W1 idle idle idle W2 W3

(b) Function performed by each processor stage in successive clock cycles

Figure 8.4. Pipeline stall caused by a cache miss in F2.


Pipeline Performance
Structural hazard
Load X(R1), R2
Time
Clock cycle 1 2 3 4 5 6 7

Instruction

I1 F1 D1 E1 W1

I2 (Load) F2 D2 E2 M2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4

I5 F5 D5

Figure 8.5. Effect of a Load instruction on pipeline timing.


Pipeline Performance

• Again, pipelining does not result in individual


instructions being executed faster; rather, it is the
throughput that increases.

• Throughput is measured by the rate at which instruction


execution is completed.

• Pipeline stall causes degradation in pipeline performance.

• We need to identify all hazards that may cause the


pipeline to stall and to find ways to minimize their
impact.
Quiz
• Four instructions, the I2 takes two clock cycles
for execution. Draw the figure for 4-stage
pipeline, and figure out the total cycles needed
for the four instructions to complete.
Solution
1 2 3 4 5 6 7 8
F D E W
F D E W
F D stall E W
F D stall E W
Data Hazards
• We must ensure that the results obtained when instructions are executed
in a pipelined processor are identical to those obtained when the same
instructions are executed sequentially.
• Hazard occurs
A←3+A
B←4×A
• No hazard
A←5×C
B ← 20 + C
• When two operations depend on each other, they must be executed
sequentially in the correct order.
• Another example:
Mul R2, R3, R4
Add R5, R4, R6
Data Hazards
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction

I1 (Mul) F1 D1 E1 W1

I2 (Add) F2 D2 D2A E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

Figure 8.6. Pipeline stalled by data dependency between D2 and W 1.


Figure 8.6. Pipeline stalled by data dependency between D 2 and W1.
Operand Forwarding
• Instead of from the register file, the second
instruction can get data directly from the
output of ALU after the previous instruction is
completed.
• A special arrangement needs to be made to
“forward” the output of ALU to the input of
ALU.
Source 1
Source 2

SRC1 SRC2

Register
file

ALU

RSLT

Destination

(a) Datapath

SRC1,SRC2 RSLT

E: Execute W: Write
(ALU) (Register file)

Forwarding path

(b) P osition of the source and result registers in the processor pipeline

Figure 8.7. Operand forw arding in a pipelined processor.


Handling Data Hazards in Software
• Let the compiler detect and handle the hazard:
I1: Mul R2, R3, R4
NOP
NOP
I2: Add R5, R4, R6
• The compiler can reorder the instructions to
perform some useful work during the NOP
slots.
Moore’s Law

• Gordon Moore, cofounder of Intel noticed a trend in IC


manufacture
• Every 2 years the number of components on an area of
silicon doubled
• He published this work in 1965 – known as Moore’s Law
• His predictions were for 10 years into the future
• His work predicted personal computers and fast
telecommunication networks

84
Graph of Moore’s Law

11/01/2008 EADS 85
Single-core computer
Limits
• Both physical and practical reasons pose significant
constraints to simply building ever faster serial computers:
– Transmission speeds - the speed of a serial computer is directly
dependent upon how fast data can move through hardware.
Absolute limits are the speed of light (30 cm/nanosecond) and
the transmission limit of copper wire (9 cm/nanosecond).
Increasing speeds necessitate increasing proximity of processing
elements.
– Limits to miniaturization - processor technology is allowing an
increasing number of transistors to be placed on a chip.
However, even with molecular or atomic-level components, a
limit will be reached on how small components can be.
– Economic limitations - it is increasingly expensive to make a
single processor faster. Using a larger number of moderately fast
commodity processors to achieve the same (or better)
performance is less expensive.
Limits
• Relation btwn frequency of operation and power
consumption is given by :
Power= ½ x Capacitive load x Voltage2 x Frequency

• Power consumption is a huge problem when


frequency is increased
• More is the power consumed, more will be the heat
generated
• Power consumption poses significant constraint for
simply building ever faster serial computers
Why multi-core ?
• Difficult to make single-core clock frequencies even
higher

• Deeply pipelined circuits:


– heat problems
– speed of light problems
– difficult design and verification
– large design teams necessary
– server farms need expensive
air-conditioning
• Many new applications are multithreaded

• General trend in computer architecture (shift towards


more parallelism)
Multi-core architectures
• Replicate multiple processor cores of moderate
speed on a single die.

Core 1 Core 2 Core 3 Core 4

Multi-core CPU chip


Unicore vs. Multicore

(a) Uni-core processor; (b) multicore processor.

Conventional processor Multicore processors


•Single core •At least two cores
•Dedicated caches •Shared caches
•One thread at a time •Many threads simultaneously
Multi-core CPU chip
• The cores fit on a single processor socket
What is a thread?
• A process is a program in execution.
• A process is also known as heavy weight process
• A process is made to contain many lightweight processes or threads.
• Each thread performs a subtask of the problem to be solved.
• Each thread contains private local data and all threads share a common
address space.
• Hence, the overhead of thread creation is lesser compared to the creation of
heavyweight processes.
• Also, since threads share resources of the process to which they belong, the
resources are utilized optimally.
• Above all, threads communicate with each other through shared memory
locations
• Hence the communication is faster and the associated overhead is lesser
compared to that of processes.
Thread level parallelism – multiple threads run on
separate cores in parallel
Within each core, threads can be time-sliced

You might also like