CSE305N – Computer
Architecture
Chapter 5 – The
Processor: Datapath
and Control
Goals in processor
implementation
Balance the rate of supply of instructions
and data and the rate at which the
execution core can consume them and can
update memory
instruction supply execution core data supply
Goals in processor
implementation
Recall from Chapter 2
CPU Time = INST x CPI x CT
INST largely a function of the ISA and
compiler
Objective: minimize CPI x CT within design
constraints (cost, power, etc.)
Trading off CPI and CT is tricky
multiplier
multiplier
multiplier
logic
logic
logic
Brief review of sequential logic
design
State elements are clocked devices
Flip flops, etc
Combinatorial elements hold no state
ALU, caches, multiplier, multiplexers, etc.
In edge triggered clocking, state elements
are only updated on the (rising) edge of
the clock pulse
Brief review of sequential logic
design
The same state element can be read at the
beginning of a clock cycle and updated at
the end
Example:
12 incrementing
clock the PC
8
Add input 8
PC
Add Add output 12
4
PC register 8 12
clock
Our processor design progression
(1) Instruction fetch, execute, and operand
reads from data memory all take place in a
single clock cycle
(2) Instruction fetch, execute, and operand
reads from data memory take place in
successive clock cycles
(3) A pipelined design (Chapter 6)
Pieces of the processor puzzle
Instruction fetch
Execution
Data memory
instruction supply execution core data supply
Instruction fetch datapath
Memory to hold instructions
Register to hold the instruction memory
address
Logic to generate the next instruction
address
PC +4
Execution datapath
Focus on only a subset of all MIPS
instructions
add, sub, and, or
lw, sw
slt
beq, j
For all instructions except j, we
Read operands from the register file
Perform an ALU operation
For all instructions except sw, beq, and j,
we write a result into the register file
Execution datapath
Register file block diagram
Read register 1,2: source operand register numbers
Read data 1,2: source operands (32 bits each)
Write register: destination operand register number
Write data: data written into register file
RegWrite: when asserted, enables the writing of Write
Data
Execution datapath
Datapath for R-type (add, sub, and, or, slt)
R-type instruction format:
31 26 25 21 20 16 15 11 10 6 5 0
op rs rt rd shamt funct
Execution datapath
Datapath for beq instruction
I-type instruction format:
31 26 25 21 20 16 15 0
Zero ALU output
op rs indicates
rt if rs=rt (branch is taken/not taken)
immediate
Branch target address is the sign extended immediate left
shifted two positions, and added to PC+4
Data memory
Used for lw, sw (I-type format)
Block diagram
Address: memory location to be read or written
Read data: data out of the memory on a load
Write data: data into the memory on a store
MemRead: indicates a read operation is to be performed
MemWrite: indicates a write operation is to be performed
Execution datapath + data
memory
Datapath for lw, sw
Address is the sign-extended immediate added to the
source operand read out of the register file
sw: data written to memory from specified register
lw: data written to register file from specified memory
address
Putting the pieces together
Single clock cycle for fetch, execute, and
operand read from data memory
3 MUXes
Registerfile operand or sign extended immediate to ALU
ALU or data memory output written to register file
PC+4 or branch target address written to PC register
Datapath for R-type instructions
Example: add $4, $18, $30
Datapath for I-type ALU
instructions
Example: slti $7, $4, 100
Datapath for not taken beq
instruction
Example: beq $28, $13, EXIT
Datapath for taken beq
instruction
Example: beq $28, $13, EXIT
Datapath for load instruction
Example: lw $8, 112($2)
Datapath for store instruction
Example: sw $10, 0($3)
Control signals we need to
generate
ALU operation control
ALU control input codes from Chapter 4
ALU control input ALU operation Used for
000 and and
001 or or
010 add add, lw, sw
110 subtract sub, beq
111 set on less than slt
Two steps to generate the ALU control input
Use the opcode to distinguish R-type, lw and sw, and beq
If R-type, use funct field to determine the ALU control
input
ALU operation control
Opcode used to generate a 2-bit signal
called ALUOp with the following encodings
00: lw or sw, perform an ALU add
01: beq, perform an ALU subtract
10: R-type, ALU operation is determined by the funct
field
Funct Instruction ALU control input
100000 add 010
100010 sub 110
100100 and 000
100101 or 001
101010 slt 111
Given Datapath: RTL →
Control
Instruction<31:0>
Inst
<21:25>
<21:25>
<16:20>
<11:15>
<0:15>
Memory
Adr
Op Fun Rt Rs Rd Imm16
Control
nPC_sel RegWr RegDst ExtOp ALUSrc ALUctr MemWr MemtoReg Zero
DATA PATH
A Summary of Control
inst Register Transfer
Signals
ADD R[rd] ← R[rs] + R[rt]; PC ← PC + 4
ALUsrc = RegB, ALUctr = “add”, RegDst = rd, RegWr, nPC_sel = “+4”
SUB R[rd] ← R[rs] – R[rt]; PC ← PC + 4
ALUsrc = RegB, ALUctr = “sub”, RegDst = rd, RegWr, nPC_sel = “+4”
ORi R[rt] ← R[rs] + zero_ext(Imm16); PC ← PC + 4
ALUsrc = Im, Extop = “Z”, ALUctr = “or”, RegDst = rt, RegWr, nPC_sel = “+4”
LOAD R[rt] ← MEM[ R[rs] + sign_ext(Imm16)]; PC ← PC + 4
ALUsrc = Im, Extop = “Sn”, ALUctr = “add”,
MemtoReg, RegDst = rt, RegWr, nPC_sel = “+4”
STORE MEM[ R[rs] + sign_ext(Imm16) ] ← R[rs]; PC ← PC + 4
ALUsrc = Im, Extop = “Sn”, ALUctr = “add”, MemWr, nPC_sel = “+4”
BEQ if ( R[rs] == R[rt] ) then PC ← PC + sign_ext(Imm16)] || 00 else PC ← PC + 4
nPC_sel = “Br”, ALUctr = “sub”
A Summary of Control
See
Signals We Don’t Care :-)
func 10 0000 10 0010
Appendix A op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
add sub ori lw sw beq jump
RegDst 1 1 0 0 x x x
ALUSrc 0 0 1 1 1 0 x
MemtoReg 0 0 0 1 x x x
RegWrite 1 1 1 1 0 0 0
MemWrite 0 0 0 0 1 0 0
nPCsel 0 0 0 0 0 1 0
Jump 0 0 0 0 0 0 1
ExtOp x x 0 1 1 x x
ALUctr<2:0> Add Subtract Or Add Add Subtract xxx
31 26 21 16 11 6 0
R-type op rs rt rd shamt funct add, sub
I-type op rs rt immediate ori, lw, sw, beq
J-type op target address jump
The Concept of Local
op Decoding
00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
R-type ori lw sw beq jump
RegDst 1 0 0 x x x
ALUSrc 0 1 1 1 0 x
MemtoReg 0 0 1 x x x
RegWrite 1 1 1 0 0 0
MemWrite 0 0 0 1 0 0
Branch 0 0 0 0 1 0
Jump 0 0 0 0 0 1
ExtOp x 0 1 1 x x
ALUop<N:0> “R-type” Or Add Add Subtract xxx
func
ALU ALUctr
op Main 6
ALUop Control 3
6 Control (Local)
N
ALU
The Encoding of
ALUop
func
ALU
op Main 6 ALUctr
ALUop Control
6 Control (Local) 3
N
In
this exercise, ALUop has to be 2 bits wide to
represent:
“R-type” instructions (1)
“I-type” instructions that require the ALU to perform:
(2) Or, (3) Add, and (4) Subtract
Toimplement the full MIPS ISA, ALUop has to be
3 bits to represent:
“R-type” instructions (1)
“I-type” instructions that require the ALU to perform:
(2) Or, (3) Add, (4) Subtract, and (5) And (Example: andi)
R-type ori lw sw beq jump
ALUop (Symbolic) “R-type” Or Add Add Subtract xxx
ALUop<2:0> 1 00 0 10 0 00 0 00 0 01 xxx
The Decoding of the
op Main
“func”
func
6
Field
ALU ALUctr
ALUop Control
6 Control 3
(Local)
N
R-type ori lw sw beq jump
ALUop (Symbolic) “R-type” Or Add Add Subtract xxx
ALUop<2:0> 1 00 0 10 0 00 0 00 0 01 xxx
31 26 21 16 11 6 0
R-type op rs rt rd shamt funct
P. 286 text:
funct<5:0> Instruction Operation ALUctr ALUctr<2:0> ALU Operation
10 0000 add 000 And
10 0010 subtract 001 Or
10 0100 and
ALU
010 Add
10 0101 or 110 Subtract
10 1010 set-on-less-than 111 Set-on-less-than
The Truth Table for
ALUctr
funct<3:0> Instruction Op.
0000 add
ALUop R-type ori lw sw beq 0010 subtract
(Symbolic) “R-type” Or Add Add Subtract 0100 and
ALUop<2:0> 1 00 0 10 0 00 0 00 0 01 0101 or
1010 set-on-less-than
ALUop func ALU ALUctr
bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> Operation bit<2> bit<1> bit<0>
0 0 0 x x x x Add 0 1 0
0 x 1 x x x x Subtract 1 1 0
0 1 x x x x x Or 0 0 1
1 x x 0 0 0 0 Add 0 1 0
1 x x 0 0 1 0 Subtract 1 1 0
1 x x 0 1 0 0 And 0 0 0
1 x x 0 1 0 1 Or 0 0 1
1 x x 1 0 1 0 Set on < 1 1 1
The Logic Equation for
ALUop ALUctr<2>
func
bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> ALUctr<2>
0 x 1 x x x x 1
1 x x 0 0 1 0 1
1 x x 1 0 1 0 1
This makes func<3> a don’t care
ALUctr<2> = !ALUop<2> & ALUop<0> +
ALUop<2> & !func<2> & func<1> & !func<0>
The Logic Equation for
ALUop ALUctr<1>
func
bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> ALUctr<1>
0 0 0 x x x x 1
0 x 1 x x x x 1
1 x x 0 0 0 0 1
1 x x 0 0 1 0 1
1 x x 1 0 1 0 1
ALUctr<1> = !ALUop<2> & !ALUop<0> +
ALUop<2> & !func<2> & !func<0>
The Logic Equation for
ALUop ALUctr<0>
func
bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> ALUctr<0>
0 1 x x x x x 1
1 x x 0 1 0 1 1
1 x x 1 0 1 0 1
ALUctr<0> = !ALUop<2> & ALUop<0>
+ ALUop<2> & !func<3> & func<2> & !func<1>
& func<0>
+ ALUop<2> & func<3> & !func<2> & func<1>
& !func<0>
The ALU Control
func Block
6 ALU ALUctr
Control
ALUop (Local) 3
3
ALUctr<2> = !ALUop<2> & ALUop<0> +
ALUop<2> & !func<2> & func<1> & !func<0>
ALUctr<1> = !ALUop<2> & !ALUop<0> +
ALUop<2> & !func<2> & !func<0>
ALUctr<0> = !ALUop<2> & ALUop<0>
+ ALUop<2> & !func<3> & func<2>
& !func<1> & func<0>
+ ALUop<2> & func<3> & !func<2>
& func<1> & !func<0>
Step 5: Logic for Each
nPC_sel Control Signal
<= if (OP == BEQ) then “Br” else
“+4”
ALUsrc <= if (OP == “Rtype”) then “regB”
else “immed”
ALUctr <= if (OP == “Rtype”) then funct
elseif (OP == ORi) then “OR”
elseif (OP == BEQ) then “sub”
else “add”
ExtOp <= _____________
MemWr <= _____________
MemtoReg <= _____________
RegWr: <=_____________
RegDst: <= _____________
Step 5: Logic for each
nPC_sel
control signal
<= if (OP == BEQ) then “Br” else “+4”
ALUsrc <= if (OP == “Rtype”) then “regB” else
“immed”
ALUctr<= if (OP == “Rtype”) then funct elseif (OP ==
ORi) then “OR” elseif (OP == BEQ) then “sub”
else “add”
ExtOp <= if (OP == ORi) then “zero” else “sign”
MemWr <= (OP == Store)
MemtoReg <= (OP == Load)
RegWr:<= if ((OP == Store) || (OP == BEQ)) then 0
else 1
RegDst: <= if ((OP == Load) || (OP == ORi)) then 0
else 1
The “Truth Table” for the
Main Control
RegDst
ALUSrc
func
ALUctr
op 6 ALU
Main
6 Control
: Control 3
ALUop (Local)
3
op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
R-type ori lw sw beq jump
RegDst 1 0 0 x x x
ALUSrc 0 1 1 1 0 x
MemtoReg 0 0 1 x x x
RegWrite 1 1 1 0 0 0
MemWrite 0 0 0 1 0 0
nPC_sel 0 0 0 0 1 0
Jump 0 0 0 0 0 1
ExtOp x 0 1 1 x x
ALUop (Symbolic) “R-type” Or Add Add Subtract xxx
ALUop <2> 1 0 0 0 0 x
ALUop <1> 0 1 0 0 0 x
ALUop <0> 0 0 0 0 1 x
Comparing instruction fields
31 26 25 21 20 16 15 11 10 6 5 0
R-type 0 rs rt rd shamt funct
31 26 25 21 20 16 15 0
beq 4 rs rt immediate (offset)
31 26 25 21 20 16 15 0
lw (sw) 35 (43) rs rt immediate (offset)
Opcode, source registers, function code,
and immediate fields always in same place
Destination register is
bits 15-11 (rd) for R-type
bits 20-16 (rt) for lw
MUX to select the right one
Datapath with instr fields and ALU
control
Main control unit design
Main control unit design
Truth table
(0)
(34)
(43)
(4)
Adding support for jump
instructions
J-type format
31 26 25 0
2 target
Next PC formed by shifting left the 26-bit
target two bits and combining it with the 4
high-order bits of PC+4
Now the next PC will be one of
PC+4
beq target address
j target address
We need another MUX and control bit
Adding support for jump
instructions
Evaluation of the simple
implementation
All instructions take one clock cycle (CPI = 1)
Assume the following worst case delays
Instruction memory: 4 time units
Data memory: 4 time units (read), 2 time units (write)
ALU: 4 time units
Adders: 3 time units
Register file: 2 time units (read), 1 time unit (write)
MUXes, sign extension, gates, and shifters: 1 time unit
Large disparity in worst case delays among
instruction types
R-type: 4+2+1+4+1+1 = 13 time units
beq: 4+2+1+4+1+1+1 = 14 time units
j: 4+1+1 = 6 time units
store: 4+2+4+2 = 12 time units
load: 4+2+4+4+1+1 = 16 time units
Evaluation of the simple
implementation
Disparity would be worse in a real machine
Even slower integer instructions (e.g., multiply/divide
in MIPS)
Floating point instructions
Simple instructions take as long as
complex ones
A multicycle implementation
Instruction fetch, register file access, etc
occur in separate clock cycles
Different instruction types take different
numbers of cycles to complete
Clock cycle time should be faster
High level view of datapath
New registers store results of each step
Not programmer visible!
Hardware can be shared
One ALU for PC+4, branch target calculation, EA calculation,
and arithmetic operations
One memory for instructions and data
Detailed multi-cycle datapath
Multi-cycle control
First two cycles for all instructions
Instruction fetch (1st cycle)
Load the instruction into the IR register
IR = Memory[PC]
Increment the PC
PC = PC+4
Instruction decode and register fetch (2nd
cycle)
Readregister file locations rs and rt, results into the A
and B registers
A=Reg[IR[25-21]]
B=Reg[IR[20-16]]
Calculate the branch target address and load into
ALUOut
ALUOut = PC+(sign-extend (IR[15-0]) <<2)
Instruction fetch
IR=Mem[PC]
Instruction fetch
PC=PC+4
Instruction decode and register fetch
A=Reg[IR[25-21]], B=Reg[IR[20-16]]
Instruction decode and register fetch
ALUOut = PC+(sign-extend (IR[15-0]) <<2)
Additional cycles for R-type
Execution
ALUOut = A op B
Completion
Reg[IR[15-11]] = ALUOut
R-type execution cycle
ALUOut = A op B
R-type completion cycle
Reg[IR[15-11]] = ALUOut
Additional cycles for store
Address computation
ALUOut = A + sign-extend (IR[15-0])
Memory access
Memory[ALUOut] =B
Store address computation cycle
ALUOut = A + sign-extend (IR[15-0])
Store memory access cycle
Memory[ALUOut] = B
Additional cycles for load
Address computation
ALUOut = A + sign-extend (IR[15-0])
Memory access
MDR = Memory[ALUOut]
Read completion
Reg[IR[20-16]] = MDR
Load memory access cycle
MDR = Memory[ALUOut]
Load read completion cycle
Reg[IR[20-16]] = MDR
Additional cycle for beq
Branch completion
if (A == B) PC = ALUOut
Branch completion cycle for beq
if (A == B) PC = ALUOut
Additional cycle for j
Jump completion
PC = PC[31-28] || (IR[25-0]<<2)
Jump completion cycle for j
PC = PC[31-28] || (IR[25-0]<<2)
Control logic design
Implemented as a Finite State Machine
Inputs: 6 opcode bits
Outputs: 16 control signals
State: 4 bits for 10 states
High-level view of FSM
Instruction fetch cycle
Instruction decode/register fetch
cycle
R-type execution cycle
R-type completion cycle
Memory address computation
cycle
Store memory access cycle
Load memory access cycle
Load read completion cycle
beq branch completion cycle
j jump completion cycle
Complete FSM
Evaluation of the multi-cycle
design
CPI calculated based on the instruction mix
For gcc (Figure 4.54)
23% loads (5 cycles each)
13% stores (4 cycles each)
19% branches (3 cycles each)
2% jumps (3 cycles each)
43% ALU (4 cycles each)
CPI = 0.23*5+0.13*4+0.19*3+0.02*3+0.43*4=4.02
Cycle time is calculated from the longest
delay path assuming the same timing
delays as before
Worst case datapath: branch target
ALUOut = PC+(sign-extend (IR[15-0]) <<2)
Delay = 7 time units (delay of simple = 16)
Evaluation of the multi-cycle
design
Time per instruction of simple and multi-
cycle
TPI(simple) = CPI(simple) x cycle time(simple) = 16
TPI(multi-cycle) = 4.02 x 7 = 28.1
Simple single-cycle implementation is
faster
Multicycle with pipelining will be
considerably faster than single-cycle
implementation
Exceptions
An exception is an event that causes a deviation from
the normal execution of instructions
Types of exceptions
Operating system call (e.g., read a file, print a file)
Input/output device request
Page fault (request for instruction/data not in memory – Ch 7)
Arithmetic error (overflow, underflow, etc.)
Undefined instruction
Misaligned memory access (e.g., word access to odd address)
Memory protection violation
Hardware error
Power failure
An exception is not usually due to an error!
We need to be able to restart the program at
the point where the exception was detected
Handling exceptions
Detect the exception
Save enough information about the exception
to handle it properly
Save enough information about the program
to resume it after the exception is handled
Handle the exception
Either terminate the program or resume
executing it depending on the exception type
Detecting exceptions
Performed by hardware
Overflow: determined from the opcode and
the overflow output of the ALU
Undefined instruction: determined from
The opcode in the main control unit
The function code and ALUop in the ALU control logic
Detecting exceptions
overflow
undefined
instruction
Saving exception information
Performed by hardware
We need the type of exception and the PC of
the instruction when the exception occurred
In MIPS, the Cause register holds the
exception type
Need an encoding for each exception type
Need a signal from the control unit to load it into the
Cause register
and the Exception Program Counter (EPC)
register holds the PC
Need to subtract 4 from the PC register to get the correct
PC (since we loaded PC+4 into the PC register during the
Instruction Fetch cycle)
Need a signal from the control unit to load it into EPC
Saving exception information
Saving program information
Needed in order to restart the program from the
point where the exception occurred
Performed by hardware and software
EPC register holds the PC of the instruction that
had the exception (where we will restart the
program)
The software routine that handles the exception
saves any registers that it will need to the stack
and restores them when it is done
Handling the exception
Performed by hardware and software
Need to transfer control to a software routine to
handle the exception (exception handler)
The exception handler runs in a privileged mode
that allows it to use special instructions and
access all of memory
Our programs run in user mode
The hardware enables the privileged mode,
loads PC with the address of the exception
handler, and transitions to the Fetch state
Handling the exception
Loading the PC with exception handler
address
Exception handler
Stores the values of the registers that it
will need to the stack
Handles the particular exception
Operating system call: calls the subroutine associated with the
call
Underflow: sets register to zero or uses denormalized numbers
I/O: handles the particular I/O request, e.g., keyboard input
Restores registers from the stack (if
program is to be restarted)
Terminates the program, or resumes
execution by loading the PC with EPC and
transitioning to the Instruction Fetch state
FSM modifications
The Intel Pentium processor
Introduced in 1993
Uses a multi-cycle datapath with the
following steps for integer instructions
Prefetch (PF): read instruction from the instruction
memory
Decode 1 (D1): first stage of instruction decode
Decode 2 (D2): second stage of instruction decode
Execute (E): perform the ALU operation
Write back (WB): write the result to the register file
Datapath usage varies by instruction type
Simple instructions make one pass through the
datapath using state machine control
Complex instructions make multiple passes, reusing
the same hardware elements under microcode control
The Intel Pentium processor
The Pentium is a 2-way superscalar design as
two instructions can simultaneously execute
D2 E WB U pipe
PF D1
D2 E WB V pipe
Ideal CPI for a 2-way superscalar is 0.5
Conditions for superscalar execution
Both must be simple instructions
The result of the first instruction cannot be needed by the
second
Both instructions cannot write the same register
The first instruction in program sequence cannot be a jump
The Intel Pentium Pro processor
Introduced in 1995 as the successor to the
Pentium
The basis for the Pentium II and Pentium III
Implements a 14-cycle, 3-way superscalar
integer datapath
Very high frequency is the goal
Uses out-of-order execution in that instructions
may execute out of their original program
order
Completely handled by hardware transparently to software
Instructions execute as soon as their source operands
become available
Complicates exception handling
Some instructions before the excepting one may not have executed,
while some after it may have executed
The Intel Pentium Pro processor
Pentium Pro designers (and AMD designers
before them) used innovative engineering
to overcome the disadvantages of CISC
ISAs
Many complex X86 instructions are internally
translated by hardware into RISC-like micro-ops with
state machine control
Achieves a very low CPI for simple integer operations
even
on programs compiled for older implementations
Combination of high frequency and low CPI
gave the Pentium Pro extremely
competitive integer performance versus
RISC microprocessors
Result
has been that RISC CPUs have failed to gain the
desktop market share that had been expected
The Intel Pentium 4 processor
20 cycle superscalar integer pipeline
Extremely high frequency (>3GHz)
Major effort to lower power dissipation
Clock gating: clock to a unit is turned off when the unit
is not in use
Trace cache: caches micro-ops of previously decoded
complex instructions to avoid power-consuming
decode operation
Questions?