KEMBAR78
Riscv Cpu | PDF | Central Processing Unit | Microprocessor
0% found this document useful (0 votes)
449 views75 pages

Riscv Cpu

Uploaded by

m1947asha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
449 views75 pages

Riscv Cpu

Uploaded by

m1947asha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Microarchitecture

7
7.1 Introduction
7.2 Performance Analysis
7.3 Single-Cycle Processor
7.4 Multicycle Processor
7.5 Pipelined Processor
7.1 INTRODUCTION 7.6 HDL Representation*

In this chapter, you will learn how to piece together a microprocessor. 7.7 Advanced
Microarchitecture*
Indeed, you will puzzle out three different versions, each with different
trade-offs between performance, cost, and complexity. 7.8 Real-World Perspective:
Evolution of RISC-V
To the uninitiated, building a microprocessor may seem like black
Microarchitecture*
magic. But it is actually relatively straightforward and, by this point, you
7.9 Summary
have learned everything you need to know. Specifically, you have learned
to design combinational and sequential logic given functional and timing Exercises
specifications. You are familiar with circuits for arithmetic and memory. Interview Questions
And you have learned about the RISC-V architecture, which specifies
the programmer’s view of the RISC-V processor in terms of registers, Application >”hello
instructions, and memory. Software world!”
This chapter covers microarchitecture, which is the connection
Operating
between logic and architecture. Microarchitecture is the specific arrange- Systems
ment of registers, arithmetic logic units (ALUs), finite state machines
(FSMs), memories, and other logic building blocks needed to implement Architecture
an architecture. A particular architecture, such as RISC-V, may have
many different microarchitectures, each with different trade-offs of per- Micro-
formance, cost, and complexity. They all run the same programs, but architecture
their internal designs vary widely. We design three different microarchi- +
Logic
tectures in this chapter to illustrate the trade-offs.
Digital
Circuits
7 . 1 . 1 Architectural State and Instruction Set
Analog +
Recall that a computer architecture is defined by its instruction set and Circuits −
architectural state. The architectural state for the RISC-V processor con-
sists of the program counter and the 32 32-bit registers. Any RISC-V Devices
microarchitecture must contain all of this state. Based on the current
architectural state, the processor executes a particular instruction with a Physics
particular set of data to produce a new architectural state. Some

Digital Design and Computer Architecture, RISC-V Edition. DOI: 10.1016/B978-0-12-820064-3.00007-6 393
Copyright © 2022 Elsevier Inc. All rights reserved.
394 CHAPTER SEVEN Microarchitecture

The architectural state is


microarchitectures contain additional nonarchitectural state to either
the information necessary simplify the logic or improve performance; we point this out as it arises.
to define what a computer To keep the microarchitectures easy to understand, we focus on a
is doing. If one were to save subset of the RISC-V instruction set. Specifically, we handle the follow-
a copy of the architectural ing instructions:
state and contents of memory,
then turn off a computer, ▸ R-type instructions: add, sub, and, or, slt
then turn it back on and
restore the architectural state ▸ Memory instructions: lw, sw
and memory, the computer
would resume the program
▸ Branches: beq
it was running, unaware that
it had been powered off and
These particular instructions were chosen because they are sufficient
back on. Think of a science to write useful programs. Once you understand how to implement these
fiction novel in which the instructions, you can expand the hardware to handle others.
protagonist’s brain is frozen,
then thawed years later to
wake up in a new world. 7 . 1 . 2 Design Process
We divide our microarchitectures into two interacting parts: the data-
path and the control unit. The datapath operates on words of data. It
contains structures such as memories, registers, ALUs, and multiplexers.
We are implementing the 32-bit RISC-V (RV32I) architecture, so we use
a 32-bit datapath. The control unit receives the current instruction from
the datapath and tells the datapath how to execute that instruction.
Specifically, the control unit produces multiplexer select, register enable,
and memory write signals to control the operation of the datapath.
A good way to design a complex system is to start with hardware
containing the state elements. These elements include the memories and
the architectural state (the program counter and registers). Then, add
blocks of combinational logic between the state elements to compute the
new state based on the current state. The instruction is read from part
of memory; load and store instructions then read or write data from
another part of memory. Hence, it is often convenient to partition the
overall memory into two smaller memories, one containing instructions
and the other containing data. Figure 7.1 shows a block diagram with
the four state elements: the program counter, register file, and instruc-
At the very least, the program tion and data memories.
counter must have a reset In this chapter, heavy lines indicate 32-bit data busses. Medium lines
signal to initialize its value
when the processor turns indicate narrower busses, such as the 5-bit address busses on the register
on. Upon reset, RISC-V file. Narrow lines indicate 1-bit wires, and blue lines are used for con-
processors normally trol signals, such as the register file write enable. Registers usually have
initialize the PC to a low a reset input to put them into a known state at start-up, but reset is not
address in memory, such as shown to reduce clutter.
0x00001000, and we start
our programs there. The program counter (PC) points to the current instruction. Its
input, PCNext, indicates the address of the next instruction.
7.1 Introduction 395

CLK CLK
CLK
WE3 WE
PCNext PC A1 RD1
32 32 32
A RD 32
5 32

A RD
Instruction 32 32
A2 RD2 Data
Memory 5 32

5
A3 Memory
Register
WD3 WD
32 File 32

Figure 7.1 State elements of a RISC-V processor

The instruction memory has a single read port.1 It takes a 32-bit


instruction address input, A, and reads the 32-bit data (i.e., instruction)
from that address onto the read data output, RD.
The 32-element × 32-bit register file holds registers x0–x31. Recall
that x0 is hardwired to 0. The register file has two read ports and one
write port. The read ports take 5-bit address inputs, A1 and A2, each
specifying one of the 25 = 32 registers as source operands. The register
file places the 32-bit register values onto read data outputs RD1 and
RD2. The write port, port 3, takes a 5-bit address input, A3; a 32-bit
write data input, WD3; a write enable input, WE3; and a clock. If its
write enable (WE3) is asserted, then the register file writes the data
(WD3) into the specified register (A3) on the rising edge of the clock.
The data memory has a single read/write port. If its write enable,
WE, is asserted, then it writes data WD into address A on the rising
edge of the clock. If its write enable is 0, then it reads from address A
onto the read data bus, RD.
The instruction memory, register file, and data memory are all read
combinationally. In other words, if the address changes, then the new
data appears at RD after some propagation delay; no clock is involved.
The clock controls writing only. These memories are written only on
the rising edge of the clock. In this fashion, the state of the system is
changed only at the clock edge. The address, data, and write enable must
set up before the clock edge and must remain stable until a hold time
after the clock edge.
Because the state elements change their state only on the rising edge
of the clock, they are synchronous sequential circuits. A microprocessor

1
This is an oversimplification used to treat the instruction memory as a ROM. In most real
processors, the instruction memory must be writable so that the operating system (OS) can
load a new program into memory. The multicycle microarchitecture described in Section 7.4
is more realistic in that it uses a single memory that contains both instructions and data and
that can be both read and written.
396 CHAPTER SEVEN Microarchitecture

is built of clocked state elements and combinational logic, so it too is a


synchronous sequential circuit. Indeed, a processor can be viewed as a
giant finite state machine or as a collection of simpler interacting state
machines.

7 . 1 . 3 Microarchitectures
In this chapter, we develop three microarchitectures for the RISC-V
architecture: single-cycle, multicycle, and pipelined. They differ in how
the state elements are connected and in the amount of nonarchitectural
state needed.
The single-cycle microarchitecture executes an entire instruction in
one cycle. It is easy to explain and has a simple control unit. Because it
completes the operation in one cycle, it does not require any nonarchitectural
state. However, the cycle time is limited by the slowest instruction.
Moreover, the processor requires separate instruction and data memories,
Examples of classic multicycle
which is generally unrealistic.
processors include the 1947 The multicycle microarchitecture executes instructions in a series of
MIT Whirlwind, the IBM shorter cycles. Simpler instructions execute in fewer cycles than compli-
System/360, the Digital cated ones. Moreover, the multicycle microarchitecture reduces the
Equipment Corporation VAX, hardware cost by reusing expensive hardware blocks, such as adders
the 6502 used in the Apple II,
and the 8088 used in the IBM
and memories. For example, the adder may be used on different cycles
PC. Multicycle microarchitectures for several purposes while carrying out a single instruction. The multicycle
are still used in inexpensive microprocessor accomplishes this by introducing several nonarchitectural
microcontrollers such as the registers to hold intermediate results. The multicycle processor executes
8051, the 68HC11, and the only one instruction at a time, but each instruction takes multiple clock
PIC16-series found in appliances,
toys, and gadgets.
cycles. This processor requires only a single memory, accessing it on one
cycle to fetch the instruction and on another to read or write data.
Because they use less hardware than single-cycle processors, multicycle
Intel processors have been processors were the historical choice for inexpensive systems.
pipelined since the 80486 was
The pipelined microarchitecture applies pipelining to the single-cycle
introduced in 1989. Nearly all
RISC microprocessors are also microarchitecture. It therefore can execute several instructions simul-
pipelined, and all commercial taneously, improving the throughput significantly. Pipelining must add
RISC-V processors have been logic to handle dependencies between simultaneously executing instruc-
pipelined. Because of the tions. It also requires nonarchitectural pipeline registers. Pipelined
decreasing cost of transistors,
processors must access instructions and data in the same cycle; they generally
pipelined processors now cost
fractions of a penny, and the use separate instruction and data caches for this purpose, as discussed in
entire system, with memory and Chapter 8. The added logic and registers are worthwhile; all commercial
peripherals, costs 10’s of cents. high-performance processors use pipelining today.
Thus, pipelined processors We explore the details and trade-offs of these three microarchitec-
are replacing their slower
tures in the subsequent sections. At the end of the chapter, we briefly
multicycle siblings in even the
most cost-sensitive applications. mention additional techniques that are used to achieve even more speed
in modern high-performance microprocessors.
7.2 Performance Analysis 397

7.2 PERFORMANCE ANALYSIS


As we mentioned, a particular processor architecture can have many
microarchitectures with different cost and performance trade-offs. The
cost depends on the amount of hardware required and the implemen-
tation technology. Precise cost calculations require detailed knowledge
of the implementation technology but, in general, more gates and more
memory mean more dollars.
This section lays the foundation for analyzing performance. There
are many ways to measure the performance of a computer system, and
marketing departments are infamous for choosing the method that makes
their computer look fastest, regardless of whether the measurement has
any correlation to real-world performance. For example, microprocessor
makers often market their products based on the clock frequency and the When customers buy computers
number of cores. However, they gloss over the complications that some based on benchmarks, they
must be careful because
processors accomplish more work than others in a clock cycle and that
computer makers have
this varies from program to program. What is a buyer to do? strong incentive to bias the
The only gimmick-free way to measure performance is by measur- benchmark. For example,
ing the execution time of a program of interest to you. The computer Dhrystone involves extensive
that executes your program fastest has the highest performance. The string copying, but the strings
are of known constant length
next best choice is to measure the total execution time of a collection of
and word alignment. Thus, a
programs that are similar to those you plan to run. This may be necessary smart compiler may replace
if you have not written your program yet or if somebody else who does the usual code involving
not have your program is making the measurements. Such collections loops and byte accesses with
of programs are called benchmarks, and the execution times of these a series of word loads and
stores, improving Dhrystone
programs are commonly published to give some indication of how a
scores by more than 30% but
processor performs. not speeding up real-world
Dhrystone, CoreMark, and SPEC are three popular benchmarks. applications. The SPEC89
The first two are synthetic benchmarks composed of important common benchmark contained a Matrix
pieces of programs. Dhrystone was developed in 1984 and remains com- 300 program in which 99%
of the execution time was in
monly used for embedded processors, although the code is somewhat
one line. IBM sped up the
unrepresentative of real-life programs. CoreMark is an improvement program by a factor of 9
over Dhrystone and involves matrix multiplications that exercise the using a compiler technique
multiplier and adder, linked lists to exercise the memory system, state called blocking. Benchmarking
machines to exercise the branch logic, and cyclical redundancy checks multicore computing is even
harder because there are
that involve many parts of the processor. Both benchmarks are less than
many ways to write programs,
16 KB in size and do not stress the instruction cache. some of which speed up in
The SPECspeed 2017 Integer benchmark from the Standard proportion to the number
Performance Evaluation Corporation (SPEC) is composed of real pro- of cores available but are
grams, including x264 (video compression), deepsjeng (an artificial intel- inefficient on a single core.
Others are fast on a single core
ligence chess player), omnetpp (simulation), and GCC (a C compiler).
but scarcely benefit from extra
The benchmark is widely used for high-performance processors because cores.
it stresses the entire system in a representative way.
398 CHAPTER SEVEN Microarchitecture

Equation 7.1 gives the execution time of a program, measured in


seconds.
 cycles  seconds 
ExecutionTime = (# instructions) 
 instruction  cycle  (7.1)

The number of instructions in a program depends on the processor


architecture. Some architectures have complicated instructions that do
more work per instruction, thus reducing the number of instructions in
a program. However, these complicated instructions are often slower
to execute in hardware. The number of instructions also depends enor-
mously on the cleverness of the programmer. For the purposes of this
chapter, we assume that we are executing known programs on a RISC-V
processor, so the number of instructions for each program is constant,
independent of the microarchitecture. The cycles per instruction (CPI)
is the number of clock cycles required to execute an average instruction.
It is the reciprocal of the throughput (instructions per cycle, or IPC).
Different microarchitectures have different CPIs. In this chapter, we
assume we have an ideal memory system that does not affect the CPI. In
Chapter 8, we examine how the processor sometimes has to wait for the
memory, which increases the CPI.
The number of seconds per cycle is the clock period, Tc. The clock
period is determined by the critical path through the logic in the proces-
sor. Different microarchitectures have different clock periods. Logic and
circuit designs also significantly affect the clock period. For example, a
carry-lookahead adder is faster than a ripple-carry adder. Manufacturing
advances also improve transistor speed, so a microprocessor built today
will be faster than one from last decade, even if the microarchitecture
and logic are unchanged.
The challenge of the microarchitect is to choose the design that
minimizes the execution time while satisfying constraints on cost and/or
power consumption. Because microarchitectural decisions affect both
CPI and Tc and are influenced by logic and circuit designs, determining
the best choice requires careful analysis.
Many other factors affect overall computer performance. For example,
the hard disk, the memory, the graphics system, and the network connection
may be limiting factors that make processor performance irrelevant. The
fastest microprocessor in the world does not help surfing the Internet on
a poor connection. But these other factors are beyond the scope of this
book.

7.3 SINGLE-CYCLE PROCESSOR


We first design a microarchitecture that executes instructions in a single
cycle. We begin constructing the datapath by connecting the state
7.3 Single-Cycle Processor 399

elements from Figure 7.1 with combinational logic that can execute the
various instructions. Control signals determine which specific instruc-
tion is performed by the datapath at any given time. The control unit
contains combinational logic that generates the appropriate control
signals based on the current instruction. Finally, we analyze the perfor-
mance of the single-cycle processor.

7 . 3 . 1 Sample Program
For the sake of concreteness, we will have the single-cycle processor run
the short program from Figure 7.2 that exercises loads, stores, an R-type
instruction (or), and a branch (beq). Suppose that the program is stored
in memory starting at address 0x1000. The figure indicates the address
of each instruction, the instruction type, the instruction fields, and the
hexadecimal machine language code for the instruction.
Assume that register x5 initially contains the value 6 and x9 contains
0x2004. Memory location 0x2000 contains the value 10. The program
counter begins at 0x1000. The lw reads 10 from address (0x2004 – 4)
= 0x2000 and puts it in x6. The sw writes 10 to address (0x2004 + 8) =
0x200C. The or computes x4 = 6 | 10 = 01102 | 10102 = 11102 = 14.
Then, beq goes back to label L7, so the program repeats forever.

7 . 3 . 2 Single-Cycle Datapath
This section gradually develops the single-cycle datapath, adding one
piece at a time to the state elements from Figure 7.1. The new connections
are emphasized in black (or blue, for new control signals), whereas the
hardware that has already been studied is shown in gray. The example
instruction being executed is shown at the bottom of each figure.
The program counter contains the address of the instruction to
We italicize signal names in
execute. The first step is to read this instruction from instruction memory. the text but not the names
Figure 7.3 shows that the PC is simply connected to the address input of of hardware modules. For
the instruction memory. The instruction memory reads out, or fetches, the example, PC is the signal
32-bit instruction, labeled Instr. In our sample program from Figure 7.2, coming out of the PC register,
or simply, the PC.
PC is 0x1000. (Note that this is a 32-bit processor, so PC is really
0x00001000, but we omit leading zeros to avoid cluttering the figure.)

Address Instruction Type Fields Machine Language


imm11:0 rs1 f3 rd op
0x1000 L7: lw x6, -4(x9) I 111111111100 01001 010 00110 0000011 FFC4A303
imm11:5 rs2 rs1 f3 imm4:0 op
0x1004 sw x6, 8(x9) S 0000000 00110 01001 010 01000 0100011 0064A423
funct7 rs2 rs1 f3 rd op
0x1008 or x4, x5, x6 R 0000000 00110 00101 110 00100 0110011 0062E233
imm12,10:5 rs2 rs1 f3 imm4:1,11 op
0x100C beq x4, x4, L7 B 1111111 00100 00100 000 10101 1100011 FE420AE3

Figure 7.2 Sample program exercising different types of instructions


400 CHAPTER SEVEN Microarchitecture

CLK CLK
CLK
WE3 WE
PCNext PC Instr A1 RD1
A RD
A RD
0x1000

0xFFC4A303
Instruction
A2 RD2 Data
Memory
A3 Memory
Register
WD3 WD
File

Address Instruction Type Fields Machine Language


imm11:0 rs1 f3 rd op
0x1000 L7: lw x6, -4(x9) I 111111111100 01001 010 00110 0000011 FFC4A303

Figure 7.3 Fetch instruction from memory

CLK CLK
CLK
19:15 WE3 0x2004 WE
PCNext PC Instr A1 RD1
A RD 9
A RD
0x1000

0xFFC4A303

Instruction
A2 RD2 Data
Memory
A3 Memory
Register
WD3 WD
File

Address Instruction Type Fields Machine Language


imm11:0 rs1 f3 rd op
0x1000 L7: lw x6, -4(x9) I 111111111100 01001 010 00110 0000011 FFC4A303

Figure 7.4 Read source operand from register file

Instr is lw, 0xFFC4A303, as also shown at the bottom of Figure 7.3.


These sample values are annotated in light blue on the diagram.
The processor’s actions depend on the specific instruction that was
fetched. First, we will work out the datapath connections for the lw
instruction. Then, we will consider how to generalize the datapath to
handle other instructions.

lw
For the lw instruction, the next step is to read the source register con-
taining the base address. Recall that lw is an I-type instruction, and the
base register is specified in the rs1 field of the instruction, Instr19:15.
These bits of the instruction connect to the A1 address input of the reg-
ister file, as shown in Figure 7.4. The register file reads the register value
onto RD1. In our example, the register file reads 0x2004 from x9.
The lw instruction also requires an offset. The offset is stored in
the 12-bit immediate field of the instruction, Instr31:20. It is a signed
value, so it must be sign-extended to 32 bits. Sign extension simply
means copying the sign bit into the most significant bits: ImmExt31:12
= Instr31 , and ImmExt11:0 = Instr31:20. Sign-extension is performed by
7.3 Single-Cycle Processor 401

CLK CLK
CLK
19:15 WE3 0x2004 WE
PCNext PC Instr A1 RD1
A RD 9
A RD
0x1000

0xFFC4A303
Instruction
A2 RD2 Data
Memory
A3 Memory
Register
WD3 WD
File

ImmExt
31:20
Extend 0xFFFFFFFC
0xFFC

Address Instruction Type Fields Machine Language


imm11:0 rs1 f3 rd op
0x1000 L7: lw x6, -4(x9) I 111111111100 01001 010 00110 0000011 FFC4A303

Figure 7.5 Sign-extend the immediate

ALUControl2:0
000
CLK CLK
CLK
19:15 WE3 0x2004 SrcA WE
PCNext PC Instr A1 RD1
A RD 9

ALU
ALUResult
A RD
0x1000

0xFFC4A303

Instruction
A2 RD2 SrcB 0x2000 Data
Memory
A3 Memory
Register
WD3 WD
File

ImmExt
31:20
Extend 0xFFFFFFFC
0xFFC

Address Instruction Type Fields Machine Language


imm11:0 rs1 f3 rd op
0x1000 L7: lw x6, -4(x9) I 111111111100 01001 010 00110 0000011 FFC4A303

Figure 7.6 Compute memory address

an Extend unit, as shown in Figure 7.5, which receives the 12-bit signed
immediate in Instr31:20 and produces the 32-bit sign-extended immediate,
ImmExt. In our example, the two’s complement immediate −4 is
extended from its 12-bit representation 0xFFC to a 32-bit representation
0xFFFFFFFC.
The processor adds the base address to the offset to find the
address to read from memory. Figure 7.6 introduces an ALU to perform
this addition. The ALU receives two operands, SrcA and SrcB. SrcA is
the base address from the register file, and SrcB is the offset from the
sign-extended immediate, ImmExt. The ALU can perform many opera-
tions, as was described in Section 5.2.4. The 3-bit ALUControl signal
402 CHAPTER SEVEN Microarchitecture

RegWrite ALUControl2:0
1 000
CLK CLK
CLK
19:15 WE3 0x2004 SrcA WE
PCNext PC Instr A1 RD1
A RD 9 ReadData

ALU
ALUResult
A RD
0x1000

Instruction 0xFFC4A303
A2 RD2 SrcB 0x2000 Data 10
Memory 11:7
A3 Memory
6 Register
WD3 WD
File

ImmExt
31:20
Extend 0xFFFFFFFC
0xFFC

Address Instruction Type Fields Machine Language


imm11:0 rs1 f3 rd op
0x1000 L7: lw x6, -4(x9) I 111111111100 01001 010 00110 0000011 FFC4A303

Figure 7.7 Read memory and write result back to register file

specifies the operation (see Table 5.3 on page 250). The ALU receives
32-bit operands and generates a 32-bit ALUResult. For the lw instruc-
tion, ALUControl should be set to 000 to perform addition. ALUResult
is sent to the data memory as the address to read, as shown in Figure 7.6.
In our example, the ALU computes 0x2004 + 0xFFFFFFFC = 0x2000.
Again, this is a 32-bit value, but we omit the leading zeros to avoid clut-
tering the figure.
This memory address from the ALU is provided to the address (A)
port of the data memory. The data is read from the data memory onto the
ReadData bus and then written back to the destination register at the end
of the cycle, as shown in Figure 7.7. Port 3 of the register file is the write
port. lw’s destination register, indicated by the rd field (Instr11:7), is con-
nected to A3, port 3’s address input. The ReadData bus is connected to
WD3, port 3’s write data input. A control signal called RegWrite (register
write) is connected to WE3, port 3’s write enable input, and is asserted
during the lw instruction so that the data value is written into the register
file. The write takes place on the rising edge of the clock at the end of the
cycle. In our example, the processor reads 10 from address 0x2000 in the
data memory and puts that value (10) into x6 in the register file.
While the instruction is being executed, the processor must also
compute the address of the next instruction, PCNext. Because instruc-
tions are 32 bits (4 bytes), the next instruction is at PC+4. Figure 7.8
uses an adder to increment the PC by 4. In our example, PCNext =
0x1000 + 4 = 0x1004. The new address is written into the program
counter on the next rising edge of the clock. This completes the datapath
for the lw instruction.
7.3 Single-Cycle Processor 403

RegWrite ALUControl2:0
1 000
CLK CLK
CLK
19:15 WE3 0x2004 SrcA WE
PCNext PC Instr A1 RD1
A RD 9 ReadData

ALU
0x1004 ALUResult
A RD
0x1000

0xFFC4A303
Instruction
A2 RD2 SrcB 0x2000 Data 10
Memory 11:7
A3 Memory
6 Register
WD3 WD
File

ImmExt
31:7
Extend 0xFFFFFFFC
PCPlus4 0xFFC
+
4

Address Instruction Type Fields Machine Language


imm11:0 rs1 f3 rd op
0x1000 L7: lw x6, -4(x9) I 111111111100 01001 010 00110 0000011 FFC4A303

Figure 7.8 Increment program counter

sw
Next, let us extend the datapath to handle sw, which is an S-type
instruction. Like lw, sw reads a base address from port 1 of the regis-
ter file and sign-extends the immediate. The ALU adds the base address
to the immediate to find the memory address. All of these functions
are already supported in the datapath, but the 12-bit signed immediate
is stored in Instr31:25,11:7 (instead of Instr31:20, as it was for lw). Thus,
the Extend unit must be modified to also receive these additional bits,
Instr11:7. For simplicity (and for future instructions such as jal), the
Extend unit receives all the bits of Instr31:7. A control signal, ImmSrc,
decides which instruction bits to use as the immediate. When ImmSrc =
0 (for lw), the Extend unit chooses Instr31:20 as the 12-bit signed imme-
diate; when ImmSrc = 1 (for sw), it chooses Instr31:25,11:7.
The sw instruction also reads a second register from the register file
and writes its contents to the data memory. Figure 7.9 shows the new
connections for this added functionality. The register is specified in the
rs2 field, Instr24:20, which is connected to the address 2 (A2) input of the
register file. The register’s contents are read onto the read data 2 (RD2)
output, which, in turn, is connected to the write data (WD) input of the
data memory. The write enable port of the data memory, WE, is con-
trolled by MemWrite. For an sw instruction: MemWrite = 1 to write the
data to memory; ALUControl = 000 to add the base address and offset;
and RegWrite = 0, because nothing should be written to the register file.
Note that data is still read from the address given to the data memory,
but this ReadData is ignored because RegWrite = 0.
404 CHAPTER SEVEN Microarchitecture

RegWrite ImmSrc ALUControl2:0 MemWrite


0 1 000 1
CLK CLK
CLK
19:15 WE3 0x2004 SrcA WE
PCNext PC Instr A1 RD1
A RD 9 ReadData

ALU
0x1008 ALUResult
A RD
0x1004

Instruction 0x0064A423 24:20


A2 RD2
10
0x200C
Memory SrcB Data
11:7 6
A3 Memory
Register WriteData
WD3 WD
File

ImmExt
31:7
Extend 0x00000008
PCPlus4 0x008
+

Address Instruction Type Fields Machine Language


imm11:5 rs2 rs1 f3 imm4:0 op
0x1004 sw x6, 8(x9) S 0000000 00110 01001 010 01000 0100011 0064A423

Figure 7.9 Write data to memory for sw instruction

In our example, the PC is 0x1004. Thus, the instruction mem-


ory reads out the sw instruction, 0x0064A423. The register file reads
0x2004 (the base address) from x9 and 10 from x6 while the Extend
unit extends the immediate offset 8 from 12 to 32 bits. The ALU computes
0x2004 + 8 = 0x200C. The data memory writes 10 to address 0x200C.
Meanwhile, the PC is incremented to 0x1008.

R-Type Instructions
Next, consider extending the datapath to handle the R-type instructions,
add, sub, and, or, and slt. All of these instructions read two source reg-
isters from the register file, perform some ALU operation on them, and
write the result back to the destination register. They differ only in the
specific ALU operation. Hence, they can all be handled with the same
hardware but with different ALUControl signals. Recall from Section
5.2.4 that ALUControl is 000 for addition, 001 for subtraction, 010 for
AND, 011 for OR, and 101 for set less than.
Figure 7.10 shows the enhanced datapath handling these R-type
instructions. The datapath reads rs1 and rs2 from ports 1 and 2 of the
register file and performs an ALU operation on them. We introduce a
multiplexer and a new select signal, ALUSrc, to select between ImmExt
and RD2 as the second ALU source, SrcB. For lw and sw, ALUSrc is
1 to select ImmExt; for R-type instructions, ALUSrc is 0 to select the
register file output RD2 as SrcB.
7.3 Single-Cycle Processor 405

RegWrite ImmSrc ALUSrc ALUControl2:0 MemWrite ResultSrc


1 x 0 011 0 0
CLK CLK 14
CLK
19:15 WE3 6 SrcA WE
PCNext PC Instr A1 RD1 0
A RD 5

ALU
ALUResult ReadData
0x100C A RD 1 14
0x1008

0x0062E233
Instruction 24:20 10 14
A2 RD2 0 SrcB Data
Memory 11:7 6
A3 1 10 Memory
4 Register WriteData
WD3 WD
14 File

ImmExt
31:7
Extend 0x00000008
PCPlus4
+
+

Result
4

Address Instruc!on Type Fields Machine Language


funct7 rs2 rs1 f3 rd op
0x1008 or x4, x5, x6 R 0000000 00110 00101 110 00100 0110011 0062E233

Figure 7.10 Datapath enhancements for R-type instructions

Let us name the value to be written back to the register file Observe that our hardware
computes all the possible
Result. For lw, Result comes from the ReadData output of the answers needed by different
memory. However, for R-type instructions, Result comes from the instructions (e.g., ALUResult
ALUResult output of the ALU. We add the Result multiplexer to and ReadData) and then uses
choose the proper Result based on the type of instruction. The mul- a multiplexer to choose the
appropriate one based on the
tiplexer select signal ResultSrc is 0 for R-type instructions to choose
instruction. This is an important
ALUResult as Result; ResultSrc is 1 for lw to choose ReadData. design strategy. Throughout
We do not care about the value of ResultSrc for sw because it does the rest of this chapter, we will
not write the register file. add multiplexers to choose the
In our example, the PC is 0x1008. Thus, the instruction memory desired answer.
One of the major differences
reads out the or instruction 0x0062E233. The register file reads source
between software and hardware
operands 6 from x5 and 10 from x6. ALUControl is 011, so the ALU is that software operates
computes 6 | 10 = 01102 | 10102 = 11102 = 14. The result is written sequentially, so we can compute
back to x4. Meanwhile, the PC is incremented to 0x100C. just the answer we need.
Hardware operates in parallel;
beq therefore, we often compute all
the possible answers and then
Finally, we extend the datapath to handle the branch if equal (beq) pick the one we need. For
instruction. beq compares two registers. If they are equal, it takes the example, while executing an
branch by adding the branch offset to the program counter (PC). R-type instruction with the ALU,
The branch offset is a 13-bit signed immediate stored in the 12-bit the memory still receives an
address and reads data from this
immediate field of the B-type instruction. Thus, the Extend logic needs
address even though we don’t
yet another mode to choose the proper immediate. ImmSrc is increased care what that data might be.
to 2 bits, using the encoding from Table 7.1. ImmExt is now either the
406 CHAPTER SEVEN Microarchitecture

Table 7.1 ImmSrc encoding

ImmSrc ImmExt Type Description


00 {{20{Instr[31]}}, Instr[31:20]} I 12-bit signed immediate

01 {{20{Instr[31]}}, Instr[31:25], Instr[11:7]} S 12-bit signed immediate

10 {{20{Instr[31]}}, Instr[7], Instr[30:25], Instr[11:8], 1’b0} B 13-bit signed immediate

PCSrc RegWrite ImmSrc ALUSrc ALUControl2:0 MemWrite ResultSrc


1 0 10 0 001 0 x
CLK CLK
CLK
19:15 WE3 14 SrcA Zero WE
0 PCNext PC Instr A1 RD1 1 0
A RD 4 ReadData

ALU
1 0x1000 ALUResult
A RD 1
0x100C

0xFE420AE3

Instruction 24:20 14
A2 RD2 0 SrcB 0 Data
Memory 11:7 4
A3 1 14 Memory
Register WriteData
WD3 WD
File

PCTarget

+
ImmExt 0x1000
31:7
Extend 0xFFFFFFF4
PCPlus4 0xFFA
+

Result
4
0x1010

Address Instruction Type Fields Machine Language


imm12,10:5 rs2 rs1 f3 imm4:1,11 op
0x100C beq x4, x4, L7 B 1111111 00100 00100 000 10101 1100011 FE420AE3

Figure 7.11 Datapath enhancements for beq

sign-extended immediate (when ImmSrc = 00 or 01) or the branch offset


Logically, we can build the
Extend unit from a 32-bit 3:1
(when ImmSrc = 10).
multiplexer choosing one of Figure 7.11 shows the modifications to the datapath. We need
three possible inputs based another adder to compute the branch target address, PCTarget = PC +
on ImmSrc and the various ImmExt. The two source registers are compared by computing (SrcA –
bitfields of the instruction. SrcB) using the ALU. If ALUResult is 0, as indicated by the ALU’s Zero
In practice, the upper bits of
the sign-extended immediate
flag, the registers are equal. We add a multiplexer to choose PCNext
always come from bit 31 of from either PCPlus4 or PCTarget. PCTarget is selected if the instruction
the instruction, Instr31, so is a branch and the Zero flag is asserted. For beq, ALUControl = 001,
we can optimize the design so that the ALU performs a subtraction. ALUSrc = 0 to choose SrcB
and only use a multiplexer to from the register file. RegWrite and MemWrite are 0, because a branch
select the lower bits.
does not write to the register file or memory. We don’t care about the
value of ResultSrc, because the register file is not written.
In our example, the PC is 0x100C, so the instruction memory reads out
the beq instruction 0xFE420AE3. Both source registers are x4, so the
7.3 Single-Cycle Processor 407

PCSrc
Control
ResultSrc
Unit
MemWrite
6:0
op ALUControl2:0
14:12
funct3 ALUSrc
30
funct75 ImmSrc1:0
Zero RegWrite

CLK CLK
CLK
19:15 WE3 SrcA Zero WE
0 PCNext PC Instr A1 RD1 0
A RD ReadData
1 ALUResult

ALU
A RD 1
Instruction 24:20
A2 RD2 0 SrcB Data
Memory 11:7
A3 1 Memory
Register WriteData
WD3 WD
File

PCTarget

+
ImmExt
31:7
Extend
PCPlus4
+

Result
4

Figure 7.12 Complete single-cycle processor

register file reads 14 on both ports. The ALU computes 14 – 14 = 0, and the
Zero flag is asserted. Meanwhile, the Extend unit produces 0xFFFFFFF4
(i.e., −12), which is added to PC to obtain PCTarget = 0x1000. Note that
we show the unswizzled upper 12 bits of the 13-bit immediate on the input
of the Extend unit (0xFFA). The PCNext mux chooses PCTarget as the next We name the multiplexers
PC and branches back to the start of the code at the next clock edge. (muxes) by the signals they
This completes the design of the single-cycle processor datapath. We produce. For example, the
PCNext mux produces the
have illustrated not only the design itself but also the design process in PCNext signal, and the Result
which the state elements are identified and the combinational logic is mux produces the Result signal.
systematically added to connect the state elements. In the next section,
we consider how to compute the control signals that direct the operation
of our datapath.

7 . 3 . 3 Single-Cycle Control
The single-cycle processor’s control unit computes the control signals
based on op, funct3, and funct7. For the RV32I instruction set, only
bit 5 of funct7 is used, so we just need to consider op (Instr6:0), funct3
(Instr14:12), and funct75 (Instr30). Figure 7.12 shows the entire single-
cycle processor with the control unit attached to the datapath.
408 CHAPTER SEVEN Microarchitecture

Zero PCSrc
Branch

ResultSrc
Main MemWrite
op6:0 Decoder ALUSrc
Figure 7.13 Single-cycle 5
ImmSrc1:0
processor control unit
RegWrite

ALUOp1:0

ALU
funct32:0 ALUControl2:0
Decoder
funct75

Table 7.2 Main Decoder truth table

Instruction Op RegWrite ImmSrc ALUSrc MemWrite ResultSrc Branch ALUOp


lw 0000011 1 00 1 0 1 0 00

sw 0100011 0 01 1 1 x 0 00

R-type 0110011 1 xx 0 0 0 0 10

beq 1100011 0 10 0 0 x 1 01

Figure 7.13 hierarchically decomposes the control unit, which is also


referred to as the controller or the decoder, because it decodes what the
instruction should do. We partition it into two major parts: the Main
Decoder, which produces most of the control signals, and the ALU
Decoder, which determines what operation the ALU performs.
Table 7.2 shows the control signals that the Main Decoder pro-
duces, as we determined while designing the datapath. The Main
Decoder determines the instruction type from the opcode and then
produces the appropriate control signals for the datapath. The Main
Decoder generates most of the control signals for the datapath. It also
produces internal signals Branch and ALUOp, signals used within the
controller. The logic for the Main Decoder can be developed from the
truth table using your favorite techniques for combinational logic
design.
The ALU Decoder produces ALUControl based on ALUOp and
funct3. In the case of the sub and add instructions, the ALU Decoder also
uses funct75 and op5 to determine ALUControl, as given in in Table 7.3.
7.3 Single-Cycle Processor 409

Table 7.3 ALU Decoder truth table

ALUOp funct3 {op5, funct75} ALUControl Instruction


00 x x 000 (add) lw, sw

01 x x 001 (subtract) beq

10 000 00, 01, 10 000 (add) add

000 11 001 (subtract) sub

010 x 101 (set less than) slt

110 x 011 (or) or

111 x 010 (and) and

ALUOp of 00 indicates add (e.g., to find the address for loads or According to Table B.1 in
stores). ALUOp of 01 indicates subtract (e.g., to compare two numbers the inside covers of the book,
for branches). ALUOp of 10 indicates an R-type ALU instruction where add, sub, and addi all have
the ALU Decoder must look at the funct3 field (and sometimes also the funct3 = 000. add has funct7
op5 and funct75 bits) to determine which ALU operation to perform = 0000000 while sub has
funct7 = 0100000, so funct75
(e.g., add, sub, and, or, slt). is sufficient to distinguish these
two. But we will soon consider
supporting addi, which doesn’t
Example 7.1 SINGLE-CYCLE PROCESSOR OPERATION
have a funct7 field but has an
Determine the values of the control signals and the portions of the datapath that op of 0010011. With a bit of
thought, we can see that an
are used when executing an and instruction.
ALU instruction with funct3 =
Solution Figure 7.14 illustrates the control signals and flow of data during exe- 000 is sub if op5 and funct75
are both 1, or add or addi
cution of an and instruction. The PC points to the memory location holding
otherwise.
the instruction; the instruction memory outputs this instruction. The main flow
of data through the register file and ALU is represented with a heavy blue line.
The register file reads the two source operands specified by Instr. SrcB should
come from the second port of the register file (not ImmExt), so ALUSrc must
be 0. The ALU performs a bitwise AND operation, so ALUControl must be
010. The result comes from the ALU, so ResultSrc is 0, and the result is written
to the register file, so RegWrite is 1. The instruction does not write memory, so
MemWrite is 0.

The updating of PC with PCPlus4 is shown by a heavy gray line. PCSrc is 0 to


select the incremented PC. Note that data does flow through the nonhighlighted
paths, but the value of that data is disregarded. For example, the immediate is
extended and a value is read from memory, but these values do not influence the
next state of the system.
410 CHAPTER SEVEN Microarchitecture

PCSrc
Control
ResultSrc
Unit
MemWrite
6:0
op ALUControl2:0
14:12
funct3 ALUSrc
30
funct75 ImmSrc1:0
Zero RegWrite

CLK CLK
0 CLK 1 xx 0 010 0 0
19:15 WE3 SrcA Zero WE
0 PCNext PC Instr A1 RD1 0
A RD ReadData
1 ALUResult

ALU
A RD 1
Instruction 24:20
A2 RD2 0 SrcB Data
Memory 11:7
A3 1 Memory
Register WriteData
WD3 WD
File

PCTarget

+
ImmExt
31:7
Extend
PCPlus4
+

Result
4

Figure 7.14 Control signals and data flow while executing an and instruction

7 . 3 . 4 More Instructions
So far, we have considered only a small subset of the RISC-V instruction
set. In this section, we enhance the datapath and controller to support
the addi (add immediate) and jal (jump and link) instructions. These
examples illustrate the principle of how to handle new instructions, and
they give us a sufficiently rich instruction set to write many interesting
programs. With enough effort, you could extend the single-cycle pro-
cessor to handle every RISC-V instruction. Moreover, we will see that
supporting some instructions simply requires enhancing the decoders,
whereas supporting others also requires new hardware in the datapath.

Example 7.2 addi INSTRUCTION

Recall that addi rd,rs1,imm is an I-type instruction that adds the value in rs1
to a sign-extended immediate and writes the result to rd. The datapath already is
capable of this task. Determine the necessary changes to the controller to support
addi.

Solution All we need to do is add a new row to the Main Decoder truth table
showing the control signal values for addi, as given in Table 7.4. The result
should be written to the register file, so RegWrite = 1. The 12-bit immediate
in Instr31:20 is sign-extended as it was with lw, another I-type instruction, so
7.3 Single-Cycle Processor 411

Table 7.4 Main Decoder truth table enhanced to support addi

Instruction Opcode RegWrite ImmSrc ALUSrc MemWrite ResultSrc Branch ALUOp


lw 0000011 1 00 1 0 1 0 00
sw 0100011 0 01 1 1 x 0 00

R-type 0110011 1 xx 0 0 0 0 10
beq 1100011 0 10 0 0 x 1 01
addi 0010011 1 00 1 0 0 0 10

ImmSrc is 00 (see Table 7.1). SrcB comes from the immediate, so ALUSrc = 1.
The instruction does not write memory nor is it a branch, so MemWrite =
Branch = 0. The result comes from the ALU, not memory, so ResultSrc = 0.
Finally, the ALU should add, so ALUOp = 10; the ALU Decoder makes
ALUControl = 000 because funct3 = 000 and op5 = 0.

The astute reader may note that this change also provides the other I-type ALU
instructions: andi, ori, and slti. These other instructions share the same op
value of 0010011, need the same control signals, and only differ in the funct3
field, which the ALU Decoder already uses to determine ALUControl and, thus,
the ALU operation.

Example 7.3 jal INSTRUCTION

Show how to change the RISC-V single-cycle processor to support the jump and
link (jal) instruction. jal writes PC+4 to rd and changes PC to the jump target
address, PC + imm.

Solution The processor calculates the jump target address, the value of PCNext,
by adding PC to the 21-bit signed immediate encoded in the instruction. The least
significant bit of the immediate is always 0 and the next 20 most significant bits
come from Instr31:12. This 21-bit immediate is then sign-extended. The datapath
already has hardware for adding PC to a sign-extended immediate, selecting
this as the next PC, computing PC+4, and writing a value to the register file.
Hence, in the datapath, we must only modify the Extend unit to sign-extend the
21-bit immediate and expand the Result multiplexer to choose PC+4 (i.e., PCPlus4)
as shown in Figure 7.15. Table 7.5 shows the new encoding for ImmSrc to support
the long immediate for jal.

The control unit needs to set PCSrc = 1 for the jump. To do this, we
add an OR gate and another control signal, Jump, as shown in Figure 7.16.
When Jump asserts, PCSrc = 1 and PCTarget (the jump target address)
is selected as the next PC.
412 CHAPTER SEVEN Microarchitecture

PCSrc
Control
ResultSrc1:0
Unit
MemWrite
6:0
op ALUControl2:0
14:12
funct3 ALUSrc
30
funct75 ImmSrc1:0
Zero RegWrite

CLK CLK
CLK
19:15 WE3 SrcA Zero WE
0 PCNext PC Instr A1 RD1
A RD 00

ALU
1 ALUResult ReadData
A RD 01
Instruction 24:20 10
A2 RD2 0 SrcB Data
Memory 11:7
A3 1 Memory
Register WriteData
WD3 WD
File

PCTarget

+
ImmExt
31:7
Extend
PCPlus4
+

Result
4

Figure 7.15 Enhanced datapath for jal

Table 7.5 ImmSrc encoding.

ImmSrc ImmExt Type Description


00 {{20{Instr[31]}}, Instr[31:20]} I 12-bit signed immediate

01 {{20{Instr[31]}}, Instr[31:25], Instr[11:7]} S 12-bit signed immediate

10 {{20{Instr[31]}}, Instr[7], Instr[30:25], Instr[11:8], 1’b0} B 13-bit signed immediate

11 {{12{Instr[31]}}, Instr[19:12], Instr[20], Instr[30:21], 1’b0} J 21-bit signed immediate

Table 7.6 shows the updated Main Decoder table with a new row
for jal. RegWrite = 1 and ResultSrc = 10 to write PC+4 into rd. ImmSrc
= 11 to select the 21-bit jump offset. ALUSrc and ALUOp don’t matter
because the ALU is not used. MemWrite = 0 because the instruction isn’t
a store, and Branch = 0 because the instruction isn’t a branch. The new
Jump signal is 1 to pick the jump target address as the next PC.

7 . 3 . 5 Performance Analysis
Recall from Equation 7.1 that the execution time of a program is the
product of the number of instructions, the cycles per instruction, and
7.3 Single-Cycle Processor 413

Zero
Branch PCSrc
Jump

ResultSrc1:0
Main MemWrite
Decoder
op6:0 ALUSrc
5 Figure 7.16 Enhanced control
ImmSrc1:0
unit for jal
RegWrite

ALUOp1:0

ALU
funct32:0 ALUControl2:0
Decoder
funct75

Table 7.6 Main Decoder truth table enhanced to support jal

Instruction Opcode RegWrite ImmSrc ALUSrc MemWrite ResultSrc Branch ALUOp Jump
lw 0000011 1 00 1 0 01 0 00 0

sw 0100011 0 01 1 1 xx 0 00 0

R-type 0110011 1 xx 0 0 00 0 10 0

beq 1100011 0 10 0 0 xx 1 01 0

I-type ALU 0010011 1 00 1 0 00 0 10 0


jal 1101111 1 11 x 0 10 0 xx 1

the cycle time. Each instruction in the single-cycle processor takes one
clock cycle, so the clock cycles per instruction (CPI) is 1. The cycle time
is set by the critical path. In our processor, the lw instruction is the most
time-consuming and involves the critical path shown in Figure 7.17. As
indicated by heavy blue lines, the critical path starts with the PC loading
a new address on the rising edge of the clock. The instruction memory
then reads the new instruction, and the register file reads rs1 as SrcA.
While the register file is reading, the immediate field is sign-extended
based on ImmSrc and selected at the SrcB multiplexer (path highlighted
in gray). The ALU adds SrcA and SrcB to find the memory address. The
data memory reads from this address, and the Result multiplexer selects
ReadData as Result. Finally, Result must set up at the register file before
414 CHAPTER SEVEN Microarchitecture

PCSrc
Control
ResultSrc1:0
Unit
MemWrite
6:0
op ALUControl2:0
14:12
funct3 ALUSrc
30
funct75 ImmSrc1:0
Zero RegWrite

CLK CLK
CLK
19:15 WE3 SrcA Zero WE
0 PCNext PC Instr A1 RD1
A RD ReadData 00
1 ALUResult

ALU
A RD 01
Instruction 24:20 10
A2 RD2 0 SrcB Data
Memory 11:7
A3 1 Memory
Register WriteData
WD3 WD
File

PCTarget

+
ImmExt
31:7
Extend
PCPlus4
+

Result
4

Figure 7.17 Critical path for lw

the next rising clock edge so that it can be properly written. Hence, the
cycle time of the single-cycle processor is:

Tc _ single = t pcq _ PC + tmem + max [tRFread , tdec + text + tmux ]


+ t ALU + tmem + tmux + tRFsetup (7.2)

Remember that lw does not


In most implementation technologies, the ALU, memory, and register
use the second read port file are substantially slower than other combinational blocks. Therefore,
(A2/RD2) of the register file. the critical path is through the register file—not through the decoder
(controller), Extend unit, and multiplexer—and is the path highlighted
in blue in Figure 7.17. The cycle time simplifies to:

Tc _ single = t pcq _ PC + 2tmem + tRFread + t ALU + tmux + tRFsetup (7.3)

The numerical values of these times will depend on the specific imple-
mentation technology.
Other instructions have shorter critical paths. For example, R-type
instructions do not need to access data memory. However, we are disciplining
ourselves to synchronous sequential design, so the clock period is constant
and must be long enough to accommodate the slowest instruction.
7.4 Multicycle Processor 415

Table 7.7 Delay of circuit elements

Element Parameter Delay (ps)


Register clk-to-Q tpcq 40

Register setup tsetup 50

Multiplexer tmux 30

AND-OR gate tAND-OR 20

ALU tALU 120

Decoder (control unit) tdec 25

Extend unit text 35

Memory read tmem 200

Register file read tRFread 100

Register file setup tRFsetup 60

Example 7.4 SINGLE-CYCLE PROCESSOR PERFORMANCE

Ben Bitdiddle is contemplating building the single-cycle processor in a 7-nm


CMOS manufacturing process. He has determined that the logic elements have
the delays given in Table 7.7. Help him compute the execution time for a pro-
gram with 100 billion instructions.

Solution According to Equation 7.3, the cycle time of the single-cycle processor is
Tc_single = 40 + 2(200) + 100 + 120 + 30 + 60 = 750 ps. According to Equation 7.1,
the total execution time is Tsingle = (100 × 109 instruction) (1 cycle/instruction)
(750 × 10−12 s/cycle) = 75 seconds.

7.4 MULTICYCLE PROCESSOR


The single-cycle processor has three notable weaknesses. First, it requires
separate memories for instructions and data, whereas most processors
have only a single external memory holding both instructions and data.
Second, it requires a clock cycle long enough to support the slowest
instruction (lw) even though most instructions could be faster. Finally, it
requires three adders (one in the ALU and two for the PC logic); adders
are relatively expensive circuits, especially if they must be fast.
The multicycle processor addresses these weaknesses by breaking an
instruction into multiple shorter steps. The memory, ALU, and register
416 CHAPTER SEVEN Microarchitecture

file have the longest delays, so to keep the delay for each short step
approximately equal, the processor can use only one of those units in
each step. The processor uses a single memory because the instruction
is read in one step and data is read or written in a later step. And the
processor needs only one adder, which is reused for different purposes
on different steps. Various instructions use different numbers of steps, so
simpler instructions can complete faster than more complex ones.
We design a multicycle processor following the same procedure we
used for the single-cycle processor. First, we construct a datapath by
connecting the architectural state elements and memories with combina-
tional logic. But, this time, we also add nonarchitectural state elements
to hold intermediate results between the steps. Then, we design the
controller. During the execution of a single instruction, the controller
produces different signals on each step, so now the controller uses a finite
state machine rather than combinational logic. Finally, we analyze the per-
formance of the multicycle processor and compare it with the single-cycle
processor.

7 . 4 . 1 Multicycle Datapath
Again, we begin our design with the memory and architectural state of
the processor, as shown in Figure 7.18. In the single-cycle design, we
used separate instruction and data memories because we needed to read
the instruction memory and read or write the data memory all in one
cycle. Now, we choose to use a combined memory for both instructions
and data. This is more realistic and is feasible because we can read the
instruction in one cycle, then read or write the data in another cycle. The
PC and register file remain unchanged.
As with the single-cycle processor, we gradually build the datapath
by adding components to handle each step of each instruction. The PC
contains the address of the instruction to execute. The first step is to
read this instruction from memory. Figure 7.19 shows that the PC is

CLK CLK
CLK
WE WE3
A1 RD1
PCNext PC RD
A
EN
Instr / Data A2 RD2
Memory
A3 Register
WD
WD3 File

Figure 7.18 State elements with unified instruction/data memory


7.4 Multicycle Processor 417

simply connected to the address input of the memory. The instruction is Like in the single-cycle processor,
read and stored in a new nonarchitectural instruction register (IR) so we name the multiplexers and
that it is available for future cycles. The IR receives an enable signal, nonarchitectural registers by
called IRWrite, which is asserted when the IR should be loaded with a the signals they produce. For
new instruction. example, the instruction register
produces the instruction signal
(Instr), and the Result multiplexer
lw produces the Result signal.
As we did with the single-cycle processor, we first work out the datapath
connections for the lw instruction. After fetching lw, the second step is
to read the source register containing the base address. This register is
specified in the rs1 field, Instr19:15. These bits of the instruction are con-
nected to address input A1 of the register file, as shown in Figure 7.20.
The register file reads the register onto RD1, and this value is stored in
another nonarchitectural register, A.

IRWrite
CLK CLK CLK
CLK
WE WE3
Instr A1 RD1
PCNext PC RD
A EN
Instr / Data A2 RD2
Memory
A3 Register
WD
WD3 File

Figure 7.19 Fetch instruction from memory

IRWrite ImmSrc1:0
CLK CLK CLK CLK
CLK
WE 19:15 Rs1 WE3 A
Instr A1 RD1
PCNext PC RD
A EN
Instr / Data A2 RD2
Memory
A3 Register
WD
WD3 File

31:7 Extend ImmExt

Figure 7.20 Read one source from register file and extend second source from immediate field
418 CHAPTER SEVEN Microarchitecture

The lw instruction also requires a 12-bit offset found in the imme-


diate field of the instruction, Instr31:20, which must be sign-extended to
32 bits, as shown in Figure 7.20. As in the single-cycle processor, the
Extend unit takes a 2-bit ImmSrc control signal to specify a 12-, 13-, or
21-bit immediate to extend for various types of instructions. The 32-bit
extended immediate is called ImmExt. To be consistent, we might store
ImmExt in another nonarchitectural register. However, ImmExt is a
combinational function of Instr and will not change while the current
instruction is being processed, so there is no need to dedicate a register
to hold the constant value.
The address of the load is the sum of the base address and offset.
In the third step, we use an ALU to compute this sum, as shown in
Figure 7.21. ALUControl should be set to 000 to perform the addition.
ALUResult is stored in a nonarchitectural register called ALUOut.
The fourth step is to load the data from the calculated address in
the memory. We add a multiplexer in front of the memory to choose the
memory address, Adr, from either the PC or ALUOut based on
the AdrSrc select signal, as shown in Figure 7.22. The data read from
memory is stored in another nonarchitectural register, called Data. Note
that the address (Adr) multiplexer permits us to reuse the memory unit
during the lw instruction. On the first step, the address is taken from the
PC to fetch the instruction. On the fourth step, the address is taken from
ALUOut to load the data. Hence, AdrSrc must have different values
during different steps of a single instruction. In Section 7.4.2, we develop
the FSM controller that generates these sequences of control signals.
Finally, the data is written back to the register file, as shown in
Figure 7.23. The destination register is specified by the rd field of the
instruction, Instr11:7. The result comes from the Data register. Instead
of connecting the Data register directly to the register file’s WD3
write port, let us add a multiplexer on the Result bus to choose either
ALUOut or Data before feeding Result back to the register file’s

IRWrite ImmSrc1:0 ALUControl2:0


CLK CLK CLK CLK
CLK CLK
WE 19:15 Rs1 WE3 A SrcA
Instr A1 RD1
PCNext PC RD
ALU

A EN ALUResult ALUOut
Instr / Data A2 RD2 SrcB
Memory
A3 Register
WD
WD3 File

31:7 Extend ImmExt

Figure 7.21 Add base address to offset


7.4 Multicycle Processor 419

writedata port (WD3). This will be helpful because other instruc-


tions will need to write a result from the ALU to the register file. The
RegWrite signal is 1 to indicate that the register file should be updated.
While all this is happening, the processor must update the program
counter by adding 4 to the PC. In the single-cycle processor, a separate
adder was needed. In the multicycle processor, we can use the exist-
ing ALU during the instruction fetch step because it is not busy. To do
so, we must insert source multiplexers to choose PC and the constant
4 as ALU inputs, as shown in Figure 7.24. A multiplexer controlled by
ALUSrcA chooses either PC or A as SrcA. Another multiplexer chooses
either 4 or ImmExt as SrcB. We also show additional multiplexer inputs
that will be used when we implement more instructions. To update the
PC, the ALU adds SrcA (PC) to SrcB (4), and the result is written into
the program counter. The Result multiplexer chooses this sum from
ALUResult rather than ALUOut; this requires a third multiplexer input.
The PCWrite control signal enables the PC to be written only on certain
cycles. This completes the datapath for the lw instruction.

AdrSrc IRWrite ImmSrc1:0 ALUControl2:0


CLK CLK CLK CLK
CLK
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1
0 Adr RD
ALUResult ALUOut

ALU
EN A EN
1
Instr / Data A2 RD2 SrcB
Memory
ReadData

A3 Register
WD
WD3 File

CLK
31:7 Extend ImmExt
Data

Figure 7.22 Load data from memory

AdrSrc IRWrite RegWrite ImmSrc1:0 ALUControl2:0 ResultSrc1:0


CLK CLK CLK CLK
CLK
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1
0 Adr RD
ALUResult ALUOut
ALU

EN A EN 00
1
Instr / Data A2 RD2 SrcB 01
Memory
ReadData

11:7 Rd 10
A3 Register
WD
WD3 File

CLK
31:7 Extend ImmExt
Data

Figure 7.23 Write data back to register file


420 CHAPTER SEVEN Microarchitecture

PCWrite AdrSrc IRWrite RegWrite ImmSrc1:0 ALUSrcA1: ALUControl2:0 ResultSrc1:0

ALUSrcB1:0

CLK CLK CLK CLK 00


CLK
01
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1 10
0 Adr RD
ALUResult ALUOut

ALU
EN A EN 00
1
Instr / Data A2 RD2 00 SrcB 01
ReadData

Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File

CLK
31:7 Extend ImmExt
Data

Result

Figure 7.24 Increment PC by 4

PCWrite AdrSrc MemWrite IRWrite RegWrite ImmSrc1:0 ALUSrcA1:0 ALUControl2:0 ResultSrc1:0

ALUSrcB1:0

CLK CLK CLK CLK 00


CLK
01
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1 10
0 Adr RD ALU
EN A EN ALUResult ALUOut 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
01
WriteData
ReadData

Memory 11:7 Rd 10
A3 Register
WD 4 10
WD3 File

CLK
31:7 Extend ImmExt
Data

Result

Figure 7.25 Enhanced datapath for sw instruction

sw
Next, let us extend the datapath to handle the sw instruction. Like lw,
sw reads a base address from port 1 of the register file and extends the
immediate on the second step. Then, the ALU adds the base address to
the immediate to find the memory address on the third step. The only
new feature of sw is that we must read a second register from the regis-
ter file and write its contents into memory, as shown in Figure 7.25. The
register is specified in the rs2 field of the instruction, Instr24:20, which
is connected to the second port of the register file (A2). After it is read
7.4 Multicycle Processor 421

PCWrite AdrSrc MemWrite IRWrite RegWrite ImmSrc1:0 ALUSrcA1:0 ALUControl2:0 ResultSrc1:0

ALUSrcB1:0 Zero
CLK

OldPC

CLK CLK CLK 00


CLK
01
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1 10
0 Adr RD
ALUResult ALUOut

ALU
EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
01

WriteData
ReadData

Memory 11:7 Rd 10
A3 Register
WD 4 10
WD3 File

CLK
31:7 Extend ImmExt
Data

Result

Figure 7.26 Enhanced datapath for beq target address calculation

on the second step, the register’s contents are then stored in a nonar-
chitectural register, the WriteData register, just below the A register. It is
then sent to the write data port (WD) of the data memory to be written
on the fourth step. The memory receives the MemWrite control signal,
which is asserted when memory should be written.

R-Type Instructions
R-type instructions operate on two source registers and write the result
back to the register file. The datapath already contains all the connec-
tions necessary for these steps.

beq
beq checks whether two register contents are equal and computes
the branch target address by adding the current PC to a 13-bit signed
branch offset. The hardware to compare the registers using subtraction
is already present in the datapath.
The ALU is not being used during the second step of instruc-
tion execution, so we use it then to calculate the branch target address
PCTarget = PC + ImmExt. In this step, the instruction has been fetched
from memory and PC has already been updated to PC+4. Thus, in the
first step, the PC of the current instruction, OldPC, must be stored in
a nonarchitectural register. In the second step, as the registers are also
fetched, the ALU calculates PC + ImmExt by selecting OldPC for SrcA
and ImmExt for SrcB and making ALUControl = 000 so that it per-
forms addition. The processor stores this sum in the ALUOut register.
Figure 7.26 shows the updated datapath for beq.
In the third step, the ALU subtracts the source registers and asserts
the Zero output if they are equal. If they are, the control unit asserts
422 CHAPTER SEVEN Microarchitecture

PCWrite and the Result multiplexer selects ALUOut (that contains the
target address) to feed to the PC. No new hardware is needed.
This completes the design of the multicycle datapath. The design
process is much like that of the single-cycle processor in that hardware
is systematically connected between the state elements to handle each
instruction. The main difference is that the instruction is executed in sev-
eral steps. Nonarchitectural registers are inserted to hold the results of
each step. In this way, the memory can be shared for instructions and
data and the ALU can be reused several times, thus reducing hardware
costs. In the next section, we develop an FSM controller to deliver the
appropriate sequence of control signals to the datapath on each step of
each instruction.

7 . 4 . 2 Multicycle Control
As in the single-cycle processor, the control unit computes the control
signals based on the op, funct3, and funct75 fields of the instruction
(Instr6:0, Instr14:12, and Instr30). Figure 7.27 shows the entire multicycle
processor with the control unit attached to the datapath. The datapath is
shown in black and the control unit in blue.

CLK

PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc
1:0
30
funct75 RegWrite

Zero

Zero

CLK

OldPC

CLK CLK CLK 00


CLK
01
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1 10
0 Adr RD
ALUResult ALUOut
ALU

EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
WriteData

Memory
ReadData

11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File

CLK
31:7 Extend ImmExt
Data

Result

Figure 7.27 Complete multicycle processor


7.4 Multicycle Processor 423

Zero
Branch PCWrite
PCUpdate

RegWrite
Main MemWrite
FSM IRWrite
ResultSrc1:0
op6:0 ALUSrcB1:0
5
ALUSrcA1:0 Figure 7.28 Multicycle control
AdrSrc unit

ALUOp1:0

ALU
funct32:0 ALUControl2:0
Decoder
funct75

Instr
op6:0 ImmSrc1:0
Decoder

The control unit consists of a Main FSM, ALU Decoder, and


Instruction Decoder (Instr Decoder) as shown in Figure 7.28. The ALU
Decoder is the same as in the single-cycle processor (see Table 7.3), but
the combinational Main Decoder of the single-cycle processor is replaced
with the Main FSM in the multicycle processor to produce a sequence of
control signals on the appropriate cycles. A small Instruction Decoder
combinationally produces the ImmSrc select signal based on the opcode
using the ImmSrc column of Table 7.6. We design the Main FSM as a
Moore machine so that the outputs are only a function of the current state.
The remainder of this section develops the state transition diagram for
the Main FSM.
The Main FSM produces multiplexer select, register enable, and
memory write enable signals for the datapath. To keep the following
state transition diagrams readable, only the relevant control signals are
listed. Multiplexer select signals are listed only when their value matters; Reset
otherwise, they are don’t care. Enable signals (RegWrite, MemWrite,
IRWrite, PCUpdate, and Branch) are listed only when they are asserted; S0: Fetch
otherwise, they are 0. AdrSrc = 0
IRWrite

Fetch
The first step for any instruction is to fetch the instruction from memory
at the address held in the PC. The FSM enters this Fetch state on reset.
The control signals are shown in Figure 7.29. To read the instruction Figure 7.29 Fetch
424 CHAPTER SEVEN Microarchitecture

from memory, AdrSrc = 0, so the address is taken from the PC. IRWrite
is asserted to write the instruction into the instruction register, IR. At
the same time, the current PC is written into the OldPC register. The
data flow through the datapath for this and the next two steps of the lw
instruction is shown in Figure 7.32, with the flow during the Fetch stage
highlighted in gray.

Decode
The second step is to read the register file and decode the instructions.
The control unit decodes the instruction, that is, figures out what opera-
tion should be performed based on op, funct3, and funct75. In this state,
the processor also reads the source registers, rs1 and rs2, and puts the
values read into the A and WriteData nonarchitectural registers. No con-
trol signals are necessary for these tasks. Figure 7.30 shows the Decode
state in the Main FSM and Figure 7.32 shows the flow through the datapath
during this state in medium blue lines. After this step, the processor can
differentiate its actions based on the instruction because the instruction
has been fetched and decoded. We will first show the remaining steps for
lw, then continue with the steps for the other RISC-V instructions.

MemAdr
The third step for lw is to calculate the memory address. The ALU adds
the base address and the offset, so ALUSrcA = 10 to select A (the value
read from rs1) as SrcA, and ALUSrcB = 01 to select ImmExt as SrcB.
ImmSrc, as determined by the Instruction Decoder, is 00 to sign-extend
the I-type immediate, and ALUOp is 00 to add SrcA and SrcB. At the
end of this state, the ALU result (i.e., the address calculation) is stored in
the ALUOut register. Figure 7.31 shows this MemAdr state added to the
Main FSM, and Figure 7.32 shows the datapath flow during this state
highlighted in dark-blue lines.

MemRead
The memory read (MemRead) step sends the calculated address to
memory by sending ALUOut to the address port of the memory, Adr.

Reset

S0: Fetch S1: Decode


AdrSrc = 0
Figure 7.30 Decode IRWrite
7.4 Multicycle Processor 425

Reset

S0: Fetch S1: Decode


AdrSrc = 0
IRWrite

Figure 7.31 Memory address


op = 0000011 (lw) computation

S2: MemAdr
ALUSrcA = 10
ALUSrcB = 01
ALUOp = 00

CLK

PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc1:0
30
funct75 RegWrite
S0: Fetch 0 0 0 1 0 xx xx xx xxx xx
S1: Decode 0 x 0 0 Zero 0 00 xx xx xxx xx
S2: MemAdr 0 x 0 0 0 00 10 01 000 xx
Zero

CLK

OldPC

CLK CLK CLK 00


CLK
01
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1 10
0 Adr RD ALUResult ALUOut
ALU

EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
WriteData
ReadData

Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File

CLK
31:7 Extend ImmExt
Data

Result

Figure 7.32 Data flow during Fetch, Decode, and MemAdr states
426 CHAPTER SEVEN Microarchitecture

ResultSrc = 00 and AdrSrc = 1 to route ALUOut through the Result


and Adr multiplexers to the memory address input. ReadData gets the
value read from the desired address in memory. At the end of this state,
ReadData is written into the Data register.

MemWB
In the memory writeback (MemWB) step, the data read from memory,
Data, is written to the destination register. ResultSrc is 01 to select Data
as the Result, and RegWrite is asserted to write the data to the regis-
ter file. The register file’s address and write data inputs for port 3 (A3
and WD3) are already connected to rd (Instr11:7) and Result, respectively.
Figures 7.33 and 7.34 show the MemRead and MemWB states and the
flow through the datapath for both steps. MemWB is the final step in the
lw instruction. Figure 7.33 also shows the transition from MemWB back

Reset

S0: Fetch S1: Decode


AdrSrc = 0
IRWrite

op = 0000011 (lw)

S2: MemAdr
ALUSrcA = 10
ALUSrcB = 01
ALUOp = 00

Figure 7.33 Memory read


(MemRead) and memory write op =
back (MemWB) states 0000011
(lw)

S3: MemRead
ResultSrc = 00
AdrSrc = 1

S4: MemWB
ResultSrc = 01
RegWrite
7.4 Multicycle Processor 427

CLK

PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc1:0
30
funct75 RegWrite

S3: MemRead 0 1 0 0 0 00 xx xx xxx 00


S4: MemWB 0 x 0 0 Zero 1 00 xx xx xxx 01

Zero

CLK

OldPC

CLK CLK CLK 00


CLK
01
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1 10
0 Adr RD
ALUResult ALUOut

ALU
EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01

WriteData
ReadData

Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File

CLK
31:7 Extend ImmExt
Data

Result

Figure 7.34 Data flow during MemRead and MemWB

to the Fetch state so that the next instruction can be fetched. However, We started this section stating
the PC has not yet been incremented. We address this issue next. that only one of the time-
Before finishing the lw instruction, the processor must increment the consuming units (the memory,
ALU, or register file) could be
PC so that it can fetch the next instruction. We could add another state for used in each step. However,
doing this but, instead, we save a cycle by noticing that the ALU is not here, we use both the register file
being used in the Fetch step, so the processor can use that state to calculate and ALU. As long as the units
PC+4 at the same time that it fetches the instruction. ALUSrcA = 00 are used at the same time—that
to get SrcA from the PC (i.e., from signal OldPC). ALUSrcB = 10 to get is, in parallel—more than one
unit can be used in a single step.
the constant 4 for SrcB. ALUOp = 00, so that the ALU adds PC to 4. To
update the PC with PC+4, ResultSrc = 10 to choose ALUResult as the
Result, and PCUpdate = 1 to force PCWrite high (see Figure 7.28). Reset
Figure 7.35 shows the modified Fetch state. The rest of the diagram S0: Fetch
remains the same as in Figure 7.33. Figure 7.36 highlights in blue the AdrSrc = 0
IRWrite
data flow for computing PC+4. The instruction fetch, which is occurring ALUSrcA = 00
ALUSrcB =10
simultaneously, is highlighted in gray. ALUOp = 00
ResultSrc = 10
PCUpdate
sw
Now, we expand the Main FSM to handle more RISC-V instructions. Figure 7.35 Incrementing PC in
All instructions pass through the first two states, Fetch and Decode. the Fetch state
428 CHAPTER SEVEN Microarchitecture

CLK

PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc1:0
30
funct75 RegWrite

S0: Fetch 1 0 0 1 Zero 0 xx 00 10 000 10

Zero

CLK

OldPC

CLK CLK CLK 00


CLK
01
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1 10
0 Adr RD
ALUResult ALUOut

ALU
EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01

WriteData
ReadData

Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File

CLK
31:7 Extend ImmExt
Data

Result

Figure 7.36 Data flow while incrementing PC in the Fetch state

The sw instruction uses the same MemAdr state as lw to calculate the


memory address but then proceeds to the memory write (MemWrite)
state, where WriteData, the value from rs2, is written to memory.
WriteData is hardwired to the memory’s write data port (WD). The
memory’s address port, Adr, is set to the calculated address in
The ImmSrc signal differs for ALUOut by making ResultSrc = 00 and AdrSrc = 1. MemWrite is
lw and sw in the MemAdr asserted to write the memory. This completes the sw instruction, so the
state. But recall that ImmSrc Main FSM returns to the Fetch state to begin the next instruction.
is produced combinationally
by the Instruction Decoder
Figures 7.37 and 7.38 show the expanded Main FSM and the datapath
(see Figure 7.28). flow for the MemWrite state. Note that the first two states of the FSM
(Fetch and Decode), which are not shown in Figure 7.37, are the same
as in Figure 7.33.

R-Type Instructions
After the Decode state, R-type ALU instructions proceed to the execute
(ExecuteR) state, which performs the desired ALU computation.
Namely, ALUSrcA = 10 and ALUSrcB = 00 to choose the contents of
rs1 as SrcA and rs2 as SrcB. ALUOp = 10 so that the ALU Decoder uses
the instruction’s control fields to determine what operation to perform.
7.4 Multicycle Processor 429

op = 0000011 (lw)
OR
op = 0100011 (sw)

S2: MemAdr
ALUSrcA = 10
ALUSrcB = 01
ALUOp = 00

op = op =
0000011 0100011
(lw) (sw)

S3: MemRead S5: MemWrite


ResultSrc = 00 ResultSrc = 00
Figure 7.37 Memory write
AdrSrc = 1 AdrSrc = 1
MemWrite

S4: MemWB
ResultSrc = 01
RegWrite

CLK

PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc
1:0
30
funct75 RegWrite

S5: MemWrite 0 1 1 0 Zero 0 01 xx xx xxx 00

Zero

CLK

OldPC

CLK CLK CLK 00


CLK
01
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1 10
0 Adr RD
A ALUResult ALUOut
ALU

EN EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
Memory
WriteData
ReadData

11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File

CLK
31:7 Extend ImmExt
Data

Result

Figure 7.38 Data flow during the memory write (MemWrite) state
430 CHAPTER SEVEN Microarchitecture

Reset

S0: Fetch
AdrSrc = 0 S1: Decode
IRWrite
ALUSrcA = 00
ALUSrcB = 10
ALUOp = 00
ResultSrc = 10
PCUpdate

op = 0000011 (lw) op =
OR 0110011
op = 0100011 (sw) (R-type)

S2: MemAdr S6: ExecuteR


ALUSrcA = 10 ALUSrcA = 10
ALUSrcB = 01 ALUSrcB = 00
ALUOp = 00 ALUOp = 10

Figure 7.39 Execute R-type


(ExecuteR) and ALU writeback op = op =
(ALUWB) states 0000011 0100011
(lw) (sw)

S3: MemRead S5: MemWrite S7: ALUWB


ResultSrc = 00 ResultSrc = 00 ResultSrc = 00
AdrSrc = 1 AdrSrc = 1 RegWrite
MemWrite

S4: MemWB
ResultSrc = 01
RegWrite

ALUResult is written to the ALUOut register at the end of the cycle.


R-type instructions then go to the ALU writeback (ALUWB) state where
the computation result, ALUOut, is written back to the register file.
In the ALUWB state, ResultSrc = 00 to select ALUOut as Result, and
RegWrite = 1 so that rd is written with the result. Figure 7.39 shows
the ExecuteR and ALUWB states added to the Main FSM. Figure 7.40
shows the data flow during both states, with ExecuteR data flow shown
in thick light-blue lines and ALUWB data flow in thick dark-blue lines.

beq
The final instruction, beq, compares two registers and computes the
branch target address. Thus far, the ALU is idle during the Decode state,
so we can use the ALU during that state to calculate the branch target
7.4 Multicycle Processor 431

CLK

PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc1:0
30
funct75 RegWrite

S6: ExecuteR 0 x 0 0 0 xx 10 00 varies xx


S7: ALUWB 0 x 0 0 Zero 1 xx xx xx xxx 00

Zero

CLK

OldPC

CLK CLK CLK 00


CLK
01
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1 10
0 Adr RD ALUResult ALUOut

ALU
EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01

WriteData
ReadData

Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File

CLK
31:7 Extend ImmExt
Data

Result

Figure 7.40 Data flow during the ExecuteR and ALUWB states

address, OldPC + ImmExt. ALUSrcA and ALUSrcB are both 01 so that


Even though the instruction
OldPC is SrcA and the branch offset (ImmExt) is SrcB. ALUOp = 00 to is not yet decoded at the
make the ALU add. The target address is stored in the ALUOut register beginning of the Decode
at the end of the Decode state. Figure 7.41 shows the enhanced Decode state—and it may not even be
state as well as the subsequent BEQ state, which is discussed next. In a beq instruction—the branch
Figure 7.42, the data flow during the Decode state is shown in light-blue target address is calculated as
if it were a branch. If it turns
and gray lines. The branch target address calculation is highlighted in out that the instruction is not a
light blue, and the register read and immediate extension is highlighted branch or if the branch is not
with thick gray lines. taken, the resulting calculation
After the Decode state, beq proceeds to the BEQ state, where it is simply not used.
compares the source registers. ALUSrcA = 10 and ALUSrcB = 00 to
select the values read from the register file as SrcA and SrcB. ALUOp
= 01 so that the ALU performs subtraction. If the source registers are
equal, the ALU’s Zero output asserts (because rs1 − rs2 = 0). Branch =
1 in this state so that if Zero is also set, PCWrite is asserted (as shown
in the PCWrite logic of Figure 7.28) and the branch target address (in
ALUOut) becomes the next PC. ALUOut is routed to the PC register by
ResultSrc being 00. Figure 7.41 shows the BEQ state, and Figure 7.42
432 CHAPTER SEVEN Microarchitecture

Reset

S0: Fetch
AdrSrc = 0 S1: Decode
IRWrite ALUSrcA = 01
ALUSrcA = 00 ALUSrcB = 01
ALUSrcB =10 ALUOp = 00
ALUOp = 00
ResultSrc = 10
PCUpdate

op = 0000011 (lw) op =
op =
OR 0110011
1100011
op = 0100011 (sw) (R-type)
(beq)

S2: MemAdr S6: ExecuteR S10: BEQ


ALUSrcA = 10 ALUSrcA = 10 ALUSrcA = 10
ALUSrcB = 01 ALUSrcB = 00 ALUSrcB = 00
ALUOp = 00 ALUOp = 10 ALUOp = 01
ResultSrc = 00
Branch

op = op =
0000011 0100011
(lw) (sw)

S3: MemRead S5: MemWrite S7: ALUWB


ResultSrc = 00 ResultSrc = 00 ResultSrc = 00
AdrSrc = 1 AdrSrc = 1 RegWrite
MemWrite

S4: MemWB
ResultSrc = 01
RegWrite

Figure 7.41 Enhanced Decode state, with branch target address calculation, and BEQ state

shows the data flow during the BEQ state. The path for comparing rs1
and rs2 is shown in dark blue and the path to (conditionally) set PC to
the target address is in gray through the Result register. This concludes
the design of the controller for these instructions.

7 . 4 . 3 More Instructions
As we did with the single-cycle processor, we next consider examples
of how to modify the multicycle processor datapath and controller to
handle new instructions: I-type ALU instructions (addi, andi, ori,
slti) and jal.
7.4 Multicycle Processor 433

CLK

PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc1:0
30
funct75 RegWrite

S1: Decode 0 X 0 0 0 varies 01 01 000 XX


S10: BEQ 1 if taken X 0 0 Zero 0 10 10 00 001 00

Zero

CLK

OldPC

CLK CLK CLK 00


CLK
01
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1 10
0 Adr RD
ALUResult ALUOut

ALU
EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 S rcB 01

WriteData
ReadData

Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File

CLK
31:7 Extend ImmExt
Data

Result

Figure 7.42 Data flow during Decode and BEQ states

Example 7.5 EXPANDING THE MULTICYCLE PROCESSOR TO


INCLUDE I-TYPE ALU INSTRUCTIONS

Expand the multicycle processor to include I-type ALU instructions addi, andi,
ori, and slti.

Solution These I-type ALU instructions are nearly the same as their R-type equivalents
(add, and, or, and slt) except that the second source comes from ImmExt rather than
the register file. We introduce the ExecuteI state to perform the desired computation
for all I-type ALU instructions. This state is like ExecuteR except that ALUSrcB = 01
to choose ImmExt as SrcB. After the ExecuteI state, I-type ALU instructions proceed
to the ALU writeback (ALUWB) state to write the result to the register file. Figure 7.43
shows the enhanced Main FSM, which also includes the JAL state for Example 7.6.

Example 7.6 EXPANDING THE MULTICYCLE PROCESSOR TO INCLUDE jal

Expand the multicycle processor to include the jump and link instruction (jal).
Solution Like the I-type ALU instructions from Example 7.5, no additional hard-
ware is needed to implement the jal instruction. Only the Main FSM needs to be
434 CHAPTER SEVEN Microarchitecture

Reset

S0: Fetch
AdrSrc = 0 S1: Decode
IRWrite ALUSrcA = 01
ALUSrcA = 00 ALUSrcB = 01
ALUSrcB =10 ALUOp = 00
ALUOp = 00
ResultSrc = 10
PCUpdate

op = 0000011 (lw) op = op = op =
op =
OR 0110011 0010011 1101111
1100011
op = 0100011 (sw) (R-type) (I-type ALU) (jal)
(beq)

S2: MemAdr S6: ExecuteR S8: ExecuteI S9: JAL S10: BEQ
ALUSrcA = 10 ALUSrcA = 10 ALUSrcA = 10 ALUSrcA = 01 ALUSrcA = 10
ALUSrcB = 01 ALUSrcB = 00 ALUSrcB = 01 ALUSrcB = 10 ALUSrcB = 00
ALUOp = 00 ALUOp = 10 ALUOp = 10 ALUOp = 00 ALUOp = 01
ResultSrc = 00 ResultSrc = 00
PCUpdate Branch

op = op =
0000011 0100011
(lw) (sw)

S3: MemRead S5: MemWrite S7: ALUWB


ResultSrc = 00 ResultSrc = 00 ResultSrc = 00
AdrSrc = 1 AdrSrc = 1 RegWrite
MemWrite

S4: MemWB
ResultSrc = 01
RegWrite

Figure 7.43 Enhanced Main FSM: ExecuteI and JAL states

updated. The first two steps are the same as the other instructions. During the
You may notice that by the
time the processor reaches
Decode state, the jump target address is calculated using the same flow as the
the JAL state, the PC register branch target address calculation but with ImmSrc = 11, as set by the Instruction
has already been updated to Decoder. Thus, during the Decode state, the jump offset is sign-extended and
PC+4. So we could have just added to the current PC (contained in signal OldPC) to form the jump target
used the PC register output address, which is written to the ALUOut register at the end of that state. jal then
to write to rd. But that would
have required us to extend
proceeds to the JAL state, where the processor writes the target address to PC
the Result multiplexer to and calculates the return address (PC+4) so that it can write it to rd in the next
receive PC as an input. The state. The ALU calculates PC+4 (i.e., OldPC+4) using ALUSrcA = 01 (SrcA =
solution above requires less OldPC), ALUSrcB = 10 (SrcB = 4), and ALUOp = 00 for addition. To write the
hardware because it uses the target address to the PC, ResultSrc = 00 to select the target address (held in
existing datapath.
ALUOut) as Result, and PCUpdate = 1 so that PCWrite is asserted. Figure 7.43
7.4 Multicycle Processor 435

CLK

PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc
1:0
30
funct75 RegWrite

S9: JAL 1 x 0 0 Zero 0 11 01 10 000 00

Zero

CLK

OldPC

CLK CLK CLK 00


CLK
01
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1 10
0 Adr RD
A ALUResult ALUOut

ALU
EN EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01

WriteData
ReadData

Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File

CLK
31:7 Extend ImmExt
Data

Result

Figure 7.44 Data flow during the JAL state

shows the new JAL state, and Figure 7.44 shows the data flow during the JAL
state. The flow for updating the PC to the target address is in gray and the PC+4
calculation is in blue. After the JAL state, jal proceeds to the ALUWB state,
where the return address (ALUOut = PC+4) is written to rd. This concludes the
jal instruction, so the Main FSM then goes back to the Fetch state.

Putting these steps together, Figure 7.45 shows the complete Main FSM
state transition diagram for the multicycle processor. The function of
each state is summarized below the figure. Converting the diagram to
hardware is a straightforward but tedious task using the techniques of
Chapter 3. Better yet, the FSM can be coded in an HDL and synthesized
using the techniques of Chapter 4.

7 . 4 . 4 Performance Analysis
The execution time of an instruction depends on both the number
of cycles it uses and the cycle time. While the single-cycle processor
performed all instructions in one cycle, the multicycle processor uses
436 CHAPTER SEVEN Microarchitecture

Reset

S0: Fetch
AdrSrc = 0 S1: Decode
IRWrite ALUSrcA = 01
ALUSrcA = 00 ALUSrcB = 01
ALUSrcB = 10 ALUOp = 00
ALUOp = 00
ResultSrc = 10
PCUpdate

op = 0000011 (lw) op = op = op =
op =
OR 0110011 0010011 1101111
1100011
op = 0100011 (sw) (R-type) (I-type ALU) (jal)
(beq)

S2: MemAdr S6: ExecuteR S8: ExecuteI S9: JAL S10: BEQ
ALUSrcA = 10 ALUSrcA = 10 ALUSrcA = 10 ALUSrcA = 01 ALUSrcA = 10
ALUSrcB = 01 ALUSrcB = 00 ALUSrcB = 01 ALUSrcB = 10 ALUSrcB = 00
ALUOp = 00 ALUOp = 10 ALUOp = 10 ALUOp = 00 ALUOp = 01
ResultSrc = 00 ResultSrc = 00
PCUpdate Branch

op = op =
0000011 0100011
(lw) (sw)

S3: MemRead S5: MemWrite S7: ALUWB


ResultSrc = 00 ResultSrc = 00 ResultSrc = 00
AdrSrc = 1 AdrSrc = 1 RegWrite
MemWrite

S4: MemWB
ResultSrc = 01
RegWrite

State Datapath mOp


Fetch Instr ←Mem[PC]; PC ← PC+4
Decode ALUOut ← PCTarget
MemAdr ALUOut ← rs1 + imm
MemRead Data ← Mem[ALUOut]
MemWB rd ← Data
MemWrite Mem[ALUOut] ← rd
ExecuteR ALUOut ← rs1oprs2
ExecuteI ALUOut ← rs1opimm
ALUWB rd ← ALUOut
BEQ ALUResult = rs1-rs2; if Zero, PC = ALUOut
JAL PC = ALUOut; ALUOut = PC+4

Figure 7.45 Complete multicycle control FSM

varying numbers of cycles for different instructions. However, the


multicycle processor does less work in a single cycle and, thus, has a
shorter cycle time.
The multicycle processor requires three cycles for branches, four
for R-type, I-type ALU, jump, and store instructions, and five for loads.
The number of clock cycles per instruction (CPI) depends on the relative
likelihood that each instruction is used.
7.4 Multicycle Processor 437

CLK

PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc1:0
30
funct75 RegWrite

Zero

Zero

CLK

OldPC

CLK CLK CLK 00


CLK
01
WE 19:15 Rs1 WE3 A SrcA CLK
PCNext PC Instr A1 RD1 10
0 Adr RD
A ALUResult ALUOut

ALU
EN EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01

WriteData
ReadData

Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File

CLK
31:7 Extend ImmExt
Data

Result

Figure 7.46 Multicycle processor potential critical paths

Example 7.7 MULTICYCLE PROCESSOR CPI

The SPECINT2000 benchmark consists of approximately 25% loads, 10%


stores, 11% branches, 2% jumps, and 52% R- or I-type ALU instructions.2
Determine the average CPI for this benchmark.

Solution The average CPI is the weighted sum over each instruction of the CPI
for that instruction multiplied by the fraction of time that instruction is used. For
this benchmark, average CPI = (0.11)(3) + (0.10 + 0.02 + 0.52)(4) + (0.25)(5) = 4.14.
This is better than the worst-case CPI of 5, which would be required if all
instructions took the same number of cycles.

Recall that we designed the multicycle processor so that each cycle


involved one ALU operation, memory access, or register file access. Let
us assume that the register file is faster than the memory and that writing
memory is faster than reading memory. Examining the datapath reveals two
possible critical paths that would limit the cycle time, as shown in Figure 7.46:
1. The path to calculate PC+4: From the PC register through the SrcA
multiplexer, ALU, and Result multiplexer back to the PC register
(highlighted in thick blue lines); or
2
Instruction frequencies from Patterson and Hennessy, Computer Organization and
Design, 4th Edition, Morgan Kaufmann, 2011.
438 CHAPTER SEVEN Microarchitecture

2. The path to read data from memory: From the ALUOut register
through the Result and Adr muxes to read memory into the Data
register (highlighted in thick gray lines)

Both of these paths also require a delay through the decoder after the
state updates (i.e., after a tpcq delay) to produce the control (multiplexer
select and ALUControl) signals. Thus, the clock period is given in
Equation 7.4.
Tc _ multi = t pcq + tdec + 2tmux + max [t ALU , tmem ] + t setup (7.4)
The numerical values of these times will depend on the specific imple-
mentation technology.

Example 7.8 MULTICYCLE PROCESSOR PERFORMANCE COMPARISON.


Ben Bitdiddle is wondering whether the multicycle processor would be faster
than the single-cycle processor. For both designs, he plans on using the 7-nm
CMOS manufacturing process with the delays given in Table 7.7 on page 415.
Help him compare each processor’s execution time for 100 billion instructions
from the SPECINT2000 benchmark (see Example 7.4).
Solution According to Equation 7.4, the cycle time of the multicycle processor is
Tc_multi = tpcq + tdec + 2tmux + tmem + tsetup = 40 + 25 + 2(30) + 200 + 50 = 375 ps.
Using the CPI of 4.14 from Example 7.7, the total execution time is Tmulti = (100
× 109 instructions)(4.14 cycles / instruction)(375 × 10−12 s / cycle) = 155 seconds.
According to Example 7.4, the single-cycle processor had a total execution time
of 75 seconds, so the multicycle processor is slower.

One of the original motivations for building a multicycle processor


was to avoid making all instructions take as long as the slowest one.
Unfortunately, this example shows that the multicycle processor is slower
than the single-cycle processor, given the assumptions of CPI and circuit
element delays. The fundamental problem is that even though the slow-
est instruction, lw, was broken into five steps, the multicycle processor
cycle time was not nearly improved fivefold. This is partly because not
all of the steps are exactly the same length and partly because the 90-ps
sequencing overhead of the register clock-to-Q and setup time must now
be paid on every step, not just once for the entire instruction. In general,
engineers have learned that it is difficult to exploit the fact that some
computations are faster than others unless the differences are large.
Compared with the single-cycle processor, the multicycle proces-
sor is likely to be less expensive because it shares a single memory for
instructions and data and because it eliminates two adders. It does, how-
ever, require five nonarchitectural registers and additional multiplexers.
7.5 Pipelined Processor 439

7.5 PIPELINED PROCESSOR


Pipelining, introduced in Section 3.6, is a powerful way to improve the Recall that throughput is the
throughput of a digital system. We design a pipelined processor by subdi- number of tasks (in this case,
viding the single-cycle processor into five pipeline stages. Thus, five instructions) that complete
instructions can execute simultaneously, one in each stage. Because each per second. Latency is the
stage has only one-fifth of the entire logic, the clock frequency is approxi- time it takes for a given
instruction to complete, from
mately five times faster. So, ideally, the latency of each instruction is
start to finish. (See Section 3.6)
unchanged, but the throughput is five times better. Microprocessors exe-
cute millions or billions of instructions per second, so throughput is more
important than latency. Pipelining introduces some overhead, so the
throughput will not be as high as we might ideally desire, but pipelining
nevertheless gives such great advantage for so little cost that all modern
high-performance microprocessors are pipelined.
Reading and writing the memory and register file and using the ALU
typically constitute the biggest delays in the processor. We choose five
pipeline stages so that each stage involves exactly one of these slow steps.
Specifically, we call the five stages Fetch, Decode, Execute, Memory, and
Writeback. They are similar to the five steps that the multicycle processor
used to perform lw. In the Fetch stage, the processor reads the instruc-
tion from instruction memory. In the Decode stage, the processor reads
the source operands from the register file and decodes the instruction to
produce the control signals. In the Execute stage, the processor performs
a computation with the ALU. In the Memory stage, the processor reads
or writes data memory, if applicable. Finally, in the Writeback stage, the
processor writes the result to the register file, if applicable.
Figure 7.47 shows a timing diagram comparing the single-cycle and
pipelined processors. Time is on the horizontal axis and instructions are
on the vertical axis. The diagram assumes component delays from Table 7.7
(see page 415) but ignores multiplexers and registers for simplicity. In the
single-cycle processor in Figure 7.47(a), the first instruction is read from
memory at time 0. Next, the operands are read from the register file. Then,
the ALU executes the necessary computation. Finally, the data memory
may be accessed, and the result is written back to the register file at 680 ps.
The second instruction begins when the first completes. Hence, in this
diagram, the single-cycle processor has an instruction latency of 200 + 100 + Remember that for this
120 + 200 + 60 = 680 ps (see Table 7.7 on page 415) and a throughput of abstract comparison of
1 instruction per 680 ps (1.47 billion instructions per second). single-cycle and pipelined
processor performance, we
In the pipelined processor in Figure 7.47(b), the length of a pipe-
are ignoring the overhead
line stage is set at 200 ps by the slowest stage, the memory access in the of decoder, multiplexer, and
Fetch or Memory stage. Each pipeline stage is indicated by solid or dashed register delays.
vertical blue lines. At time 0, the first instruction is fetched from memory.
At 200 ps, the first instruction enters the Decode stage, and a second
instruction is fetched. At 400 ps, the first instruction executes, the second
instruction enters the Decode stage, and a third instruction is fetched.
440 CHAPTER SEVEN Microarchitecture

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
Instr
Time (ps)
Dec
1 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg
Dec
2 Fetch Execute Memory Wr
Read
Instruction ALU Read/Write Reg
Reg

(a)

Instr
Dec
1 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg
Dec
2 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg
Dec
3 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg

(b)

Figure 7.47 Timing diagrams: (a) single-cycle processor and (b) pipelined processor

And so forth, until all the instructions complete. The instruction latency is
5 × 200 = 1000 ps. Because the stages are not perfectly balanced with equal
amounts of logic, the latency is longer for the pipelined processor than
for the single-cycle processor. The throughput is 1 instruction per 200 ps
(5 billion instructions per second)—that is, one instruction completes
every clock cycle. This throughput is 3.4 times as much as the single-cycle
processor—not quite 5 times but, nonetheless, a substantial speedup.
Figure 7.48 shows an abstracted view of the pipeline in operation
in which each stage is represented pictorially. Each pipeline stage is
represented with its major component—instruction memory (IM), reg-
ister file (RF) read, ALU execution, data memory (DM), and register
file writeback—to illustrate the flow of instructions through the pipe-
line. Reading across a row shows the clock cycle in which a particular
instruction is in each stage. For example, the sub instruction is fetched
in cycle 3 and executed in cycle 5. Reading down a column shows what
the various pipeline stages are doing on a particular cycle. For example,
in cycle 6, the register file is writing a sum to s3, the data memory is
idle, the ALU is computing (s11 & t0), t4 is being read from the regis-
ter file, and the or instruction is being fetched from instruction memory.
Stages are shaded to indicate when they are used. For example, the data
memory is used by lw in cycle 4 and by sw in cycle 8. The instruction
memory and ALU are used in every cycle. The register file is written by
every instruction except sw. In the pipelined processor, the register file is
7.5 Pipelined Processor 441

1 2 3 4 5 6 7 8 9 10

Time (cycles)
s0
lw DM s2
lw s2, 40(s0) IM RF 40 + RF

s9
add DM s3
add s3, s9, s10 IM RF s10 + RF

t1
sub DM s4
sub s4, t1, s8 IM RF s8 - RF

s11
and DM s5
and s5, s11, t0 IM RF t0 & RF

t4
sw DM
sw s6, 20(t4) IM RF 20 + RF

t2
or DM s7
or s7, t2, t3 IM RF t3 | RF

Figure 7.48 Abstract view of pipeline in operation

used twice in every cycle: it is written in the first part of a cycle and read
in the second part, as suggested by the shading. This way, data can be
written by one instruction and read by another within a single cycle.
A central challenge in pipelined systems is handling hazards that
occur when one instruction’s result is needed by a subsequent instruction
before the former instruction has completed. For example, if the add in
Figure 7.48 used s2 as a source instead of s10, a hazard would occur
because the s2 register has not yet been written by the lw instruction
when it is read by add in cycle 3. After designing the pipelined datapath
and control, this section explores forwarding, stalls, and flushes as meth-
ods to resolve hazards. Finally, this section revisits performance analysis
considering sequencing overhead and the impact of hazards.

7 . 5 . 1 Pipelined Datapath
The pipelined datapath is formed by chopping the single-cycle datapath
into five stages separated by pipeline registers. Figure 7.49(a) shows the
single-cycle datapath stretched out to leave room for the pipeline registers.
Figure 7.49(b) shows the pipelined datapath formed by inserting four
pipeline registers to separate the datapath into five stages. The stages
and their boundaries are indicated in blue. Signals are given a suffix
(F, D, E, M, or W) to indicate the stage in which they reside.
The register file is peculiar because it is read in the Decode stage and
written in the Writeback stage. So, although the register file is drawn in the
Decode stage, its write address and write data come from the Writeback
stage. This feedback will lead to pipeline hazards, which are discussed in
442 CHAPTER SEVEN Microarchitecture

CLK CLK
CLK
19:15 WE3 SrcAE WE
0 PC' PC Instr A1 RD1 Zero
A RD ReadData 00
1 ALUResult

ALU
A RD 01
Instruction 24:20 10
A2 RD2 0 SrcBE Data
Memory 11:7
A3 1 Memory
Register WriteData
WD3 WD
File
PCTarget

+
+

4 ImmExt
31:7 Extend

PCPlus4

(a) Result

CLK CLK CLK

Zero
CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1
A RD 00
1 ALUResultM ReadDataW

ALU
RD2E A RD 01
Instruction 24:20 10
A2 RD2 0 SrcBE Data
Memory 11:7
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE +
+

4 ImmExtD ImmExtE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4E PCPlus4M


PCPlus4W
PCTargetE

ResultW

Fetch Decode Execute Memory Writeback


(b)

Figure 7.49 Datapaths: (a) single-cycle and (b) pipelined

Section 7.5.3. The register file in the pipelined processor writes on the falling
edge of CLK so that it can write a result in the first half of a cycle and read
that result in the second half of the cycle for use in a subsequent instruction.
One of the subtle but critical issues in pipelining is that all signals
associated with a particular instruction must advance through the pipeline
in unison. Figure 7.49(b) has an error related to this issue. Can you find it?
The error is in the register file write logic, which should operate in
the Writeback stage. The data value comes from ResultW, a Writeback
stage signal. But the destination register comes from RdD (InstrD11:7),
which is a Decode stage signal. In the pipeline diagram of Figure 7.48,
during cycle 5, the result of the lw instruction would be incorrectly writ-
ten to s5 rather than s2.
7.5 Pipelined Processor 443

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1
A RD ReadDataW 00
1 ALUResultM

ALU
RD2E A RD 01
Instruction 24:20 10
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
11:7 RdD RdE RdM RdW
+

4 ImmExtD ImmExtE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4E PCPlus4M


PCPlus4W
PCTargetE

ResultW

Figure 7.50 Corrected pipelined datapath

Figure 7.50 shows a corrected datapath, with the modification


in blue. The Rd signal is now pipelined along through the Execution,
Memory, and Writeback stages, so it remains in sync with the rest of the
instruction. RdW and ResultW are fed back together to the register file
in the Writeback stage.
The astute reader may note that the logic to produce PCF’ (the next
PC) is also problematic because it could be updated with either a Fetch
or an Execute stage signal (PCPlus4F or PCTargetE). This control hazard
will be fixed in Section 7.5.3.

7 . 5 . 2 Pipelined Control
The pipelined processor uses the same control signals as the single-cycle
processor and, therefore, has the same control unit. The control unit
examines the op, funct3, and funct75 fields of the instruction in the Decode
stage to produce the control signals, as was described in Section 7.3.3
for the single-cycle processor. These control signals must be pipelined along
with the data so that they remain synchronized with the instruction.
The entire pipelined processor with control is shown in Figure 7.51.
RegWrite must be pipelined into the Writeback stage before it feeds back
to the register file, just as Rd was pipelined in Figure 7.50. In addition to
R-type ALU instructions, lw, sw, and beq, this pipelined processor also
supports jal and I-type ALU instructions.

7 . 5 . 3 Hazards
In a pipelined system, multiple instructions are handled concurrently.
When one instruction is dependent on the results of another that has
444 CHAPTER SEVEN Microarchitecture

PCSrcE
ZeroE
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control ResultSrcD1:0 ResultSrcE1:0 ResultSrcM1:0 ResultSrcW1:0
Unit
MemWriteD MemWriteE MemWriteM
JumpD JumpE
BranchD BranchE
6:0
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
ALUSrcD ALUSrcE
30
funct75
ImmSrcD1:0

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1
A RD ReadDataW 00
1 ALUResultM

ALU
RD2E A RD 01
Instruction 24:20 10
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
11:7 RdD RdE RdM RdW
+

4 ImmExtD ImmExtE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4E PCPlus4M


PCPlus4W
PCTargetE

ResultW

Figure 7.51 Pipelined processor with control

not yet completed, a hazard occurs. The register file is written during
the first half of the cycle and read during the second half of the cycle, so
a register can be written and read back in the same cycle without intro-
ducing a hazard.
Figure 7.52 illustrates hazards that occur when one instruction
writes a register (s8) and subsequent instructions read this register. The
blue arrows highlight when s8 is written to the register file (in cycle
5) as compared to when it is needed by subsequent instructions. This
is called a read after write (RAW) hazard. The add instruction writes a
result into s8 in the first half of cycle 5. However, the sub instruction
reads s8 on cycle 3, obtaining the wrong value. The or instruction reads
s8 on cycle 4, again obtaining the wrong value. The and instruction
reads s8 in the second half of cycle 5, obtaining the correct value, which
was written in the first half of cycle 5. Subsequent instructions also read
the correct value of s8. The diagram shows that hazards may occur in
this pipeline when an instruction writes a register and either of the two
subsequent instructions reads that register. Without special treatment,
the pipeline will compute the wrong result.
A software solution would be to require the programmer or compiler
to insert nop instructions between the add and sub instructions so that
the dependent instruction does not read the result (s8) until it is available
7.5 Pipelined Processor 445

1 2 3 4 5 6 7 8

Time (cycles)
s4
add s8
add s8, s4, s5 IM RF s5 + DM RF

s8
sub s2
sub s2, s8, s3 IM RF s3 - DM RF

t6
or s9
or s9, t6, s8 IM RF s8 | DM RF

s8
and s7
and s7, s8, t2 IM RF t2 & DM RF

Figure 7.52 Abstract pipeline diagram illustrating hazards

1 2 3 4 5 6 7 8 9 10

Time (cycles)
s4
add s8
add s8, s4, s5 IM RF s5 + DM RF

nop
nop IM RF DM RF

nop
nop IM RF DM RF

s8
sub s2
sub s2, s8, s3 IM RF s3 - DM RF

t6
or DM s9
or s9, t6, s8 IM RF s8 | RF

s8
and DM s7
and s7, s8, t2 IM RF t2 & RF

Figure 7.53 Solving data hazard with nops

in the register file, as shown in Figure 7.53. Such a software interlock


complicates programming and degrades performance, so it is not ideal.
On closer inspection, observe from Figure 7.52 that the sum from
the add instruction is computed by the ALU in cycle 3 and is not strictly
needed by the and instruction until the ALU uses it in cycle 4. In princi-
ple, we should be able to forward the result from one instruction to the
next to resolve the RAW hazard without waiting for the result to appear
in the register file and without slowing down the pipeline. In other situ-
ations explored later in this section, we may have to stall the pipeline to
give time for a result to be produced before the subsequent instruction
uses the result. In any event, something must be done to solve hazards so
that the program executes correctly despite the pipelining.
446 CHAPTER SEVEN Microarchitecture

1 2 3 4 5 6 7 8

Time (cycles)
s4
add s8 DM s8
add s8, s4, s5 IM RF s5 + RF

s8
sub s2
sub s2, s8, s3 IM RF s3 - DM RF

t6
or s9
or s9, t6, s8 IM RF s8 | DM RF

s8
and s7
and s7, s8, t2 IM RF t2 & DM RF

Figure 7.54 Abstract pipeline diagram illustrating forwarding

Hazards are classified as data hazards or control hazards. A data


hazard occurs when an instruction tries to read a register that has not
yet been written back by a previous instruction. A control hazard occurs
when the decision of what instruction to fetch next has not been made
by the time the fetch takes place. In the remainder of this section, we
enhance the pipelined processor with a Hazard Unit that detects hazards
and handles them appropriately so that the processor executes the pro-
gram correctly.
Solving Data Hazards with Forwarding
Some data hazards can be solved by forwarding (also called bypassing)
a result from the Memory or Writeback stage to a dependent instruc-
tion in the Execute stage. This requires adding multiplexers in front of
the ALU to select its operands from the register file or the Memory or
Writeback stage. Figure 7.54 illustrates this principle. This program
computes s8 with the add instruction and then uses s8 in the three sub-
sequent instructions. In cycle 4, s8 is forwarded from the Memory stage
of the add instruction to the Execute stage of the dependent sub instruc-
tion. In cycle 5, s8 is forwarded from the Writeback stage of the add
instruction to the Execute stage of the dependent or instruction. Again,
no forwarding is needed for the and instruction because s8 is written to
the register file in the first half of cycle 5 and read in the second half.
Forwarding is necessary when an instruction in the Execute stage
has a source register matching the destination register of an instruction
in the Memory or Writeback stage. Figure 7.55 modifies the pipelined
processor to support forwarding. It adds a Hazard Unit and two forwarding
multiplexers. The hazard detection unit receives the two source registers
from the instruction in the Execute stage, Rs1E and Rs2E, and the
destination registers from the instructions in the Memory and Writeback
7.5 Pipelined Processor 447

PCSrcE
ZeroE
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control ResultSrcD1:0 ResultSrcE1:0 ResultSrcM1:0 ResultSrcW1:0
Unit
MemWriteD MemWriteE MemWriteM
JumpD JumpE

6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1 00
A RD 01 ReadDataW 00
1 10 ALUResultM

ALU
RD2E A RD 01
Instruction 24:20 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+

4 ImmExtD ImmExtE
31:7 Extend

PCPlus4F PCPlus4D l
PCPus4E PCPlus4M
PCPlus4W
PCTargetE

ResultW
ForwardAE
ForwardBE

Hazard Unit

Figure 7.55 Pipelined processor with forwarding to solve some data hazards

stages, RdM and RdW. It also receives the RegWrite signals from the
Memory and Writeback stages (RegWriteM and RegWriteW) to know
whether the destination register will actually be written (e.g., the sw and
beq instructions do not write results to the register file and, hence, do
not have their results forwarded).
The Hazard Unit computes control signals for the forwarding mul-
tiplexers to choose operands from the register file or from the results
in the Memory or Writeback stage (ALUResultM or ResultW). The
Hazard Unit should forward from a stage if that stage will write a des-
tination register and the destination register matches the source register.
However, x0 is hardwired to 0 and should never be forwarded. If both
the Memory and Writeback stages contain matching destination reg-
isters, then the Memory stage should have priority because it contains
the more recently executed instruction. In summary, the function of the
forwarding logic for SrcAE (ForwardAE) is given on the next page.
The forwarding logic for SrcBE (ForwardBE) is identical except that it
checks Rs2E instead of Rs1E.
448 CHAPTER SEVEN Microarchitecture

if ((Rs1E == RdM) & RegWriteM) & (Rs1E != 0) then // Forward from Memory stage
ForwardAE = 10
else if ((Rs1E == RdW) & RegWriteW) & (Rs1E != 0) then // Forward from Writeback stage
ForwardAE = 01
else ForwardAE = 00 // No forwarding (use RF output)

Solving Data Hazards with Stalls

Forwarding is sufficient to solve RAW data hazards when the result


is computed in the Execute stage of an instruction because its result
can then be forwarded to the Execute stage of the next instruction.
Unfortunately, the lw instruction does not finish reading data until
the end of the Memory stage, so its result cannot be forwarded to the
Execute stage of the next instruction. We say that the lw instruction
has a two-cycle latency because a dependent instruction cannot use its
result until two cycles later. Figure 7.56 shows this problem. The lw
instruction receives data from memory at the end of cycle 4, but the
and instruction needs that data (the value in s7) as a source operand
at the beginning of cycle 4. There is no way to solve this hazard with
forwarding.
A solution is to stall the pipeline, holding up operation until the data
is available. Figure 7.57 shows stalling the dependent instruction (and)
in the Decode stage. and enters the Decode stage in cycle 3 and stalls
there through cycle 4. The subsequent instruction (or) must remain in the
Fetch stage during both cycles as well because the Decode stage is full.
In cycle 5, the result can be forwarded from the Writeback stage
of lw to the Execute stage of and. Also, in cycle 5, source s7 of the
or instruction is read directly from the register file, with no need for
forwarding.
Note that the Execute stage is unused in cycle 4. Likewise, Memory
is unused in cycle 5 and Writeback is unused in cycle 6. This unused
stage propagating through the pipeline is called a bubble, which behaves
like a nop instruction. The bubble is introduced by zeroing out the
Execute stage control signals during a Decode stage stall so that the
bubble performs no action and changes no architectural state.
In summary, stalling a stage is performed by disabling its pipeline
register (i.e., the register to the left of a stage) so that the stage’s inputs
do not change. When a stage is stalled, all previous stages must also be
stalled so that no subsequent instructions are lost. The pipeline register
directly after the stalled stage must be cleared (flushed) to prevent bogus
information from propagating forward. Stalls degrade performance, so
they should be used only when necessary.
7.5 Pipelined Processor 449

1 2 3 4 5 6 7 8

Time (cycles)
s5 s7
1w
1w s7, 40(s5) IM RF 40 + DM RF

Trouble!
s7
and s8
and s8, s7, t3 IM RF t3 & DM RF

s6 t2
or
or t2, s6, s7 IM RF s7 | DM RF

s7
sub s3
sub s3, s7, s2 IM RF s2 – DM RF

Figure 7.56 Abstract pipeline diagram illustrating trouble forwarding from lw

1 2 3 4 5 6 7 8 9

Time (cycles)
s5
lw s7
lw s7, 40(s5) IM RF 40 + DM RF

s7 s7
and s8
and s8, s7, t3 IM RF t3 RF t3 & DM RF

s6
or or t2
or t2, s6, s7 IM IM RF s7 | DM RF

Stall s7
sub s3
sub s3, s7, s2 IM RF s2 - DM RF

Figure 7.57 Abstract pipeline diagram illustrating stall to solve hazards

Figure 7.58 modifies the pipelined processor to add stalls for lw


data dependencies. In order for the Hazard Unit to stall the pipeline, the
following conditions must be met:
1. A load word is in the Execute stage (indicated by ResultSrcE0 = 1)
and
2. The load’s destination register (RdE) matches Rs1D or Rs2D, the
source operands of the instruction in the Decode stage
Stalls are supported by adding enable inputs (EN) to the Fetch and
Decode pipeline registers and a synchronous reset/clear (CLR) input to
the Execute pipeline register. When a load word (lw) stall occurs, StallD
and StallF are asserted to force the Decode and Fetch stage pipeline reg-
isters to retain their existing values. FlushE is also asserted to clear the
contents of the Execute stage pipeline register, introducing a bubble. The
Hazard Unit lwStall (load word stall) signal indicates when the pipeline
450 CHAPTER SEVEN Microarchitecture

PCSrcE
ZeroE
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control ResultSrcD1:0 ResultSrcE1:0 ResultSrcM1:0 ResultSrcW1:0
Unit 0
MemWriteD MemWriteE MemWriteM
JumpD JumpE

6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1 00
A RD 01 ReadDataW 00
1 10 ALUResultM

ALU
EN

RD2E A RD 01
Instruction 24:20 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+

4 ImmExtD ImmExtE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4E PCPlus4M


CLR
EN

PCPlus4W
PCTargetE

ResultW
ResultSrcE0
ForwardAE
ForwardBE
FlushE
StallD
StallF

Hazard Unit

Figure 7.58 Pipelined processor with stalls to solve lw data hazard

should be stalled due to a load word dependency. Whenever lwStall is


The lwStall logic
TRUE, all of the stall and flush signals are asserted. Hence, the logic to
described here could cause
the processor to stall compute the stalls and flushes is:
unnecessarily when the
destination of the load is x0 lwStall = ResultSrcE0 & ((Rs1D = = RdE) | (Rs2D = = RdE))
or when a false dependency StallF = StallD = FlushE = lwStall
exists—that is, when the
instruction in the Decode
stage is a J- or I-type Solving Control Hazards
instruction that randomly The beq instruction presents a control hazard: the pipelined processor
causes a false match between does not know what instruction to fetch next because the branch deci-
bits in their immediate fields sion has not been made by the time the next instruction is fetched.
and RdE. However, these
One mechanism for dealing with this control hazard is to stall the
cases are rare (and poor
coding practice, in the case of pipeline until the branch decision is made (i.e., PCSrcE is computed).
x0 being the load destination) Because the decision is made in the Execute stage, the pipeline would
and they cause only a small have to be stalled for two cycles at every branch. This would severely
performance loss. degrade the system performance if branches occur often, which is typi-
cally the case.
7.5 Pipelined Processor 451

1 2 3 4 5 6 7 8 9 10

Time (cycles)
s1
beq DM
20 beq s1, s2, L1 IM RF s2 - RF

t1
sub DM
24 sub s8, t1, s3 IM RF s3 RF Flush
these
or instruc!ons
28 or s9, t6, s5 IM RF DM RF

2C ...
... ...
s3
add DM s7
58 L1: add s7, s3, s4 IM RF s4 + RF

Figure 7.59 Abstract pipeline diagram illustrating flushing when a branch is taken

An alternative to stalling the pipeline is to predict whether the branch


will be taken and begin executing instructions based on the predic-
tion. Once the branch decision is available, the processor can throw out
the instructions if the prediction was wrong. In the pipeline presented so
far (Figure 7.58), the processor predicts that branches are not taken and
simply continues executing the program in order until PCSrcE is asserted to
select the next PC from PCTargetE instead. If the branch should have been
taken, then the two instructions following the branch must be flushed
(discarded) by clearing the pipeline registers for those instructions. These
wasted instruction cycles are called the branch misprediction penalty.
Figure 7.59 shows such a scheme in which a branch from address
0x20 to address 0x58 is taken. The PC is not written until cycle 3, by
which point the sub and or instructions at addresses 0x24 and 0x28
have already been fetched. These instructions must be flushed, and the
add instruction is fetched from address 0x58 in cycle 4.
Finally, we must work out the stall and flush signals to handle
branches and PC writes. When a branch is taken, the subsequent two
instructions must be flushed from the pipeline registers of the Decode
and Execute stages. Thus, we add a synchronous clear input (CLR) to
the Decode pipeline register and add the FlushD output to the Hazard
Unit. (When CLR = 1, the register contents are cleared, that is, become
0.) When a branch is taken (indicated by PCSrcE being 1), FlushD and
FlushE must be asserted to flush the Decode and Execute pipeline regis-
ters. Figure 7.60 shows the enhanced pipelined processor for handling
control hazards. The flushes are now calculated as:

FlushD = PCSrcE
FlushE = lwStall | PCSrcE
452 CHAPTER SEVEN Microarchitecture

PCSrcE
ZeroE
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control ResultSrcD1:0 ResultSrcE1:0 ResultSrcM1:0 ResultSrcW1:0
Unit 0
MemWriteD MemWriteE MemWriteM
JumpD JumpE

6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1 00
A RD 01 ReadDataW 00
1 10 ALUResultM

ALU
EN

A RD 01
Instruction 24:20 RD2E 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+

4 ImmExtD ImmExtE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4M


CLR

CLR
EN

PCPlus4W
PCTargetE

ResultW
ForwardAE
ForwardBE
FlushD

FlushE
StallD
StallF

HazardUnit

Figure 7.60 Expanded Hazard Unit for handling branch control hazard

Hazard Summary
In summary, RAW data hazards occur when an instruction depends on
a result (from another instruction) that has not yet been written into the
register file. Data hazards can be resolved by forwarding if the result is
computed soon enough; otherwise, they require stalling the pipeline until
the result is available. Control hazards occur when the decision of what
instruction to fetch has not been made by the time the next instruction
must be fetched. Control hazards are solved by stalling the pipeline until
the decision is made or by predicting which instruction should be fetched
and flushing the pipeline if the prediction is later determined to be wrong.
Moving the decision as early as possible minimizes the number of instruc-
tions that are flushed on a misprediction. You may have observed by now
that one of the challenges of designing a pipelined processor is to under-
stand all possible interactions between instructions and to discover all
of the hazards that may exist. Figure 7.61 shows the complete pipelined
processor handling all of the hazards. The hazard logic is summarized on
the next page.
7.5 Pipelined Processor 453

PCSrcE
ZeroE
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control ResultSrcD1:0 ResultSrcE1:0 ResultSrcM1:0 ResultSrcW1:0
Unit 0
MemWriteD MemWriteE MemWriteM
JumpD JumpE

6:0
BranchD BranchE
op
14:12
ALUControlD2:0 ALUControlE2:0
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1 00
A RD 01 00
1 10 ALUResultM ReadDataW

ALU
EN

A RD 01
Instruction 24:20 RD2E 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+

4 ImmExtD ImmExtE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4E PCPlus4M


CLR

CLR
EN

PCPlus4W
PCTargetE

ResultW
ForwardAE
ForwardBE
FlushD

FlushE
StallD
StallF

Hazard Unit

Figure 7.61 Pipelined processor with full hazard handling

Forward to solve data hazards when possible3:


if ((Rs1E = = RdM) & RegWriteM) & (Rs1E != 0) then
ForwardAE = 10
else if ((Rs1E = = RdW) & RegWriteW) & (Rs1E != 0) then
ForwardAE = 01
else ForwardAE = 00

Stall when a load hazard occurs:


lwStall = ResultSrcE0 & ((Rs1D = = RdE) | (Rs2D = = RdE))
StallF = lwStall
StallD = lwStall

Flush when a branch is taken or a load introduces a bubble:


FlushD = PCSrcE
FlushE = lwStall | PCSrcE

3
Recall that the forwarding logic for SrcBE (ForwardBE) is identical except that it checks
Rs2E instead of Rs1E.
454 CHAPTER SEVEN Microarchitecture

7 . 5 . 4 Performance Analysis
The pipelined processor ideally would have a CPI of 1 because a new
instruction is issued—that is, fetched—every cycle. However, a stall or a
flush wastes 1 to 2 cycles, so the CPI is slightly higher and depends on
the specific program being executed.

Example 7.9 PIPELINED PROCESSOR CPI

The SPECINT2000 benchmark considered in Example 7.4 consists of approxi-


mately 25% loads, 10% stores, 11% branches, 2% jumps, and 52% R- or I-type
ALU instructions. Assume that 40% of the loads are immediately followed by an
instruction that uses the result, requiring a stall, and that 50% of the branches
are taken (mispredicted), requiring two instructions to be flushed. Ignore other
hazards. Compute the average CPI of the pipelined processor.

Solution The average CPI is the weighted sum over each instruction of the CPI
for that instruction multiplied by the fraction of time that instruction is used.
Loads take one clock cycle when there is no dependency and two cycles when
the processor must stall for a dependency, so they have a CPI of (0.6)(1) + (0.4)
(2) = 1.4. Branches take one clock cycle when they are predicted properly and
three when they are not, so they have a CPI of (0.5)(1) + (0.5)(3) = 2. Jumps
take three clock cycles (CPI = 3). All other instructions have a CPI of 1. Hence,
for this benchmark, the average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(2) + (0.02)
(3) + (0.52)(1) = 1.25.

We can determine the cycle time by considering the critical path in each
of the five pipeline stages shown in Figure 7.61. Recall that the register
file is used twice in a single cycle: it is written in the first half of the
Writeback cycle and read in the second half of the Decode cycle; so these
stages can use only half of the cycle time for their critical path. Another
way of saying it is this: twice the critical path for each of those stages
must fit in a cycle. Figure 7.62 shows the critical path for the Execute
stage. It occurs when a branch is in the Execute stage that requires for-
warding from the Writeback stage: the path goes from the Writeback
pipeline register, through the Result, ForwardBE, and SrcB multiplexers,
The critical path analysis for through the ALU and AND-OR logic to the PC multiplexer and, finally,
the Execute stage assumes to the PC register.
that the Hazard Unit delay
for calculating ForwardAE t pcq + tmem + t setup 
Fetch
and ForwardBE is less than  
or equal to the delay of the  2(tRFread + t setup ) Decode 
 
Result multiplexer. If the
Tc _ pipelined = max t pcq + 4tmux + t ALU + t AND−OR + t setup Execute 
Hazard Unit delay is longer,
t Memory 
it must be included in the  pcq + tmem + t setup
critical path instead of the  2(t Writeback
Result multiplexer delay.  pcq + tmux + tRFsetup )
(7.5)
7.5 Pipelined Processor 455

PCSrcE
ZeroE
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control ResultSrcD1:0 ResultSrcE1:0 ResultSrcM1:0 ResultSrcW1:0
Unit 0
MemWriteD MemWriteE MemWriteM
JumpD JumpE

6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1 00
A RD 01 00
1 ALUResultM ReadDataW

ALU
10
EN

A RD 01
Instruction 24:20 RD2E 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+

4 ImmExtD ImmExtE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4E PCPlus4M


CLR

CLR
EN

PCPlus4W
PCTargetE

ResultW
ForwardAE
ForwardBE
FlushD

FlushE
StallD
StallF

Hazard Unit

Figure 7.62 Pipelined processor critical path

Example 7.10 PIPELINED PROCESSOR PERFORMANCE COMPARISON

Ben Bitdiddle needs to compare the pipelined processor performance with that
of the single-cycle and multicycle processors considered in Examples 7.4 and
7.8. The logic delays were given in Table 7.7 (on page 415). Help Ben compare
the execution time of 100 billion instructions from the SPECINT2000 bench-
mark for each processor.

Solution According to Equation 7.5, the cycle time of the pipelined processor is
Tc_pipelined = max[40 + 200 + 50, 2(100 + 50), 40 + 4(30) + 120 + 20 + 50, 40 + Our pipelined processor is
200 + 50, 2(40 + 30 + 60)] = 350 ps. The Execute stage takes the longest. According unbalanced, with branch
to Equation 7.1, the total execution time is Tpipelined = (100 × 109 instructions) resolution in the Execute
(1.25 cycles / instruction)(350 × 10−12 s / cycle) = 44 seconds. This compares with stage taking much longer
than any other stage. The
75 seconds for the single-cycle processor and 155 seconds for the multicycle processor.
pipeline could be balanced
better by pushing the Result
multiplexer back into the
The pipelined processor is substantially faster than the others.
Memory stage, reducing the
However, its advantage over the single-cycle processor is nowhere near cycle time to 320 ps.
the fivefold speedup one might hope to get from a five-stage pipeline.
456 CHAPTER SEVEN Microarchitecture

The pipeline hazards introduce a small CPI penalty. More significantly,


the sequencing overhead (clk-to-Q and setup times) of the registers
applies to every pipeline stage, not just once to the overall datapath.
Sequencing overhead limits the benefits one can hope to achieve from
pipelining. Imbalanced delay in pipeline stages also decreases the bene-
fits of pipelining. The pipelined processor is similar in hardware require-
ments to the single-cycle processor, but it adds many 32-bit pipeline
registers, along with multiplexers, smaller pipeline registers, and control
logic to resolve hazards.

7.6 HDL REPRESENTATION*


This section presents HDL code for the single-cycle RISC-V processor that
supports the instructions discussed in this chapter. The code illustrates
good coding practices for a moderately complex system. HDL code for
the multicycle processor and pipelined processor are left to Exercises 7.25
to 7.27 and 7.42 to 7.44.
In this section, the instruction and data memories are separated from
the datapath and connected by address and data busses. In practice, most
processors pull instructions and data from separate caches. However,
to handle smaller memory maps where data may be intermixed with
instructions, a more complete processor must also be able to read data (in
addition to instructions) from the instruction memory. Chapter 8 will revisit
memory systems, including the interaction of caches with main memory.
Figure 7.63 shows a block diagram of the single-cycle RISC-V
processor interfaced to external memories. The processor is composed of

RISCVsingle

A
Controller Instr Instruction
Memory
RegSrc
PCSrc
ResultSrc
ALUControl
ALUSrc
ImmSrc
RegWrite
MemWrite

Instr
RD
Zero

Figure 7.63 Single-cycle


processor interfaced to external
CLK
memories PC
PC
WE
Instr
DataAdr
CLK Datapath ALUResult A
Data
Reset Memory
WriteData
WriteData WD
ReadData
ReadData RD

Processor External Memories


Top
7.6 HDL Representation 457

the datapath from Figure 7.15 and the controller from Figure 7.16. The
controller, in turn, is composed of the Main Decoder and the ALU Decoder.
The HDL code is partitioned into several sections. Section 7.6.1
provides HDL for the single-cycle processor datapath and controller.
Section 7.6.2 presents the generic building blocks, such as registers and
multiplexers, which are used by any microarchitecture. Section 7.6.3
introduces the test program, testbench, and external memories. The
HDL and test program are available in electronic form on this book’s
website (see the Preface).

7 . 6 . 1 Single-Cycle Processor
The main modules of the single-cycle processor module are given in the
following HDL examples.

HDL Example 7.1 SINGLE-CYCLE PROCESSOR

SystemVerilog VHDL
module riscvsingle(input logic clk, reset, library IEEE;
output logic [31:0] PC, use IEEE.STD_LOGIC_1164.all;
input logic [31:0] Instr,
output logic MemWrite, entity riscvsingle is
output logic [31:0] ALUResult, WriteData, port(clk, reset: in STD_LOGIC;
input logic [31:0] ReadData); PC: out STD_LOGIC_VECTOR(31 downto 0);
Instr: in STD_LOGIC_VECTOR(31 downto 0);
logic ALUSrc, RegWrite, Jump, Zero; MemWrite: out STD_LOGIC;
logic [1:0] ResultSrc, ImmSrc; ALUResult, WriteData: out STD_LOGIC_VECTOR(31 downto 0);
logic [2:0] ALUControl; ReadData: in STD_LOGIC_VECTOR(31 downto 0));
end;
controller c(Instr[6:0], Instr[14:12], Instr[30], Zero,
ResultSrc, MemWrite, PCSrc, architecture struct of riscvsingle is
ALUSrc, RegWrite, Jump, component controller
ImmSrc, ALUControl); port(op: in STD_LOGIC_VECTOR(6 downto 0);
datapath dp(clk, reset, ResultSrc, PCSrc, funct3: in STD_LOGIC_VECTOR(2 downto 0);
ALUSrc, RegWrite, funct7b5, Zero: in STD_LOGIC;
ImmSrc, ALUControl, ResultSrc: out STD_LOGIC_VECTOR(1 downto 0);
Zero, PC, Instr, MemWrite: out STD_LOGIC;
ALUResult, WriteData, ReadData); PCSrc, ALUSrc: out STD_LOGIC;
endmodule RegWrite, Jump: out STD_LOGIC;
ImmSrc: out STD_LOGIC_VECTOR(1 downto 0);
ALUControl: out STD_LOGIC_VECTOR(2 downto 0));
end component;
component datapath
port(clk, reset: in STD_LOGIC;
ResultSrc: in STD_LOGIC_VECTOR(1 downto 0);
PCSrc, ALUSrc: in STD_LOGIC;
RegWrite: in STD_LOGIC;
ImmSrc: in STD_LOGIC_VECTOR(1 downto 0);
ALUControl: in STD_LOGIC_VECTOR(2 downto 0);
Zero: out STD_LOGIC;
PC: out STD_LOGIC_VECTOR(31 downto 0);
Instr: in STD_LOGIC_VECTOR(31 downto 0);
ALUResult, WriteData: out STD_LOGIC_VECTOR(31 downto 0);
ReadData: in STD_LOGIC_VECTOR(31 downto 0));
end component;
458 CHAPTER SEVEN Microarchitecture

signal ALUSrc, RegWrite, Jump, Zero, PCSrc: STD_LOGIC;


signal ResultSrc, ImmSrc: STD_LOGIC_VECTOR(1 downto 0);
signal ALUControl: STD_LOGIC_VECTOR(2 downto 0);
begin
c: controller port map(Instr(6 downto 0), Instr(14 downto 12),
Instr(30), Zero, ResultSrc, MemWrite,
PCSrc, ALUSrc, RegWrite, Jump,
ImmSrc, ALUControl);
dp: datapath port map(clk, reset, ResultSrc, PCSrc, ALUSrc,
RegWrite, ImmSrc, ALUControl, Zero,
PC, Instr, ALUResult, WriteData,
ReadData);

end;

HDL Example 7.2 CONTROLLER

SystemVerilog VHDL
module controller(input logic [6:0] op, library IEEE;
input logic [2:0] funct3, use IEEE.STD_LOGIC_1164.all;
input logic funct7b5,
input logic Zero, entity controller is
output logic [1:0] ResultSrc, port(op: in STD_LOGIC_VECTOR(6 downto 0);
output logic MemWrite, funct3: in STD_LOGIC_VECTOR(2 downto 0);
output logic PCSrc, ALUSrc, funct7b5, Zero: in STD_LOGIC;
output logic RegWrite, Jump, ResultSrc: out STD_LOGIC_VECTOR(1 downto 0);
output logic [1:0] ImmSrc, MemWrite: out STD_LOGIC;
output logic [2:0] ALUControl); PCSrc, ALUSrc: out STD_LOGIC;
logic [1:0] ALUOp; RegWrite: out STD_LOGIC;
logic Branch; Jump: buffer STD_LOGIC;
ImmSrc: out STD_LOGIC_VECTOR(1 downto 0);
maindec md(op, ResultSrc, MemWrite, Branch, ALUControl: out STD_LOGIC_VECTOR(2 downto 0));
ALUSrc, RegWrite, Jump, ImmSrc, ALUOp); end;
aludec ad(op[5], funct3, funct7b5, ALUOp, ALUControl);
architecture struct of controller is
assign PCSrc = Branch & Zero | Jump; component maindec
endmodule port(op: in STD_LOGIC_VECTOR(6 downto 0);
ResultSrc: out STD_LOGIC_VECTOR(1 downto 0);
MemWrite: out STD_LOGIC;
Branch, ALUSrc: out STD_LOGIC;
RegWrite, Jump: out STD_LOGIC;
ImmSrc: out STD_LOGIC_VECTOR(1 downto 0);
ALUOp: out STD_LOGIC_VECTOR(1 downto 0));
end component;
component aludec
port(opb5: in STD_LOGIC;
funct3: in STD_LOGIC_VECTOR(2 downto 0);
funct7b5: in STD_LOGIC;
ALUOp: in STD_LOGIC_VECTOR(1 downto 0);
ALUControl: out STD_LOGIC_VECTOR(2 downto 0));
end component;

signal ALUOp: STD_LOGIC_VECTOR(1 downto 0);


signal Branch: STD_LOGIC;
begin
md: maindec port map(op, ResultSrc, MemWrite, Branch,
ALUSrc, RegWrite, Jump, ImmSrc, ALUOp);
ad: aludec port map(op(5), funct3, funct7b5, ALUOp, ALUControl);
PCSrc <= (Branch and Zero) or Jump;
end;
7.6 HDL Representation 459

HDL Example 7.3 MAIN DECODER


SystemVerilog VHDL
module maindec(input logic [6:0] op, library IEEE;
output logic [1:0] ResultSrc, use IEEE.STD_LOGIC_1164.all;
output logic MemWrite,
output logic Branch, ALUSrc, entity maindec is
output logic RegWrite, Jump, port(op: in STD_LOGIC_VECTOR(6 downto 0);
output logic [1:0] ImmSrc, ResultSrc: out STD_LOGIC_VECTOR(1 downto 0);
output logic [1:0] ALUOp); MemWrite: out STD_LOGIC;
logic [10:0] controls; Branch, ALUSrc: out STD_LOGIC;
RegWrite, Jump: out STD_LOGIC;
assign {RegWrite, ImmSrc, ALUSrc, MemWrite, ImmSrc: out STD_LOGIC_VECTOR(1 downto 0);
ResultSrc, Branch, ALUOp, Jump} = controls; ALUOp: out STD_LOGIC_VECTOR(1 downto 0));
end;
always_comb
case(op) architecture behave of maindec is
// RegWrite_ImmSrc_ALUSrc_MemWrite_ResultSrc_Branch_ALUOp_Jump signal controls: STD_LOGIC_VECTOR(10 downto 0);
7'b0000011: controls = 11'b1_00_1_0_01_0_00_0; // lw begin
7'b0100011: controls = 11'b0_01_1_1_00_0_00_0; // sw process(op) begin
7'b0110011: controls = 11'b1_xx_0_0_00_0_10_0; // R–type case op is
7'b1100011: controls = 11'b0_10_0_0_00_1_01_0; // beq when "0000011" => controls <= "10010010000"; –– lw
7'b0010011: controls = 11'b1_00_1_0_00_0_10_0; // I–type ALU when "0100011" => controls <= "00111000000"; –– sw
7'b1101111: controls = 11'b1_11_0_0_10_0_00_1; // jal
when "0110011" => controls <= "1––00000100"; –– R–type
default: controls = 11'bx_xx_x_x_xx_x_xx_x; // ???
endcase when "1100011" => controls <= "01000001010"; –– beq
endmodule when "0010011" => controls <= "10010000100"; –– I–type ALU
when "1101111" => controls <= "11100100001"; –– jal
when others => controls <= "–––––––––––"; –– not valid
end case;
end process;

(RegWrite, ImmSrc(1), ImmSrc(0), ALUSrc, MemWrite,


ResultSrc(1), ResultSrc(0), Branch, ALUOp(1), ALUOp(0),
Jump) <= controls;
end;

HDL Example 7.4 ALU DECODER

SystemVerilog VHDL
module aludec(input logic opb5, library IEEE;
input logic [2:0] funct3, use IEEE.STD_LOGIC_1164.all;
input logic funct7b5,
input logic [1:0] ALUOp, entity aludec is
output logic [2:0] ALUControl); port(opb5: in STD_LOGIC;
funct3: in STD_LOGIC_VECTOR(2 downto 0);
logic RtypeSub; funct7b5: in STD_LOGIC;
assign RtypeSub = funct7b5 & opb5; // TRUE for R–type subtract ALUOp: in STD_LOGIC_VECTOR(1 downto 0);
ALUControl: out STD_LOGIC_VECTOR(2 downto 0));
always_comb end;
case(ALUOp)
2'b00: ALUControl = 3'b000; // addition architecture behave of aludec is
2'b01: ALUControl = 3'b001; // subtraction signal RtypeSub: STD_LOGIC;
default: case(funct3) // R–type or I–type ALU begin
3'b000: if (RtypeSub) RtypeSub <= funct7b5 and opb5; –– TRUE for R–type subtract
ALUControl = 3'b001; // sub process(opb5, funct3, funct7b5, ALUOp, RtypeSub) begin
else case ALUOp is
460 CHAPTER SEVEN Microarchitecture

ALUControl = 3'b000; // add, addi


3'b010: ALUControl = 3'b101; // slt, slti when "00" => ALUControl <= "000"; –– addition
3'b110: ALUControl = 3'b011; // or, ori when "01" => ALUControl <= "001"; –– subtraction
3'b111: ALUControl = 3'b010; // and, andi when others => case funct3 is –– R–type or I–type ALU
default: ALUControl = 3'bxxx; // ??? when "000" = if RtypeSub = '1' then
endcase ALUControl <= "001"; –– sub
endcase else
endmodule ALUControl <= "000"; –– add, addi
end if;
when "010" => ALUControl <= "101"; –– slt, slti
when "110" => ALUControl <= "011"; –– or, ori
when "111" => ALUControl <= "010"; –– and, andi
when others => ALUControl <= "–––"; –– unknown
end case;
end case;
end process;
end;

HDL Example 7.5 DATAPATH

SystemVerilog VHDL
module datapath(input logic clk, reset, library IEEE;
input logic [1:0] ResultSrc, use IEEE.STD_LOGIC_1164.all;
input logic PCSrc, ALUSrc, use IEEE.STD_LOGIC_ARITH.all;
input logic RegWrite,
input logic [1:0] ImmSrc, entity datapath is
input logic [2:0] ALUControl, port(clk, reset: in STD_LOGIC;
output logic Zero, ResultSrc: in STD_LOGIC_VECTOR(1 downto 0);
output logic [31:0] PC, PCSrc, ALUSrc: in STD_LOGIC;
input logic [31:0] Instr, RegWrite: in STD_LOGIC;
output logic [31:0] ALUResult, WriteData, ImmSrc: in STD_LOGIC_VECTOR(1 downto 0);
input logic [31:0] ReadData); ALUControl: in STD_LOGIC_VECTOR(2 downto 0);
Zero: out STD_LOGIC;
logic [31:0] PCNext, PCPlus4, PCTarget; PC: buffer STD_LOGIC_VECTOR(31 downto 0);
logic [31:0] ImmExt; Instr: in STD_LOGIC_VECTOR(31 downto 0);
logic [31:0] SrcA, SrcB; ALUResult, WriteData: buffer STD_LOGIC_VECTOR(31 downto 0);
logic [31:0] Result; ReadData: in STD_LOGIC_VECTOR(31 downto 0));
end;
// next PC logic
flopr #(32) pcreg(clk, reset, PCNext, PC); architecture struct of datapath is
adder pcadd4(PC, 32'd4, PCPlus4); component flopr generic(width: integer);
adder pcaddbranch(PC, ImmExt, PCTarget); port(clk, reset: in STD_LOGIC;
mux2 #(32) pcmux(PCPlus4, PCTarget, PCSrc, PCNext); d: in STD_LOGIC_VECTOR(width−1 downto 0);
q: out STD_LOGIC_VECTOR(width−1 downto 0));
// register file logic end component;
regfile rf(clk, RegWrite, Instr[19:15], Instr[24:20], component adder
Instr[11:7], Result, SrcA, WriteData); port(a, b: in STD_LOGIC_VECTOR(31 downto 0);
extend ext(Instr[31:7], ImmSrc, ImmExt); y: out STD_LOGIC_VECTOR(31 downto 0));
end component;
// ALU logic
component mux2 generic(width: integer);
mux2 #(32) srcbmux(WriteData, ImmExt, ALUSrc, SrcB);
port(d0, d1: in STD_LOGIC_VECTOR(width−1 downto 0);
alu alu(SrcA, SrcB, ALUControl, ALUResult, Zero);
s: in STD_LOGIC;
mux3 #(32) resultmux(ALUResult, ReadData, PCPlus4,
y: out STD_LOGIC_VECTOR(width−1 downto 0));
ResultSrc, Result);
end component;
endmodule
component mux3 generic(width: integer);
port(d0, d1, d2: in STD_LOGIC_VECTOR(width−1 downto 0);
s: in STD_LOGIC_VECTOR(1 downto 0);
y: out STD_LOGIC_VECTOR(width−1 downto 0));
end component;
component regfile
port(clk: in STD_LOGIC;
we3: in STD_LOGIC;
a1, a2, a3: in STD_LOGIC_VECTOR(4 downto 0);
7.6 HDL Representation 461

wd3: in STD_LOGIC_VECTOR(31 downto 0);


rd1, rd2: out STD_LOGIC_VECTOR(31 downto 0));
end component;
component extend
port(instr: in STD_LOGIC_VECTOR(31 downto 7);
immsrc: in STD_LOGIC_VECTOR(1 downto 0);
immext: out STD_LOGIC_VECTOR(31 downto 0));
end component;
component alu
port(a, b: in STD_LOGIC_VECTOR(31 downto 0);
ALUControl: in STD_LOGIC_VECTOR(2 downto 0);
ALUResult: buffer STD_LOGIC_VECTOR(31 downto 0);
Zero: out STD_LOGIC);
end component;

signal PCNext, PCPlus4, PCTarget: STD_LOGIC_VECTOR(31 downto 0);


signal ImmExt: STD_LOGIC_VECTOR(31 downto 0);
signal SrcA, SrcB: STD_LOGIC_VECTOR(31 downto 0);
signal Result: STD_LOGIC_VECTOR(31 downto 0);
begin
–– next PC logic
pcreg: flopr generic map(32) port map(clk, reset, PCNext, PC);
pcadd4: adder port map(PC, X"00000004", PCPlus4);
pcaddbranch: adder port map(PC, ImmExt, PCTarget);
pcmux: mux2 generic map(32) port map(PCPlus4, PCTarget, PCSrc,
PCNext);
–– register file logic
rf: regfile port map(clk, RegWrite, Instr(19 downto 15),
Instr(24 downto 20), Instr(11 downto 7),
Result, SrcA, WriteData);
ext: extend port map(Instr(31 downto 7), ImmSrc, ImmExt);
–– ALU logic
srcbmux: mux2 generic map(32) port map(WriteData, ImmExt,
ALUSrc, SrcB);
mainalu: alu port map(SrcA, SrcB, ALUControl, ALUResult, Zero);
resultmux: mux3 generic map(32) port map(ALUResult, ReadData,
PCPlus4, ResultSrc,
Result);
end;

7 . 6 . 2 Generic Building Blocks


This section contains generic building blocks that may be useful in any
digital system, including an adder, flip-flops, and a 2:1 multiplexer. The
register file appeared in HDL Example 5.8. The HDL for the ALU is left
to Exercises 5.11 through 5.14.

HDL Example 7.6 ADDER

SystemVerilog VHDL
module adder(input [31:0] a, b, library IEEE;
output [31:0] y); use IEEE.STD_LOGIC_1164.all;
use IEEE.NUMERIC_STD_UNSIGNED.all;
assign y = a + b;
endmodule entity adder is
port(a, b: in STD_LOGIC_VECTOR(31 downto 0);
y: out STD_LOGIC_VECTOR(31 downto 0));
end;

architecture behave of adder is


begin
y <= a + b;
end;
462 CHAPTER SEVEN Microarchitecture

HDL Example 7.7 EXTEND UNIT

SystemVerilog VHDL
module extend(input logic [31:7] instr, library IEEE;
input logic [1:0] immsrc, use IEEE.STD_LOGIC_1164.all;
output logic [31:0] immext);
entity extend is
always_comb port(instr: in STD_LOGIC_VECTOR(31 downto 7);
case(immsrc) immsrc: in STD_LOGIC_VECTOR(1 downto 0);
// I−type immext: out STD_LOGIC_VECTOR(31 downto 0));
2'b00: immext = {{20{instr[31]}}, instr[31:20]}; end;
// S−type (stores)
2'b01: immext = {{20{instr[31]}}, instr[31:25], architecture behave of extend is
instr[11:7]}; begin
// B−type (branches) process(instr, immsrc) begin
2'b10: immext = {{20{instr[31]}}, instr[7], case immsrc is
instr[30:25], instr[11:8], 1’b0}; –– I–type
// J−type (jal) when "00" =>
2'b11: immext = {{12{instr[31]}}, instr[19:12], immext <= (31 downto 12 => instr(31)) & instr(31 downto 20);
instr[20], instr[30:21], 1’b0}; –– S–types (stores)
default: immext = 32'bx; // undefined when "01" =>
endcase immext <= (31 downto 12 => instr(31)) &
endmodule instr(31 downto 25) & instr(11 downto 7);
–– B–type (branches)
when "10" =>
immext <= (31 downto 12 => instr(31)) & instr(7) & instr(30
downto 25) & instr(11 downto 8) & '0';
–– J–type (jal)
when "11" =>
immext <= (31 downto 20 => instr(31)) &
instr(19 downto 12) & instr(20) &
instr(30 downto 21) & '0';
when others =>
immext <= (31 downto 0 => '–');
end case;
end process;
end;

HDL Example 7.8 RESETTABLE FLIP-FLOP

SystemVerilog VHDL
module flopr #(parameter WIDTH = 8) library IEEE;
(input logic clk, reset, use IEEE.STD_LOGIC_1164.all;
input logic [WIDTH−1:0] d, use IEEE.STD_LOGIC_ARITH.all;
output logic [WIDTH−1:0] q);
entity flopr is
always_ff @(posedge clk, posedge reset) generic(width: integer);
if (reset) q <= 0; port(clk, reset: in STD_LOGIC;
else q <= d; d: in STD_LOGIC_VECTOR(width−1 downto 0);
endmodule q: out STD_LOGIC_VECTOR(width−1 downto 0));
end;

architecture asynchronous of flopr is


begin
process(clk, reset) begin
if reset = '1' then q <= (others => '0');
elsif rising_edge(clk) then q <= d;
end if;
end process;
end;
7.6 HDL Representation 463

HDL Example 7.9 RESETTABLE FLIP-FLOP WITH ENABLE

SystemVerilog VHDL
module flopenr #(parameter WIDTH = 8) library IEEE;
(input logic clk, reset, en, use IEEE.STD_LOGIC_1164.all;
input logic [WIDTH–1:0] d, use IEEE.STD_LOGIC_ARITH.all;
output logic [WIDTH–1:0] q);
entity flopenr is
always_ff @(posedge clk, posedge reset) generic(width: integer);
if (reset) q <= 0; port(clk, reset, en: in STD_LOGIC;
else if (en) q <= d; d: in STD_LOGIC_VECTOR(width–1 downto 0);
endmodule q: out STD_LOGIC_VECTOR(width–1 downto 0));
end;
architecture asynchronous of flopenr is
begin
process(clk, reset, en) begin
if reset = '1' then q <= (others => '0');
elsif rising_edge(clk) and en = '1' then q <= d;
end if;
end process;
end;

HDL Example 7.10 2:1 MULTIPLEXER

SystemVerilog VHDL
module mux2 #(parameter WIDTH = 8) library IEEE;
(input logic [WIDTH−1:0] d0, d1, use IEEE.STD_LOGIC_1164.all;
input logic s, entity mux2 is
output logic [WIDTH−1:0] y); generic(width: integer := 8);
port(d0, d1: in STD_LOGIC_VECTOR(width−1 downto 0);
assign y = s ? d1 : d0;
s: in STD_LOGIC;
endmodule y: out STD_LOGIC_VECTOR(width−1 downto 0));
end;
architecture behave of mux2 is
begin
y <= d1 when s = '1' else d0;
end;

HDL Example 7.11 3:1 MULTIPLEXER


SystemVerilog VHDL
module mux3 #(parameter WIDTH = 8) library IEEE;
(input logic [WIDTH−1:0] d0, d1, d2, use IEEE.STD_LOGIC_1164.all;
input logic [1:0] s, entity mux3 is
output logic [WIDTH−1:0] y); generic(width: integer := 8);
port(d0, d1, d2: in STD_LOGIC_VECTOR(width−1 downto 0);
assign y = s[1] ? d2 : (s[0] ? d1 : d0);
s: in STD_LOGIC_VECTOR(1 downto 0);
endmodule y: out STD_LOGIC_VECTOR(width−1 downto 0));
end;
architecture behave of mux3 is
begin
process(d0, d1, d2, s) begin
if (s = "00") then y <= d0;
elsif (s = "01") then y <= d1;
elsif (s = "10") then y <= d2;
end if;
end process;
end;
464 CHAPTER SEVEN Microarchitecture

# riscvtest.s
# Sarah.Harris@unlv.edu
# David_Harris@hmc.edu
# 27 Oct 2020
#
# Test the RISC-V processor:
# add, sub, and, or, slt, addi, lw, sw, beq, jal
# If successful, it should write the value 25 to address 100
# RISC-V Assembly Description Address Machine Code
main: addi x2, x0, 5 # x2 = 5 0 00500113
addi x3, x0, 12 # x3 = 12 4 00C00193
addi x7, x3, -9 # x7 = (12 - 9) = 3 8 FF718393
or x4, x7, x2 # x4 = (3 OR 5) = 7 C 0023E233
and x5, x3, x4 # x5 = (12 AND 7) = 4 10 0041F2B3
add x5, x5, x4 # x5 = 4 + 7 = 11 14 004282B3
beq x5, x7, end # shouldn't be taken 18 02728863
slt x4, x3, x4 # x4 = (12 < 7) = 0 1C 0041A233
beq x4, x0, around # should be taken 20 00020463
addi x5, x0, 0 # shouldn't execute 24 00000293
around: slt x4, x7, x2 # x4 = (3 < 5) = 1 28 0023A233
add x7, x4, x5 # x7 = (1 + 11) = 12 2C 005203B3
sub x7, x7, x2 # x7 = (12 - 5) = 7 30 402383B3
sw x7, 84(x3) # [96] = 7 34 0471AA23
lw x2, 96(x0) # x2 = [96] = 7 38 06002103
add x9, x2, x5 # x9 = (7 + 11) = 18 3C 005104B3
jal x3, end # jump to end, x3 = 0x44 40 008001EF
addi x2, x0, 1 # shouldn't execute 44 00100113
end: add x2, x2, x9 # x2 = (7 + 18) = 25 48 00910133
sw x2, 0x20(x3) # [100] = 25 4C 0221A023
done: beq x2, x2, done # infinite loop 50 00210063

Figure 7.64 riscvtest.s

7 . 6 . 3 Testbench
00500113
00C00193 The testbench loads a program into the memories. The program in
FF718393 Figure 7.64 exercises all of the instructions by performing a computa-
0023E233
0041F2B3 tion that should produce the correct result only if all of the instructions
004282B3 are functioning correctly. Specifically, the program will write the value
02728863 25 to address 100 if it runs correctly, but it is unlikely to do so if the
0041A233
00020463
hardware is buggy. This is an example of ad hoc testing.
00000293 The machine code is stored in a text file called riscvtest.txt
0023A233 (Figure 7.65) which is loaded by the testbench during simulation. The
005203B3
402383B3
file consists of the machine code for the instructions written in hexadeci-
0471AA23 mal, one instruction per line.
06002103 The testbench, top-level RISC-V module (that instantiates the
005104B3
008001EF
RISC-V processor and memories), and external memory HDL code are
00100113 given in the following examples. The testbench instantiates the top-level
00910133 module being tested and generates a periodic clock and a reset at the
0221A023
00210063
start of the simulation. It checks for memory writes and reports success
if the correct value (25) is written to address 100. The memories in this
Figure 7.65 riscvtest.txt example hold 64 32-bit words each.
7.6 HDL Representation 465

HDL Example 7.12 TESTBENCH

SystemVerilog VHDL
module testbench(); library IEEE;
use IEEE.STD_LOGIC_1164.all;
logic clk; use IEEE.NUMERIC_STD_UNSIGNED.all;
logic reset;
logic [31:0] WriteData, DataAdr; entity testbench is
logic MemWrite; end;

// instantiate device to be tested architecture test of testbench is


top dut(clk, reset, WriteData, DataAdr, MemWrite); component top
port(clk, reset: in STD_LOGIC;
// initialize test WriteData, DataAdr: out STD_LOGIC_VECTOR(31 downto 0);
initial MemWrite: out STD_LOGIC);
begin end component;
reset <= 1; # 22; reset <= 0;
end signal WriteData, DataAdr: STD_LOGIC_VECTOR(31 downto 0);
signal clk, reset, MemWrite: STD_LOGIC;
// generate clock to sequence tests begin
always –– instantiate device to be tested
begin dut: top port map(clk, reset, WriteData, DataAdr, MemWrite);
clk <= 1; # 5; clk <= 0; # 5;
end –– Generate clock with 10 ns period
process begin
// check results clk <= '1';
always @(negedge clk) wait for 5 ns;
begin clk <= '0';
if(MemWrite) begin wait for 5 ns;
if(DataAdr = = = 100 & WriteData = = = 25) begin end process;
$display("Simulation succeeded");
$stop; –– Generate reset for first two clock cycles
end else if (DataAdr != = 96) begin process begin
$display("Simulation failed"); reset <= '1';
$stop; wait for 22 ns;
end reset <= '0';
end wait;
end end process;
endmodule
–– check that 25 gets written to address 100 at end of program
process(clk) begin
if(clk'event and clk = '0' and MemWrite = '1') then
if(to_integer(DataAdr) = 100 and
to_integer(writedata) = 25) then
report "NO ERRORS: Simulation succeeded" severity
failure;
elsif (DataAdr /= 96) then
report "Simulation failed" severity failure;
end if;
end if;
end process;
end;
466 CHAPTER SEVEN Microarchitecture

HDL Example 7.13 TOP-LEVEL MODULE

SystemVerilog VHDL
module top(input logic clk, reset, library IEEE;
output logic [31:0] WriteData, DataAdr, use IEEE.STD_LOGIC_1164.all;
output logic MemWrite); use IEEE.NUMERIC_STD_UNSIGNED.all;

logic [31:0] PC, Instr, ReadData; entity top is


port(clk, reset: in STD_LOGIC;
// instantiate processor and memories WriteData, DataAdr: buffer STD_LOGIC_VECTOR(31 downto 0);
riscvsingle rvsingle(clk, reset, PC, Instr, MemWrite, MemWrite: buffer STD_LOGIC);
DataAdr, WriteData, ReadData); end;
imem imem(PC, Instr);
dmem dmem(clk, MemWrite, DataAdr, WriteData, ReadData); architecture test of top is
endmodule component riscvsingle
port(clk, reset: in STD_LOGIC;
PC: out STD_LOGIC_VECTOR(31 downto 0);
Instr: in STD_LOGIC_VECTOR(31 downto 0);
MemWrite: out STD_LOGIC;
ALUResult, WriteData: out STD_LOGIC_VECTOR(31 downto 0);
ReadData: in STD_LOGIC_VECTOR(31 downto 0));
end component;
component imem
port(a: in STD_LOGIC_VECTOR(31 downto 0);
rd: out STD_LOGIC_VECTOR(31 downto 0));
end component;
component dmem
port(clk, we: in STD_LOGIC;
a, wd: in STD_LOGIC_VECTOR(31 downto 0);
rd: out STD_LOGIC_VECTOR(31 downto 0));
end component;

signal PC, Instr, ReadData: STD_LOGIC_VECTOR(31 downto 0);


begin
–– instantiate processor and memories
rvsingle: riscvsingle port map(clk, reset, PC, Instr,
MemWrite, DataAdr,
WriteData, ReadData);
imem1: imem port map(PC, Instr);
dmem1: dmem port map(clk, MemWrite, DataAdr, WriteData,
ReadData);
end;

HDL Example 7.14 INSTRUCTION MEMORY

SystemVerilog VHDL
module imem(input logic [31:0] a, library IEEE;
output logic [31:0] rd); use IEEE.STD_LOGIC_1164.all;
use STD.TEXTIO.all;
logic [31:0] RAM[63:0]; use IEEE.NUMERIC_STD_UNSIGNED.all;
use ieee.std_logic_textio.all;
initial
$readmemh("riscvtest.txt",RAM); entity imem is
port(a: in STD_LOGIC_VECTOR(31 downto 0);
assign rd = RAM[a[31:2]]; // word aligned
rd: out STD_LOGIC_VECTOR(31 downto 0));
endmodule
end;

architecture behave of imem is


type ramtype is array(63 downto 0) of
STD_LOGIC_VECTOR(31 downto 0);
7.6 HDL Representation 467

–– initialize memory from file


impure function init_ram_hex return ramtype is
file text_file : text open read_mode is "riscvtest.txt";
variable text_line : line;
variable ram_content : ramtype;
variable i : integer := 0;
begin
for i in 0 to 63 loop –– set all contents low
ram_content(i) := (others => '0');
end loop;
while not endfile(text_file) loop –– set contents from file
readline(text_file, text_line);
hread(text_line, ram_content(i));
i := i + 1;
end loop;

return ram_content;
end function;

signal mem : ramtype := init_ram_hex;


begin
–– read memory
process(a) begin
rd <= mem(to_integer(a(31 downto 2)));
end process;
end;

HDL Example 7.15 DATA MEMORY

SystemVerilog VHDL
module dmem(input logic clk, we, library IEEE;
input logic [31:0] a, wd, use IEEE.STD_LOGIC_1164.all;
output logic [31:0] rd); use STD.TEXTIO.all;
use IEEE.NUMERIC_STD_UNSIGNED.all;
logic [31:0] RAM[63:0];
entity dmem is
assign rd = RAM[a[31:2]]; // word aligned port(clk, we: in STD_LOGIC;
a, wd: in STD_LOGIC_VECTOR(31 downto 0);
always_ff @(posedge clk)
rd: out STD_LOGIC_VECTOR(31 downto 0));
if (we) RAM[a[31:2]] <= wd;
end;
endmodule
architecture behave of dmem is
begin
process is
type ramtype is array (63 downto 0) of
STD_LOGIC_VECTOR(31 downto 0);
variable mem: ramtype;
begin
–– read or write memory
loop
if rising_edge(clk) then
if (we = '1') then mem(to_integer(a(7 downto 2))) := wd;
end if;
end if;
rd <= mem(to_integer(a(7 downto 2)));
wait on clk, a;
end loop;
end process;
end;

You might also like