Riscv Cpu
Riscv Cpu
7
7.1 Introduction
7.2 Performance Analysis
7.3 Single-Cycle Processor
7.4 Multicycle Processor
7.5 Pipelined Processor
7.1 INTRODUCTION 7.6 HDL Representation*
In this chapter, you will learn how to piece together a microprocessor. 7.7 Advanced
Microarchitecture*
Indeed, you will puzzle out three different versions, each with different
trade-offs between performance, cost, and complexity. 7.8 Real-World Perspective:
Evolution of RISC-V
To the uninitiated, building a microprocessor may seem like black
Microarchitecture*
magic. But it is actually relatively straightforward and, by this point, you
7.9 Summary
have learned everything you need to know. Specifically, you have learned
to design combinational and sequential logic given functional and timing Exercises
specifications. You are familiar with circuits for arithmetic and memory. Interview Questions
And you have learned about the RISC-V architecture, which specifies
the programmer’s view of the RISC-V processor in terms of registers, Application >”hello
instructions, and memory. Software world!”
This chapter covers microarchitecture, which is the connection
Operating
between logic and architecture. Microarchitecture is the specific arrange- Systems
ment of registers, arithmetic logic units (ALUs), finite state machines
(FSMs), memories, and other logic building blocks needed to implement Architecture
an architecture. A particular architecture, such as RISC-V, may have
many different microarchitectures, each with different trade-offs of per- Micro-
formance, cost, and complexity. They all run the same programs, but architecture
their internal designs vary widely. We design three different microarchi- +
Logic
tectures in this chapter to illustrate the trade-offs.
Digital
Circuits
7 . 1 . 1 Architectural State and Instruction Set
Analog +
Recall that a computer architecture is defined by its instruction set and Circuits −
architectural state. The architectural state for the RISC-V processor con-
sists of the program counter and the 32 32-bit registers. Any RISC-V Devices
microarchitecture must contain all of this state. Based on the current
architectural state, the processor executes a particular instruction with a Physics
particular set of data to produce a new architectural state. Some
Digital Design and Computer Architecture, RISC-V Edition. DOI: 10.1016/B978-0-12-820064-3.00007-6 393
Copyright © 2022 Elsevier Inc. All rights reserved.
394 CHAPTER SEVEN Microarchitecture
CLK CLK
CLK
WE3 WE
PCNext PC A1 RD1
32 32 32
A RD 32
5 32
A RD
Instruction 32 32
A2 RD2 Data
Memory 5 32
5
A3 Memory
Register
WD3 WD
32 File 32
1
This is an oversimplification used to treat the instruction memory as a ROM. In most real
processors, the instruction memory must be writable so that the operating system (OS) can
load a new program into memory. The multicycle microarchitecture described in Section 7.4
is more realistic in that it uses a single memory that contains both instructions and data and
that can be both read and written.
396 CHAPTER SEVEN Microarchitecture
7 . 1 . 3 Microarchitectures
In this chapter, we develop three microarchitectures for the RISC-V
architecture: single-cycle, multicycle, and pipelined. They differ in how
the state elements are connected and in the amount of nonarchitectural
state needed.
The single-cycle microarchitecture executes an entire instruction in
one cycle. It is easy to explain and has a simple control unit. Because it
completes the operation in one cycle, it does not require any nonarchitectural
state. However, the cycle time is limited by the slowest instruction.
Moreover, the processor requires separate instruction and data memories,
Examples of classic multicycle
which is generally unrealistic.
processors include the 1947 The multicycle microarchitecture executes instructions in a series of
MIT Whirlwind, the IBM shorter cycles. Simpler instructions execute in fewer cycles than compli-
System/360, the Digital cated ones. Moreover, the multicycle microarchitecture reduces the
Equipment Corporation VAX, hardware cost by reusing expensive hardware blocks, such as adders
the 6502 used in the Apple II,
and the 8088 used in the IBM
and memories. For example, the adder may be used on different cycles
PC. Multicycle microarchitectures for several purposes while carrying out a single instruction. The multicycle
are still used in inexpensive microprocessor accomplishes this by introducing several nonarchitectural
microcontrollers such as the registers to hold intermediate results. The multicycle processor executes
8051, the 68HC11, and the only one instruction at a time, but each instruction takes multiple clock
PIC16-series found in appliances,
toys, and gadgets.
cycles. This processor requires only a single memory, accessing it on one
cycle to fetch the instruction and on another to read or write data.
Because they use less hardware than single-cycle processors, multicycle
Intel processors have been processors were the historical choice for inexpensive systems.
pipelined since the 80486 was
The pipelined microarchitecture applies pipelining to the single-cycle
introduced in 1989. Nearly all
RISC microprocessors are also microarchitecture. It therefore can execute several instructions simul-
pipelined, and all commercial taneously, improving the throughput significantly. Pipelining must add
RISC-V processors have been logic to handle dependencies between simultaneously executing instruc-
pipelined. Because of the tions. It also requires nonarchitectural pipeline registers. Pipelined
decreasing cost of transistors,
processors must access instructions and data in the same cycle; they generally
pipelined processors now cost
fractions of a penny, and the use separate instruction and data caches for this purpose, as discussed in
entire system, with memory and Chapter 8. The added logic and registers are worthwhile; all commercial
peripherals, costs 10’s of cents. high-performance processors use pipelining today.
Thus, pipelined processors We explore the details and trade-offs of these three microarchitec-
are replacing their slower
tures in the subsequent sections. At the end of the chapter, we briefly
multicycle siblings in even the
most cost-sensitive applications. mention additional techniques that are used to achieve even more speed
in modern high-performance microprocessors.
7.2 Performance Analysis 397
elements from Figure 7.1 with combinational logic that can execute the
various instructions. Control signals determine which specific instruc-
tion is performed by the datapath at any given time. The control unit
contains combinational logic that generates the appropriate control
signals based on the current instruction. Finally, we analyze the perfor-
mance of the single-cycle processor.
7 . 3 . 1 Sample Program
For the sake of concreteness, we will have the single-cycle processor run
the short program from Figure 7.2 that exercises loads, stores, an R-type
instruction (or), and a branch (beq). Suppose that the program is stored
in memory starting at address 0x1000. The figure indicates the address
of each instruction, the instruction type, the instruction fields, and the
hexadecimal machine language code for the instruction.
Assume that register x5 initially contains the value 6 and x9 contains
0x2004. Memory location 0x2000 contains the value 10. The program
counter begins at 0x1000. The lw reads 10 from address (0x2004 – 4)
= 0x2000 and puts it in x6. The sw writes 10 to address (0x2004 + 8) =
0x200C. The or computes x4 = 6 | 10 = 01102 | 10102 = 11102 = 14.
Then, beq goes back to label L7, so the program repeats forever.
7 . 3 . 2 Single-Cycle Datapath
This section gradually develops the single-cycle datapath, adding one
piece at a time to the state elements from Figure 7.1. The new connections
are emphasized in black (or blue, for new control signals), whereas the
hardware that has already been studied is shown in gray. The example
instruction being executed is shown at the bottom of each figure.
The program counter contains the address of the instruction to
We italicize signal names in
execute. The first step is to read this instruction from instruction memory. the text but not the names
Figure 7.3 shows that the PC is simply connected to the address input of of hardware modules. For
the instruction memory. The instruction memory reads out, or fetches, the example, PC is the signal
32-bit instruction, labeled Instr. In our sample program from Figure 7.2, coming out of the PC register,
or simply, the PC.
PC is 0x1000. (Note that this is a 32-bit processor, so PC is really
0x00001000, but we omit leading zeros to avoid cluttering the figure.)
CLK CLK
CLK
WE3 WE
PCNext PC Instr A1 RD1
A RD
A RD
0x1000
0xFFC4A303
Instruction
A2 RD2 Data
Memory
A3 Memory
Register
WD3 WD
File
CLK CLK
CLK
19:15 WE3 0x2004 WE
PCNext PC Instr A1 RD1
A RD 9
A RD
0x1000
0xFFC4A303
Instruction
A2 RD2 Data
Memory
A3 Memory
Register
WD3 WD
File
lw
For the lw instruction, the next step is to read the source register con-
taining the base address. Recall that lw is an I-type instruction, and the
base register is specified in the rs1 field of the instruction, Instr19:15.
These bits of the instruction connect to the A1 address input of the reg-
ister file, as shown in Figure 7.4. The register file reads the register value
onto RD1. In our example, the register file reads 0x2004 from x9.
The lw instruction also requires an offset. The offset is stored in
the 12-bit immediate field of the instruction, Instr31:20. It is a signed
value, so it must be sign-extended to 32 bits. Sign extension simply
means copying the sign bit into the most significant bits: ImmExt31:12
= Instr31 , and ImmExt11:0 = Instr31:20. Sign-extension is performed by
7.3 Single-Cycle Processor 401
CLK CLK
CLK
19:15 WE3 0x2004 WE
PCNext PC Instr A1 RD1
A RD 9
A RD
0x1000
0xFFC4A303
Instruction
A2 RD2 Data
Memory
A3 Memory
Register
WD3 WD
File
ImmExt
31:20
Extend 0xFFFFFFFC
0xFFC
ALUControl2:0
000
CLK CLK
CLK
19:15 WE3 0x2004 SrcA WE
PCNext PC Instr A1 RD1
A RD 9
ALU
ALUResult
A RD
0x1000
0xFFC4A303
Instruction
A2 RD2 SrcB 0x2000 Data
Memory
A3 Memory
Register
WD3 WD
File
ImmExt
31:20
Extend 0xFFFFFFFC
0xFFC
an Extend unit, as shown in Figure 7.5, which receives the 12-bit signed
immediate in Instr31:20 and produces the 32-bit sign-extended immediate,
ImmExt. In our example, the two’s complement immediate −4 is
extended from its 12-bit representation 0xFFC to a 32-bit representation
0xFFFFFFFC.
The processor adds the base address to the offset to find the
address to read from memory. Figure 7.6 introduces an ALU to perform
this addition. The ALU receives two operands, SrcA and SrcB. SrcA is
the base address from the register file, and SrcB is the offset from the
sign-extended immediate, ImmExt. The ALU can perform many opera-
tions, as was described in Section 5.2.4. The 3-bit ALUControl signal
402 CHAPTER SEVEN Microarchitecture
RegWrite ALUControl2:0
1 000
CLK CLK
CLK
19:15 WE3 0x2004 SrcA WE
PCNext PC Instr A1 RD1
A RD 9 ReadData
ALU
ALUResult
A RD
0x1000
Instruction 0xFFC4A303
A2 RD2 SrcB 0x2000 Data 10
Memory 11:7
A3 Memory
6 Register
WD3 WD
File
ImmExt
31:20
Extend 0xFFFFFFFC
0xFFC
Figure 7.7 Read memory and write result back to register file
specifies the operation (see Table 5.3 on page 250). The ALU receives
32-bit operands and generates a 32-bit ALUResult. For the lw instruc-
tion, ALUControl should be set to 000 to perform addition. ALUResult
is sent to the data memory as the address to read, as shown in Figure 7.6.
In our example, the ALU computes 0x2004 + 0xFFFFFFFC = 0x2000.
Again, this is a 32-bit value, but we omit the leading zeros to avoid clut-
tering the figure.
This memory address from the ALU is provided to the address (A)
port of the data memory. The data is read from the data memory onto the
ReadData bus and then written back to the destination register at the end
of the cycle, as shown in Figure 7.7. Port 3 of the register file is the write
port. lw’s destination register, indicated by the rd field (Instr11:7), is con-
nected to A3, port 3’s address input. The ReadData bus is connected to
WD3, port 3’s write data input. A control signal called RegWrite (register
write) is connected to WE3, port 3’s write enable input, and is asserted
during the lw instruction so that the data value is written into the register
file. The write takes place on the rising edge of the clock at the end of the
cycle. In our example, the processor reads 10 from address 0x2000 in the
data memory and puts that value (10) into x6 in the register file.
While the instruction is being executed, the processor must also
compute the address of the next instruction, PCNext. Because instruc-
tions are 32 bits (4 bytes), the next instruction is at PC+4. Figure 7.8
uses an adder to increment the PC by 4. In our example, PCNext =
0x1000 + 4 = 0x1004. The new address is written into the program
counter on the next rising edge of the clock. This completes the datapath
for the lw instruction.
7.3 Single-Cycle Processor 403
RegWrite ALUControl2:0
1 000
CLK CLK
CLK
19:15 WE3 0x2004 SrcA WE
PCNext PC Instr A1 RD1
A RD 9 ReadData
ALU
0x1004 ALUResult
A RD
0x1000
0xFFC4A303
Instruction
A2 RD2 SrcB 0x2000 Data 10
Memory 11:7
A3 Memory
6 Register
WD3 WD
File
ImmExt
31:7
Extend 0xFFFFFFFC
PCPlus4 0xFFC
+
4
sw
Next, let us extend the datapath to handle sw, which is an S-type
instruction. Like lw, sw reads a base address from port 1 of the regis-
ter file and sign-extends the immediate. The ALU adds the base address
to the immediate to find the memory address. All of these functions
are already supported in the datapath, but the 12-bit signed immediate
is stored in Instr31:25,11:7 (instead of Instr31:20, as it was for lw). Thus,
the Extend unit must be modified to also receive these additional bits,
Instr11:7. For simplicity (and for future instructions such as jal), the
Extend unit receives all the bits of Instr31:7. A control signal, ImmSrc,
decides which instruction bits to use as the immediate. When ImmSrc =
0 (for lw), the Extend unit chooses Instr31:20 as the 12-bit signed imme-
diate; when ImmSrc = 1 (for sw), it chooses Instr31:25,11:7.
The sw instruction also reads a second register from the register file
and writes its contents to the data memory. Figure 7.9 shows the new
connections for this added functionality. The register is specified in the
rs2 field, Instr24:20, which is connected to the address 2 (A2) input of the
register file. The register’s contents are read onto the read data 2 (RD2)
output, which, in turn, is connected to the write data (WD) input of the
data memory. The write enable port of the data memory, WE, is con-
trolled by MemWrite. For an sw instruction: MemWrite = 1 to write the
data to memory; ALUControl = 000 to add the base address and offset;
and RegWrite = 0, because nothing should be written to the register file.
Note that data is still read from the address given to the data memory,
but this ReadData is ignored because RegWrite = 0.
404 CHAPTER SEVEN Microarchitecture
ALU
0x1008 ALUResult
A RD
0x1004
ImmExt
31:7
Extend 0x00000008
PCPlus4 0x008
+
R-Type Instructions
Next, consider extending the datapath to handle the R-type instructions,
add, sub, and, or, and slt. All of these instructions read two source reg-
isters from the register file, perform some ALU operation on them, and
write the result back to the destination register. They differ only in the
specific ALU operation. Hence, they can all be handled with the same
hardware but with different ALUControl signals. Recall from Section
5.2.4 that ALUControl is 000 for addition, 001 for subtraction, 010 for
AND, 011 for OR, and 101 for set less than.
Figure 7.10 shows the enhanced datapath handling these R-type
instructions. The datapath reads rs1 and rs2 from ports 1 and 2 of the
register file and performs an ALU operation on them. We introduce a
multiplexer and a new select signal, ALUSrc, to select between ImmExt
and RD2 as the second ALU source, SrcB. For lw and sw, ALUSrc is
1 to select ImmExt; for R-type instructions, ALUSrc is 0 to select the
register file output RD2 as SrcB.
7.3 Single-Cycle Processor 405
ALU
ALUResult ReadData
0x100C A RD 1 14
0x1008
0x0062E233
Instruction 24:20 10 14
A2 RD2 0 SrcB Data
Memory 11:7 6
A3 1 10 Memory
4 Register WriteData
WD3 WD
14 File
ImmExt
31:7
Extend 0x00000008
PCPlus4
+
+
Result
4
Let us name the value to be written back to the register file Observe that our hardware
computes all the possible
Result. For lw, Result comes from the ReadData output of the answers needed by different
memory. However, for R-type instructions, Result comes from the instructions (e.g., ALUResult
ALUResult output of the ALU. We add the Result multiplexer to and ReadData) and then uses
choose the proper Result based on the type of instruction. The mul- a multiplexer to choose the
appropriate one based on the
tiplexer select signal ResultSrc is 0 for R-type instructions to choose
instruction. This is an important
ALUResult as Result; ResultSrc is 1 for lw to choose ReadData. design strategy. Throughout
We do not care about the value of ResultSrc for sw because it does the rest of this chapter, we will
not write the register file. add multiplexers to choose the
In our example, the PC is 0x1008. Thus, the instruction memory desired answer.
One of the major differences
reads out the or instruction 0x0062E233. The register file reads source
between software and hardware
operands 6 from x5 and 10 from x6. ALUControl is 011, so the ALU is that software operates
computes 6 | 10 = 01102 | 10102 = 11102 = 14. The result is written sequentially, so we can compute
back to x4. Meanwhile, the PC is incremented to 0x100C. just the answer we need.
Hardware operates in parallel;
beq therefore, we often compute all
the possible answers and then
Finally, we extend the datapath to handle the branch if equal (beq) pick the one we need. For
instruction. beq compares two registers. If they are equal, it takes the example, while executing an
branch by adding the branch offset to the program counter (PC). R-type instruction with the ALU,
The branch offset is a 13-bit signed immediate stored in the 12-bit the memory still receives an
address and reads data from this
immediate field of the B-type instruction. Thus, the Extend logic needs
address even though we don’t
yet another mode to choose the proper immediate. ImmSrc is increased care what that data might be.
to 2 bits, using the encoding from Table 7.1. ImmExt is now either the
406 CHAPTER SEVEN Microarchitecture
ALU
1 0x1000 ALUResult
A RD 1
0x100C
0xFE420AE3
Instruction 24:20 14
A2 RD2 0 SrcB 0 Data
Memory 11:7 4
A3 1 14 Memory
Register WriteData
WD3 WD
File
PCTarget
+
ImmExt 0x1000
31:7
Extend 0xFFFFFFF4
PCPlus4 0xFFA
+
Result
4
0x1010
PCSrc
Control
ResultSrc
Unit
MemWrite
6:0
op ALUControl2:0
14:12
funct3 ALUSrc
30
funct75 ImmSrc1:0
Zero RegWrite
CLK CLK
CLK
19:15 WE3 SrcA Zero WE
0 PCNext PC Instr A1 RD1 0
A RD ReadData
1 ALUResult
ALU
A RD 1
Instruction 24:20
A2 RD2 0 SrcB Data
Memory 11:7
A3 1 Memory
Register WriteData
WD3 WD
File
PCTarget
+
ImmExt
31:7
Extend
PCPlus4
+
Result
4
register file reads 14 on both ports. The ALU computes 14 – 14 = 0, and the
Zero flag is asserted. Meanwhile, the Extend unit produces 0xFFFFFFF4
(i.e., −12), which is added to PC to obtain PCTarget = 0x1000. Note that
we show the unswizzled upper 12 bits of the 13-bit immediate on the input
of the Extend unit (0xFFA). The PCNext mux chooses PCTarget as the next We name the multiplexers
PC and branches back to the start of the code at the next clock edge. (muxes) by the signals they
This completes the design of the single-cycle processor datapath. We produce. For example, the
PCNext mux produces the
have illustrated not only the design itself but also the design process in PCNext signal, and the Result
which the state elements are identified and the combinational logic is mux produces the Result signal.
systematically added to connect the state elements. In the next section,
we consider how to compute the control signals that direct the operation
of our datapath.
7 . 3 . 3 Single-Cycle Control
The single-cycle processor’s control unit computes the control signals
based on op, funct3, and funct7. For the RV32I instruction set, only
bit 5 of funct7 is used, so we just need to consider op (Instr6:0), funct3
(Instr14:12), and funct75 (Instr30). Figure 7.12 shows the entire single-
cycle processor with the control unit attached to the datapath.
408 CHAPTER SEVEN Microarchitecture
Zero PCSrc
Branch
ResultSrc
Main MemWrite
op6:0 Decoder ALUSrc
Figure 7.13 Single-cycle 5
ImmSrc1:0
processor control unit
RegWrite
ALUOp1:0
ALU
funct32:0 ALUControl2:0
Decoder
funct75
sw 0100011 0 01 1 1 x 0 00
R-type 0110011 1 xx 0 0 0 0 10
beq 1100011 0 10 0 0 x 1 01
ALUOp of 00 indicates add (e.g., to find the address for loads or According to Table B.1 in
stores). ALUOp of 01 indicates subtract (e.g., to compare two numbers the inside covers of the book,
for branches). ALUOp of 10 indicates an R-type ALU instruction where add, sub, and addi all have
the ALU Decoder must look at the funct3 field (and sometimes also the funct3 = 000. add has funct7
op5 and funct75 bits) to determine which ALU operation to perform = 0000000 while sub has
funct7 = 0100000, so funct75
(e.g., add, sub, and, or, slt). is sufficient to distinguish these
two. But we will soon consider
supporting addi, which doesn’t
Example 7.1 SINGLE-CYCLE PROCESSOR OPERATION
have a funct7 field but has an
Determine the values of the control signals and the portions of the datapath that op of 0010011. With a bit of
thought, we can see that an
are used when executing an and instruction.
ALU instruction with funct3 =
Solution Figure 7.14 illustrates the control signals and flow of data during exe- 000 is sub if op5 and funct75
are both 1, or add or addi
cution of an and instruction. The PC points to the memory location holding
otherwise.
the instruction; the instruction memory outputs this instruction. The main flow
of data through the register file and ALU is represented with a heavy blue line.
The register file reads the two source operands specified by Instr. SrcB should
come from the second port of the register file (not ImmExt), so ALUSrc must
be 0. The ALU performs a bitwise AND operation, so ALUControl must be
010. The result comes from the ALU, so ResultSrc is 0, and the result is written
to the register file, so RegWrite is 1. The instruction does not write memory, so
MemWrite is 0.
PCSrc
Control
ResultSrc
Unit
MemWrite
6:0
op ALUControl2:0
14:12
funct3 ALUSrc
30
funct75 ImmSrc1:0
Zero RegWrite
CLK CLK
0 CLK 1 xx 0 010 0 0
19:15 WE3 SrcA Zero WE
0 PCNext PC Instr A1 RD1 0
A RD ReadData
1 ALUResult
ALU
A RD 1
Instruction 24:20
A2 RD2 0 SrcB Data
Memory 11:7
A3 1 Memory
Register WriteData
WD3 WD
File
PCTarget
+
ImmExt
31:7
Extend
PCPlus4
+
Result
4
Figure 7.14 Control signals and data flow while executing an and instruction
7 . 3 . 4 More Instructions
So far, we have considered only a small subset of the RISC-V instruction
set. In this section, we enhance the datapath and controller to support
the addi (add immediate) and jal (jump and link) instructions. These
examples illustrate the principle of how to handle new instructions, and
they give us a sufficiently rich instruction set to write many interesting
programs. With enough effort, you could extend the single-cycle pro-
cessor to handle every RISC-V instruction. Moreover, we will see that
supporting some instructions simply requires enhancing the decoders,
whereas supporting others also requires new hardware in the datapath.
Recall that addi rd,rs1,imm is an I-type instruction that adds the value in rs1
to a sign-extended immediate and writes the result to rd. The datapath already is
capable of this task. Determine the necessary changes to the controller to support
addi.
Solution All we need to do is add a new row to the Main Decoder truth table
showing the control signal values for addi, as given in Table 7.4. The result
should be written to the register file, so RegWrite = 1. The 12-bit immediate
in Instr31:20 is sign-extended as it was with lw, another I-type instruction, so
7.3 Single-Cycle Processor 411
R-type 0110011 1 xx 0 0 0 0 10
beq 1100011 0 10 0 0 x 1 01
addi 0010011 1 00 1 0 0 0 10
ImmSrc is 00 (see Table 7.1). SrcB comes from the immediate, so ALUSrc = 1.
The instruction does not write memory nor is it a branch, so MemWrite =
Branch = 0. The result comes from the ALU, not memory, so ResultSrc = 0.
Finally, the ALU should add, so ALUOp = 10; the ALU Decoder makes
ALUControl = 000 because funct3 = 000 and op5 = 0.
The astute reader may note that this change also provides the other I-type ALU
instructions: andi, ori, and slti. These other instructions share the same op
value of 0010011, need the same control signals, and only differ in the funct3
field, which the ALU Decoder already uses to determine ALUControl and, thus,
the ALU operation.
Show how to change the RISC-V single-cycle processor to support the jump and
link (jal) instruction. jal writes PC+4 to rd and changes PC to the jump target
address, PC + imm.
Solution The processor calculates the jump target address, the value of PCNext,
by adding PC to the 21-bit signed immediate encoded in the instruction. The least
significant bit of the immediate is always 0 and the next 20 most significant bits
come from Instr31:12. This 21-bit immediate is then sign-extended. The datapath
already has hardware for adding PC to a sign-extended immediate, selecting
this as the next PC, computing PC+4, and writing a value to the register file.
Hence, in the datapath, we must only modify the Extend unit to sign-extend the
21-bit immediate and expand the Result multiplexer to choose PC+4 (i.e., PCPlus4)
as shown in Figure 7.15. Table 7.5 shows the new encoding for ImmSrc to support
the long immediate for jal.
The control unit needs to set PCSrc = 1 for the jump. To do this, we
add an OR gate and another control signal, Jump, as shown in Figure 7.16.
When Jump asserts, PCSrc = 1 and PCTarget (the jump target address)
is selected as the next PC.
412 CHAPTER SEVEN Microarchitecture
PCSrc
Control
ResultSrc1:0
Unit
MemWrite
6:0
op ALUControl2:0
14:12
funct3 ALUSrc
30
funct75 ImmSrc1:0
Zero RegWrite
CLK CLK
CLK
19:15 WE3 SrcA Zero WE
0 PCNext PC Instr A1 RD1
A RD 00
ALU
1 ALUResult ReadData
A RD 01
Instruction 24:20 10
A2 RD2 0 SrcB Data
Memory 11:7
A3 1 Memory
Register WriteData
WD3 WD
File
PCTarget
+
ImmExt
31:7
Extend
PCPlus4
+
Result
4
Table 7.6 shows the updated Main Decoder table with a new row
for jal. RegWrite = 1 and ResultSrc = 10 to write PC+4 into rd. ImmSrc
= 11 to select the 21-bit jump offset. ALUSrc and ALUOp don’t matter
because the ALU is not used. MemWrite = 0 because the instruction isn’t
a store, and Branch = 0 because the instruction isn’t a branch. The new
Jump signal is 1 to pick the jump target address as the next PC.
7 . 3 . 5 Performance Analysis
Recall from Equation 7.1 that the execution time of a program is the
product of the number of instructions, the cycles per instruction, and
7.3 Single-Cycle Processor 413
Zero
Branch PCSrc
Jump
ResultSrc1:0
Main MemWrite
Decoder
op6:0 ALUSrc
5 Figure 7.16 Enhanced control
ImmSrc1:0
unit for jal
RegWrite
ALUOp1:0
ALU
funct32:0 ALUControl2:0
Decoder
funct75
Instruction Opcode RegWrite ImmSrc ALUSrc MemWrite ResultSrc Branch ALUOp Jump
lw 0000011 1 00 1 0 01 0 00 0
sw 0100011 0 01 1 1 xx 0 00 0
R-type 0110011 1 xx 0 0 00 0 10 0
beq 1100011 0 10 0 0 xx 1 01 0
the cycle time. Each instruction in the single-cycle processor takes one
clock cycle, so the clock cycles per instruction (CPI) is 1. The cycle time
is set by the critical path. In our processor, the lw instruction is the most
time-consuming and involves the critical path shown in Figure 7.17. As
indicated by heavy blue lines, the critical path starts with the PC loading
a new address on the rising edge of the clock. The instruction memory
then reads the new instruction, and the register file reads rs1 as SrcA.
While the register file is reading, the immediate field is sign-extended
based on ImmSrc and selected at the SrcB multiplexer (path highlighted
in gray). The ALU adds SrcA and SrcB to find the memory address. The
data memory reads from this address, and the Result multiplexer selects
ReadData as Result. Finally, Result must set up at the register file before
414 CHAPTER SEVEN Microarchitecture
PCSrc
Control
ResultSrc1:0
Unit
MemWrite
6:0
op ALUControl2:0
14:12
funct3 ALUSrc
30
funct75 ImmSrc1:0
Zero RegWrite
CLK CLK
CLK
19:15 WE3 SrcA Zero WE
0 PCNext PC Instr A1 RD1
A RD ReadData 00
1 ALUResult
ALU
A RD 01
Instruction 24:20 10
A2 RD2 0 SrcB Data
Memory 11:7
A3 1 Memory
Register WriteData
WD3 WD
File
PCTarget
+
ImmExt
31:7
Extend
PCPlus4
+
Result
4
the next rising clock edge so that it can be properly written. Hence, the
cycle time of the single-cycle processor is:
The numerical values of these times will depend on the specific imple-
mentation technology.
Other instructions have shorter critical paths. For example, R-type
instructions do not need to access data memory. However, we are disciplining
ourselves to synchronous sequential design, so the clock period is constant
and must be long enough to accommodate the slowest instruction.
7.4 Multicycle Processor 415
Multiplexer tmux 30
Solution According to Equation 7.3, the cycle time of the single-cycle processor is
Tc_single = 40 + 2(200) + 100 + 120 + 30 + 60 = 750 ps. According to Equation 7.1,
the total execution time is Tsingle = (100 × 109 instruction) (1 cycle/instruction)
(750 × 10−12 s/cycle) = 75 seconds.
file have the longest delays, so to keep the delay for each short step
approximately equal, the processor can use only one of those units in
each step. The processor uses a single memory because the instruction
is read in one step and data is read or written in a later step. And the
processor needs only one adder, which is reused for different purposes
on different steps. Various instructions use different numbers of steps, so
simpler instructions can complete faster than more complex ones.
We design a multicycle processor following the same procedure we
used for the single-cycle processor. First, we construct a datapath by
connecting the architectural state elements and memories with combina-
tional logic. But, this time, we also add nonarchitectural state elements
to hold intermediate results between the steps. Then, we design the
controller. During the execution of a single instruction, the controller
produces different signals on each step, so now the controller uses a finite
state machine rather than combinational logic. Finally, we analyze the per-
formance of the multicycle processor and compare it with the single-cycle
processor.
7 . 4 . 1 Multicycle Datapath
Again, we begin our design with the memory and architectural state of
the processor, as shown in Figure 7.18. In the single-cycle design, we
used separate instruction and data memories because we needed to read
the instruction memory and read or write the data memory all in one
cycle. Now, we choose to use a combined memory for both instructions
and data. This is more realistic and is feasible because we can read the
instruction in one cycle, then read or write the data in another cycle. The
PC and register file remain unchanged.
As with the single-cycle processor, we gradually build the datapath
by adding components to handle each step of each instruction. The PC
contains the address of the instruction to execute. The first step is to
read this instruction from memory. Figure 7.19 shows that the PC is
CLK CLK
CLK
WE WE3
A1 RD1
PCNext PC RD
A
EN
Instr / Data A2 RD2
Memory
A3 Register
WD
WD3 File
simply connected to the address input of the memory. The instruction is Like in the single-cycle processor,
read and stored in a new nonarchitectural instruction register (IR) so we name the multiplexers and
that it is available for future cycles. The IR receives an enable signal, nonarchitectural registers by
called IRWrite, which is asserted when the IR should be loaded with a the signals they produce. For
new instruction. example, the instruction register
produces the instruction signal
(Instr), and the Result multiplexer
lw produces the Result signal.
As we did with the single-cycle processor, we first work out the datapath
connections for the lw instruction. After fetching lw, the second step is
to read the source register containing the base address. This register is
specified in the rs1 field, Instr19:15. These bits of the instruction are con-
nected to address input A1 of the register file, as shown in Figure 7.20.
The register file reads the register onto RD1, and this value is stored in
another nonarchitectural register, A.
IRWrite
CLK CLK CLK
CLK
WE WE3
Instr A1 RD1
PCNext PC RD
A EN
Instr / Data A2 RD2
Memory
A3 Register
WD
WD3 File
IRWrite ImmSrc1:0
CLK CLK CLK CLK
CLK
WE 19:15 Rs1 WE3 A
Instr A1 RD1
PCNext PC RD
A EN
Instr / Data A2 RD2
Memory
A3 Register
WD
WD3 File
Figure 7.20 Read one source from register file and extend second source from immediate field
418 CHAPTER SEVEN Microarchitecture
A EN ALUResult ALUOut
Instr / Data A2 RD2 SrcB
Memory
A3 Register
WD
WD3 File
ALU
EN A EN
1
Instr / Data A2 RD2 SrcB
Memory
ReadData
A3 Register
WD
WD3 File
CLK
31:7 Extend ImmExt
Data
EN A EN 00
1
Instr / Data A2 RD2 SrcB 01
Memory
ReadData
11:7 Rd 10
A3 Register
WD
WD3 File
CLK
31:7 Extend ImmExt
Data
ALUSrcB1:0
ALU
EN A EN 00
1
Instr / Data A2 RD2 00 SrcB 01
ReadData
Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File
CLK
31:7 Extend ImmExt
Data
Result
ALUSrcB1:0
Memory 11:7 Rd 10
A3 Register
WD 4 10
WD3 File
CLK
31:7 Extend ImmExt
Data
Result
sw
Next, let us extend the datapath to handle the sw instruction. Like lw,
sw reads a base address from port 1 of the register file and extends the
immediate on the second step. Then, the ALU adds the base address to
the immediate to find the memory address on the third step. The only
new feature of sw is that we must read a second register from the regis-
ter file and write its contents into memory, as shown in Figure 7.25. The
register is specified in the rs2 field of the instruction, Instr24:20, which
is connected to the second port of the register file (A2). After it is read
7.4 Multicycle Processor 421
ALUSrcB1:0 Zero
CLK
OldPC
ALU
EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
01
WriteData
ReadData
Memory 11:7 Rd 10
A3 Register
WD 4 10
WD3 File
CLK
31:7 Extend ImmExt
Data
Result
on the second step, the register’s contents are then stored in a nonar-
chitectural register, the WriteData register, just below the A register. It is
then sent to the write data port (WD) of the data memory to be written
on the fourth step. The memory receives the MemWrite control signal,
which is asserted when memory should be written.
R-Type Instructions
R-type instructions operate on two source registers and write the result
back to the register file. The datapath already contains all the connec-
tions necessary for these steps.
beq
beq checks whether two register contents are equal and computes
the branch target address by adding the current PC to a 13-bit signed
branch offset. The hardware to compare the registers using subtraction
is already present in the datapath.
The ALU is not being used during the second step of instruc-
tion execution, so we use it then to calculate the branch target address
PCTarget = PC + ImmExt. In this step, the instruction has been fetched
from memory and PC has already been updated to PC+4. Thus, in the
first step, the PC of the current instruction, OldPC, must be stored in
a nonarchitectural register. In the second step, as the registers are also
fetched, the ALU calculates PC + ImmExt by selecting OldPC for SrcA
and ImmExt for SrcB and making ALUControl = 000 so that it per-
forms addition. The processor stores this sum in the ALUOut register.
Figure 7.26 shows the updated datapath for beq.
In the third step, the ALU subtracts the source registers and asserts
the Zero output if they are equal. If they are, the control unit asserts
422 CHAPTER SEVEN Microarchitecture
PCWrite and the Result multiplexer selects ALUOut (that contains the
target address) to feed to the PC. No new hardware is needed.
This completes the design of the multicycle datapath. The design
process is much like that of the single-cycle processor in that hardware
is systematically connected between the state elements to handle each
instruction. The main difference is that the instruction is executed in sev-
eral steps. Nonarchitectural registers are inserted to hold the results of
each step. In this way, the memory can be shared for instructions and
data and the ALU can be reused several times, thus reducing hardware
costs. In the next section, we develop an FSM controller to deliver the
appropriate sequence of control signals to the datapath on each step of
each instruction.
7 . 4 . 2 Multicycle Control
As in the single-cycle processor, the control unit computes the control
signals based on the op, funct3, and funct75 fields of the instruction
(Instr6:0, Instr14:12, and Instr30). Figure 7.27 shows the entire multicycle
processor with the control unit attached to the datapath. The datapath is
shown in black and the control unit in blue.
CLK
PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc
1:0
30
funct75 RegWrite
Zero
Zero
CLK
OldPC
EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
WriteData
Memory
ReadData
11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File
CLK
31:7 Extend ImmExt
Data
Result
Zero
Branch PCWrite
PCUpdate
RegWrite
Main MemWrite
FSM IRWrite
ResultSrc1:0
op6:0 ALUSrcB1:0
5
ALUSrcA1:0 Figure 7.28 Multicycle control
AdrSrc unit
ALUOp1:0
ALU
funct32:0 ALUControl2:0
Decoder
funct75
Instr
op6:0 ImmSrc1:0
Decoder
Fetch
The first step for any instruction is to fetch the instruction from memory
at the address held in the PC. The FSM enters this Fetch state on reset.
The control signals are shown in Figure 7.29. To read the instruction Figure 7.29 Fetch
424 CHAPTER SEVEN Microarchitecture
from memory, AdrSrc = 0, so the address is taken from the PC. IRWrite
is asserted to write the instruction into the instruction register, IR. At
the same time, the current PC is written into the OldPC register. The
data flow through the datapath for this and the next two steps of the lw
instruction is shown in Figure 7.32, with the flow during the Fetch stage
highlighted in gray.
Decode
The second step is to read the register file and decode the instructions.
The control unit decodes the instruction, that is, figures out what opera-
tion should be performed based on op, funct3, and funct75. In this state,
the processor also reads the source registers, rs1 and rs2, and puts the
values read into the A and WriteData nonarchitectural registers. No con-
trol signals are necessary for these tasks. Figure 7.30 shows the Decode
state in the Main FSM and Figure 7.32 shows the flow through the datapath
during this state in medium blue lines. After this step, the processor can
differentiate its actions based on the instruction because the instruction
has been fetched and decoded. We will first show the remaining steps for
lw, then continue with the steps for the other RISC-V instructions.
MemAdr
The third step for lw is to calculate the memory address. The ALU adds
the base address and the offset, so ALUSrcA = 10 to select A (the value
read from rs1) as SrcA, and ALUSrcB = 01 to select ImmExt as SrcB.
ImmSrc, as determined by the Instruction Decoder, is 00 to sign-extend
the I-type immediate, and ALUOp is 00 to add SrcA and SrcB. At the
end of this state, the ALU result (i.e., the address calculation) is stored in
the ALUOut register. Figure 7.31 shows this MemAdr state added to the
Main FSM, and Figure 7.32 shows the datapath flow during this state
highlighted in dark-blue lines.
MemRead
The memory read (MemRead) step sends the calculated address to
memory by sending ALUOut to the address port of the memory, Adr.
Reset
Reset
S2: MemAdr
ALUSrcA = 10
ALUSrcB = 01
ALUOp = 00
CLK
PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc1:0
30
funct75 RegWrite
S0: Fetch 0 0 0 1 0 xx xx xx xxx xx
S1: Decode 0 x 0 0 Zero 0 00 xx xx xxx xx
S2: MemAdr 0 x 0 0 0 00 10 01 000 xx
Zero
CLK
OldPC
EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
WriteData
ReadData
Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File
CLK
31:7 Extend ImmExt
Data
Result
Figure 7.32 Data flow during Fetch, Decode, and MemAdr states
426 CHAPTER SEVEN Microarchitecture
MemWB
In the memory writeback (MemWB) step, the data read from memory,
Data, is written to the destination register. ResultSrc is 01 to select Data
as the Result, and RegWrite is asserted to write the data to the regis-
ter file. The register file’s address and write data inputs for port 3 (A3
and WD3) are already connected to rd (Instr11:7) and Result, respectively.
Figures 7.33 and 7.34 show the MemRead and MemWB states and the
flow through the datapath for both steps. MemWB is the final step in the
lw instruction. Figure 7.33 also shows the transition from MemWB back
Reset
op = 0000011 (lw)
S2: MemAdr
ALUSrcA = 10
ALUSrcB = 01
ALUOp = 00
S3: MemRead
ResultSrc = 00
AdrSrc = 1
S4: MemWB
ResultSrc = 01
RegWrite
7.4 Multicycle Processor 427
CLK
PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc1:0
30
funct75 RegWrite
Zero
CLK
OldPC
ALU
EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
WriteData
ReadData
Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File
CLK
31:7 Extend ImmExt
Data
Result
to the Fetch state so that the next instruction can be fetched. However, We started this section stating
the PC has not yet been incremented. We address this issue next. that only one of the time-
Before finishing the lw instruction, the processor must increment the consuming units (the memory,
ALU, or register file) could be
PC so that it can fetch the next instruction. We could add another state for used in each step. However,
doing this but, instead, we save a cycle by noticing that the ALU is not here, we use both the register file
being used in the Fetch step, so the processor can use that state to calculate and ALU. As long as the units
PC+4 at the same time that it fetches the instruction. ALUSrcA = 00 are used at the same time—that
to get SrcA from the PC (i.e., from signal OldPC). ALUSrcB = 10 to get is, in parallel—more than one
unit can be used in a single step.
the constant 4 for SrcB. ALUOp = 00, so that the ALU adds PC to 4. To
update the PC with PC+4, ResultSrc = 10 to choose ALUResult as the
Result, and PCUpdate = 1 to force PCWrite high (see Figure 7.28). Reset
Figure 7.35 shows the modified Fetch state. The rest of the diagram S0: Fetch
remains the same as in Figure 7.33. Figure 7.36 highlights in blue the AdrSrc = 0
IRWrite
data flow for computing PC+4. The instruction fetch, which is occurring ALUSrcA = 00
ALUSrcB =10
simultaneously, is highlighted in gray. ALUOp = 00
ResultSrc = 10
PCUpdate
sw
Now, we expand the Main FSM to handle more RISC-V instructions. Figure 7.35 Incrementing PC in
All instructions pass through the first two states, Fetch and Decode. the Fetch state
428 CHAPTER SEVEN Microarchitecture
CLK
PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc1:0
30
funct75 RegWrite
Zero
CLK
OldPC
ALU
EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
WriteData
ReadData
Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File
CLK
31:7 Extend ImmExt
Data
Result
R-Type Instructions
After the Decode state, R-type ALU instructions proceed to the execute
(ExecuteR) state, which performs the desired ALU computation.
Namely, ALUSrcA = 10 and ALUSrcB = 00 to choose the contents of
rs1 as SrcA and rs2 as SrcB. ALUOp = 10 so that the ALU Decoder uses
the instruction’s control fields to determine what operation to perform.
7.4 Multicycle Processor 429
op = 0000011 (lw)
OR
op = 0100011 (sw)
S2: MemAdr
ALUSrcA = 10
ALUSrcB = 01
ALUOp = 00
op = op =
0000011 0100011
(lw) (sw)
S4: MemWB
ResultSrc = 01
RegWrite
CLK
PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc
1:0
30
funct75 RegWrite
Zero
CLK
OldPC
EN EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
Memory
WriteData
ReadData
11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File
CLK
31:7 Extend ImmExt
Data
Result
Figure 7.38 Data flow during the memory write (MemWrite) state
430 CHAPTER SEVEN Microarchitecture
Reset
S0: Fetch
AdrSrc = 0 S1: Decode
IRWrite
ALUSrcA = 00
ALUSrcB = 10
ALUOp = 00
ResultSrc = 10
PCUpdate
op = 0000011 (lw) op =
OR 0110011
op = 0100011 (sw) (R-type)
S4: MemWB
ResultSrc = 01
RegWrite
beq
The final instruction, beq, compares two registers and computes the
branch target address. Thus far, the ALU is idle during the Decode state,
so we can use the ALU during that state to calculate the branch target
7.4 Multicycle Processor 431
CLK
PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc1:0
30
funct75 RegWrite
Zero
CLK
OldPC
ALU
EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
WriteData
ReadData
Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File
CLK
31:7 Extend ImmExt
Data
Result
Figure 7.40 Data flow during the ExecuteR and ALUWB states
Reset
S0: Fetch
AdrSrc = 0 S1: Decode
IRWrite ALUSrcA = 01
ALUSrcA = 00 ALUSrcB = 01
ALUSrcB =10 ALUOp = 00
ALUOp = 00
ResultSrc = 10
PCUpdate
op = 0000011 (lw) op =
op =
OR 0110011
1100011
op = 0100011 (sw) (R-type)
(beq)
op = op =
0000011 0100011
(lw) (sw)
S4: MemWB
ResultSrc = 01
RegWrite
Figure 7.41 Enhanced Decode state, with branch target address calculation, and BEQ state
shows the data flow during the BEQ state. The path for comparing rs1
and rs2 is shown in dark blue and the path to (conditionally) set PC to
the target address is in gray through the Result register. This concludes
the design of the controller for these instructions.
7 . 4 . 3 More Instructions
As we did with the single-cycle processor, we next consider examples
of how to modify the multicycle processor datapath and controller to
handle new instructions: I-type ALU instructions (addi, andi, ori,
slti) and jal.
7.4 Multicycle Processor 433
CLK
PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc1:0
30
funct75 RegWrite
Zero
CLK
OldPC
ALU
EN A EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 S rcB 01
WriteData
ReadData
Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File
CLK
31:7 Extend ImmExt
Data
Result
Expand the multicycle processor to include I-type ALU instructions addi, andi,
ori, and slti.
Solution These I-type ALU instructions are nearly the same as their R-type equivalents
(add, and, or, and slt) except that the second source comes from ImmExt rather than
the register file. We introduce the ExecuteI state to perform the desired computation
for all I-type ALU instructions. This state is like ExecuteR except that ALUSrcB = 01
to choose ImmExt as SrcB. After the ExecuteI state, I-type ALU instructions proceed
to the ALU writeback (ALUWB) state to write the result to the register file. Figure 7.43
shows the enhanced Main FSM, which also includes the JAL state for Example 7.6.
Expand the multicycle processor to include the jump and link instruction (jal).
Solution Like the I-type ALU instructions from Example 7.5, no additional hard-
ware is needed to implement the jal instruction. Only the Main FSM needs to be
434 CHAPTER SEVEN Microarchitecture
Reset
S0: Fetch
AdrSrc = 0 S1: Decode
IRWrite ALUSrcA = 01
ALUSrcA = 00 ALUSrcB = 01
ALUSrcB =10 ALUOp = 00
ALUOp = 00
ResultSrc = 10
PCUpdate
op = 0000011 (lw) op = op = op =
op =
OR 0110011 0010011 1101111
1100011
op = 0100011 (sw) (R-type) (I-type ALU) (jal)
(beq)
S2: MemAdr S6: ExecuteR S8: ExecuteI S9: JAL S10: BEQ
ALUSrcA = 10 ALUSrcA = 10 ALUSrcA = 10 ALUSrcA = 01 ALUSrcA = 10
ALUSrcB = 01 ALUSrcB = 00 ALUSrcB = 01 ALUSrcB = 10 ALUSrcB = 00
ALUOp = 00 ALUOp = 10 ALUOp = 10 ALUOp = 00 ALUOp = 01
ResultSrc = 00 ResultSrc = 00
PCUpdate Branch
op = op =
0000011 0100011
(lw) (sw)
S4: MemWB
ResultSrc = 01
RegWrite
updated. The first two steps are the same as the other instructions. During the
You may notice that by the
time the processor reaches
Decode state, the jump target address is calculated using the same flow as the
the JAL state, the PC register branch target address calculation but with ImmSrc = 11, as set by the Instruction
has already been updated to Decoder. Thus, during the Decode state, the jump offset is sign-extended and
PC+4. So we could have just added to the current PC (contained in signal OldPC) to form the jump target
used the PC register output address, which is written to the ALUOut register at the end of that state. jal then
to write to rd. But that would
have required us to extend
proceeds to the JAL state, where the processor writes the target address to PC
the Result multiplexer to and calculates the return address (PC+4) so that it can write it to rd in the next
receive PC as an input. The state. The ALU calculates PC+4 (i.e., OldPC+4) using ALUSrcA = 01 (SrcA =
solution above requires less OldPC), ALUSrcB = 10 (SrcB = 4), and ALUOp = 00 for addition. To write the
hardware because it uses the target address to the PC, ResultSrc = 00 to select the target address (held in
existing datapath.
ALUOut) as Result, and PCUpdate = 1 so that PCWrite is asserted. Figure 7.43
7.4 Multicycle Processor 435
CLK
PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc
1:0
30
funct75 RegWrite
Zero
CLK
OldPC
ALU
EN EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
WriteData
ReadData
Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File
CLK
31:7 Extend ImmExt
Data
Result
shows the new JAL state, and Figure 7.44 shows the data flow during the JAL
state. The flow for updating the PC to the target address is in gray and the PC+4
calculation is in blue. After the JAL state, jal proceeds to the ALUWB state,
where the return address (ALUOut = PC+4) is written to rd. This concludes the
jal instruction, so the Main FSM then goes back to the Fetch state.
Putting these steps together, Figure 7.45 shows the complete Main FSM
state transition diagram for the multicycle processor. The function of
each state is summarized below the figure. Converting the diagram to
hardware is a straightforward but tedious task using the techniques of
Chapter 3. Better yet, the FSM can be coded in an HDL and synthesized
using the techniques of Chapter 4.
7 . 4 . 4 Performance Analysis
The execution time of an instruction depends on both the number
of cycles it uses and the cycle time. While the single-cycle processor
performed all instructions in one cycle, the multicycle processor uses
436 CHAPTER SEVEN Microarchitecture
Reset
S0: Fetch
AdrSrc = 0 S1: Decode
IRWrite ALUSrcA = 01
ALUSrcA = 00 ALUSrcB = 01
ALUSrcB = 10 ALUOp = 00
ALUOp = 00
ResultSrc = 10
PCUpdate
op = 0000011 (lw) op = op = op =
op =
OR 0110011 0010011 1101111
1100011
op = 0100011 (sw) (R-type) (I-type ALU) (jal)
(beq)
S2: MemAdr S6: ExecuteR S8: ExecuteI S9: JAL S10: BEQ
ALUSrcA = 10 ALUSrcA = 10 ALUSrcA = 10 ALUSrcA = 01 ALUSrcA = 10
ALUSrcB = 01 ALUSrcB = 00 ALUSrcB = 01 ALUSrcB = 10 ALUSrcB = 00
ALUOp = 00 ALUOp = 10 ALUOp = 10 ALUOp = 00 ALUOp = 01
ResultSrc = 00 ResultSrc = 00
PCUpdate Branch
op = op =
0000011 0100011
(lw) (sw)
S4: MemWB
ResultSrc = 01
RegWrite
CLK
PCWrite
AdrSrc Control
MemWrite Unit
IRWrite ResultSrc1:0
ALUControl2:0
ALUSrcB1:0
6:0
op ALUSrcA1:0
14:12
funct3 ImmSrc1:0
30
funct75 RegWrite
Zero
Zero
CLK
OldPC
ALU
EN EN 00
1 24:20 Rs2
Instr / Data A2 RD2 00 SrcB 01
WriteData
ReadData
Memory 11:7 Rd 01 10
A3 Register
WD 4 10
WD3 File
CLK
31:7 Extend ImmExt
Data
Result
Solution The average CPI is the weighted sum over each instruction of the CPI
for that instruction multiplied by the fraction of time that instruction is used. For
this benchmark, average CPI = (0.11)(3) + (0.10 + 0.02 + 0.52)(4) + (0.25)(5) = 4.14.
This is better than the worst-case CPI of 5, which would be required if all
instructions took the same number of cycles.
2. The path to read data from memory: From the ALUOut register
through the Result and Adr muxes to read memory into the Data
register (highlighted in thick gray lines)
Both of these paths also require a delay through the decoder after the
state updates (i.e., after a tpcq delay) to produce the control (multiplexer
select and ALUControl) signals. Thus, the clock period is given in
Equation 7.4.
Tc _ multi = t pcq + tdec + 2tmux + max [t ALU , tmem ] + t setup (7.4)
The numerical values of these times will depend on the specific imple-
mentation technology.
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
Instr
Time (ps)
Dec
1 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg
Dec
2 Fetch Execute Memory Wr
Read
Instruction ALU Read/Write Reg
Reg
(a)
Instr
Dec
1 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg
Dec
2 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg
Dec
3 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg
(b)
Figure 7.47 Timing diagrams: (a) single-cycle processor and (b) pipelined processor
And so forth, until all the instructions complete. The instruction latency is
5 × 200 = 1000 ps. Because the stages are not perfectly balanced with equal
amounts of logic, the latency is longer for the pipelined processor than
for the single-cycle processor. The throughput is 1 instruction per 200 ps
(5 billion instructions per second)—that is, one instruction completes
every clock cycle. This throughput is 3.4 times as much as the single-cycle
processor—not quite 5 times but, nonetheless, a substantial speedup.
Figure 7.48 shows an abstracted view of the pipeline in operation
in which each stage is represented pictorially. Each pipeline stage is
represented with its major component—instruction memory (IM), reg-
ister file (RF) read, ALU execution, data memory (DM), and register
file writeback—to illustrate the flow of instructions through the pipe-
line. Reading across a row shows the clock cycle in which a particular
instruction is in each stage. For example, the sub instruction is fetched
in cycle 3 and executed in cycle 5. Reading down a column shows what
the various pipeline stages are doing on a particular cycle. For example,
in cycle 6, the register file is writing a sum to s3, the data memory is
idle, the ALU is computing (s11 & t0), t4 is being read from the regis-
ter file, and the or instruction is being fetched from instruction memory.
Stages are shaded to indicate when they are used. For example, the data
memory is used by lw in cycle 4 and by sw in cycle 8. The instruction
memory and ALU are used in every cycle. The register file is written by
every instruction except sw. In the pipelined processor, the register file is
7.5 Pipelined Processor 441
1 2 3 4 5 6 7 8 9 10
Time (cycles)
s0
lw DM s2
lw s2, 40(s0) IM RF 40 + RF
s9
add DM s3
add s3, s9, s10 IM RF s10 + RF
t1
sub DM s4
sub s4, t1, s8 IM RF s8 - RF
s11
and DM s5
and s5, s11, t0 IM RF t0 & RF
t4
sw DM
sw s6, 20(t4) IM RF 20 + RF
t2
or DM s7
or s7, t2, t3 IM RF t3 | RF
used twice in every cycle: it is written in the first part of a cycle and read
in the second part, as suggested by the shading. This way, data can be
written by one instruction and read by another within a single cycle.
A central challenge in pipelined systems is handling hazards that
occur when one instruction’s result is needed by a subsequent instruction
before the former instruction has completed. For example, if the add in
Figure 7.48 used s2 as a source instead of s10, a hazard would occur
because the s2 register has not yet been written by the lw instruction
when it is read by add in cycle 3. After designing the pipelined datapath
and control, this section explores forwarding, stalls, and flushes as meth-
ods to resolve hazards. Finally, this section revisits performance analysis
considering sequencing overhead and the impact of hazards.
7 . 5 . 1 Pipelined Datapath
The pipelined datapath is formed by chopping the single-cycle datapath
into five stages separated by pipeline registers. Figure 7.49(a) shows the
single-cycle datapath stretched out to leave room for the pipeline registers.
Figure 7.49(b) shows the pipelined datapath formed by inserting four
pipeline registers to separate the datapath into five stages. The stages
and their boundaries are indicated in blue. Signals are given a suffix
(F, D, E, M, or W) to indicate the stage in which they reside.
The register file is peculiar because it is read in the Decode stage and
written in the Writeback stage. So, although the register file is drawn in the
Decode stage, its write address and write data come from the Writeback
stage. This feedback will lead to pipeline hazards, which are discussed in
442 CHAPTER SEVEN Microarchitecture
CLK CLK
CLK
19:15 WE3 SrcAE WE
0 PC' PC Instr A1 RD1 Zero
A RD ReadData 00
1 ALUResult
ALU
A RD 01
Instruction 24:20 10
A2 RD2 0 SrcBE Data
Memory 11:7
A3 1 Memory
Register WriteData
WD3 WD
File
PCTarget
+
+
4 ImmExt
31:7 Extend
PCPlus4
(a) Result
Zero
CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1
A RD 00
1 ALUResultM ReadDataW
ALU
RD2E A RD 01
Instruction 24:20 10
A2 RD2 0 SrcBE Data
Memory 11:7
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE +
+
4 ImmExtD ImmExtE
31:7 Extend
ResultW
Section 7.5.3. The register file in the pipelined processor writes on the falling
edge of CLK so that it can write a result in the first half of a cycle and read
that result in the second half of the cycle for use in a subsequent instruction.
One of the subtle but critical issues in pipelining is that all signals
associated with a particular instruction must advance through the pipeline
in unison. Figure 7.49(b) has an error related to this issue. Can you find it?
The error is in the register file write logic, which should operate in
the Writeback stage. The data value comes from ResultW, a Writeback
stage signal. But the destination register comes from RdD (InstrD11:7),
which is a Decode stage signal. In the pipeline diagram of Figure 7.48,
during cycle 5, the result of the lw instruction would be incorrectly writ-
ten to s5 rather than s2.
7.5 Pipelined Processor 443
ALU
RD2E A RD 01
Instruction 24:20 10
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE
+
11:7 RdD RdE RdM RdW
+
4 ImmExtD ImmExtE
31:7 Extend
ResultW
7 . 5 . 2 Pipelined Control
The pipelined processor uses the same control signals as the single-cycle
processor and, therefore, has the same control unit. The control unit
examines the op, funct3, and funct75 fields of the instruction in the Decode
stage to produce the control signals, as was described in Section 7.3.3
for the single-cycle processor. These control signals must be pipelined along
with the data so that they remain synchronized with the instruction.
The entire pipelined processor with control is shown in Figure 7.51.
RegWrite must be pipelined into the Writeback stage before it feeds back
to the register file, just as Rd was pipelined in Figure 7.50. In addition to
R-type ALU instructions, lw, sw, and beq, this pipelined processor also
supports jal and I-type ALU instructions.
7 . 5 . 3 Hazards
In a pipelined system, multiple instructions are handled concurrently.
When one instruction is dependent on the results of another that has
444 CHAPTER SEVEN Microarchitecture
PCSrcE
ZeroE
CLK CLK CLK
ALU
RD2E A RD 01
Instruction 24:20 10
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE
+
11:7 RdD RdE RdM RdW
+
4 ImmExtD ImmExtE
31:7 Extend
ResultW
not yet completed, a hazard occurs. The register file is written during
the first half of the cycle and read during the second half of the cycle, so
a register can be written and read back in the same cycle without intro-
ducing a hazard.
Figure 7.52 illustrates hazards that occur when one instruction
writes a register (s8) and subsequent instructions read this register. The
blue arrows highlight when s8 is written to the register file (in cycle
5) as compared to when it is needed by subsequent instructions. This
is called a read after write (RAW) hazard. The add instruction writes a
result into s8 in the first half of cycle 5. However, the sub instruction
reads s8 on cycle 3, obtaining the wrong value. The or instruction reads
s8 on cycle 4, again obtaining the wrong value. The and instruction
reads s8 in the second half of cycle 5, obtaining the correct value, which
was written in the first half of cycle 5. Subsequent instructions also read
the correct value of s8. The diagram shows that hazards may occur in
this pipeline when an instruction writes a register and either of the two
subsequent instructions reads that register. Without special treatment,
the pipeline will compute the wrong result.
A software solution would be to require the programmer or compiler
to insert nop instructions between the add and sub instructions so that
the dependent instruction does not read the result (s8) until it is available
7.5 Pipelined Processor 445
1 2 3 4 5 6 7 8
Time (cycles)
s4
add s8
add s8, s4, s5 IM RF s5 + DM RF
s8
sub s2
sub s2, s8, s3 IM RF s3 - DM RF
t6
or s9
or s9, t6, s8 IM RF s8 | DM RF
s8
and s7
and s7, s8, t2 IM RF t2 & DM RF
1 2 3 4 5 6 7 8 9 10
Time (cycles)
s4
add s8
add s8, s4, s5 IM RF s5 + DM RF
nop
nop IM RF DM RF
nop
nop IM RF DM RF
s8
sub s2
sub s2, s8, s3 IM RF s3 - DM RF
t6
or DM s9
or s9, t6, s8 IM RF s8 | RF
s8
and DM s7
and s7, s8, t2 IM RF t2 & RF
1 2 3 4 5 6 7 8
Time (cycles)
s4
add s8 DM s8
add s8, s4, s5 IM RF s5 + RF
s8
sub s2
sub s2, s8, s3 IM RF s3 - DM RF
t6
or s9
or s9, t6, s8 IM RF s8 | DM RF
s8
and s7
and s7, s8, t2 IM RF t2 & DM RF
PCSrcE
ZeroE
CLK CLK CLK
6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0
ALU
RD2E A RD 01
Instruction 24:20 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE
+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+
4 ImmExtD ImmExtE
31:7 Extend
PCPlus4F PCPlus4D l
PCPus4E PCPlus4M
PCPlus4W
PCTargetE
ResultW
ForwardAE
ForwardBE
Hazard Unit
Figure 7.55 Pipelined processor with forwarding to solve some data hazards
stages, RdM and RdW. It also receives the RegWrite signals from the
Memory and Writeback stages (RegWriteM and RegWriteW) to know
whether the destination register will actually be written (e.g., the sw and
beq instructions do not write results to the register file and, hence, do
not have their results forwarded).
The Hazard Unit computes control signals for the forwarding mul-
tiplexers to choose operands from the register file or from the results
in the Memory or Writeback stage (ALUResultM or ResultW). The
Hazard Unit should forward from a stage if that stage will write a des-
tination register and the destination register matches the source register.
However, x0 is hardwired to 0 and should never be forwarded. If both
the Memory and Writeback stages contain matching destination reg-
isters, then the Memory stage should have priority because it contains
the more recently executed instruction. In summary, the function of the
forwarding logic for SrcAE (ForwardAE) is given on the next page.
The forwarding logic for SrcBE (ForwardBE) is identical except that it
checks Rs2E instead of Rs1E.
448 CHAPTER SEVEN Microarchitecture
if ((Rs1E == RdM) & RegWriteM) & (Rs1E != 0) then // Forward from Memory stage
ForwardAE = 10
else if ((Rs1E == RdW) & RegWriteW) & (Rs1E != 0) then // Forward from Writeback stage
ForwardAE = 01
else ForwardAE = 00 // No forwarding (use RF output)
1 2 3 4 5 6 7 8
Time (cycles)
s5 s7
1w
1w s7, 40(s5) IM RF 40 + DM RF
Trouble!
s7
and s8
and s8, s7, t3 IM RF t3 & DM RF
s6 t2
or
or t2, s6, s7 IM RF s7 | DM RF
s7
sub s3
sub s3, s7, s2 IM RF s2 – DM RF
1 2 3 4 5 6 7 8 9
Time (cycles)
s5
lw s7
lw s7, 40(s5) IM RF 40 + DM RF
s7 s7
and s8
and s8, s7, t3 IM RF t3 RF t3 & DM RF
s6
or or t2
or t2, s6, s7 IM IM RF s7 | DM RF
Stall s7
sub s3
sub s3, s7, s2 IM RF s2 - DM RF
PCSrcE
ZeroE
CLK CLK CLK
6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0
ALU
EN
RD2E A RD 01
Instruction 24:20 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE
+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+
4 ImmExtD ImmExtE
31:7 Extend
PCPlus4W
PCTargetE
ResultW
ResultSrcE0
ForwardAE
ForwardBE
FlushE
StallD
StallF
Hazard Unit
1 2 3 4 5 6 7 8 9 10
Time (cycles)
s1
beq DM
20 beq s1, s2, L1 IM RF s2 - RF
t1
sub DM
24 sub s8, t1, s3 IM RF s3 RF Flush
these
or instruc!ons
28 or s9, t6, s5 IM RF DM RF
2C ...
... ...
s3
add DM s7
58 L1: add s7, s3, s4 IM RF s4 + RF
Figure 7.59 Abstract pipeline diagram illustrating flushing when a branch is taken
FlushD = PCSrcE
FlushE = lwStall | PCSrcE
452 CHAPTER SEVEN Microarchitecture
PCSrcE
ZeroE
CLK CLK CLK
6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0
ALU
EN
A RD 01
Instruction 24:20 RD2E 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE
+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+
4 ImmExtD ImmExtE
31:7 Extend
CLR
EN
PCPlus4W
PCTargetE
ResultW
ForwardAE
ForwardBE
FlushD
FlushE
StallD
StallF
HazardUnit
Figure 7.60 Expanded Hazard Unit for handling branch control hazard
Hazard Summary
In summary, RAW data hazards occur when an instruction depends on
a result (from another instruction) that has not yet been written into the
register file. Data hazards can be resolved by forwarding if the result is
computed soon enough; otherwise, they require stalling the pipeline until
the result is available. Control hazards occur when the decision of what
instruction to fetch has not been made by the time the next instruction
must be fetched. Control hazards are solved by stalling the pipeline until
the decision is made or by predicting which instruction should be fetched
and flushing the pipeline if the prediction is later determined to be wrong.
Moving the decision as early as possible minimizes the number of instruc-
tions that are flushed on a misprediction. You may have observed by now
that one of the challenges of designing a pipelined processor is to under-
stand all possible interactions between instructions and to discover all
of the hazards that may exist. Figure 7.61 shows the complete pipelined
processor handling all of the hazards. The hazard logic is summarized on
the next page.
7.5 Pipelined Processor 453
PCSrcE
ZeroE
CLK CLK CLK
6:0
BranchD BranchE
op
14:12
ALUControlD2:0 ALUControlE2:0
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0
ALU
EN
A RD 01
Instruction 24:20 RD2E 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE
+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+
4 ImmExtD ImmExtE
31:7 Extend
CLR
EN
PCPlus4W
PCTargetE
ResultW
ForwardAE
ForwardBE
FlushD
FlushE
StallD
StallF
Hazard Unit
3
Recall that the forwarding logic for SrcBE (ForwardBE) is identical except that it checks
Rs2E instead of Rs1E.
454 CHAPTER SEVEN Microarchitecture
7 . 5 . 4 Performance Analysis
The pipelined processor ideally would have a CPI of 1 because a new
instruction is issued—that is, fetched—every cycle. However, a stall or a
flush wastes 1 to 2 cycles, so the CPI is slightly higher and depends on
the specific program being executed.
Solution The average CPI is the weighted sum over each instruction of the CPI
for that instruction multiplied by the fraction of time that instruction is used.
Loads take one clock cycle when there is no dependency and two cycles when
the processor must stall for a dependency, so they have a CPI of (0.6)(1) + (0.4)
(2) = 1.4. Branches take one clock cycle when they are predicted properly and
three when they are not, so they have a CPI of (0.5)(1) + (0.5)(3) = 2. Jumps
take three clock cycles (CPI = 3). All other instructions have a CPI of 1. Hence,
for this benchmark, the average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(2) + (0.02)
(3) + (0.52)(1) = 1.25.
We can determine the cycle time by considering the critical path in each
of the five pipeline stages shown in Figure 7.61. Recall that the register
file is used twice in a single cycle: it is written in the first half of the
Writeback cycle and read in the second half of the Decode cycle; so these
stages can use only half of the cycle time for their critical path. Another
way of saying it is this: twice the critical path for each of those stages
must fit in a cycle. Figure 7.62 shows the critical path for the Execute
stage. It occurs when a branch is in the Execute stage that requires for-
warding from the Writeback stage: the path goes from the Writeback
pipeline register, through the Result, ForwardBE, and SrcB multiplexers,
The critical path analysis for through the ALU and AND-OR logic to the PC multiplexer and, finally,
the Execute stage assumes to the PC register.
that the Hazard Unit delay
for calculating ForwardAE t pcq + tmem + t setup
Fetch
and ForwardBE is less than
or equal to the delay of the 2(tRFread + t setup ) Decode
Result multiplexer. If the
Tc _ pipelined = max t pcq + 4tmux + t ALU + t AND−OR + t setup Execute
Hazard Unit delay is longer,
t Memory
it must be included in the pcq + tmem + t setup
critical path instead of the 2(t Writeback
Result multiplexer delay. pcq + tmux + tRFsetup )
(7.5)
7.5 Pipelined Processor 455
PCSrcE
ZeroE
CLK CLK CLK
6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0
ALU
10
EN
A RD 01
Instruction 24:20 RD2E 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE
+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+
4 ImmExtD ImmExtE
31:7 Extend
CLR
EN
PCPlus4W
PCTargetE
ResultW
ForwardAE
ForwardBE
FlushD
FlushE
StallD
StallF
Hazard Unit
Ben Bitdiddle needs to compare the pipelined processor performance with that
of the single-cycle and multicycle processors considered in Examples 7.4 and
7.8. The logic delays were given in Table 7.7 (on page 415). Help Ben compare
the execution time of 100 billion instructions from the SPECINT2000 bench-
mark for each processor.
Solution According to Equation 7.5, the cycle time of the pipelined processor is
Tc_pipelined = max[40 + 200 + 50, 2(100 + 50), 40 + 4(30) + 120 + 20 + 50, 40 + Our pipelined processor is
200 + 50, 2(40 + 30 + 60)] = 350 ps. The Execute stage takes the longest. According unbalanced, with branch
to Equation 7.1, the total execution time is Tpipelined = (100 × 109 instructions) resolution in the Execute
(1.25 cycles / instruction)(350 × 10−12 s / cycle) = 44 seconds. This compares with stage taking much longer
than any other stage. The
75 seconds for the single-cycle processor and 155 seconds for the multicycle processor.
pipeline could be balanced
better by pushing the Result
multiplexer back into the
The pipelined processor is substantially faster than the others.
Memory stage, reducing the
However, its advantage over the single-cycle processor is nowhere near cycle time to 320 ps.
the fivefold speedup one might hope to get from a five-stage pipeline.
456 CHAPTER SEVEN Microarchitecture
RISCVsingle
A
Controller Instr Instruction
Memory
RegSrc
PCSrc
ResultSrc
ALUControl
ALUSrc
ImmSrc
RegWrite
MemWrite
Instr
RD
Zero
the datapath from Figure 7.15 and the controller from Figure 7.16. The
controller, in turn, is composed of the Main Decoder and the ALU Decoder.
The HDL code is partitioned into several sections. Section 7.6.1
provides HDL for the single-cycle processor datapath and controller.
Section 7.6.2 presents the generic building blocks, such as registers and
multiplexers, which are used by any microarchitecture. Section 7.6.3
introduces the test program, testbench, and external memories. The
HDL and test program are available in electronic form on this book’s
website (see the Preface).
7 . 6 . 1 Single-Cycle Processor
The main modules of the single-cycle processor module are given in the
following HDL examples.
SystemVerilog VHDL
module riscvsingle(input logic clk, reset, library IEEE;
output logic [31:0] PC, use IEEE.STD_LOGIC_1164.all;
input logic [31:0] Instr,
output logic MemWrite, entity riscvsingle is
output logic [31:0] ALUResult, WriteData, port(clk, reset: in STD_LOGIC;
input logic [31:0] ReadData); PC: out STD_LOGIC_VECTOR(31 downto 0);
Instr: in STD_LOGIC_VECTOR(31 downto 0);
logic ALUSrc, RegWrite, Jump, Zero; MemWrite: out STD_LOGIC;
logic [1:0] ResultSrc, ImmSrc; ALUResult, WriteData: out STD_LOGIC_VECTOR(31 downto 0);
logic [2:0] ALUControl; ReadData: in STD_LOGIC_VECTOR(31 downto 0));
end;
controller c(Instr[6:0], Instr[14:12], Instr[30], Zero,
ResultSrc, MemWrite, PCSrc, architecture struct of riscvsingle is
ALUSrc, RegWrite, Jump, component controller
ImmSrc, ALUControl); port(op: in STD_LOGIC_VECTOR(6 downto 0);
datapath dp(clk, reset, ResultSrc, PCSrc, funct3: in STD_LOGIC_VECTOR(2 downto 0);
ALUSrc, RegWrite, funct7b5, Zero: in STD_LOGIC;
ImmSrc, ALUControl, ResultSrc: out STD_LOGIC_VECTOR(1 downto 0);
Zero, PC, Instr, MemWrite: out STD_LOGIC;
ALUResult, WriteData, ReadData); PCSrc, ALUSrc: out STD_LOGIC;
endmodule RegWrite, Jump: out STD_LOGIC;
ImmSrc: out STD_LOGIC_VECTOR(1 downto 0);
ALUControl: out STD_LOGIC_VECTOR(2 downto 0));
end component;
component datapath
port(clk, reset: in STD_LOGIC;
ResultSrc: in STD_LOGIC_VECTOR(1 downto 0);
PCSrc, ALUSrc: in STD_LOGIC;
RegWrite: in STD_LOGIC;
ImmSrc: in STD_LOGIC_VECTOR(1 downto 0);
ALUControl: in STD_LOGIC_VECTOR(2 downto 0);
Zero: out STD_LOGIC;
PC: out STD_LOGIC_VECTOR(31 downto 0);
Instr: in STD_LOGIC_VECTOR(31 downto 0);
ALUResult, WriteData: out STD_LOGIC_VECTOR(31 downto 0);
ReadData: in STD_LOGIC_VECTOR(31 downto 0));
end component;
458 CHAPTER SEVEN Microarchitecture
end;
SystemVerilog VHDL
module controller(input logic [6:0] op, library IEEE;
input logic [2:0] funct3, use IEEE.STD_LOGIC_1164.all;
input logic funct7b5,
input logic Zero, entity controller is
output logic [1:0] ResultSrc, port(op: in STD_LOGIC_VECTOR(6 downto 0);
output logic MemWrite, funct3: in STD_LOGIC_VECTOR(2 downto 0);
output logic PCSrc, ALUSrc, funct7b5, Zero: in STD_LOGIC;
output logic RegWrite, Jump, ResultSrc: out STD_LOGIC_VECTOR(1 downto 0);
output logic [1:0] ImmSrc, MemWrite: out STD_LOGIC;
output logic [2:0] ALUControl); PCSrc, ALUSrc: out STD_LOGIC;
logic [1:0] ALUOp; RegWrite: out STD_LOGIC;
logic Branch; Jump: buffer STD_LOGIC;
ImmSrc: out STD_LOGIC_VECTOR(1 downto 0);
maindec md(op, ResultSrc, MemWrite, Branch, ALUControl: out STD_LOGIC_VECTOR(2 downto 0));
ALUSrc, RegWrite, Jump, ImmSrc, ALUOp); end;
aludec ad(op[5], funct3, funct7b5, ALUOp, ALUControl);
architecture struct of controller is
assign PCSrc = Branch & Zero | Jump; component maindec
endmodule port(op: in STD_LOGIC_VECTOR(6 downto 0);
ResultSrc: out STD_LOGIC_VECTOR(1 downto 0);
MemWrite: out STD_LOGIC;
Branch, ALUSrc: out STD_LOGIC;
RegWrite, Jump: out STD_LOGIC;
ImmSrc: out STD_LOGIC_VECTOR(1 downto 0);
ALUOp: out STD_LOGIC_VECTOR(1 downto 0));
end component;
component aludec
port(opb5: in STD_LOGIC;
funct3: in STD_LOGIC_VECTOR(2 downto 0);
funct7b5: in STD_LOGIC;
ALUOp: in STD_LOGIC_VECTOR(1 downto 0);
ALUControl: out STD_LOGIC_VECTOR(2 downto 0));
end component;
SystemVerilog VHDL
module aludec(input logic opb5, library IEEE;
input logic [2:0] funct3, use IEEE.STD_LOGIC_1164.all;
input logic funct7b5,
input logic [1:0] ALUOp, entity aludec is
output logic [2:0] ALUControl); port(opb5: in STD_LOGIC;
funct3: in STD_LOGIC_VECTOR(2 downto 0);
logic RtypeSub; funct7b5: in STD_LOGIC;
assign RtypeSub = funct7b5 & opb5; // TRUE for R–type subtract ALUOp: in STD_LOGIC_VECTOR(1 downto 0);
ALUControl: out STD_LOGIC_VECTOR(2 downto 0));
always_comb end;
case(ALUOp)
2'b00: ALUControl = 3'b000; // addition architecture behave of aludec is
2'b01: ALUControl = 3'b001; // subtraction signal RtypeSub: STD_LOGIC;
default: case(funct3) // R–type or I–type ALU begin
3'b000: if (RtypeSub) RtypeSub <= funct7b5 and opb5; –– TRUE for R–type subtract
ALUControl = 3'b001; // sub process(opb5, funct3, funct7b5, ALUOp, RtypeSub) begin
else case ALUOp is
460 CHAPTER SEVEN Microarchitecture
SystemVerilog VHDL
module datapath(input logic clk, reset, library IEEE;
input logic [1:0] ResultSrc, use IEEE.STD_LOGIC_1164.all;
input logic PCSrc, ALUSrc, use IEEE.STD_LOGIC_ARITH.all;
input logic RegWrite,
input logic [1:0] ImmSrc, entity datapath is
input logic [2:0] ALUControl, port(clk, reset: in STD_LOGIC;
output logic Zero, ResultSrc: in STD_LOGIC_VECTOR(1 downto 0);
output logic [31:0] PC, PCSrc, ALUSrc: in STD_LOGIC;
input logic [31:0] Instr, RegWrite: in STD_LOGIC;
output logic [31:0] ALUResult, WriteData, ImmSrc: in STD_LOGIC_VECTOR(1 downto 0);
input logic [31:0] ReadData); ALUControl: in STD_LOGIC_VECTOR(2 downto 0);
Zero: out STD_LOGIC;
logic [31:0] PCNext, PCPlus4, PCTarget; PC: buffer STD_LOGIC_VECTOR(31 downto 0);
logic [31:0] ImmExt; Instr: in STD_LOGIC_VECTOR(31 downto 0);
logic [31:0] SrcA, SrcB; ALUResult, WriteData: buffer STD_LOGIC_VECTOR(31 downto 0);
logic [31:0] Result; ReadData: in STD_LOGIC_VECTOR(31 downto 0));
end;
// next PC logic
flopr #(32) pcreg(clk, reset, PCNext, PC); architecture struct of datapath is
adder pcadd4(PC, 32'd4, PCPlus4); component flopr generic(width: integer);
adder pcaddbranch(PC, ImmExt, PCTarget); port(clk, reset: in STD_LOGIC;
mux2 #(32) pcmux(PCPlus4, PCTarget, PCSrc, PCNext); d: in STD_LOGIC_VECTOR(width−1 downto 0);
q: out STD_LOGIC_VECTOR(width−1 downto 0));
// register file logic end component;
regfile rf(clk, RegWrite, Instr[19:15], Instr[24:20], component adder
Instr[11:7], Result, SrcA, WriteData); port(a, b: in STD_LOGIC_VECTOR(31 downto 0);
extend ext(Instr[31:7], ImmSrc, ImmExt); y: out STD_LOGIC_VECTOR(31 downto 0));
end component;
// ALU logic
component mux2 generic(width: integer);
mux2 #(32) srcbmux(WriteData, ImmExt, ALUSrc, SrcB);
port(d0, d1: in STD_LOGIC_VECTOR(width−1 downto 0);
alu alu(SrcA, SrcB, ALUControl, ALUResult, Zero);
s: in STD_LOGIC;
mux3 #(32) resultmux(ALUResult, ReadData, PCPlus4,
y: out STD_LOGIC_VECTOR(width−1 downto 0));
ResultSrc, Result);
end component;
endmodule
component mux3 generic(width: integer);
port(d0, d1, d2: in STD_LOGIC_VECTOR(width−1 downto 0);
s: in STD_LOGIC_VECTOR(1 downto 0);
y: out STD_LOGIC_VECTOR(width−1 downto 0));
end component;
component regfile
port(clk: in STD_LOGIC;
we3: in STD_LOGIC;
a1, a2, a3: in STD_LOGIC_VECTOR(4 downto 0);
7.6 HDL Representation 461
SystemVerilog VHDL
module adder(input [31:0] a, b, library IEEE;
output [31:0] y); use IEEE.STD_LOGIC_1164.all;
use IEEE.NUMERIC_STD_UNSIGNED.all;
assign y = a + b;
endmodule entity adder is
port(a, b: in STD_LOGIC_VECTOR(31 downto 0);
y: out STD_LOGIC_VECTOR(31 downto 0));
end;
SystemVerilog VHDL
module extend(input logic [31:7] instr, library IEEE;
input logic [1:0] immsrc, use IEEE.STD_LOGIC_1164.all;
output logic [31:0] immext);
entity extend is
always_comb port(instr: in STD_LOGIC_VECTOR(31 downto 7);
case(immsrc) immsrc: in STD_LOGIC_VECTOR(1 downto 0);
// I−type immext: out STD_LOGIC_VECTOR(31 downto 0));
2'b00: immext = {{20{instr[31]}}, instr[31:20]}; end;
// S−type (stores)
2'b01: immext = {{20{instr[31]}}, instr[31:25], architecture behave of extend is
instr[11:7]}; begin
// B−type (branches) process(instr, immsrc) begin
2'b10: immext = {{20{instr[31]}}, instr[7], case immsrc is
instr[30:25], instr[11:8], 1’b0}; –– I–type
// J−type (jal) when "00" =>
2'b11: immext = {{12{instr[31]}}, instr[19:12], immext <= (31 downto 12 => instr(31)) & instr(31 downto 20);
instr[20], instr[30:21], 1’b0}; –– S–types (stores)
default: immext = 32'bx; // undefined when "01" =>
endcase immext <= (31 downto 12 => instr(31)) &
endmodule instr(31 downto 25) & instr(11 downto 7);
–– B–type (branches)
when "10" =>
immext <= (31 downto 12 => instr(31)) & instr(7) & instr(30
downto 25) & instr(11 downto 8) & '0';
–– J–type (jal)
when "11" =>
immext <= (31 downto 20 => instr(31)) &
instr(19 downto 12) & instr(20) &
instr(30 downto 21) & '0';
when others =>
immext <= (31 downto 0 => '–');
end case;
end process;
end;
SystemVerilog VHDL
module flopr #(parameter WIDTH = 8) library IEEE;
(input logic clk, reset, use IEEE.STD_LOGIC_1164.all;
input logic [WIDTH−1:0] d, use IEEE.STD_LOGIC_ARITH.all;
output logic [WIDTH−1:0] q);
entity flopr is
always_ff @(posedge clk, posedge reset) generic(width: integer);
if (reset) q <= 0; port(clk, reset: in STD_LOGIC;
else q <= d; d: in STD_LOGIC_VECTOR(width−1 downto 0);
endmodule q: out STD_LOGIC_VECTOR(width−1 downto 0));
end;
SystemVerilog VHDL
module flopenr #(parameter WIDTH = 8) library IEEE;
(input logic clk, reset, en, use IEEE.STD_LOGIC_1164.all;
input logic [WIDTH–1:0] d, use IEEE.STD_LOGIC_ARITH.all;
output logic [WIDTH–1:0] q);
entity flopenr is
always_ff @(posedge clk, posedge reset) generic(width: integer);
if (reset) q <= 0; port(clk, reset, en: in STD_LOGIC;
else if (en) q <= d; d: in STD_LOGIC_VECTOR(width–1 downto 0);
endmodule q: out STD_LOGIC_VECTOR(width–1 downto 0));
end;
architecture asynchronous of flopenr is
begin
process(clk, reset, en) begin
if reset = '1' then q <= (others => '0');
elsif rising_edge(clk) and en = '1' then q <= d;
end if;
end process;
end;
SystemVerilog VHDL
module mux2 #(parameter WIDTH = 8) library IEEE;
(input logic [WIDTH−1:0] d0, d1, use IEEE.STD_LOGIC_1164.all;
input logic s, entity mux2 is
output logic [WIDTH−1:0] y); generic(width: integer := 8);
port(d0, d1: in STD_LOGIC_VECTOR(width−1 downto 0);
assign y = s ? d1 : d0;
s: in STD_LOGIC;
endmodule y: out STD_LOGIC_VECTOR(width−1 downto 0));
end;
architecture behave of mux2 is
begin
y <= d1 when s = '1' else d0;
end;
# riscvtest.s
# Sarah.Harris@unlv.edu
# David_Harris@hmc.edu
# 27 Oct 2020
#
# Test the RISC-V processor:
# add, sub, and, or, slt, addi, lw, sw, beq, jal
# If successful, it should write the value 25 to address 100
# RISC-V Assembly Description Address Machine Code
main: addi x2, x0, 5 # x2 = 5 0 00500113
addi x3, x0, 12 # x3 = 12 4 00C00193
addi x7, x3, -9 # x7 = (12 - 9) = 3 8 FF718393
or x4, x7, x2 # x4 = (3 OR 5) = 7 C 0023E233
and x5, x3, x4 # x5 = (12 AND 7) = 4 10 0041F2B3
add x5, x5, x4 # x5 = 4 + 7 = 11 14 004282B3
beq x5, x7, end # shouldn't be taken 18 02728863
slt x4, x3, x4 # x4 = (12 < 7) = 0 1C 0041A233
beq x4, x0, around # should be taken 20 00020463
addi x5, x0, 0 # shouldn't execute 24 00000293
around: slt x4, x7, x2 # x4 = (3 < 5) = 1 28 0023A233
add x7, x4, x5 # x7 = (1 + 11) = 12 2C 005203B3
sub x7, x7, x2 # x7 = (12 - 5) = 7 30 402383B3
sw x7, 84(x3) # [96] = 7 34 0471AA23
lw x2, 96(x0) # x2 = [96] = 7 38 06002103
add x9, x2, x5 # x9 = (7 + 11) = 18 3C 005104B3
jal x3, end # jump to end, x3 = 0x44 40 008001EF
addi x2, x0, 1 # shouldn't execute 44 00100113
end: add x2, x2, x9 # x2 = (7 + 18) = 25 48 00910133
sw x2, 0x20(x3) # [100] = 25 4C 0221A023
done: beq x2, x2, done # infinite loop 50 00210063
7 . 6 . 3 Testbench
00500113
00C00193 The testbench loads a program into the memories. The program in
FF718393 Figure 7.64 exercises all of the instructions by performing a computa-
0023E233
0041F2B3 tion that should produce the correct result only if all of the instructions
004282B3 are functioning correctly. Specifically, the program will write the value
02728863 25 to address 100 if it runs correctly, but it is unlikely to do so if the
0041A233
00020463
hardware is buggy. This is an example of ad hoc testing.
00000293 The machine code is stored in a text file called riscvtest.txt
0023A233 (Figure 7.65) which is loaded by the testbench during simulation. The
005203B3
402383B3
file consists of the machine code for the instructions written in hexadeci-
0471AA23 mal, one instruction per line.
06002103 The testbench, top-level RISC-V module (that instantiates the
005104B3
008001EF
RISC-V processor and memories), and external memory HDL code are
00100113 given in the following examples. The testbench instantiates the top-level
00910133 module being tested and generates a periodic clock and a reset at the
0221A023
00210063
start of the simulation. It checks for memory writes and reports success
if the correct value (25) is written to address 100. The memories in this
Figure 7.65 riscvtest.txt example hold 64 32-bit words each.
7.6 HDL Representation 465
SystemVerilog VHDL
module testbench(); library IEEE;
use IEEE.STD_LOGIC_1164.all;
logic clk; use IEEE.NUMERIC_STD_UNSIGNED.all;
logic reset;
logic [31:0] WriteData, DataAdr; entity testbench is
logic MemWrite; end;
SystemVerilog VHDL
module top(input logic clk, reset, library IEEE;
output logic [31:0] WriteData, DataAdr, use IEEE.STD_LOGIC_1164.all;
output logic MemWrite); use IEEE.NUMERIC_STD_UNSIGNED.all;
SystemVerilog VHDL
module imem(input logic [31:0] a, library IEEE;
output logic [31:0] rd); use IEEE.STD_LOGIC_1164.all;
use STD.TEXTIO.all;
logic [31:0] RAM[63:0]; use IEEE.NUMERIC_STD_UNSIGNED.all;
use ieee.std_logic_textio.all;
initial
$readmemh("riscvtest.txt",RAM); entity imem is
port(a: in STD_LOGIC_VECTOR(31 downto 0);
assign rd = RAM[a[31:2]]; // word aligned
rd: out STD_LOGIC_VECTOR(31 downto 0));
endmodule
end;
return ram_content;
end function;
SystemVerilog VHDL
module dmem(input logic clk, we, library IEEE;
input logic [31:0] a, wd, use IEEE.STD_LOGIC_1164.all;
output logic [31:0] rd); use STD.TEXTIO.all;
use IEEE.NUMERIC_STD_UNSIGNED.all;
logic [31:0] RAM[63:0];
entity dmem is
assign rd = RAM[a[31:2]]; // word aligned port(clk, we: in STD_LOGIC;
a, wd: in STD_LOGIC_VECTOR(31 downto 0);
always_ff @(posedge clk)
rd: out STD_LOGIC_VECTOR(31 downto 0));
if (we) RAM[a[31:2]] <= wd;
end;
endmodule
architecture behave of dmem is
begin
process is
type ramtype is array (63 downto 0) of
STD_LOGIC_VECTOR(31 downto 0);
variable mem: ramtype;
begin
–– read or write memory
loop
if rising_edge(clk) then
if (we = '1') then mem(to_integer(a(7 downto 2))) := wd;
end if;
end if;
rd <= mem(to_integer(a(7 downto 2)));
wait on clk, a;
end loop;
end process;
end;