Name: USC ID#: .
I hereby affirm that all the answers below are my own. I have
neither searched online nor taken assistance from any external
entity.
---------------------------------
Student Signature Above
EE557—Spring 2024
MIDTERM 1
Open Books and Notes
No electronics but a calculator is allowed
Time limit: 1 hour and 50 minutes
Q1: / 20
Q2: / 20
Q3: / 10
Q4: / 18
Q5: / 16
Q6: / 20
TOTAL: / 104
The total number of pages is 18 (including this page). Staple and turn in all pages.
Put your name on every page where noted
Name: __________________________
Question 1 [Power] (20 points)
A) What is the difference between dynamic and static power? (2 points)
Static power is power dissipated when powered on but not active switching on inputs.
Dynamic power is power dissipated when there is active switching on inputs
Give full credit if students mention anything similar to toggling and switching.
B) Describe what sub-threshold power leakage is for CMOS chips. Will clock-gating help to address
sub-threshold power leakage? (3 points)
(2pts) Even when the transistors are OFF, there exists a small current across transistors causing energy
dissipation.
(1pt) Clock-gating cannot mitigate the sub-threshold power leakage.
C) Chip designers embrace the hybrid CPU architecture where E (efficient) and P (performance)
cores exist in a single package. E cores are optimized for power efficiency, and P cores are for
high-performance tasks. In this question, assume core specification when clocked at 𝑓 in Table 1.
Given 100 unit of workload and a CPU with 2 P cores and 4 E cores, how to distribute the
workloads to those 6 cores that result in the shortest runtime? Write down what is the shortest
runtime and energy consumption in terms of 𝑃! , 𝑃" , and 𝑇 . (5 points)
P Cores E Cores
Dynamic Power 6 ⋅ 𝑃! 𝑃!
Static Power 2 ⋅ 𝑃" 𝑃"
𝑇
Time taken for 1 unit of workload 𝑇
4
Table 1: Core specification when clocked at 𝑓 .
1pt Give each P core 33.33 units of workload.
1pt Give each E core 8.33 units of workload.
1pt Execution can finish 8.33T
2pt Total energy consumption is (16𝑃! + 8𝑃" ) ⋅ 8.33𝑇 = 133.28𝑇𝑃! + 66.64 ⋅ 𝑇𝑃"
Page 2 of 18
Name: __________________________
(Question 1 continued)
D) Repeat Part C for 100 units of workloads but CPU has 4 P cores and no E cores? (5 points)
2pt Given each P core 25 units of workload.
1pt Execution can finish 6.25T
2pt Total energy (24𝑃! + 8𝑃" ) ⋅ 6.25𝑇 = 150𝑇𝑃! + 50𝑇𝑃"
E) Repeat Part C if we increase the VDD on E core only and clock them at 1.5𝑓 . (5 points)
1pt Give each P core 28.57 units of workload.
1pt Give each E core 10.71 units of workload.
1pt Execution can finish 7.14 T
2pt Total energy (21𝑃! + 10𝑃" ) ⋅ 7.14𝑇 = 149.94𝑇𝑃! + 71.4𝑇𝑃"
Page 3 of 18
Name: __________________________
Question 2 [Speedup] (20 points)
A) You are developing a new enhancement that provides a 2.5x speedup to certain kinds of instructions.
What percentage of a program, as measured by its original execution time, must consist of these
instructions if you want to gain an overall speedup of 10%? (3 points)
Solution: 1.1 = 1/(1-f+f/2.5)
1/(1-3/5f) = 1.1
3f/5 = 1-1/1.1
f = .152 or 15.2%
Grading: 2 points for equation, 1 for final simplification
B) For the program identified in Part (A), what is the maximum possible speedup if the described
enhancement provides infinite speedup? (2 points)
Solution: Speedup = 1/(1-0.152) = 1.18
Gradin: 2 points for the equation
C) Now let us consider a different machine where two enhancements are proposed: one that can enhance
40% of execution time with a speedup of 1.5, and another that can enhance 25% of execution time
with some greater speedup value. Only one of these two can be implemented. How much of a
speedup is necessary in the second enhancement to beat the first enhancement? (3 points)
Solution: Tnew1 > Tnew2
0.6+0.4/1.5 > 0.75+0.25/s
7/60 > 0.25/s
s > 15/7 or 2.14
Grading: 1point per side of the equation and 1 point for simplification (do not deduct for calculation
error)
Page 4 of 18
Name: __________________________
(Question 2 continued)
D) Suppose an important program you run has the following characteristics.
Instruction Type % of execution time
Load from memory 12
Store to memory 6
FP multiplication 17
Others 65
You are considering upgrading your machine to one of two possible configurations. M1 reduces the
contribution of loads to the execution time by 3✕ and that of stores by 2✕. M2 reduces the contribution
of (only) floating-point multiplications to the execution time by 10✕. What is the overall speedup of the
program for each upgrade? You may express your answer in terms of an equation with all variables
explicitly substituted. You are not required to perform numerical calculations. (3 points)
Solution:
Speedup of M1 = 1/(0. 12/3 + 0. 06/2 + 0. 17 + 0. 65) = 1/0.89 = 1.124 or 12.4%
Speedup of M2 = 1/(0. 12 + 0. 06 + 0. 17/10 + 0. 65) = 1/0.847 = 1.181 or 18.1%
Grading: 1.5 points for the speedup equation for M1. 1.5 points for the speedup equation for M2.
E) What is the maximum possible speedup if all memory accesses could be speed up infinitely? What is
the maximum speedup if all (and only) floating-point multiplications could be sped up infinitely? You
may express your answer in terms of an equation with all variables explicitly substituted. You are not
required to perform numerical calculations. (3 points)
Solution:
Speedup memory: 1/(0.17+0.65) = 1/0.82
Speedup FP: 1/(0.12+0.06+0.65) = 1/0.83
Grading: 1.5 points for the speedup equation for memory. 1.5 points for the speedup equation for FP.
Page 5 of 18
Name: __________________________
(Question 2 continued)
F) Several researchers have suggested that adding a register-memory addressing mode to a load-store
machine might be useful. The idea is to replace sequences of
LD R1, 0(R8)
ADD R2, R2, R1
with
ADD R2, 0(R8)
Assume that the new instruction will cause the clock cycle time to increase by 5% and will not affect
the CPI (clock per instruction). Also, assume loads constitute 25.1% of all instructions. What
percentage of the loads must be eliminated for the machine with the new instruction to have at least
the same performance? (6 points)
Solution:
Let 𝐿 be the fraction of loads that are eliminated. This means that 0. 251 × 𝐿 of all instructions are
eliminated.
𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑜𝑙𝑑 = # 𝑜𝑓 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 × 𝐶𝑃𝐼 × 𝑐𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒
𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑛𝑒𝑤 = ((1 − 0. 251 × 𝐿) × # 𝑜𝑓 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠) × 𝐶𝑃𝐼×((1 +. 05) × 𝑐𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒)
𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑛𝑒𝑤 ≤ 𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑜𝑙𝑑
(1 − 0. 251 × 𝐿) × 1 × 1. 05 ≤ 1
0. 251 × 𝐿 ≥ 1 − 1/1.05
𝐿 ≥ 0. 19 𝑜𝑟 19%
Grading: 2 points for the basic time equation. 3 points for setting up the equation correctly to solve the
problem. No deductions for calculation mistakes.
Page 6 of 18
Name: __________________________
Question 3 [Tomasulo] (10 points)
Your processor currently has a small 2-bit saturating counter-based branch predictor which performs
moderately well. It has 8 Integer Functional Units and 4 Floating Point Units (FPUs), 256KB of on-chip
caches, 4 reservation stations for the Integer Units, and 2 reservation stations for FPUs. The Reorder
Buffer has 8 entries. The processor has a 25 stage pipeline.
The applications you care for have a small code size and work on small data sets in the range of 64 KB.
These applications spend most of their time in loops whose iterations are independent of each other, but
typically have only a limited amount of ILP within a single iteration (within the current processor
implementation).
You have some extra transistor budget in (possibly several of) the following ways:
1. Improve the branch predictor accuracy.
2. Add more reservation stations to your Tomasulo’s Algorithm-based Dynamic Scheduler.
3. Add more FPUs and Integer Units.
4. Add more Reorder Buffer entries.
Some of these may be desirable additions, while others may not be too beneficial given the current
configuration. Which of the above four additions should you support, and which ones should you oppose
(you can support/oppose multiple of these)? You need to justify your choices bellow to receive credit.
1. Improve the branch predictor accuracy: Support or Oppose, and why?
1. Improve the branch predictor accuracy: This is a desirable addition. The problem states that the
branch predictor performs only moderately well and the processor has a long pipeline making branch
mispredictions expensive. Thus, improving branch prediction accuracy is quite likely to increase
performance.
Page 7 of 18
Name: __________________________
(Question 3 continued)
2. Add more reservation stations to your Tomasulo’s Algorithm-based Dynamic Scheduler: Support or
Oppose, and why?
2. Add more reservation stations to your Tomasulo’s Algorithm-based Dynamic Scheduler: This is a
desirable addition. More reservation stations would mean a larger window within which the processor
can search for ready instructions to execute, thus it can discover more parallelism and keep execution
units busy. This would lead to better performance, especially since our application needs to discover
parallelism across loop iterations.
3. Add more FPUs and Integer Units: Support or Oppose, and why?
3. Add more FPUs and Integer Units: This doesn’t seem to be a good addition. The current machine
already has enough FUs and we should try to improve other aspects of the processor. Adding more FUs
12 won’t help if the processor is unable to discover enough parallelism in the instruction stream to keep
them busy.
4. Add more Reorder Buffer entries: Support or Oppose, and why?
4. Add more Reorder Buffer entries: This is a desirable addition. The current configuration has very few
ROB entries. A large ROB helps to mask out the effects of long latency instructions and help search for
parallelism within a larger window (this goes together with (3)).
Grading: 2.5 points for correctly analyzing each part. 10 points total. Give partial credit (1 points) if
student gives valid reason for why improvement can be avoided.
Page 8 of 18
Name: __________________________
Question 4 [Software Optimization] (18 points)
Consider a single-issue, in-order five stage pipeline similar to those studied in class, but with the
following specification:
Functional Unit Cycles in EX No. Of Functional Units Pipelined
Integer 1 1 Yes
FP Add/Sub 3 1 Yes
FP Mult 8 1 Yes
FP Division 24 1 Yes
• The integer functional unit performs integer addition (including effective address calculation for
loads/stores), subtraction, and logic operations.
• There is full forwarding and bypassing, including forwarding from the end of a functional unit to
the MEM stage for stores.
• Loads and stores complete in one cycle. That is, they spend one cycle in the MEM stage after the
effective address calculation.
• There are as many registers, both FP and integer, as you need.
• Branches are resolved in ID and there is one branch delay slot.
• While the hardware has full forwarding and bypassing, it is the responsibility of the compiler to
schedule such that the operands of each instruction are available when needed by each
instruction.
• If multiple instructions finish their EX stages in the same cycle, then we will assume they can all
proceed to the MEM stage together. Similarly, if multiple instructions finish their MEM stages in
the same cycle, then we will assume they can all proceed to the WB stage together. In other
words, for the purpose of this problem, you are to ignore structural hazards on the MEM and WB
stages.
This problem explores the ability of the compiler to schedule code as efficiently as possible for such a
pipeline. Consider the following code (also repeated on the next pages for reference):
loop:
L.D F4, 0(R1)
MUL.D F8, F4, F0
L.D F6, 0(R2)
ADD.D F10, F6, F2
ADD.D F12, F8, F10
S.D F12, 0(R3)
DADDIU R1, R1, #8
DADDIU R2, R2, #8
DADDIU R3, R3, #8
SUB.D R5, R4, R1
BNEZ R5, loop
Note: DADDIU is just like ADDIU but for 64 bits which performs ADD IMMEDIATE operation.
Page 9 of 18
Name: __________________________
(Question 4 continued)
A) Rewrite the above loop (repeated below for reference), but let every row take a cycle (each row can
be an instruction or a stall). If an instruction can’t be issued on a given cycle (because the current
instruction has a dependency that will not be resolved in time), write “stall” instead, and move on to
the next cycle (row) to see if it can be issued then. Assume that a NOP is scheduled in the branch
delay slot (effectively stalling 1 cycle after the branch). Explain the cause of all stalls, but don’t
reorder instructions. How many cycles elapse before the second iteration begins? (6 points)
L.D F4, 0(R1)
stall RAW F4
MUL.D F8, F4, F0
L.D F6, 0(R2)
stall RAW F6
ADD.D F10, F6, F2
stall RAW F8, F10
stall RAW F8, F10
stall RAW F8
stall RAW F8
ADD.D F12, F8, F10
stall RAW F12
S.D F12, 0(R3)
DADDIU R1, R1, #8
DADDIU R2, R2, #8
DADDIU R3, R3, #8
SUB.D R5, R4, R1
stall RAW R5 (branch resolved in ID)
BNEZ R5, loop
NOP stall for branch delay
20 cycles elapse before the second iteration begins.
Grading: 1 point for each sequence of stalls (1*5); 1 point for a correct cycle count; ½ point partial credit
for identifying that a stall is needed between a pair of instructions, but with an incorrect number of
cycles. Negative ½ point for each unnecessary sequence of stalls
Page 10 of 18
Name: __________________________
(Question 4 continued)
B) Now reschedule the loop to compute the same results as quickly as possible. You can change
immediate values and memory offsets and reorder instructions without violating any of the
dependencies, but don’t change anything else. Show any stalls that remain. How many cycles elapse
before the second iteration begins? Show your work. (6 points)
Solution:
L.D F4, 0(R1)
L.D F6, 0(R2)
MUL.D F8, F4, F0
ADD.D F10, F6, F2
DADDIU R1, R1, #8
DADDIU R2, R2, #8
DADDIU R3, R3, #8
SUB.D R5, R4, R1
stall RAW F8
stall RAW F8
ADD.D F12, F8, F10
BNEZ R5, loop
S.D F12, -8(R3)
13 cycles elapse before the second iteration begins.
Grading: Full points for any correct sequence with minimum number of stalls. Partial credit only if the
sequence does the same computation and reduces some stalls. Deduct ½ point for each error e.g.
incorrect index. Deduct ½ point for each stall in excess of 2.
Page 11 of 18
Name: __________________________
(Question 4 continued)
C) Now unroll the loop the minimum number of times needed to eliminate all stalls (with rescheduling).
Show the unrolled and rescheduled loop. You can, and should, remove redundant instructions. How
many original iterations of the loop are in an iteration of your new unrolled loop? How many cycles
elapse before the next iteration of the unrolled loop begins? Don’t worry about start-up or clean-up
code outside the unrolled loop. Assume a very large number of iterations for the original loop. Show
your work. (6 points)
Solution: Note that in the solution below, the registers used could be different and there is some
flexibility in scheduling the instructions.
L.D F4, 0(R1)
L.D F6, 0(R2)
MUL.D F8, F4, F0
L.D F14, 8(R1)
L.D F16, 8(R2)
MUL.D F18, F14, F0
ADD.D F10, F6, F2
ADD.D F20, F16, F2
DADDIU R1, R1, #16
DADDIU R2, R2, #16
DADDIU R3, R3, #16
ADD.D F12, F8, F10
SUB.D R5, R4, R1
ADD.D F22, F18, F20
S.D F12, -16(R3)
BNEZ R5, loop
S.D F22, -8(R3)
The loop has an unroll factor of 2: there are two iterations of the original loop in a single iteration of the
new loop. 17 cycles elapse before the next iteration of the unrolled loop begins.
Grading: 1 point for the correct iteration count. Deduct ½ point for each error or stall cycle. Give partial
credit (3 points) if three iterations are used instead of two and the solution is correct with three
iterations.
Page 12 of 18
Name: __________________________
Question 5 [Tomasulo] (16 points)
In this problem, we try to understand the implications of the Reorder Buffer (ROB) size on performance.
• Consider a processor implementing the ROB.
• Each instruction goes through issue (IS), CDB, execute (EX), and commit/retire (CM).
• Assume IS includes instruction fetch, decode and issue.
• Assume IS, CDB, EX and CM each take one cycle (once all the conditions for these stages are
met), as discussed in class.
• Assume our machine can fetch, decode, issue, and commit 4 instructions each cycle.
• Assume a branch misprediction is handled when the branch instruction reaches the head of the
ROB. It involves flushing that ROB entry and all entries following that entry.
• For now, assume there are no memory accesses (this will change in part C).
A) Suppose we have a perfect branch predictor and there is no data dependency between instructions.
We have infinite execution units of each type and infinite reservation stations. All instructions take
one cycle in the EX-stage. What is the maximum achievable IPC? What is the minimum ROB size
required to guarantee that IPC? (4 points)
Solution: Each instruction holds its ROB entry for 4 cycles (Issue, Execute, CDB, and Commit). Therefore,
we must have a minimum ROB size of 16 to avoid any stalls. In that case, the throughput would be 4
instructions per cycle.
Grading: 2 point for correct IPC. 2 point for correct ROB size.
B) Suppose different FUs have different latencies in the EX-stage, as given by the following table.
Everything else is the same as in the previous part. What is the minimum size of the ROB required
now to avoid any issue stalls due to a full ROB? (4 points)
Functional Unit Cycles
Integer ALU 3
Floating Point Addition 7
Floating Point Multiplication 13
Solution: If we issue an instruction to the FP multiplier, it will occupy its ROB entry for 16 cycles, and
would also block the instructions following it from committing. During that period, we could have issued
64 instructions in all. Thus, we need a minimum ROB size of 64 to avoid stalls. In that case we would get
a throughput of 4 instructions per cycle.
Grading: 2 points for realizing that FP multiplier is the bottleneck. 2 points for correct ROB size.
Page 13 of 18
Name: __________________________
(Question 5 continued)
C) In addition to the latencies above, now every 10th instruction is a load instruction. Assume the
address calculation and cache/memory access parts of the load both happen in the EX-stage. The hit
rate in the data cache is 95% and the misses are uniformly spaced through the instruction stream. A
hit takes 1 cycle in the EX-stage. However, upon a miss, the data has to be fetched from the memory
and this results in 120 cycles in the EX-stage. What is the ROB size required now to avoid any issue
stalls? (4 points)
Solution: A load instruction that misses in the cache would occupy its ROB entry for 1 + 120 + 1 + 1 = 123
cycles. During this time, we could have issued 492 instructions. This is the size of the ROB required to
continuously issue 4 instructions each cycle.
Grading: 2 points for calculating the correct latency of loads. 2 points for correct ROB size.
D) Now additionally assume we don’t have perfect branch prediction anymore. Instead, we have a
predictor with an accuracy of 95%. Assume every 8th instruction is a branch and the mispredictions
are uniformly spaced through the instruction stream. After a misprediction, how many instructions are
issued before the next misprediction is encountered? In light of this result, do you think we need a
ROB of the size you derived in Part C? Why/why not? If not, what do you think would be a good
ROB size to have? (4 points)
Solution: We would issue 8 × 20 = 160 instructions before a mispredict is encountered. Given this result,
it doesn’t make sense to have a ROB of size 492, because once we have a cache miss, soon after we
would run into a branch mispredict, and all the work done after that would be wasted anyway. A
reasonable ROB size would be 160.
Grading: 2 points for starting that a smaller ROB is sufficient. 2 points for correct ROB size.
Page 14 of 18
Name: __________________________
Question 6 [Branch Prediction] (20 Points)
Assume that the current state of global history register is: ghist[8:0]= 111010111 and the current branch
PC is PC[5:0]= 000101. You are asked to update a TAGE Branch predictor that has the following
predictor tables:
Table 0 bimodal
address ctr (2 bits)
000 01 Table 1 Table 2 Table 3 Table 4
001 10 address
ctr tag u ctr tag u ctr tag u ctr tag u
010 00 00 001 110 00 101 101 01 110 111 01 100 011 00
011 01 01 101 010 11 000 011 00 101 100 00 101 100 00
100 01 10 100 100 01 101 000 10 010 010 01 000 101 11
101 11 11 111 100 10 111 110 11 011 000 10 111 001 10
110 01
111 00
You will have to use the global history, branch PC and the predictor tables to predict the outcome of the
branch then update the predictor tables.
You index the bimodal table (table 0) using the following bits from the branch PC: PC[5:3]
You index the predictor tables (tables 1-4) and compute the tag as follows. FHi is the folded history of i
bits that is computed from a subset of ghist bits as specified below.
Address: PC[4:3] ⊕ FH2, Tag: PC[2:0] ⊕ FH3, where ‘⊕’ is an xor operation
Note that table 1 uses history ghist[1:0], table 2 uses history ghist[3:0], table 3 uses history ghist[5:0], and
table 4 uses history ghist[7:0]. These ghist bits are used to compute FH2, FH3 which are then used for tag
and address computation. FHi is the folded global history and here is an example to show you how to
compute it:
Assume we are to fold 8 bits of global history ghist[7:0] into 4 bits or 3 bits:
To fold it into 4 bits à FH4=ghist[7:4] ⊕ghist[3:0]
To fold it into 3 bits à FH3= ghist[8:6] ⊕ghist[5:3] ⊕ghist[2:0]
For all non-existing bits, we treat them as 0. Here ghist[8] is set to 0.
Note also that we are using is a 3-bit counter in the tagged predictor components. Hence, 1xx is predicted
taken and 0xx is predicted not taken. A 2-bit counter is used in Table 0 (base predictor) so 1x is predicted
taken and 0x is predicted not taken.
Page 15 of 18
Name: __________________________
(Question 6 continued)
a) Find the outcome of the prediction from each predictor components (i.e. table 0-5). Show your
steps in details. (13 points)
Table 0:
Prediction: Use PC[5:3]=000 to index table 0 and get 01 so predict not taken
Table 1:
Address computation: PC[4:3] ⊕FH2 = 11
Tag computation: PC[2:0] ⊕ FH3 = 110
Prediction: Tag doesn’t match à can’t predict
Table 2:
Address computation: PC[4:3] ⊕FH2 = 10
Tag computation: PC[2:0] ⊕ FH3 = 010
Prediction: Tag doesn’t match à can’t predict
Table 3:
Address computation: PC[4:3] ⊕FH2 = 11
Tag computation: PC[2:0] ⊕ FH3 = 000
Prediction: Tag matches and ctr = 011 à not taken
Table 4:
Address computation: PC[4:3] ⊕FH2 = 00
Tag computation: PC[2:0] ⊕ FH3 = 011
Prediction: Tag matches and ctr = 100 à not taken
Page 16 of 18
Name: __________________________
(Question 6 continued)
b) Assume the branch is taken, how will you update the predictor components? mark the tables
below and show your steps. (7 points)
Table 0 bimodal
address ctr (2 bits)
000 01 Table 1 Table 2 Table 3 Table 4
001 10 address
ctr tag u ctr tag u ctr tag u ctr tag u
010 00 00 001 110 00 101 101 01 110 111 01 101 011 01
011 01 01 101 010 11 000 011 00 101 100 00 101 100 00
100 01 10 100 100 01 101 000 10 010 010 01 000 101 11
101 11 11 111 100 10 111 110 11 011 000 10 111 001 10
110 01
111 00
Table 4 à increment counter and increment u bits since prediction is correct.
Page 17 of 18
Name: __________________________
This page is blank
Page 18 of 18