IBM Power5 Chip A Dual-Core Multithreaded Processor
IBM Power5 Chip A Dual-Core Multithreaded Processor
MULTITHREADED PROCESSOR
FEATURING SINGLE- AND MULTITHREADED EXECUTION, THE POWER5
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
thread executes at any instance. When a
Processor Processor Processor Processor
thread encounters a long-latency event, such
as a cache miss, the hardware swaps in a sec- L2 L2
ond thread to use the machine’s resources, cache cache
rather than letting the machine remain idle.
By allowing other work to use what otherwise Fabric Fabric
would be idle cycles, this scheme increases controller controller
overall system throughput. To conserve
resources, both threads share many system L3 L3
resources, such as architectural registers. cache cache
Hence, swapping program control from one
Memory Memory
thread to another requires several cycles. IBM controller controller
implemented coarse-grained multithreading
in the IBM eServer pSeries Model 680.2 Memory Memory
(a)
A variant of coarse-grained multithreading
is fine-grained multithreading. Machines of Processor Processor Processor Processor
this class execute threads in successive cycles,
in round-robin fashion.3 Accommodating this L3 L2 L2 L3
design requires duplicate hardware facilities. cache cache cache cache
When a thread encounters a long-latency
event, its cycles remain unused. Fabric Fabric
controller controller
Finally, in simultaneous multithreading
(SMT), as in other multithreaded implemen- Memory Memory
tations, the processor fetches instructions controller controller
from more than one thread.4 What differen-
tiates this implementation is its ability to Memory Memory
schedule instructions for execution from all (b)
threads concurrently. With SMT, the system Figure 1. Power4 (a) and Power5 (b) system structures.
dynamically adjusts to the environment,
allowing instructions to execute from each
thread if possible, and allowing instructions fabric. This can cause greater contention and
from one thread to utilize all the execution negatively affect system scalability. Moving the
units if the other thread encounters a long- level-three (L3) cache from the memory side to
latency event. the processor side of the fabric lets the Power5
The Power5 design implements two-way more frequently satisfy level-two (L2) cache
SMT on each of the chip’s two processor cores. misses with hits in the 36-Mbyte off-chip L3
Although a higher level of multithreading is cache, avoiding traffic on the interchip fabric.
possible, our simulations showed that the References to data not resident in the on-chip
added complexity was unjustified. As design- L2 cache cause the system to check the L3
ers add simultaneous threads to a single phys- cache before sending requests onto the inter-
ical processor, the marginal performance connection fabric. Moving the L3 cache pro-
benefit decreases. In fact, additional multi- vides significantly more cache on the processor
threading might decrease performance because side than previously available, thus reducing
of cache thrashing, as data from one thread traffic on the fabric and allowing Power5-based
displaces data needed by another thread. systems to scale to higher levels of symmetric
multiprocessing. Initial Power5 systems sup-
Power5 system structure port 64 physical processors.
Figure 1 shows the high-level structures of The Power4 includes a 1.41-Mbyte on-chip
Power4- and Power5-based systems. The L2 cache. Power4+ chips are similar in design
Power4 handles up to a 32-way symmetric to the Power4 but are fabricated in 130-nm
multiprocessor. Going beyond 32 processors technology rather than the Power4’s 180-nm
increases interprocessor communication, technology. The Power4+ includes a 1.5-
resulting in high traffic on the interconnection Mbyte on-chip L2 cache, whereas the Power5
MARCH–APRIL 2004 41
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
HOT CHIPS 15
42 IEEE MICRO
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
Branch redirects Out-of-order processing
Branch
Instruction fetch pipeline
MP ISS RF EX WB Xfer
Load/store
IF pipeline
IC BP
MP ISS RF EA DC Fmt WB Xfer CP
F6 WB Xfer
Floating-
point pipeline
Interrupts and flushes
Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage
0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =
compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and
CP = group commit).
Dynamic
Branch prediction instruction
selection
Shared
Shared
Branch Return Target execution
Program issue
history stack cache units
counter queues
tables LSU0 Data Data
Alternate FXU0 Translation Cache
Instruction LSU1
buffer 0 Group formation
Instruction FXU1 Group Store
Instruction decode
cache completion queue
Instruction Dispatch FPU0
Instruction buffer 1 FPU1
translation
BXU
Thread Data Data
priority CRL translation cache
Shared- Read Write
register shared- shared-
mappers register files register files
L2
cache
Shared by two threads Thread 0 resources Thread 1 resources
Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).
The Power5 scans fetched instructions for rect direction.7 If the fetched instructions con-
branches (BP stage), and if it finds a branch, tain multiple branches, the BP stage can pre-
predicts the branch direction using three dict all the branches at the same time. In
branch history tables shared by the two addition to predicting direction, the Power5
threads. Two of the BHTs use bimodal and also predicts the target of a taken branch in
path-correlated branch prediction mecha- the current cycle’s eight-instruction group. In
nisms to predict branch directions.6,7 The the PowerPC architecture, the processor can
third BHT predicts which of these prediction calculate the target of most branches from the
mechanisms is more likely to predict the cor- instruction’s address and offset value. For
MARCH–APRIL 2004 43
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
HOT CHIPS 15
predicting the target of a subroutine return, multiple issue queues: The floating-point issue
the processor uses a return stack, one for each queue feeds the two floating-point units, the
thread. For predicting the target of other branch issue queue feeds the branch execu-
branches, it uses a shared target cache. If there tion unit, the condition register logical queue
is a taken branch, the processor loads the pro- feeds the condition register logical operation
gram counter with the branch’s target address. execution unit, and a combined issue queue
Otherwise, it loads the program counter with feeds the two fixed-point execution units and
the address of the next sequential instruction the two load-store execution units. Like the
to fetch from. Power4, the Power5 contains eight execution
After fetching, the Power5 places instruc- units, each of which can execute an instruc-
tions in the predicted path in separate instruc- tion each cycle.1
tion fetch queues for the two threads (D0 To simplify the logic for tracking instruc-
stage). Like the Power4, the Power5 can dis- tions through the pipeline, the Power5 tracks
patch up to five instructions each cycle. On the instructions as a group. Each group of dis-
basis of thread priorities, the processor selects patched instructions takes an entry in the glob-
instructions from one of the instruction fetch al completion table at the time of dispatch.
queues and forms a group (D1, D2, and D3 The two threads share 20 entries in the GCT.
stages). All instructions in a group come from Each GCT entry holds a group of instructions;
the same thread and are decoded in parallel. a group can contain up to five instructions, all
Before a group can be dispatched, the from the same thread. Power5 allocates GCT
processor must make several resources avail- entries in program order for each thread at the
able for the instructions in the group. Each time of dispatch. An entry is deallocated from
dispatched group needs an entry in the glob- the GCT when the group is committed.
al completion table (GCT). Each instruction Although the entries in the GCT are in pro-
in the group needs an entry in an appropriate gram order and from a given thread, succes-
issue queue. Each load and store instruction sive entries can belong to different threads.
needs an entry in the load reorder queue and When all input operands for an instruction
store reorder queue, respectively, to detect out- are available, it becomes eligible for issue.
of-order execution hazards.1 When all the Among the eligible instructions in the issue
resources necessary for dispatch are available queue, the issue logic selects one and issues it
for the group, the group is dispatched (GD for execution (ISS stage). For instruction issue,
stage). Instructions flow through the pipeline there is no distinction between instructions
stages between instruction fetch (IF) and from the two threads. When issued, the
group dispatch (GD) in program order. After instruction reads its input physical registers
dispatch, each instruction flows through the (RF stage), executes on the proper execution
register-renaming (mapping) facilities (MP unit (EX stage), and writes the result back to
stage), which map the logical register num- the output physical register (WB stage). Each
bers in the instruction to physical registers. In floating-point unit has a six-cycle execution
the Power5, there are 120 physical general- pipe (F1 through F6 stages). In each load-
purpose registers (GPRs) and 120 physical store unit, an adder computes the address to
floating-point registers (FPRs). The two read or write (EA stage), and the data cache is
threads dynamically share the register files. An accessed (DC stage). For load instructions,
out-of-order processor can exploit the high once data is returned, a formatter selects the
instruction-level parallelism exhibited by correct bytes from the cache line (Fmt stage)
some applications (such as some technical and writes them to the register (WB stage).
applications) if a large pool of rename registers When all the instructions in a group have
is available. To facilitate this, in ST mode, the executed (without generating an exception)
Power5 makes all physical registers available and the group is the oldest group of a given
to the single thread, allowing higher instruc- thread, the group commits (CP stage). In the
tion-level parallelism. Power5, two groups can commit per cycle,
After register renaming, instructions enter one from each thread.
issue queues shared by the two threads. The To efficiently support SMT, we tuned all
Power5 microprocessor, like the Power4, has resources for improved performance within
44 IEEE MICRO
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
area and power budget constraints. The L1 thread’s decoding until the congestion clears
instruction and data caches are the same size is the primary mechanism for throttling
as in the Power4—64 Kbytes and 32 a thread executing a long-executing
Kbytes—but their associativity has doubled instruction, such as a synch instruction.
to two- and four-way. The first-level data (A synch instruction orders memory
translation table is now fully associative, but operations across multiple processors.)
the size remains at 128 entries.
Adjustable thread priority. Adjustable thread pri-
Enhanced SMT features ority lets software determine when one thread
To improve SMT performance for various should have a greater (or lesser) share of execu-
workload mixes and provide robust quality of tion resources. (All software layers—operating
service, we added two features to the Power5 systems, middleware, and applications—can
chip: dynamic resource balancing and set the thread priority. Some priority levels are
adjustable thread priority. reserved for setting by a privileged instruction
only.) Reasons for choosing an imbalanced
Dynamic resource balancing. The objective of thread priority include the following:
dynamic resource balancing is to ensure that
the two threads executing on the same proces- • A thread is in a spin loop waiting for a lock.
sor flow smoothly through the system. Software would give the thread lower pri-
Dynamic resource-balancing logic monitors ority, because it is not doing useful work
resources such as the GCT and the load miss while spinning.
queue to determine if one thread is hogging • A thread has no immediate work to do and
resources. For example, if one thread encoun- is waiting in an idle loop. Again, software
ters multiple L2 cache load misses, dependent would give this thread lower priority.
instructions can back up in the issue queues, • One application must run faster than
preventing additional groups from dispatch- another. For example, software would
ing and slowing down the other thread. To give higher priority to real-time tasks over
prevent this, resource-balancing logic detects concurrently running background tasks.
that a thread has reached a threshold of L2
cache misses and throttles that thread. The The Power5 microprocessor supports eight
other thread can then flow through the software-controlled priority levels for each
machine without encountering congestion thread. Level 0 is in effect when a thread is not
from the stalled thread. The Power5 resource- running. Levels 1 (the lowest) through 7 apply
balancing logic also monitors how many GCT to running threads. The Power5 chip observes
entries each thread is using. If one thread starts the difference in priority levels between the
to use too many GCT entries, the resource- two threads and gives the one with higher pri-
balancing logic throttles it back to prevent its ority additional decode cycles. Figure 5 (next
blocking the other thread. page) shows how the difference in thread pri-
Depending on the situation, the Power5 ority affects the relative performance of each
resource-balancing logic has three thread- thread. If both threads are at the lowest run-
throttling mechanisms: ning priority (level 1), the microprocessor
assumes that neither thread is doing mean-
• Reducing the thread’s priority is the pri- ingful work and throttles the decode rate to
mary mechanism in situations where a conserve power.
thread uses more than a predetermined
number of GCT entries. Single-threaded operation
• Inhibiting the thread’s instruction decod- Not all applications benefit from SMT.
ing until the congestion clears is the pri- Having two threads executing on the same
mary mechanism for throttling a thread processor will not increase the performance
that incurs a prescribed number of L2 of applications with execution-unit-limited
cache misses. performance or applications that consume all
• Flushing all the thread’s instructions that the chip’s memory bandwidth. For this rea-
are waiting for dispatch and holding the son, the Power5 supports the ST execution
MARCH–APRIL 2004 45
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
HOT CHIPS 15
46 IEEE MICRO
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
most, further reducing switching power. The Balaram Sinharoy is the chief scientist for the
Power5 uses this mode only when there is no IBM Power5 microprocessor. His research
ready task to run on either thread. interests include advanced microprocessor
The out-of-order execution Power5 design design, computer architecture, and perfor-
coupled with dual 2-way simultaneous mul- mance analysis. Sinharoy has a BS in physics,
tithreaded processors provides instruction and a BTech in computer science and electrical
thread level parallelism. Future plans call for engineering from the University of Calcutta,
shrinking the size of the Power5 die by using and an MS and a PhD in computer science
a 90-nm lithography fabrication process, from Rensselaer Polytechnic Institute. He is
which should allow even higher performance an IBM Master Inventor and a senior mem-
at lower power. ber of the IEEE.
MARCH–APRIL 2004 47
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.