0% found this document useful (0 votes)

66 views8 pages

IBM Power5 Chip A Dual-Core Multithreaded Processor

The IBM Power5 chip is a dual-core multithreaded processor that provides higher single-threaded performance than its predecessor, the Power4. It features enhancements such as dynamic resource balancing between threads, software-controlled thread prioritization, and dynamic power management to reduce power consumption without affecting performance.

Uploaded by

e1s1v09092023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views8 pages

IBM Power5 Chip A Dual-Core Multithreaded Processor

Uploaded by

e1s1v09092023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

IBM POWER5 CHIP: A DUAL-CORE

MULTITHREADED PROCESSOR
FEATURING SINGLE- AND MULTITHREADED EXECUTION, THE POWER5

PROVIDES HIGHER PERFORMANCE IN THE SINGLE-THREADED MODE THAN ITS

POWER4 PREDECESSOR AT EQUIVALENT FREQUENCIES. ENHANCEMENTS

INCLUDE DYNAMIC RESOURCE BALANCING TO EFFICIENTLY ALLOCATE SYSTEM

RESOURCES TO EACH THREAD, SOFTWARE-CONTROLLED THREAD

PRIORITIZATION, AND DYNAMIC POWER MANAGEMENT TO REDUCE POWER

CONSUMPTION WITHOUT AFFECTING PERFORMANCE.

IBM introduced Power4-based sys- that base requirement, we specified increased

tems in 2001.1 The Power4 design integrates performance and other functional enhance-
two processor cores on a single chip, a shared ments of server virtualization, reliability,
second-level cache, a directory for an off-chip availability, and serviceability at both chip and
third-level cache, and the necessary circuitry system levels. In this article, we describe the
to connect it to other Power4 chips to form a approach we used to improve chip-level
system. The dual-processor chip provides nat- performance.
ural thread-level parallelism at the chip level.
Additionally, the Power4’s out-of-order exe- Multithreading
cution design lets the hardware bypass instruc- Conventional processors execute instruc-
Ron Kalla tions whose operands are not yet available tions from a single instruction stream. Despite
(perhaps because of an earlier cache miss dur- microarchitectural advances, execution unit
Balaram Sinharoy ing register loading) and execute other instruc- utilization remains low in today’s micro-
tions whose operands are ready. Later, when processors. It is not unusual to see average exe-
Joel M. Tendler the operands become available, the hardware cution unit utilization rates of approximately
can execute the skipped instruction. Coupled 25 percent across a broad spectrum of envi-
IBM with a superscalar design, out-of-order exe- ronments. To increase execution unit utiliza-
cution results in higher instruction execution tion, designers use thread-level parallelism, in
parallelism than otherwise possible. which the physical processor core executes
The Power5 is the next-generation chip in instructions from more than one instruction
this line. One of our key goals in designing stream. To the operating system, the physical
the Power5 was to maintain both binary and processor core appears as if it is a symmetric
structural compatibility with existing Power4 multiprocessor containing two logical proces-
systems to ensure that binaries continue exe- sors. There are at least three different meth-
cuting properly and all application optimiza- ods for handling multiple threads.
tions carry forward to newer systems. With In coarse-grained multithreading, only one

40 Published by the IEEE Computer Society 0272-1732/04/$20.00  2004 IEEE

Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
thread executes at any instance. When a
Processor Processor Processor Processor
thread encounters a long-latency event, such
as a cache miss, the hardware swaps in a sec- L2 L2
ond thread to use the machine’s resources, cache cache
rather than letting the machine remain idle.
By allowing other work to use what otherwise Fabric Fabric
would be idle cycles, this scheme increases controller controller
overall system throughput. To conserve
resources, both threads share many system L3 L3
resources, such as architectural registers. cache cache
Hence, swapping program control from one
Memory Memory
thread to another requires several cycles. IBM controller controller
implemented coarse-grained multithreading
in the IBM eServer pSeries Model 680.2 Memory Memory
(a)
A variant of coarse-grained multithreading
is fine-grained multithreading. Machines of Processor Processor Processor Processor
this class execute threads in successive cycles,
in round-robin fashion.3 Accommodating this L3 L2 L2 L3
design requires duplicate hardware facilities. cache cache cache cache
When a thread encounters a long-latency
event, its cycles remain unused. Fabric Fabric
controller controller
Finally, in simultaneous multithreading
(SMT), as in other multithreaded implemen- Memory Memory
tations, the processor fetches instructions controller controller
from more than one thread.4 What differen-
tiates this implementation is its ability to Memory Memory
schedule instructions for execution from all (b)
threads concurrently. With SMT, the system Figure 1. Power4 (a) and Power5 (b) system structures.
dynamically adjusts to the environment,
allowing instructions to execute from each
thread if possible, and allowing instructions fabric. This can cause greater contention and
from one thread to utilize all the execution negatively affect system scalability. Moving the
units if the other thread encounters a long- level-three (L3) cache from the memory side to
latency event. the processor side of the fabric lets the Power5
The Power5 design implements two-way more frequently satisfy level-two (L2) cache
SMT on each of the chip’s two processor cores. misses with hits in the 36-Mbyte off-chip L3
Although a higher level of multithreading is cache, avoiding traffic on the interchip fabric.
possible, our simulations showed that the References to data not resident in the on-chip
added complexity was unjustified. As design- L2 cache cause the system to check the L3
ers add simultaneous threads to a single phys- cache before sending requests onto the inter-
ical processor, the marginal performance connection fabric. Moving the L3 cache pro-
benefit decreases. In fact, additional multi- vides significantly more cache on the processor
threading might decrease performance because side than previously available, thus reducing
of cache thrashing, as data from one thread traffic on the fabric and allowing Power5-based
displaces data needed by another thread. systems to scale to higher levels of symmetric
multiprocessing. Initial Power5 systems sup-
Power5 system structure port 64 physical processors.
Figure 1 shows the high-level structures of The Power4 includes a 1.41-Mbyte on-chip
Power4- and Power5-based systems. The L2 cache. Power4+ chips are similar in design
Power4 handles up to a 32-way symmetric to the Power4 but are fabricated in 130-nm
multiprocessor. Going beyond 32 processors technology rather than the Power4’s 180-nm
increases interprocessor communication, technology. The Power4+ includes a 1.5-
resulting in high traffic on the interconnection Mbyte on-chip L2 cache, whereas the Power5

MARCH–APRIL 2004 41
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
HOT CHIPS 15

ing paths. In 130 nm lithography, the chip

uses eight metal levels and measures 389 mm2.
The Power5 processor supports the 64-bit
PowerPC architecture. A single die contains
two identical processor cores, each supporting
two logical threads. This architecture makes
the chip appear as a four-way symmetric mul-
tiprocessor to the operating system. The two
cores share a 1.875-Mbyte (1,920-Kbyte) L2
cache. We implemented the L2 cache as three
identical slices with separate controllers for
each. The L2 slices are 10-way set-associative
with 512 congruence classes of 128-byte lines.
The data’s real address determines which L2
slice the data is cached in. Either processor core
can independently access each L2 controller.
We also integrated the directory for an off-
chip 36-Mbyte L3 cache on the Power5 chip.
Having the L3 cache directory on chip allows
the processor to check the directory after an
Figure 2. Power5 chip (FXU = fixed-point execution unit, ISU L2 miss without experiencing off-chip delays.
= instruction sequencing unit, IDU = instruction decode unit, To reduce memory latencies, we integrated
LSU = load/store unit, IFU = instruction fetch unit, FPU = the memory controller on the chip. This elim-
floating-point unit, and MC = memory controller). inates driver and receiver delays to an exter-
nal controller.

supports a 1.875-Mbyte on-chip L2 cache. Processor core

Power4 and Power4+ systems both have 32- We designed the Power5 processor core to
Mbyte L3 caches, whereas Power5 systems support both enhanced SMT and single-
have a 36-Mbyte L3 cache. threaded (ST) operation modes. Figure 3
The L3 cache operates as a backdoor with shows the Power5’s instruction pipeline,
separate buses for reads and writes that oper- which is identical to the Power4’s. All pipeline
ate at half processor speed. In Power4 and latencies in the Power5, including the branch
Power4+ systems, the L3 was an inline cache misprediction penalty and load-to-use laten-
for data retrieved from memory. Because of cy with an L1 data cache hit, are the same as
the higher transistor density of the Power5’s in the Power4. The identical pipeline struc-
130-nm technology, we could move the mem- ture lets optimizations designed for Power4-
ory controller on chip and eliminate a chip based systems perform equally well on
previously needed for the memory controller Power5-based systems. Figure 4 shows the
function. These two changes in the Power5 Power5’s instruction flow diagram.
also have the significant side benefits of reduc- In SMT mode, the Power5 uses two sepa-
ing latency to the L3 cache and main memo- rate instruction fetch address registers to store
ry, as well as reducing the number of chips the program counters for the two threads.
necessary to build a system. Instruction fetches (IF stage) alternate
between the two threads. In ST mode, the
Chip overview Power5 uses only one program counter and
Figure 2 shows the Power5 chip, which can fetch instructions for that thread every
IBM fabricates using silicon-on-insulator cycle. It can fetch up to eight instructions
(SOI) devices and copper interconnect. SOI from the instruction cache (IC stage) every
technology reduces device capacitance to cycle. The two threads share the instruction
increase transistor performance.5 Copper cache and the instruction translation facility.
interconnect decreases wire resistance and In a given cycle, all fetched instructions come
reduces delays in wire-dominated chip-tim- from the same thread.

42 IEEE MICRO

Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
Branch redirects Out-of-order processing

Branch
Instruction fetch pipeline
MP ISS RF EX WB Xfer
Load/store
IF pipeline
IC BP
MP ISS RF EA DC Fmt WB Xfer CP

D0 D1 D2 D3 Xfer GD MP ISS RF EX WB Xfer

Fixed-point
Group formation and pipeline
instruction decode MP ISS RF

F6 WB Xfer
Floating-
point pipeline
Interrupts and flushes

Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage
0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =
compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and
CP = group commit).

Dynamic
Branch prediction instruction
selection
Shared
Shared
Branch Return Target execution
Program issue
history stack cache units
counter queues
tables LSU0 Data Data
Alternate FXU0 Translation Cache
Instruction LSU1
buffer 0 Group formation
Instruction FXU1 Group Store
Instruction decode
cache completion queue
Instruction Dispatch FPU0
Instruction buffer 1 FPU1
translation
BXU
Thread Data Data
priority CRL translation cache
Shared- Read Write
register shared- shared-
mappers register files register files
L2
cache
Shared by two threads Thread 0 resources Thread 1 resources

Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).

The Power5 scans fetched instructions for rect direction.7 If the fetched instructions con-
branches (BP stage), and if it finds a branch, tain multiple branches, the BP stage can pre-
predicts the branch direction using three dict all the branches at the same time. In
branch history tables shared by the two addition to predicting direction, the Power5
threads. Two of the BHTs use bimodal and also predicts the target of a taken branch in
path-correlated branch prediction mecha- the current cycle’s eight-instruction group. In
nisms to predict branch directions.6,7 The the PowerPC architecture, the processor can
third BHT predicts which of these prediction calculate the target of most branches from the
mechanisms is more likely to predict the cor- instruction’s address and offset value. For

MARCH–APRIL 2004 43
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
HOT CHIPS 15

predicting the target of a subroutine return, multiple issue queues: The floating-point issue
the processor uses a return stack, one for each queue feeds the two floating-point units, the
thread. For predicting the target of other branch issue queue feeds the branch execu-
branches, it uses a shared target cache. If there tion unit, the condition register logical queue
is a taken branch, the processor loads the pro- feeds the condition register logical operation
gram counter with the branch’s target address. execution unit, and a combined issue queue
Otherwise, it loads the program counter with feeds the two fixed-point execution units and
the address of the next sequential instruction the two load-store execution units. Like the
to fetch from. Power4, the Power5 contains eight execution
After fetching, the Power5 places instruc- units, each of which can execute an instruc-
tions in the predicted path in separate instruction each cycle.1
tion fetch queues for the two threads (D0 To simplify the logic for tracking instruc-
stage). Like the Power4, the Power5 can dis- tions through the pipeline, the Power5 tracks
patch up to five instructions each cycle. On the instructions as a group. Each group of dis-
basis of thread priorities, the processor selects patched instructions takes an entry in the glob-
instructions from one of the instruction fetch al completion table at the time of dispatch.
queues and forms a group (D1, D2, and D3 The two threads share 20 entries in the GCT.
stages). All instructions in a group come from Each GCT entry holds a group of instructions;
the same thread and are decoded in parallel. a group can contain up to five instructions, all
Before a group can be dispatched, the from the same thread. Power5 allocates GCT
processor must make several resources avail- entries in program order for each thread at the
able for the instructions in the group. Each time of dispatch. An entry is deallocated from
dispatched group needs an entry in the glob- the GCT when the group is committed.
al completion table (GCT). Each instruction Although the entries in the GCT are in pro-
in the group needs an entry in an appropriate gram order and from a given thread, succes-
issue queue. Each load and store instruction sive entries can belong to different threads.
needs an entry in the load reorder queue and When all input operands for an instruction
store reorder queue, respectively, to detect out- are available, it becomes eligible for issue.
of-order execution hazards.1 When all the Among the eligible instructions in the issue
resources necessary for dispatch are available queue, the issue logic selects one and issues it
for the group, the group is dispatched (GD for execution (ISS stage). For instruction issue,
stage). Instructions flow through the pipeline there is no distinction between instructions
stages between instruction fetch (IF) and from the two threads. When issued, the
group dispatch (GD) in program order. After instruction reads its input physical registers
dispatch, each instruction flows through the (RF stage), executes on the proper execution
register-renaming (mapping) facilities (MP unit (EX stage), and writes the result back to
stage), which map the logical register num- the output physical register (WB stage). Each
bers in the instruction to physical registers. In floating-point unit has a six-cycle execution
the Power5, there are 120 physical general- pipe (F1 through F6 stages). In each load-
purpose registers (GPRs) and 120 physical store unit, an adder computes the address to
floating-point registers (FPRs). The two read or write (EA stage), and the data cache is
threads dynamically share the register files. An accessed (DC stage). For load instructions,
out-of-order processor can exploit the high once data is returned, a formatter selects the
instruction-level parallelism exhibited by correct bytes from the cache line (Fmt stage)
some applications (such as some technical and writes them to the register (WB stage).
applications) if a large pool of rename registers When all the instructions in a group have
is available. To facilitate this, in ST mode, the executed (without generating an exception)
Power5 makes all physical registers available and the group is the oldest group of a given
to the single thread, allowing higher instruc- thread, the group commits (CP stage). In the
tion-level parallelism. Power5, two groups can commit per cycle,
After register renaming, instructions enter one from each thread.
issue queues shared by the two threads. The To efficiently support SMT, we tuned all
Power5 microprocessor, like the Power4, has resources for improved performance within

44 IEEE MICRO

Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
area and power budget constraints. The L1 thread’s decoding until the congestion clears
instruction and data caches are the same size is the primary mechanism for throttling
as in the Power4—64 Kbytes and 32 a thread executing a long-executing
Kbytes—but their associativity has doubled instruction, such as a synch instruction.
to two- and four-way. The first-level data (A synch instruction orders memory
translation table is now fully associative, but operations across multiple processors.)
the size remains at 128 entries.
Adjustable thread priority. Adjustable thread pri-
Enhanced SMT features ority lets software determine when one thread
To improve SMT performance for various should have a greater (or lesser) share of execu-
workload mixes and provide robust quality of tion resources. (All software layers—operating
service, we added two features to the Power5 systems, middleware, and applications—can
chip: dynamic resource balancing and set the thread priority. Some priority levels are
adjustable thread priority. reserved for setting by a privileged instruction
only.) Reasons for choosing an imbalanced
Dynamic resource balancing. The objective of thread priority include the following:
dynamic resource balancing is to ensure that
the two threads executing on the same proces- • A thread is in a spin loop waiting for a lock.
sor flow smoothly through the system. Software would give the thread lower pri-
Dynamic resource-balancing logic monitors ority, because it is not doing useful work
resources such as the GCT and the load miss while spinning.
queue to determine if one thread is hogging • A thread has no immediate work to do and
resources. For example, if one thread encoun- is waiting in an idle loop. Again, software
ters multiple L2 cache load misses, dependent would give this thread lower priority.
instructions can back up in the issue queues, • One application must run faster than
preventing additional groups from dispatch- another. For example, software would
ing and slowing down the other thread. To give higher priority to real-time tasks over
prevent this, resource-balancing logic detects concurrently running background tasks.
that a thread has reached a threshold of L2
cache misses and throttles that thread. The The Power5 microprocessor supports eight
other thread can then flow through the software-controlled priority levels for each
machine without encountering congestion thread. Level 0 is in effect when a thread is not
from the stalled thread. The Power5 resource- running. Levels 1 (the lowest) through 7 apply
balancing logic also monitors how many GCT to running threads. The Power5 chip observes
entries each thread is using. If one thread starts the difference in priority levels between the
to use too many GCT entries, the resource- two threads and gives the one with higher pri-
balancing logic throttles it back to prevent its ority additional decode cycles. Figure 5 (next
blocking the other thread. page) shows how the difference in thread pri-
Depending on the situation, the Power5 ority affects the relative performance of each
resource-balancing logic has three thread- thread. If both threads are at the lowest run-
throttling mechanisms: ning priority (level 1), the microprocessor
assumes that neither thread is doing mean-
• Reducing the thread’s priority is the pri- ingful work and throttles the decode rate to
mary mechanism in situations where a conserve power.
thread uses more than a predetermined
number of GCT entries. Single-threaded operation
• Inhibiting the thread’s instruction decod- Not all applications benefit from SMT.
ing until the congestion clears is the pri- Having two threads executing on the same
mary mechanism for throttling a thread processor will not increase the performance
that incurs a prescribed number of L2 of applications with execution-unit-limited
cache misses. performance or applications that consume all
• Flushing all the thread’s instructions that the chip’s memory bandwidth. For this rea-
are waiting for dispatch and holding the son, the Power5 supports the ST execution

MARCH–APRIL 2004 45
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
HOT CHIPS 15

does not allocate resources to a null thread. This

mode is advantageous if all the system’s execut-
Single-thread mode ing tasks perform better in ST mode.

Dynamic power management

Instructions per cycle (IPC) In current CMOS technologies, chip power
has become one of the most important design
parameters. With the introduction of SMT,
more instructions execute per cycle per proces-
sor core, thus increasing the core’s and the
chip’s total switching power. To reduce switch-
ing power, Power5 chips use a fine-grained,
dynamic clock-gating mechanism extensively.
This mechanism gates off clocks to a local
clock buffer if dynamic power management
0,7 2,7 4,7 6,7 7,7 7,6 7,4 7,2 7,0 1,1 logic knows the set of latches driven by the
1,6 3,6 5,6 6,6 6,5 6,3 6,1 0,1 buffer will not be used in the next cycle. For
2,5 4,5 5,5 5,4 5,2 1,0
1,4 3,4 4,4 4,3 4,1 example, if the GPRs are guaranteed not to
Power
2,3 3,3 3,2 save be read in a given cycle, the clock-gating
2,1 2,2 2,1 mode mechanism turns off the clocks to the GPR
read ports. This allows substantial power sav-
Thread 0 priority, thread 1 priority
ing with no performance impact.
Thread 0 IPC Thread 1 IPC In every cycle, the dynamic power man-
agement logic determines whether a local
Figure 5. Effects of thread priority on performance. clock buffer that drives a set of latches can be
clock gated in the next cycle. The set of latch-
es driven by a clock-gated local clock buffer
mode. In this mode, the Power5 gives all the can still be read but cannot be written. We
physical resources, including the GPR and used power-modeling tools to estimate the
FPR rename pools, to the active thread, allow- utilization of various design macros and their
ing it to achieve higher performance than a associated switching power across a range of
Power4 system at equivalent frequencies. workloads. We then determined the benefit
The Power5 supports two types of ST oper- of clock gating for those macros, implement-
ation: An inactive thread can be in either a ing cycle-by-cycle dynamic power manage-
dormant or a null state. From a hardware per- ment in macros where such management
spective, the only difference between these provided a reasonable power-saving benefit.
states is whether or not the thread awakens on We paid special attention to ensuring that
an external or decrementer interrupt. In the clock gating causes no performance loss and
dormant state, the operating system boots up that clock-gating logic does not create a crit-
in SMT mode but instructs the hardware to ical timing path. A minimum amount of logic
put the thread into the dormant state when implements the clock-gating function.
there is no work for that thread. To make a In addition to switching power, leakage
dormant thread active, either the active thread power has become a performance limiter. To
executes a special instruction, or an external reduce leakage power, the Power5 uses tran-
or decrementer interrupt targets the dormant sistors with low threshold voltage only in crit-
thread. The hardware detects these scenarios ical paths, such as the FPR read path. We
and changes the dormant thread to the active implemented the Power5 SRAM arrays main-
state. It is software’s responsibility to restore ly with high threshold voltage devices.
the architected state of a thread transitioning The Power5 also has a low-power mode,
from the dormant to the active state. enabled when the system software instructs
When a thread is in the null state, the oper- the hardware to execute both threads at the
ating system is unaware of the thread’s existence. lowest available priority. In low-power mode,
As in the dormant state, the operating system instructions dispatch once every 32 cycles at

46 IEEE MICRO

Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
most, further reducing switching power. The Balaram Sinharoy is the chief scientist for the
Power5 uses this mode only when there is no IBM Power5 microprocessor. His research
ready task to run on either thread. interests include advanced microprocessor
The out-of-order execution Power5 design design, computer architecture, and perfor-
coupled with dual 2-way simultaneous mul- mance analysis. Sinharoy has a BS in physics,
tithreaded processors provides instruction and a BTech in computer science and electrical
thread level parallelism. Future plans call for engineering from the University of Calcutta,
shrinking the size of the Power5 die by using and an MS and a PhD in computer science
a 90-nm lithography fabrication process, from Rensselaer Polytechnic Institute. He is
which should allow even higher performance an IBM Master Inventor and a senior mem-
at lower power. ber of the IEEE.

Acknowledgments Joel M. Tendler is the program director of

We thank Ravi Arimilli, Steve Dodson, and technology assessment for the IBM Systems
the entire Power5 team. and Technology Group in Austin, Texas. His
research interests include computer systems
References design, architecture, and performance.
1. J.M. Tendler et al., “Power4 System Micro- Tendler has a bachelor’s degree in engineering
architecture,” IBM J. Research and Devel- from The Cooper Union, and a PhD in elec-
opment, vol. 46, no. 1, Jan. 2002, pp. 5-26. trical engineering from Syracuse University.
2. J. Borkenhagen et al., “A Multithreaded He is a member of the IEEE and the Com-
Power PC Processor for Commercial puter Society.
Servers,” IBM J. Research and Development,
vol. 44, no. 6, Nov. 2000, pp. 885-898. Direct questions and comments about this
3. G. Alverson et al., “The Tera Computer Sys- article to Joel Tendler, IBM Corp., 0453B002,
tem,” Proc. 1990 ACM Int’l Conf. Super- 11400 Burnett Road, Austin, TX 78758 or
computing (Supercomputing 90), IEEE CS
Press, 1990, pp. 1-6.
you@computer.org
4. D.M. Tullsen, S.J. Eggers, and H.M. Levy,
“Simultaneous Multithreading: Maximizing FREE!
On-Chip Parallelism,” Proc. 22nd Ann. Int’l All IEEE Computer Society
Symp. Computer Architecture (ISCA 95), members can obtain a free,
ACM Press, 1995, pp. 392-403. portable email
alias@computer.org. Select your
5. G.G. Shahidi et al., “Partially-Depleted SOI
own user name and initiate your
Technology for Digital Logic,” Proc. Int’l
account. The address you choose
Solid-State Circuits Conf. (ISSCC 99), IEEE is yours for as long as you are a
Press, 1999, pp. 426-427. member. If you change jobs or
6. J.E. Smith, “A Study of Branch Prediction Internet service providers, just
Strategies,” Proc. 8th Int’l Symp. Computer update your information with us,
Architecture (ISCA 81), IEEE CS Press, 1981, and the society automatically
pp. 135-148. forwards all your mail.
7. S. McFarling, Combining Branch Predictors,
tech. note TN-36, Digital Equipment Corp. Sign up today at
Western Research Laboratory, 1993. http://computer.org
Ron Kalla is the lead engineer for IBM
Power5 systems, specializing in processor core
development. His research interests include
computer micro-architecture and post-silicon
hardware verification. Kalla has a BSEE in
electrical engineering from the University of
Minnesota.

MARCH–APRIL 2004 47
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.

Section 2: The Technology: "Any Sufficiently Advanced Technology Will Have The Appearance of Magic." Arthur C. Clarke
No ratings yet
Section 2: The Technology: "Any Sufficiently Advanced Technology Will Have The Appearance of Magic." Arthur C. Clarke
90 pages
POWER9 Processor Architecture
No ratings yet
POWER9 Processor Architecture
12 pages
Unit 01
No ratings yet
Unit 01
28 pages
Optimizing Embedded Multicore CPUs
No ratings yet
Optimizing Embedded Multicore CPUs
36 pages
CH02 COA10e
No ratings yet
CH02 COA10e
67 pages
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
No ratings yet
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
15 pages
PowerPC Microprocessor Overview
No ratings yet
PowerPC Microprocessor Overview
5 pages
Instruction Level Parallelism: Intel
No ratings yet
Instruction Level Parallelism: Intel
6 pages
HPC Important Question
No ratings yet
HPC Important Question
19 pages
Power PC vs. MIPS: by Saumil Shah and Joel Martin
No ratings yet
Power PC vs. MIPS: by Saumil Shah and Joel Martin
31 pages
Whats New Performance Power9
No ratings yet
Whats New Performance Power9
55 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
Power PC Slides
No ratings yet
Power PC Slides
13 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Memory Coherent
No ratings yet
Memory Coherent
62 pages
Operating Systems
No ratings yet
Operating Systems
52 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
kiến trúc máy tính
No ratings yet
kiến trúc máy tính
30 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Multicore Processor
100% (1)
Multicore Processor
23 pages
Power PC Architecture
No ratings yet
Power PC Architecture
24 pages
Chapter 2
No ratings yet
Chapter 2
15 pages
Parallel Computing Course Guide
100% (1)
Parallel Computing Course Guide
49 pages
Cell Multproc Comm NTWK - Built For SPD
No ratings yet
Cell Multproc Comm NTWK - Built For SPD
14 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
24 pages
Computer Architecture: Vnu - University Engineering Technology
No ratings yet
Computer Architecture: Vnu - University Engineering Technology
30 pages
Lecture 3
No ratings yet
Lecture 3
26 pages
IBM Power System E980 - Designed To Support The Most Mission-Critical Applications
No ratings yet
IBM Power System E980 - Designed To Support The Most Mission-Critical Applications
4 pages
CS/EE 6810: Computer Architecture
No ratings yet
CS/EE 6810: Computer Architecture
18 pages
20BCE2351 Micro Assignment-02
No ratings yet
20BCE2351 Micro Assignment-02
5 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
Computer Organization Basics
No ratings yet
Computer Organization Basics
20 pages
IBM POWER PC (Personal Computer) : Rabiya Ghafoor (08-Rcet-Cs-06)
No ratings yet
IBM POWER PC (Personal Computer) : Rabiya Ghafoor (08-Rcet-Cs-06)
21 pages
Computer Archi
No ratings yet
Computer Archi
13 pages
2 4pwrpc MSG
No ratings yet
2 4pwrpc MSG
17 pages
Unit 1
No ratings yet
Unit 1
194 pages
Computer System: Operating Systems: Internals and Design Principles
No ratings yet
Computer System: Operating Systems: Internals and Design Principles
62 pages
The 50 Year History of The Microprocessor As Five Technology Eras
No ratings yet
The 50 Year History of The Microprocessor As Five Technology Eras
2 pages
Unit 5
No ratings yet
Unit 5
86 pages
Lecture 24
No ratings yet
Lecture 24
21 pages
Computer Architecture Overview
No ratings yet
Computer Architecture Overview
64 pages
CMP2008 L1
No ratings yet
CMP2008 L1
47 pages
CH02 COA10e.performance Issues
No ratings yet
CH02 COA10e.performance Issues
19 pages
Power Aware Architecture
No ratings yet
Power Aware Architecture
46 pages
Power8 Technology Workshop: Daniel Villacorta Power Systems Consultant Ibm Systems Lab Services
No ratings yet
Power8 Technology Workshop: Daniel Villacorta Power Systems Consultant Ibm Systems Lab Services
45 pages
Chapter 5
No ratings yet
Chapter 5
18 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Ingle Hreaded Vs Ultithreaded Here Hould E Ocus: S - T - M: W S W F ?
No ratings yet
Ingle Hreaded Vs Ultithreaded Here Hould E Ocus: S - T - M: W S W F ?
11 pages
Performance
No ratings yet
Performance
57 pages
Evolution of IBM POWER Processors
No ratings yet
Evolution of IBM POWER Processors
6 pages
ARPUG Briefing P8 Overview PDF
No ratings yet
ARPUG Briefing P8 Overview PDF
88 pages
Multi-Core Architectures: Rakesh Kumar Rakumar@cs - Ucsd.edu
No ratings yet
Multi-Core Architectures: Rakesh Kumar Rakumar@cs - Ucsd.edu
23 pages
L03 Architecture Memory
No ratings yet
L03 Architecture Memory
56 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
Computer Architecture for CS Students
No ratings yet
Computer Architecture for CS Students
72 pages
Unit 3
No ratings yet
Unit 3
31 pages
IBM POWER7 Multicore Server Processor
No ratings yet
IBM POWER7 Multicore Server Processor
29 pages
Three Phase Heat Exchangers Enhancing Thermal Efficiency
No ratings yet
Three Phase Heat Exchangers Enhancing Thermal Efficiency
8 pages
Faster, Better, Cheaper in The History of Manufacturing: From The Stone Age To Lean Manufacturing and Beyond 1 Edition Edition Christoph Roser
100% (1)
Faster, Better, Cheaper in The History of Manufacturing: From The Stone Age To Lean Manufacturing and Beyond 1 Edition Edition Christoph Roser
65 pages
BMDE Course Outline
No ratings yet
BMDE Course Outline
5 pages
SSS GuideBook 2010 PDF
No ratings yet
SSS GuideBook 2010 PDF
113 pages
Glocalisation
No ratings yet
Glocalisation
6 pages
Tugas1 - 122220040 - THARIQ ZATA WAFI - TI B
No ratings yet
Tugas1 - 122220040 - THARIQ ZATA WAFI - TI B
3 pages
Third Republic
No ratings yet
Third Republic
12 pages
Handbook of Green Building Des6d7b8f7089cb - Anna's Archive 58
No ratings yet
Handbook of Green Building Des6d7b8f7089cb - Anna's Archive 58
1 page
Ecs Product Catalog2022
No ratings yet
Ecs Product Catalog2022
19 pages
Certificate of Analysis BIRKOSIT 021208
No ratings yet
Certificate of Analysis BIRKOSIT 021208
1 page
Divorce Part4
No ratings yet
Divorce Part4
12 pages
F2 Chapter 10 (Foreign Currency Transactions)
No ratings yet
F2 Chapter 10 (Foreign Currency Transactions)
5 pages
Northern Rock 2
100% (1)
Northern Rock 2
19 pages
Artikel MCDM
No ratings yet
Artikel MCDM
17 pages
Ingersoll Rand PARTS R2.2, R4IU, R5,5IU-10-200
100% (5)
Ingersoll Rand PARTS R2.2, R4IU, R5,5IU-10-200
17 pages
Python Operators Guide
No ratings yet
Python Operators Guide
9 pages
(SAMS) Safety Assessment Management System - Cebu Pacific Air
No ratings yet
(SAMS) Safety Assessment Management System - Cebu Pacific Air
22 pages
HSY 2024 Proxy Statement and 2023 Annual Report
No ratings yet
HSY 2024 Proxy Statement and 2023 Annual Report
220 pages
SALES Review
No ratings yet
SALES Review
9 pages
Tesla Free Power Device - VladimirUtkin
100% (1)
Tesla Free Power Device - VladimirUtkin
75 pages
Kurious Brochure
No ratings yet
Kurious Brochure
4 pages
GIMP Graffiti Guide for Beginners
No ratings yet
GIMP Graffiti Guide for Beginners
6 pages
Backfilling and Compacting Test 655
No ratings yet
Backfilling and Compacting Test 655
5 pages
MX200 Design en R01
100% (1)
MX200 Design en R01
116 pages
Bpia 2371525652
No ratings yet
Bpia 2371525652
172 pages
2024 Digital Planner - Landscape, Light Mode, Sunday Start - PDF - Google Drive
0% (1)
2024 Digital Planner - Landscape, Light Mode, Sunday Start - PDF - Google Drive
1 page
FRAME 2008 Technical Reference Guide - FRAME Fire Risk ...
No ratings yet
FRAME 2008 Technical Reference Guide - FRAME Fire Risk ...
104 pages
Literature Review - Anorexia
No ratings yet
Literature Review - Anorexia
11 pages
Notification of Award (NOA)
No ratings yet
Notification of Award (NOA)
2 pages
4 - ITU Standards and Network Deployment Guidelines
100% (5)
4 - ITU Standards and Network Deployment Guidelines
87 pages

IBM Power5 Chip A Dual-Core Multithreaded Processor

Uploaded by

IBM Power5 Chip A Dual-Core Multithreaded Processor

Uploaded by

IBM POWER5 CHIP: A DUAL-CORE

PROVIDES HIGHER PERFORMANCE IN THE SINGLE-THREADED MODE THAN ITS

POWER4 PREDECESSOR AT EQUIVALENT FREQUENCIES. ENHANCEMENTS

RESOURCES TO EACH THREAD, SOFTWARE-CONTROLLED THREAD

PRIORITIZATION, AND DYNAMIC POWER MANAGEMENT TO REDUCE POWER

CONSUMPTION WITHOUT AFFECTING PERFORMANCE.

IBM introduced Power4-based sys- that base requirement, we specified increased

40 Published by the IEEE Computer Society 0272-1732/04/$20.00  2004 IEEE

ing paths. In 130 nm lithography, the chip

supports a 1.875-Mbyte on-chip L2 cache. Processor core

D0 D1 D2 D3 Xfer GD MP ISS RF EX WB Xfer

does not allocate resources to a null thread. This

Dynamic power management

Acknowledgments Joel M. Tendler is the program director of

You might also like