KEMBAR78
IBM Power5 Chip A Dual-Core Multithreaded Processor | PDF | Cpu Cache | Central Processing Unit
0% found this document useful (0 votes)
66 views8 pages

IBM Power5 Chip A Dual-Core Multithreaded Processor

The IBM Power5 chip is a dual-core multithreaded processor that provides higher single-threaded performance than its predecessor, the Power4. It features enhancements such as dynamic resource balancing between threads, software-controlled thread prioritization, and dynamic power management to reduce power consumption without affecting performance.

Uploaded by

e1s1v09092023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views8 pages

IBM Power5 Chip A Dual-Core Multithreaded Processor

The IBM Power5 chip is a dual-core multithreaded processor that provides higher single-threaded performance than its predecessor, the Power4. It features enhancements such as dynamic resource balancing between threads, software-controlled thread prioritization, and dynamic power management to reduce power consumption without affecting performance.

Uploaded by

e1s1v09092023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

IBM POWER5 CHIP: A DUAL-CORE

MULTITHREADED PROCESSOR
FEATURING SINGLE- AND MULTITHREADED EXECUTION, THE POWER5

PROVIDES HIGHER PERFORMANCE IN THE SINGLE-THREADED MODE THAN ITS

POWER4 PREDECESSOR AT EQUIVALENT FREQUENCIES. ENHANCEMENTS


INCLUDE DYNAMIC RESOURCE BALANCING TO EFFICIENTLY ALLOCATE SYSTEM

RESOURCES TO EACH THREAD, SOFTWARE-CONTROLLED THREAD

PRIORITIZATION, AND DYNAMIC POWER MANAGEMENT TO REDUCE POWER

CONSUMPTION WITHOUT AFFECTING PERFORMANCE.

IBM introduced Power4-based sys- that base requirement, we specified increased


tems in 2001.1 The Power4 design integrates performance and other functional enhance-
two processor cores on a single chip, a shared ments of server virtualization, reliability,
second-level cache, a directory for an off-chip availability, and serviceability at both chip and
third-level cache, and the necessary circuitry system levels. In this article, we describe the
to connect it to other Power4 chips to form a approach we used to improve chip-level
system. The dual-processor chip provides nat- performance.
ural thread-level parallelism at the chip level.
Additionally, the Power4’s out-of-order exe- Multithreading
cution design lets the hardware bypass instruc- Conventional processors execute instruc-
Ron Kalla tions whose operands are not yet available tions from a single instruction stream. Despite
(perhaps because of an earlier cache miss dur- microarchitectural advances, execution unit
Balaram Sinharoy ing register loading) and execute other instruc- utilization remains low in today’s micro-
tions whose operands are ready. Later, when processors. It is not unusual to see average exe-
Joel M. Tendler the operands become available, the hardware cution unit utilization rates of approximately
can execute the skipped instruction. Coupled 25 percent across a broad spectrum of envi-
IBM with a superscalar design, out-of-order exe- ronments. To increase execution unit utiliza-
cution results in higher instruction execution tion, designers use thread-level parallelism, in
parallelism than otherwise possible. which the physical processor core executes
The Power5 is the next-generation chip in instructions from more than one instruction
this line. One of our key goals in designing stream. To the operating system, the physical
the Power5 was to maintain both binary and processor core appears as if it is a symmetric
structural compatibility with existing Power4 multiprocessor containing two logical proces-
systems to ensure that binaries continue exe- sors. There are at least three different meth-
cuting properly and all application optimiza- ods for handling multiple threads.
tions carry forward to newer systems. With In coarse-grained multithreading, only one

40 Published by the IEEE Computer Society 0272-1732/04/$20.00  2004 IEEE

Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
thread executes at any instance. When a
Processor Processor Processor Processor
thread encounters a long-latency event, such
as a cache miss, the hardware swaps in a sec- L2 L2
ond thread to use the machine’s resources, cache cache
rather than letting the machine remain idle.
By allowing other work to use what otherwise Fabric Fabric
would be idle cycles, this scheme increases controller controller
overall system throughput. To conserve
resources, both threads share many system L3 L3
resources, such as architectural registers. cache cache
Hence, swapping program control from one
Memory Memory
thread to another requires several cycles. IBM controller controller
implemented coarse-grained multithreading
in the IBM eServer pSeries Model 680.2 Memory Memory
(a)
A variant of coarse-grained multithreading
is fine-grained multithreading. Machines of Processor Processor Processor Processor
this class execute threads in successive cycles,
in round-robin fashion.3 Accommodating this L3 L2 L2 L3
design requires duplicate hardware facilities. cache cache cache cache
When a thread encounters a long-latency
event, its cycles remain unused. Fabric Fabric
controller controller
Finally, in simultaneous multithreading
(SMT), as in other multithreaded implemen- Memory Memory
tations, the processor fetches instructions controller controller
from more than one thread.4 What differen-
tiates this implementation is its ability to Memory Memory
schedule instructions for execution from all (b)
threads concurrently. With SMT, the system Figure 1. Power4 (a) and Power5 (b) system structures.
dynamically adjusts to the environment,
allowing instructions to execute from each
thread if possible, and allowing instructions fabric. This can cause greater contention and
from one thread to utilize all the execution negatively affect system scalability. Moving the
units if the other thread encounters a long- level-three (L3) cache from the memory side to
latency event. the processor side of the fabric lets the Power5
The Power5 design implements two-way more frequently satisfy level-two (L2) cache
SMT on each of the chip’s two processor cores. misses with hits in the 36-Mbyte off-chip L3
Although a higher level of multithreading is cache, avoiding traffic on the interchip fabric.
possible, our simulations showed that the References to data not resident in the on-chip
added complexity was unjustified. As design- L2 cache cause the system to check the L3
ers add simultaneous threads to a single phys- cache before sending requests onto the inter-
ical processor, the marginal performance connection fabric. Moving the L3 cache pro-
benefit decreases. In fact, additional multi- vides significantly more cache on the processor
threading might decrease performance because side than previously available, thus reducing
of cache thrashing, as data from one thread traffic on the fabric and allowing Power5-based
displaces data needed by another thread. systems to scale to higher levels of symmetric
multiprocessing. Initial Power5 systems sup-
Power5 system structure port 64 physical processors.
Figure 1 shows the high-level structures of The Power4 includes a 1.41-Mbyte on-chip
Power4- and Power5-based systems. The L2 cache. Power4+ chips are similar in design
Power4 handles up to a 32-way symmetric to the Power4 but are fabricated in 130-nm
multiprocessor. Going beyond 32 processors technology rather than the Power4’s 180-nm
increases interprocessor communication, technology. The Power4+ includes a 1.5-
resulting in high traffic on the interconnection Mbyte on-chip L2 cache, whereas the Power5

MARCH–APRIL 2004 41
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
HOT CHIPS 15

ing paths. In 130 nm lithography, the chip


uses eight metal levels and measures 389 mm2.
The Power5 processor supports the 64-bit
PowerPC architecture. A single die contains
two identical processor cores, each supporting
two logical threads. This architecture makes
the chip appear as a four-way symmetric mul-
tiprocessor to the operating system. The two
cores share a 1.875-Mbyte (1,920-Kbyte) L2
cache. We implemented the L2 cache as three
identical slices with separate controllers for
each. The L2 slices are 10-way set-associative
with 512 congruence classes of 128-byte lines.
The data’s real address determines which L2
slice the data is cached in. Either processor core
can independently access each L2 controller.
We also integrated the directory for an off-
chip 36-Mbyte L3 cache on the Power5 chip.
Having the L3 cache directory on chip allows
the processor to check the directory after an
Figure 2. Power5 chip (FXU = fixed-point execution unit, ISU L2 miss without experiencing off-chip delays.
= instruction sequencing unit, IDU = instruction decode unit, To reduce memory latencies, we integrated
LSU = load/store unit, IFU = instruction fetch unit, FPU = the memory controller on the chip. This elim-
floating-point unit, and MC = memory controller). inates driver and receiver delays to an exter-
nal controller.

supports a 1.875-Mbyte on-chip L2 cache. Processor core


Power4 and Power4+ systems both have 32- We designed the Power5 processor core to
Mbyte L3 caches, whereas Power5 systems support both enhanced SMT and single-
have a 36-Mbyte L3 cache. threaded (ST) operation modes. Figure 3
The L3 cache operates as a backdoor with shows the Power5’s instruction pipeline,
separate buses for reads and writes that oper- which is identical to the Power4’s. All pipeline
ate at half processor speed. In Power4 and latencies in the Power5, including the branch
Power4+ systems, the L3 was an inline cache misprediction penalty and load-to-use laten-
for data retrieved from memory. Because of cy with an L1 data cache hit, are the same as
the higher transistor density of the Power5’s in the Power4. The identical pipeline struc-
130-nm technology, we could move the mem- ture lets optimizations designed for Power4-
ory controller on chip and eliminate a chip based systems perform equally well on
previously needed for the memory controller Power5-based systems. Figure 4 shows the
function. These two changes in the Power5 Power5’s instruction flow diagram.
also have the significant side benefits of reduc- In SMT mode, the Power5 uses two sepa-
ing latency to the L3 cache and main memo- rate instruction fetch address registers to store
ry, as well as reducing the number of chips the program counters for the two threads.
necessary to build a system. Instruction fetches (IF stage) alternate
between the two threads. In ST mode, the
Chip overview Power5 uses only one program counter and
Figure 2 shows the Power5 chip, which can fetch instructions for that thread every
IBM fabricates using silicon-on-insulator cycle. It can fetch up to eight instructions
(SOI) devices and copper interconnect. SOI from the instruction cache (IC stage) every
technology reduces device capacitance to cycle. The two threads share the instruction
increase transistor performance.5 Copper cache and the instruction translation facility.
interconnect decreases wire resistance and In a given cycle, all fetched instructions come
reduces delays in wire-dominated chip-tim- from the same thread.

42 IEEE MICRO

Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
Branch redirects Out-of-order processing

Branch
Instruction fetch pipeline
MP ISS RF EX WB Xfer
Load/store
IF pipeline
IC BP
MP ISS RF EA DC Fmt WB Xfer CP

D0 D1 D2 D3 Xfer GD MP ISS RF EX WB Xfer


Fixed-point
Group formation and pipeline
instruction decode MP ISS RF

F6 WB Xfer
Floating-
point pipeline
Interrupts and flushes

Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage
0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =
compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and
CP = group commit).

Dynamic
Branch prediction instruction
selection
Shared
Shared
Branch Return Target execution
Program issue
history stack cache units
counter queues
tables LSU0 Data Data
Alternate FXU0 Translation Cache
Instruction LSU1
buffer 0 Group formation
Instruction FXU1 Group Store
Instruction decode
cache completion queue
Instruction Dispatch FPU0
Instruction buffer 1 FPU1
translation
BXU
Thread Data Data
priority CRL translation cache
Shared- Read Write
register shared- shared-
mappers register files register files
L2
cache
Shared by two threads Thread 0 resources Thread 1 resources

Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).

The Power5 scans fetched instructions for rect direction.7 If the fetched instructions con-
branches (BP stage), and if it finds a branch, tain multiple branches, the BP stage can pre-
predicts the branch direction using three dict all the branches at the same time. In
branch history tables shared by the two addition to predicting direction, the Power5
threads. Two of the BHTs use bimodal and also predicts the target of a taken branch in
path-correlated branch prediction mecha- the current cycle’s eight-instruction group. In
nisms to predict branch directions.6,7 The the PowerPC architecture, the processor can
third BHT predicts which of these prediction calculate the target of most branches from the
mechanisms is more likely to predict the cor- instruction’s address and offset value. For

MARCH–APRIL 2004 43
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
HOT CHIPS 15

predicting the target of a subroutine return, multiple issue queues: The floating-point issue
the processor uses a return stack, one for each queue feeds the two floating-point units, the
thread. For predicting the target of other branch issue queue feeds the branch execu-
branches, it uses a shared target cache. If there tion unit, the condition register logical queue
is a taken branch, the processor loads the pro- feeds the condition register logical operation
gram counter with the branch’s target address. execution unit, and a combined issue queue
Otherwise, it loads the program counter with feeds the two fixed-point execution units and
the address of the next sequential instruction the two load-store execution units. Like the
to fetch from. Power4, the Power5 contains eight execution
After fetching, the Power5 places instruc- units, each of which can execute an instruc-
tions in the predicted path in separate instruc- tion each cycle.1
tion fetch queues for the two threads (D0 To simplify the logic for tracking instruc-
stage). Like the Power4, the Power5 can dis- tions through the pipeline, the Power5 tracks
patch up to five instructions each cycle. On the instructions as a group. Each group of dis-
basis of thread priorities, the processor selects patched instructions takes an entry in the glob-
instructions from one of the instruction fetch al completion table at the time of dispatch.
queues and forms a group (D1, D2, and D3 The two threads share 20 entries in the GCT.
stages). All instructions in a group come from Each GCT entry holds a group of instructions;
the same thread and are decoded in parallel. a group can contain up to five instructions, all
Before a group can be dispatched, the from the same thread. Power5 allocates GCT
processor must make several resources avail- entries in program order for each thread at the
able for the instructions in the group. Each time of dispatch. An entry is deallocated from
dispatched group needs an entry in the glob- the GCT when the group is committed.
al completion table (GCT). Each instruction Although the entries in the GCT are in pro-
in the group needs an entry in an appropriate gram order and from a given thread, succes-
issue queue. Each load and store instruction sive entries can belong to different threads.
needs an entry in the load reorder queue and When all input operands for an instruction
store reorder queue, respectively, to detect out- are available, it becomes eligible for issue.
of-order execution hazards.1 When all the Among the eligible instructions in the issue
resources necessary for dispatch are available queue, the issue logic selects one and issues it
for the group, the group is dispatched (GD for execution (ISS stage). For instruction issue,
stage). Instructions flow through the pipeline there is no distinction between instructions
stages between instruction fetch (IF) and from the two threads. When issued, the
group dispatch (GD) in program order. After instruction reads its input physical registers
dispatch, each instruction flows through the (RF stage), executes on the proper execution
register-renaming (mapping) facilities (MP unit (EX stage), and writes the result back to
stage), which map the logical register num- the output physical register (WB stage). Each
bers in the instruction to physical registers. In floating-point unit has a six-cycle execution
the Power5, there are 120 physical general- pipe (F1 through F6 stages). In each load-
purpose registers (GPRs) and 120 physical store unit, an adder computes the address to
floating-point registers (FPRs). The two read or write (EA stage), and the data cache is
threads dynamically share the register files. An accessed (DC stage). For load instructions,
out-of-order processor can exploit the high once data is returned, a formatter selects the
instruction-level parallelism exhibited by correct bytes from the cache line (Fmt stage)
some applications (such as some technical and writes them to the register (WB stage).
applications) if a large pool of rename registers When all the instructions in a group have
is available. To facilitate this, in ST mode, the executed (without generating an exception)
Power5 makes all physical registers available and the group is the oldest group of a given
to the single thread, allowing higher instruc- thread, the group commits (CP stage). In the
tion-level parallelism. Power5, two groups can commit per cycle,
After register renaming, instructions enter one from each thread.
issue queues shared by the two threads. The To efficiently support SMT, we tuned all
Power5 microprocessor, like the Power4, has resources for improved performance within

44 IEEE MICRO

Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
area and power budget constraints. The L1 thread’s decoding until the congestion clears
instruction and data caches are the same size is the primary mechanism for throttling
as in the Power4—64 Kbytes and 32 a thread executing a long-executing
Kbytes—but their associativity has doubled instruction, such as a synch instruction.
to two- and four-way. The first-level data (A synch instruction orders memory
translation table is now fully associative, but operations across multiple processors.)
the size remains at 128 entries.
Adjustable thread priority. Adjustable thread pri-
Enhanced SMT features ority lets software determine when one thread
To improve SMT performance for various should have a greater (or lesser) share of execu-
workload mixes and provide robust quality of tion resources. (All software layers—operating
service, we added two features to the Power5 systems, middleware, and applications—can
chip: dynamic resource balancing and set the thread priority. Some priority levels are
adjustable thread priority. reserved for setting by a privileged instruction
only.) Reasons for choosing an imbalanced
Dynamic resource balancing. The objective of thread priority include the following:
dynamic resource balancing is to ensure that
the two threads executing on the same proces- • A thread is in a spin loop waiting for a lock.
sor flow smoothly through the system. Software would give the thread lower pri-
Dynamic resource-balancing logic monitors ority, because it is not doing useful work
resources such as the GCT and the load miss while spinning.
queue to determine if one thread is hogging • A thread has no immediate work to do and
resources. For example, if one thread encoun- is waiting in an idle loop. Again, software
ters multiple L2 cache load misses, dependent would give this thread lower priority.
instructions can back up in the issue queues, • One application must run faster than
preventing additional groups from dispatch- another. For example, software would
ing and slowing down the other thread. To give higher priority to real-time tasks over
prevent this, resource-balancing logic detects concurrently running background tasks.
that a thread has reached a threshold of L2
cache misses and throttles that thread. The The Power5 microprocessor supports eight
other thread can then flow through the software-controlled priority levels for each
machine without encountering congestion thread. Level 0 is in effect when a thread is not
from the stalled thread. The Power5 resource- running. Levels 1 (the lowest) through 7 apply
balancing logic also monitors how many GCT to running threads. The Power5 chip observes
entries each thread is using. If one thread starts the difference in priority levels between the
to use too many GCT entries, the resource- two threads and gives the one with higher pri-
balancing logic throttles it back to prevent its ority additional decode cycles. Figure 5 (next
blocking the other thread. page) shows how the difference in thread pri-
Depending on the situation, the Power5 ority affects the relative performance of each
resource-balancing logic has three thread- thread. If both threads are at the lowest run-
throttling mechanisms: ning priority (level 1), the microprocessor
assumes that neither thread is doing mean-
• Reducing the thread’s priority is the pri- ingful work and throttles the decode rate to
mary mechanism in situations where a conserve power.
thread uses more than a predetermined
number of GCT entries. Single-threaded operation
• Inhibiting the thread’s instruction decod- Not all applications benefit from SMT.
ing until the congestion clears is the pri- Having two threads executing on the same
mary mechanism for throttling a thread processor will not increase the performance
that incurs a prescribed number of L2 of applications with execution-unit-limited
cache misses. performance or applications that consume all
• Flushing all the thread’s instructions that the chip’s memory bandwidth. For this rea-
are waiting for dispatch and holding the son, the Power5 supports the ST execution

MARCH–APRIL 2004 45
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
HOT CHIPS 15

does not allocate resources to a null thread. This


mode is advantageous if all the system’s execut-
Single-thread mode ing tasks perform better in ST mode.

Dynamic power management


Instructions per cycle (IPC) In current CMOS technologies, chip power
has become one of the most important design
parameters. With the introduction of SMT,
more instructions execute per cycle per proces-
sor core, thus increasing the core’s and the
chip’s total switching power. To reduce switch-
ing power, Power5 chips use a fine-grained,
dynamic clock-gating mechanism extensively.
This mechanism gates off clocks to a local
clock buffer if dynamic power management
0,7 2,7 4,7 6,7 7,7 7,6 7,4 7,2 7,0 1,1 logic knows the set of latches driven by the
1,6 3,6 5,6 6,6 6,5 6,3 6,1 0,1 buffer will not be used in the next cycle. For
2,5 4,5 5,5 5,4 5,2 1,0
1,4 3,4 4,4 4,3 4,1 example, if the GPRs are guaranteed not to
Power
2,3 3,3 3,2 save be read in a given cycle, the clock-gating
2,1 2,2 2,1 mode mechanism turns off the clocks to the GPR
read ports. This allows substantial power sav-
Thread 0 priority, thread 1 priority
ing with no performance impact.
Thread 0 IPC Thread 1 IPC In every cycle, the dynamic power man-
agement logic determines whether a local
Figure 5. Effects of thread priority on performance. clock buffer that drives a set of latches can be
clock gated in the next cycle. The set of latch-
es driven by a clock-gated local clock buffer
mode. In this mode, the Power5 gives all the can still be read but cannot be written. We
physical resources, including the GPR and used power-modeling tools to estimate the
FPR rename pools, to the active thread, allow- utilization of various design macros and their
ing it to achieve higher performance than a associated switching power across a range of
Power4 system at equivalent frequencies. workloads. We then determined the benefit
The Power5 supports two types of ST oper- of clock gating for those macros, implement-
ation: An inactive thread can be in either a ing cycle-by-cycle dynamic power manage-
dormant or a null state. From a hardware per- ment in macros where such management
spective, the only difference between these provided a reasonable power-saving benefit.
states is whether or not the thread awakens on We paid special attention to ensuring that
an external or decrementer interrupt. In the clock gating causes no performance loss and
dormant state, the operating system boots up that clock-gating logic does not create a crit-
in SMT mode but instructs the hardware to ical timing path. A minimum amount of logic
put the thread into the dormant state when implements the clock-gating function.
there is no work for that thread. To make a In addition to switching power, leakage
dormant thread active, either the active thread power has become a performance limiter. To
executes a special instruction, or an external reduce leakage power, the Power5 uses tran-
or decrementer interrupt targets the dormant sistors with low threshold voltage only in crit-
thread. The hardware detects these scenarios ical paths, such as the FPR read path. We
and changes the dormant thread to the active implemented the Power5 SRAM arrays main-
state. It is software’s responsibility to restore ly with high threshold voltage devices.
the architected state of a thread transitioning The Power5 also has a low-power mode,
from the dormant to the active state. enabled when the system software instructs
When a thread is in the null state, the oper- the hardware to execute both threads at the
ating system is unaware of the thread’s existence. lowest available priority. In low-power mode,
As in the dormant state, the operating system instructions dispatch once every 32 cycles at

46 IEEE MICRO

Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.
most, further reducing switching power. The Balaram Sinharoy is the chief scientist for the
Power5 uses this mode only when there is no IBM Power5 microprocessor. His research
ready task to run on either thread. interests include advanced microprocessor
The out-of-order execution Power5 design design, computer architecture, and perfor-
coupled with dual 2-way simultaneous mul- mance analysis. Sinharoy has a BS in physics,
tithreaded processors provides instruction and a BTech in computer science and electrical
thread level parallelism. Future plans call for engineering from the University of Calcutta,
shrinking the size of the Power5 die by using and an MS and a PhD in computer science
a 90-nm lithography fabrication process, from Rensselaer Polytechnic Institute. He is
which should allow even higher performance an IBM Master Inventor and a senior mem-
at lower power. ber of the IEEE.

Acknowledgments Joel M. Tendler is the program director of


We thank Ravi Arimilli, Steve Dodson, and technology assessment for the IBM Systems
the entire Power5 team. and Technology Group in Austin, Texas. His
research interests include computer systems
References design, architecture, and performance.
1. J.M. Tendler et al., “Power4 System Micro- Tendler has a bachelor’s degree in engineering
architecture,” IBM J. Research and Devel- from The Cooper Union, and a PhD in elec-
opment, vol. 46, no. 1, Jan. 2002, pp. 5-26. trical engineering from Syracuse University.
2. J. Borkenhagen et al., “A Multithreaded He is a member of the IEEE and the Com-
Power PC Processor for Commercial puter Society.
Servers,” IBM J. Research and Development,
vol. 44, no. 6, Nov. 2000, pp. 885-898. Direct questions and comments about this
3. G. Alverson et al., “The Tera Computer Sys- article to Joel Tendler, IBM Corp., 0453B002,
tem,” Proc. 1990 ACM Int’l Conf. Super- 11400 Burnett Road, Austin, TX 78758 or
computing (Supercomputing 90), IEEE CS
Press, 1990, pp. 1-6.
you@computer.org
4. D.M. Tullsen, S.J. Eggers, and H.M. Levy,
“Simultaneous Multithreading: Maximizing FREE!
On-Chip Parallelism,” Proc. 22nd Ann. Int’l All IEEE Computer Society
Symp. Computer Architecture (ISCA 95), members can obtain a free,
ACM Press, 1995, pp. 392-403. portable email
alias@computer.org. Select your
5. G.G. Shahidi et al., “Partially-Depleted SOI
own user name and initiate your
Technology for Digital Logic,” Proc. Int’l
account. The address you choose
Solid-State Circuits Conf. (ISSCC 99), IEEE is yours for as long as you are a
Press, 1999, pp. 426-427. member. If you change jobs or
6. J.E. Smith, “A Study of Branch Prediction Internet service providers, just
Strategies,” Proc. 8th Int’l Symp. Computer update your information with us,
Architecture (ISCA 81), IEEE CS Press, 1981, and the society automatically
pp. 135-148. forwards all your mail.
7. S. McFarling, Combining Branch Predictors,
tech. note TN-36, Digital Equipment Corp. Sign up today at
Western Research Laboratory, 1993. http://computer.org
Ron Kalla is the lead engineer for IBM
Power5 systems, specializing in processor core
development. His research interests include
computer micro-architecture and post-silicon
hardware verification. Kalla has a BSEE in
electrical engineering from the University of
Minnesota.

MARCH–APRIL 2004 47
Authorized licensed use limited to: UNIVERSIDAD DE SALAMANCA. Downloaded on December 02,2022 at 12:38:20 UTC from IEEE Xplore. Restrictions apply.

You might also like