KEMBAR78
Module 2 | PDF | Logic Gate | Integrated Circuit
0% found this document useful (0 votes)
35 views105 pages

Module 2

Uploaded by

Sonika Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views105 pages

Module 2

Uploaded by

Sonika Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

L ast E d ited by S P 14112 0 0 4

FLOORPLANNING
AND
PLACEMENT
The input to the floorplanning step is the output of system partitioning and design
entrya netlist. Floorplanning precedes placement, but we shall cover them
together. The output of the placement step is a set of directions for the routing
tools.
At the start of floorplanning we have a netlist describing circuit blocks, the logic
cells within the blocks, and their connections. For example, Figure 16.1 shows
the Viterbi decoder example as a collection of standard cells with no room set
aside yet for routing. We can think of the standard cells as a hod of bricks to be
made into a wall. What we have to do now is set aside spaces (we call these
spaces the channels ) for interconnect, the mortar, and arrange the cells.
Figure 16.2 shows a finished wallafter floorplanning and placement steps are
complete. We still have not completed any routing at this pointthat comes later
all we have done is placed the logic cells in a fashion that we hope will minimize
the total interconnect length, for example.
FIGURE 16.1 The starting point for the floorplanning and placement steps for
the Viterbi decoder (containing only standard cells). This is the initial display of
the floorplanning and placement tool. The small boxes that look like bricks are
the outlines of the standard cells. The largest standard cells, at the bottom of the
display (labeled dfctnb) are 188 D flip-flops. The '+' symbols represent the
drawing origins of the standard cellsfor the D flip-flops they are shifted to the
left and below the logic cell bottom left-hand corner. The large box surrounding
all the logic cells represents the estimated chip size. (This is a screen shot from
Cadence Cell Ensemble.)
FIGURE 16.2 The Viterbi Decoder (from Figure 16.1 ) after floorplanning and
placement. There are 18 rows of standard cells separated by 17 horizontal
channels (labeled 218). The channels are routed as numbered. In this example,
the I/O pads are omitted to show the cell placement more clearly. Figure 17.1
shows the same placement without the channel labels. (A screen shot from
Cadence Cell Ensemble.)
16.1 Floorplanning
Figure 16.3 shows that both interconnect delay and gate delay decrease as we
scale down feature sizesbut at different rates. This is because interconnect
capacitance tends to a limit of about 2 pFcm 1 for a minimum-width wire while
gate delay continues to decrease (see Section 17.4, Circuit Extraction and DRC).
Floorplanning allows us to predict this interconnect delay by estimating
interconnect length.
FIGURE 16.3 Interconnect
and gate delays. As feature
sizes decrease, both average
interconnect delay and
average gate delay decrease
but at different rates. This is
because interconnect
capacitance tends to a limit
that is independent of
scaling. Interconnect delay
now dominates gate delay.

16.1.1 Floorplanning Goals and


Objectives
The input to a floorplanning tool is a hierarchical netlist that describes the
interconnection of the blocks (RAM, ROM, ALU, cache controller, and so on);
the logic cells (NAND, NOR, D flip-flop, and so on) within the blocks; and the
logic cell connectors (the terms terminals , pins , or ports mean the same thing as
connectors ). The netlist is a logical description of the ASIC; the floorplan is a
physical description of an ASIC. Floorplanning is thus a mapping between the
logical description (the netlist) and the physical description (the floorplan).
The goals of floorplanning are to:
● arrange the blocks on a chip,

● decide the location of the I/O pads,

● decide the location and number of the power pads,

● decide the type of power distribution, and

● decide the location and type of clock distribution.


The objectives of floorplanning are to minimize the chip area and minimize
delay. Measuring area is straightforward, but measuring delay is more difficult
and we shall explore this next.

16.1.2 Measurement of Delay in


Floorplanning
Throughout the ASIC design process we need to predict the performance of the
final layout. In floorplanning we wish to predict the interconnect delay before we
complete any routing. Imagine trying to predict how long it takes to get from
Russia to China without knowing where in Russia we are or where our
destination is in China. Actually it is worse, because in floorplanning we may
move Russia or China.
To predict delay we need to know the parasitics associated with interconnect: the
interconnect capacitance ( wiring capacitance or routing capacitance ) as well as
the interconnect resistance. At the floorplanning stage we know only the fanout (
FO ) of a net (the number of gates driven by a net) and the size of the block that
the net belongs to. We cannot predict the resistance of the various pieces of the
interconnect path since we do not yet know the shape of the interconnect for a
net. However, we can estimate the total length of the interconnect and thus
estimate the total capacitance. We estimate interconnect length by collecting
statistics from previously routed chips and analyzing the results. From these
statistics we create tables that predict the interconnect capacitance as a function
of net fanout and block size. A floorplanning tool can then use these
predicted-capacitance tables (also known as interconnect-load tables or wire-load
tables ). Figure 16.4 shows how we derive and use wire-load tables and illustrates
the following facts:
FIGURE 16.4 Predicted capacitance. (a) Interconnect lengths as a function of
fanout (FO) and circuit-block size. (b) Wire-load table. There is only one
capacitance value for each fanout (typically the average value). (c) The
wire-load table predicts the capacitance and delay of a net (with a considerable
error). Net A and net B both have a fanout of 1, both have the same predicted net
delay, but net B in fact has a much greater delay than net A in the actual layout
(of course we shall not know what the actual layout is until much later in the
design process).
● Typically between 60 and 70 percent of nets have a FO = 1.
● The distribution for a FO = 1 has a very long tail, stretching to
interconnects that run from corner to corner of the chip.
● The distribution for a FO = 1 often has two peaks, corresponding to a
distribution for close neighbors in subgroups within a block, superimposed
on a distribution corresponding to routing between subgroups.
● We often see a twin-peaked distribution at the chip level also,
corresponding to separate distributions for interblock routing (inside
blocks) and intrablock routing (between blocks).
● The distributions for FO > 1 are more symmetrical and flatter than for FO
= 1.
● The wire-load tables can only contain one number, for example the average
net capacitance, for any one distribution. Many tools take a worst-case
approach and use the 80- or 90-percentile point instead of the average.
Thus a tool may use a predicted capacitance for which we know 90 percent
of the nets will have less than the estimated capacitance.
● We need to repeat the statistical analysis for blocks with different sizes.
For example, a net with a FO = 1 in a 25 k-gate block will have a different
(larger) average length than if the net were in a 5 k-gate block.
● The statistics depend on the shape (aspect ratio) of the block (usually the
statistics are only calculated for square blocks).
● The statistics will also depend on the type of netlist. For example, the
distributions will be different for a netlist generated by setting a constraint
for minimum logic delay during synthesiswhich tends to generate large
numbers of two-input NAND gatesthan for netlists generated using
minimum-area constraints.
There are no standards for the wire-load tables themselves, but there are some
standards for their use and for presenting the extracted loads (see Section 16.4 ).
Wire-load tables often present loads in terms of a standard load that is usually the
input capacitance of a two-input NAND gate with a 1X (default) drive strength.
TABLE 16.1 A wire-load table showing average interconnect lengths (mm). 1
Fanout
Array (available gates) Chip size (mm) 1 2 4
3k 3.45 0.56 0.85 1.46
11 k 5.11 0.84 1.34 2.25
105 k 12.50 1.75 2.70 4.92

Table 16.1 shows the estimated metal interconnect lengths, as a function of die
size and fanout, for a series of three-level metal gate arrays. In this case the
interconnect capacitance is about 2 pFcm 1 , a typical figure.
Figure 16.5 shows that, because we do not decrease chip size as we scale down
feature size, the worst-case interconnect delay increases. One way to measure the
worst-case delay uses an interconnect that completely crosses the chip, a
coast-to-coast interconnect . In certain cases the worst-case delay of a 0.25 m m
process may be worse than a 0.35 m m process, for example.
FIGURE 16.5
Worst-case
interconnect delay.
As we scale circuits,
but avoid scaling the
chip size, the
worst-case
interconnect delay
increases.

16.1.3 Floorplanning Tools


Figure 16.6 (a) shows an initial random floorplan generated by a floorplanning
tool. Two of the blocks, A and C in this example, are standard-cell areas (the chip
shown in Figure 16.1 is one large standard-cell area). These are flexible blocks
(or variable blocks ) because, although their total area is fixed, their shape (aspect
ratio) and connector locations may be adjusted during the placement step. The
dimensions and connector locations of the other fixed blocks (perhaps RAM,
ROM, compiled cells, or megacells) can only be modified when they are created.
We may force logic cells to be in selected flexible blocks by seeding . We choose
seed cells by name. For example, ram_control* would select all logic cells whose
names started with ram_control to be placed in one flexible block. The special
symbol, usually ' * ', is a wildcard symbol . Seeding may be hard or soft. A hard
seed is fixed and not allowed to move during the remaining floorplanning and
placement steps. A soft seed is an initial suggestion only and can be altered if
necessary by the floorplanner. We may also use seed connectors within flexible
blocksforcing certain nets to appear in a specified order, or location at the
boundary of a flexible block.
FIGURE 16.6 Floorplanning a cell-based ASIC. (a) Initial floorplan generated
by the floorplanning tool. Two of the blocks are flexible (A and C) and contain
rows of standard cells (unplaced). A pop-up window shows the status of block
A. (b) An estimated placement for flexible blocks A and C. The connector
positions are known and a rats nest display shows the heavy congestion below
block B. (c) Moving blocks to improve the floorplan. (d) The updated display
shows the reduced congestion after the changes.

The floorplanner can complete an estimated placement to determine the positions


of connectors at the boundaries of the flexible blocks. Figure 16.6 (b) illustrates a
rat's nest display of the connections between blocks. Connections are shown as
bundles between the centers of blocks or as flight lines between connectors.
Figure 16.6 (c) and (d) show how we can move the blocks in a floorplanning tool
to minimize routing congestion .
We need to control the aspect ratio of our floorplan because we have to fit our
chip into the die cavity (a fixed-size hole, usually square) inside a package.
Figure 16.7 (a)(c) show how we can rearrange our chip to achieve a square
aspect ratio. Figure 16.7 (c) also shows a congestion map , another form of
routability display. There is no standard measure of routability. Generally the
interconnect channels , (or wiring channelsI shall call them channels from now
on) have a certain channel capacity ; that is, they can handle only a fixed number
of interconnects. One measure of congestion is the difference between the
number of interconnects that we actually need, called the channel density , and
the channel capacity. Another measure, shown in Figure 16.7 (c), uses the ratio of
channel density to the channel capacity. With practice, we can create a good
initial placement by floorplanning and a pictorial display. This is one area where
the human ability to recognize patterns and spatial relations is currently superior
to a computer programs ability.

FIGURE 16.7 Congestion analysis. (a) The initial floorplan with a 2:1.5 die
aspect ratio. (b) Altering the floorplan to give a 1:1 chip aspect ratio. (c) A trial
floorplan with a congestion map. Blocks A and C have been placed so that we
know the terminal positions in the channels. Shading indicates the ratio of
channel density to the channel capacity. Dark areas show regions that cannot be
routed because the channel congestion exceeds the estimated capacity.
(d) Resizing flexible blocks A and C alleviates congestion.
FIGURE 16.8 Routing a T-junction between two channels in two-level metal.
The dots represent logic cell pins. (a) Routing channel A (the stem of the T) first
allows us to adjust the width of channel B. (b) If we route channel B first (the
top of the T), this fixes the width of channel A. We have to route the stem of a
T-junction before we route the top.

16.1.4 Channel Definition


During the floorplanning step we assign the areas between blocks that are to be
used for interconnect. This process is known as channel definition or channel
allocation . Figure 16.8 shows a T-shaped junction between two rectangular
channels and illustrates why we must route the stem (vertical) of the T before the
bar. The general problem of choosing the order of rectangular channels to route is
channel ordering .

FIGURE 16.9 Defining the channel routing order for a slicing floorplan using a
slicing tree. (a) Make a cut all the way across the chip between circuit blocks.
Continue slicing until each piece contains just one circuit block. Each cut divides
a piece into two without cutting through a circuit block. (b) A sequence of cuts:
1, 2, 3, and 4 that successively slices the chip until only circuit blocks are left.
(c) The slicing tree corresponding to the sequence of cuts gives the order in
which to route the channels: 4, 3, 2, and finally 1.

Figure 16.9 shows a floorplan of a chip containing several blocks. Suppose we


cut along the block boundaries slicing the chip into two pieces ( Figure 16.9 a).
Then suppose we can slice each of these pieces into two. If we can continue in
this fashion until all the blocks are separated, then we have a slicing floorplan (
Figure 16.9 b). Figure 16.9 (c) shows how the sequence we use to slice the chip
defines a hierarchy of the blocks. Reversing the slicing order ensures that we
route the stems of all the channel T-junctions first.
FIGURE 16.10 Cyclic constraints. (a) A nonslicing floorplan with a cyclic
constraint that prevents channel routing. (b) In this case it is difficult to find a
slicing floorplan without increasing the chip area. (c) This floorplan may be
sliced (with initial cuts 1 or 2) and has no cyclic constraints, but it is inefficient
in area use and will be very difficult to route.

Figure 16.10 shows a floorplan that is not a slicing structure. We cannot cut the
chip all the way across with a knife without chopping a circuit block in two. This
means we cannot route any of the channels in this floorplan without routing all of
the other channels first. We say there is a cyclic constraint in this floorplan. There
are two solutions to this problem. One solution is to move the blocks until we
obtain a slicing floorplan. The other solution is to allow the use of L -shaped,
rather than rectangular, channels (or areas with fixed connectors on all sidesa
switch box ). We need an area-based router rather than a channel router to route L
-shaped regions or switch boxes (see Section 17.2.6, Area-Routing Algorithms).
Figure 16.11 (a) displays the floorplan of the ASIC shown in Figure 16.7 . We
can remove the cyclic constraint by moving the blocks again, but this increases
the chip size. Figure 16.11 (b) shows an alternative solution. We merge the
flexible standard cell areas A and C. We can do this by selective flattening of the
netlist. Sometimes flattening can reduce the routing area because routing between
blocks is usually less efficient than routing inside the row-based blocks.
Figure 16.11 (b) shows the channel definition and routing order for our chip.

FIGURE 16.11 Channel definition and ordering. (a) We can eliminate the cyclic
constraint by merging the blocks A and C. (b) A slicing structure.
16.1.5 I/O and Power Planning
Every chip communicates with the outside world. Signals flow onto and off the
chip and we need to supply power. We need to consider the I/O and power
constraints early in the floorplanning process. A silicon chip or die (plural die,
dies, or dice) is mounted on a chip carrier inside a chip package . Connections are
made by bonding the chip pads to fingers on a metal lead frame that is part of the
package. The metal lead-frame fingers connect to the package pins . A die
consists of a logic core inside a pad ring . Figure 16.12 (a) shows a pad-limited
die and Figure 16.12 (b) shows a core-limited die . On a pad-limited die we use
tall, thin pad-limited pads , which maximize the number of pads we can fit
around the outside of the chip. On a core-limited die we use short, wide
core-limited pads . Figure 16.12 (c) shows how we can use both types of pad to
change the aspect ratio of a die to be different from that of the core.

FIGURE 16.12 Pad-limited and core-limited die. (a) A pad-limited die. The
number of pads determines the die size. (b) A core-limited die: The core logic
determines the die size. (c) Using both pad-limited pads and core-limited pads
for a square die.

Special power pads are used for the positive supply, or VDD, power buses (or
power rails ) and the ground or negative supply, VSS or GND. Usually one set of
VDD/VSS pads supplies one power ring that runs around the pad ring and
supplies power to the I/O pads only. Another set of VDD/VSS pads connects to a
second power ring that supplies the logic core. We sometimes call the I/O power
dirty power since it has to supply large transient currents to the output transistors.
We keep dirty power separate to avoid injecting noise into the internal-logic
power (the clean power ). I/O pads also contain special circuits to protect against
electrostatic discharge ( ESD ). These circuits can withstand very short
high-voltage (several kilovolt) pulses that can be generated during human or
machine handling.
Depending on the type of package and how the foundry attaches the silicon die to
the chip cavity in the chip carrier, there may be an electrical connection between
the chip carrier and the die substrate. Usually the die is cemented in the chip
cavity with a conductive epoxy, making an electrical connection between
substrate and the package cavity in the chip carrier. If we make an electrical
connection between the substrate and a chip pad, or to a package pin, it must be
to VDD ( n -type substrate) or VSS ( p -type substrate). This substrate connection
(for the whole chip) employs a down bond (or drop bond) to the carrier. We have
several options:
● We can dedicate one (or more) chip pad(s) to down bond to the chip
carrier.
● We can make a connection from a chip pad to the lead frame and down
bond from the chip pad to the chip carrier.
● We can make a connection from a chip pad to the lead frame and down
bond from the lead frame.
● We can down bond from the lead frame without using a chip pad.

● We can leave the substrate and/or chip carrier unconnected.

Depending on the package design, the type and positioning of down bonds may
be fixed. This means we need to fix the position of the chip pad for down
bonding using a pad seed .
A double bond connects two pads to one chip-carrier finger and one package pin.
We can do this to save package pins or reduce the series inductance of bond
wires (typically a few nanohenries) by parallel connection of the pads. A
multiple-signal pad or pad group is a set of pads. For example, an oscillator pad
usually comprises a set of two adjacent pads that we connect to an external
crystal. The oscillator circuit and the two signal pads form a single logic cell.
Another common example is a clock pad . Some foundries allow a special form
of corner pad (normal pads are edge pads ) that squeezes two pads into the area at
the corners of a chip using a special two-pad corner cell , to help meet bond-wire
angle design rules (see also Figure 16.13 b and c).

To reduce the series resistive and inductive impedance of power supply networks,
it is normal to use multiple VDD and VSS pads. This is particularly important
with the simultaneously switching outputs ( SSOs ) that occur when driving buses
off-chip [ Wada, Eino, and Anami, 1990]. The output pads can easily consume
most of the power on a CMOS ASIC, because the load on a pad (usually tens of
picofarads) is much larger than typical on-chip capacitive loads. Depending on
the technology it may be necessary to provide dedicated VDD and VSS pads for
every few SSOs. Design rules set how many SSOs can be used per VDD/VSS
pad pair. These dedicated VDD/VSS pads must follow groups of output pads as
they are seeded or planned on the floorplan. With some chip packages this can
become difficult because design rules limit the location of package pins that may
be used for supplies (due to the differing series inductance of each pin).
Using a pad mapping we translate the logical pad in a netlist to a physical pad
from a pad library . We might control pad seeding and mapping in the
floorplanner. The handling of I/O pads can become quite complex; there are
several nonobvious factors that must be considered when generating a pad ring:
● Ideally we would only need to design library pad cells for one orientation.
For example, an edge pad for the south side of the chip, and a corner pad
for the southeast corner. We could then generate other orientations by
rotation and flipping (mirroring). Some ASIC vendors will not allow
rotation or mirroring of logic cells in the mask file. To avoid these
problems we may need to have separate horizontal, vertical, left-handed,
and right-handed pad cells in the library with appropriate logical to
physical pad mappings.
● If we mix pad-limited and core-limited edge pads in the same pad ring, this
complicates the design of corner pads. Usually the two types of edge pad
cannot abut. In this case a corner pad also becomes a pad-format changer ,
or hybrid corner pad .
● In single-supply chips we have one VDD net and one VSS net, both global
power nets . It is also possible to use mixed power supplies (for example,
3.3 V and 5 V) or multiple power supplies ( digital VDD, analog VDD).
Figure 16.13 (a) and (b) are magnified views of the southeast corner of our
example chip and show the different types of I/O cells. Figure 16.13 (c) shows a
stagger-bond arrangement using two rows of I/O pads. In this case the design
rules for bond wires (the spacing and the angle at which the bond wires leave the
pads) become very important.
FIGURE 16.13 Bonding pads. (a) This chip uses both pad-limited and
core-limited pads. (b) A hybrid corner pad. (c) A chip with stagger-bonded pads.
(d) An area-bump bonded chip (or flip-chip). The chip is turned upside down
and solder bumps connect the pads to the lead frame.

Figure 16.13 (d) shows an area-bump bonding arrangement (also known as


flip-chip, solder-bump or C4, terms coined by IBM who developed this
technology [ Masleid, 1991]) used, for example, with ball-grid array ( BGA )
packages. Even though the bonding pads are located in the center of the chip, the
I/O circuits are still often located at the edges of the chip because of difficulties
in power supply distribution and integrating I/O circuits together with logic in the
center of the die.
In an MGA the pad spacing and I/O-cell spacing is fixedeach pad occupies a
fixed pad slot (or pad site ). This means that the properties of the pad I/O are also
fixed but, if we need to, we can parallel adjacent output cells to increase the
drive. To increase flexibility further the I/O cells can use a separation, the
I/O-cell pitch , that is smaller than the pad pitch . For example, three 4 mA driver
cells can occupy two pad slots. Then we can use two 4 mA output cells in parallel
to drive one pad, forming an 8 mA output pad as shown in Figure 16.14 . This
arrangement also means the I/O pad cells can be changed without changing the
base array. This is useful as bonding techniques improve and the pads can be
moved closer together.

FIGURE 16.14 Gate-array I/O pads. (a) Cell-based ASICs may contain pad
cells of different sizes and widths. (b) A corner of a gate-array base. (c) A
gate-array base with different I/O cell and pad pitches.
FIGURE 16.15 Power distribution. (a) Power distributed using m1 for VSS and
m2 for VDD. This helps minimize the number of vias and layer crossings needed
but causes problems in the routing channels. (b) In this floorplan m1 is run
parallel to the longest side of all channels, the channel spine. This can make
automatic routing easier but may increase the number of vias and layer
crossings. (c) An expanded view of part of a channel (interconnect is shown as
lines). If power runs on different layers along the spine of a channel, this forces
signals to change layers. (d) A closeup of VDD and VSS buses as they cross.
Changing layers requires a large number of via contacts to reduce resistance.

Figure 16.15 shows two possible power distribution schemes. The long direction
of a rectangular channel is the channel spine . Some automatic routers may
require that metal lines parallel to a channel spine use a preferred layer (either
m1, m2, or m3). Alternatively we say that a particular metal layer runs in a
preferred direction . Since we can have both horizontal and vertical channels, we
may have the situation shown in Figure 16.15 , where we have to decide whether
to use a preferred layer or the preferred direction for some channels. This may or
may not be handled automatically by the routing software.

16.1.6 Clock Planning


Figure 16.16 (a) shows a clock spine (not to be confused with a channel spine)
routing scheme with all clock pins driven directly from the clock driver. MGAs
and FPGAs often use this fish bone type of clock distribution scheme.
Figure 16.16 (b) shows a clock spine for a cell-based ASIC. Figure 16.16 (c)
shows the clock-driver cell, often part of a special clock-pad cell. Figure 16.16
(d) illustrates clock skew and clock latency . Since all clocked elements are
driven from one net with a clock spine, skew is caused by differing interconnect
lengths and loads. If the clock-driver delay is much larger than the interconnect
delays, a clock spine achieves minimum skew but with long latency.
FIGURE 16.16 Clock distribution. (a) A clock spine for a gate array. (b) A
clock spine for a cell-based ASIC (typical chips have thousands of clock nets).
(c) A clock spine is usually driven from one or more clock-driver cells. Delay in
the driver cell is a function of the number of stages and the ratio of output to
input capacitance for each stage (taper). (d) Clock latency and clock skew. We
would like to minimize both latency and skew.

Clock skew represents a fraction of the clock period that we cannot use for
computation. A clock skew of 500 ps with a 200 MHz clock means that we waste
500 ps of every 5 ns clock cycle, or 10 percent of performance. Latency can
cause a similar loss of performance at the system level when we need to
resynchronize our output signals with a master system clock.
Figure 16.16 (c) illustrates the construction of a clock-driver cell. The delay
through a chain of CMOS gates is minimized when the ratio between the input
capacitance C 1 and the output (load) capacitance C 2 is about 3 (exactly e ª 2.7,
an exponential ratio, if we neglect the effect of parasitics). This means that the
fastest way to drive a large load is to use a chain of buffers with their input and
output loads chosen to maintain this ratio, or taper (we use this as a noun and a
verb). This is not necessarily the smallest or lowest-power method, though.
Suppose we have an ASIC with the following specifications:
● 40,000 flip-flops

● Input capacitance of the clock input to each flip-flop is 0.025 pF


● Clock frequency is 200 MHz
● V DD = 3.3 V
● Chip size is 20 mm on a side
● Clock spine consists of 200 lines across the chip
● Interconnect capacitance is 2 pFcm 1
In this case the clock-spine capacitance C L = 200 ¥ 2 cm ¥ 2 pFcm 1 = 800 pF.
If we drive the clock spine with a chain of buffers with taper equal to e ª 2.7, and
with a first-stage input capacitance of 0.025 pF (a reasonable value for a 0.5 m m
process), we will need
800 ¥ 10 12
log or 11 stages. (16.1)
0.025 ¥ 10 12

The power dissipated charging the input capacitance of the flip-flop clock is fCV
2 or

P 1 1 = (4 ¥ 10 4 ) (200 MHz) (0.025 pF) (3.3 V) 2 = 2.178 W . (16.2)

or approximately 2 W. This is only a little larger than the power dissipated


driving the 800 pF clock-spine interconnect that we can calculate as follows:
P 2 1 = (200 ) (200 MHz) (20 mm) (2 pFcm 1 )(3.3 V) 2 = 1.7424 W . (16.3)

All of this power is dissipated in the clock-driver cell. The worst problem,
however, is the enormous peak current in the final inverter stage. If we assume
the needed rise time is 0.1 ns (with a 200 MHz clock whose period is 5 ns), the
peak current would have to approach
(800 pF) (3.3 V)
I= ª 25 A . (16.4)
0.1 ns

Clearly such a current is not possible without extraordinary design techniques.


Clock spines are used to drive loads of 100200 pF but, as is apparent from the
power dissipation problems of this example, it would be better to find a way to
spread the power dissipation more evenly across the chip.
We can design a tree of clock buffers so that the taper of each stage is e • 2.7 by
using a fanout of three at each node, as shown in Figure 16.17 (a) and (b). The
clock tree , shown in Figure 16.17 (c), uses the same number of stages as a clock
spine, but with a lower peak current for the inverter buffers. Figure 16.17 (c)
illustrates that we now have another problemwe need to balance the delays
through the tree carefully to minimize clock skew (see Section 17.3.1, Clock
Routing).
FIGURE 16.17 A clock tree. (a) Minimum delay is achieved when the taper of
successive stages is about 3. (b) Using a fanout of three at successive nodes.
(c) A clock tree for the cell-based ASIC of Figure 16.16 b. We have to balance
the clock arrival times at all of the leaf nodes to minimize clock skew.

Designing a clock tree that balances the rise and fall times at the leaf nodes has
the beneficial side-effect of minimizing the effect of hot-electron wearout . This
problem occurs when an electron gains enough energy to become hot and jump
out of the channel into the gate oxide (the problem is worse for electrons in n
-channel devices because electrons are more mobile than holes). The trapped
electrons change the threshold voltage of the device and this alters the delay of
the buffers. As the buffer delays change with time, this introduces unpredictable
skew. The problem is worst when the n -channel device is carrying maximum
current with a high voltage across the channelthis occurs during the rise-and
fall-time transitions. Balancing the rise and fall times in each buffer means that
they all wear out at the same rate, minimizing any additional skew.
A phase-locked loop ( PLL ) is an electronic flywheel that locks in frequency to
an input clock signal. The input and output frequencies may differ in phase,
however. This means that we can, for example, drive a clock network with a PLL
in such a way that the output of the clock network is locked in phase to the
incoming clock, thus eliminating the latency of the clock network . A PLL can
also help to reduce random variation of the input clock frequency, known as jitter
, which, since it is unpredictable, must also be discounted from the time available
for computation in each clock cycle. Actel was one of the first FPGA vendors to
incorporate PLLs, and Actels online product literature explains their use in ASIC
design.

1. Interconnect lengths are derived from interconnect capacitance data.


Interconnect capacitance is 2 pFcm 1 .
16.2 Placement
After completing a floorplan we can begin placement of the logic cells within the
flexible blocks. Placement is much more suited to automation than floorplanning.
Thus we shall need measurement techniques and algorithms. After we complete
floorplanning and placement, we can predict both intrablock and interblock
capacitances. This allows us to return to logic synthesis with more accurate estimates
of the capacitive loads that each logic cell must drive.

16.2.1 Placement Terms and Definitions


CBIC, MGA, and FPGA architectures all have rows of logic cells separated by the
interconnectthese are row-based ASICs . Figure 16.18 shows an example of the
interconnect structure for a CBIC. Interconnect runs in horizontal and vertical
directions in the channels and in the vertical direction by crossing through the logic
cells. Figure 16.18 (c) illustrates the fact that it is possible to use over-the-cell routing
( OTC routing) in areas that are not blocked. However, OTC routing is complicated by
the fact that the logic cells themselves may contain metal on the routing layers. We
shall return to this topic in Section 17.2.7, Multilevel Routing. Figure 16.19 shows
the interconnect structure of a two-level metal MGA.
FIGURE 16.18 Interconnect structure. (a) The two-level metal CBIC floorplan
shown in Figure 16.11 b. (b) A channel from the flexible block A. This channel has a
channel height equal to the maximum channel density of 7 (there is room for seven
interconnects to run horizontally in m1). (c) A channel that uses OTC (over-the-cell)
routing in m2.

Most ASICs currently use two or three levels of metal for signal routing. With two
layers of metal, we route within the rectangular channels using the first metal layer for
horizontal routing, parallel to the channel spine, and the second metal layer for the
vertical direction (if there is a third metal layer it will normally run in the horizontal
direction again). The maximum number of horizontal interconnects that can be placed
side by side, parallel to the channel spine, is the channel capacity .

FIGURE 16.19 Gate-array interconnect. (a) A small two-level metal gate array
(about 4.6 k-gate). (b) Routing in a block. (c) Channel routing showing channel
density and channel capacity. The channel height on a gate array may only be
increased in increments of a row. If the interconnect does not use up all of the
channel, the rest of the space is wasted. The interconnect in the channel runs in m1 in
the horizontal direction with m2 in the vertical direction.

Vertical interconnect uses feedthroughs (or feedthrus in the United States) to cross the
logic cells. Here are some commonly used terms with explanations (there are no
generally accepted definitions):
● An unused vertical track (or just track ) in a logic cell is called an uncommitted
feedthrough (also built-in feedthrough , implicit feedthrough , or jumper ).
● A vertical strip of metal that runs from the top to bottom of a cell (for
double-entry cells ), but has no connections inside the cell, is also called a
feedthrough or jumper.
● Two connectors for the same physical net are electrically equivalent connectors
(or equipotential connectors ). For double-entry cells these are usually at the top
and bottom of the logic cell.
● A dedicated feedthrough cell (or crosser cell ) is an empty cell (with no logic)
that can hold one or more vertical interconnects. These are used if there are no
other feedthroughs available.
● A feedthrough pin or feedthrough terminal is an input or output that has
connections at both the top and bottom of the standard cell.
● A spacer cell (usually the same as a feedthrough cell) is used to fill space in
rows so that the ends of all rows in a flexible block may be aligned to connect
to power buses, for example.
There is no standard terminology for connectors and the terms can be very confusing.
There is a difference between connectors that are joined inside the logic cell using a
high-resistance material such as polysilicon and connectors that are joined by
low-resistance metal. The high-resistance kind are really two separate alternative
connectors (that cannot be used as a feedthrough), whereas the low-resistance kind are
electrically equivalent connectors. There may be two or more connectors to a logic
cell, which are not joined inside the cell, and which must be joined by the router (
must-join connectors ).
There are also logically equivalent connectors (or functionally equivalent connectors,
sometimes also called just equivalent connectorswhich is very confusing). The two
inputs of a two-input NAND gate may be logically equivalent connectors. The
placement tool can swap these without altering the logic (but the two inputs may have
different delay properties, so it is not always a good idea to swap them). There can
also be logically equivalent connector groups . For example, in an OAI22
(OR-AND-INVERT) gate there are four inputs: A1, A2 are inputs to one OR gate
(gate A), and B1, B2 are inputs to the second OR gate (gate B). Then group A = (A1,
A2) is logically equivalent to group B = (B1, B2)if we swap one input (A1 or A2)
from gate A to gate B, we must swap the other input in the group (A2 or A1).
In the case of channeled gate arrays and FPGAs, the horizontal interconnect areasthe
channels, usually on m1have a fixed capacity (sometimes they are called
fixed-resource ASICs for this reason). The channel capacity of CBICs and channelless
MGAs can be expanded to hold as many interconnects as are needed. Normally we
choose, as an objective, to minimize the number of interconnects that use each
channel. In the vertical interconnect direction, usually m2, FPGAs still have fixed
resources. In contrast the placement tool can always add vertical feedthroughs to a
channeled MGA, channelless MGA, or CBIC. These problems become less important
as we move to three and more levels of interconnect.
16.2.2 Placement Goals and Objectives
The goal of a placement tool is to arrange all the logic cells within the flexible blocks
on a chip. Ideally, the objectives of the placement step are to
● Guarantee the router can complete the routing step

● Minimize all the critical net delays

● Make the chip as dense as possible

We may also have the following additional objectives:


● Minimize power dissipation

● Minimize cross talk between signals

Objectives such as these are difficult to define in a way that can be solved with an
algorithm and even harder to actually meet. Current placement tools use more specific
and achievable criteria. The most commonly used placement objectives are one or
more of the following:
● Minimize the total estimated interconnect length

● Meet the timing requirements for critical nets

● Minimize the interconnect congestion

Each of these objectives in some way represents a compromise.

16.2.3 Measurement of Placement Goals


and Objectives
In order to determine the quality of a placement, we need to be able to measure it. We
need an approximate measure of interconnect length, closely correlated with the final
interconnect length, that is easy to calculate.
The graph structures that correspond to making all the connections for a net are
known as trees on graphs (or just trees ). Special classes of trees Steiner trees
minimize the total length of interconnect and they are central to ASIC routing
algorithms. Figure 16.20 shows a minimum Steiner tree. This type of tree uses
diagonal connectionswe want to solve a restricted version of this problem, using
interconnects on a rectangular grid. This is called rectilinear routing or Manhattan
routing (because of the eastwest and northsouth grid of streets in Manhattan). We say
that the Euclidean distance between two points is the straight-line distance (as the
crow flies). The Manhattan distance (or rectangular distance) between two points is
the distance we would have to walk in New York.
FIGURE 16.20 Placement using trees on graphs. (a) The floorplan from Figure 16.11
b. (b) An expanded view of the flexible block A showing four rows of standard cells
for placement (typical blocks may contain thousands or tens of thousands of logic
cells). We want to find the length of the net shown with four terminals, W through Z,
given the placement of four logic cells (labeled: A.211, A.19, A.43, A.25). (c) The
problem for net (W, X, Y, Z) drawn as a graph. The shortest connection is the
minimum Steiner tree. (d) The minimum rectilinear Steiner tree using Manhattan
routing. The rectangular (Manhattan) interconnect-length measures are shown for
each tree.

The minimum rectilinear Steiner tree ( MRST ) is the shortest interconnect using a
rectangular grid. The determination of the MRST is in general an NP-complete
problemwhich means it is hard to solve. For small numbers of terminals heuristic
algorithms do exist, but they are expensive to compute. Fortunately we only need to
estimate the length of the interconnect. Two approximations to the MRST are shown
in Figure 16.21 .

The complete graph has connections from each terminal to every other terminal [
Hanan, Wolff, and Agule, 1973]. The complete-graph measure adds all the
interconnect lengths of the complete-graph connection together and then divides by n
/2, where n is the number of terminals. We can justify this since, in a graph with n
terminals, ( n 1) interconnects will emanate from each terminal to join the other ( n
1) terminals in a complete graph connection. That makes n ( n 1) interconnects in
total. However, we have then made each connection twice. So there are one-half this
many, or n ( n 1)/2, interconnects needed for a complete graph connection. Now we
actually only need ( n 1) interconnects to join n terminals, so we have n /2 times as
many interconnects as we really need. Hence we divide the total net length of the
complete graph connection by n /2 to obtain a more reasonable estimate of minimum
interconnect length. Figure 16.21 (a) shows an example of the complete-graph
measure.
FIGURE 16.21 Interconnect-length
measures. (a) Complete-graph
measure. (b) Half-perimeter measure.

The bounding box is the smallest rectangle that encloses all the terminals (not to be
confused with a logic cell bounding box, which encloses all the layout in a logic cell).
The half-perimeter measure (or bounding-box measure) is one-half the perimeter of
the bounding box ( Figure 16.21 b) [ Schweikert, 1976]. For nets with two or three
terminals (corresponding to a fanout of one or two, which usually includes over 50
percent of all nets on a chip), the half-perimeter measure is the same as the minimum
Steiner tree. For nets with four or five terminals, the minimum Steiner tree is between
one and two times the half-perimeter measure [ Hanan, 1966]. For a circuit with m
nets, using the half-perimeter measure corresponds to minimizing the cost function,
1m
f= S h i , (16.5)
2i=1

where h i is the half-perimeter measure for net i .

It does not really matter if our approximations are inaccurate if there is a good
correlation between actual interconnect lengths (after routing) and our
approximations. Figure 16.22 shows that we can adjust the complete-graph and
half-perimeter measures using correction factors [ Goto and Matsuda, 1986]. Now our
wiring length approximations are functions, not just of the terminal positions, but also
of the number of terminals, and the size of the bounding box. One practical example
adjusts a Steiner-tree approximation using the number of terminals [ Chao, Nequist,
and Vuong, 1990]. This technique is used in the Cadence Gate Ensemble placement
tool, for example.
FIGURE 16.22 Correlation between total length
of chip interconnect and the half-perimeter and
complete-graph measures.

One problem with the measurements we have described is that the MRST may only
approximate the interconnect that will be completed by the detailed router. Some
programs have a meander factor that specifies, on average, the ratio of the
interconnect created by the routing tool to the interconnect-length estimate used by the
placement tool. Another problem is that we have concentrated on finding estimates to
the MRST, but the MRST that minimizes total net length may not minimize net delay
(see Section 16.2.8 ).

There is no point in minimizing the interconnect length if we create a placement that


is too congested to route. If we use minimum interconnect congestion as an additional
placement objective, we need some way of measuring it. What we are trying to
measure is interconnect density. Unfortunately we always use the term density to
mean channel density (which we shall discuss in Section 17.2.2, Measurement of
Channel Density). In this chapter, while we are discussing placement, we shall try to
use the term congestion , instead of density, to avoid any confusion.
One measure of interconnect congestion uses the maximum cut line . Imagine a
horizontal or vertical line drawn anywhere across a chip or block, as shown in
Figure 16.23 . The number of interconnects that must cross this line is the cut size (the
number of interconnects we cut). The maximum cut line has the highest cut size.
FIGURE 16.23 Interconnect congestion for the cell-based ASIC from Figure 16.11
(b). (a) Measurement of congestion. (b) An expanded view of flexible block A shows
a maximum cut line.

Many placement tools minimize estimated interconnect length or interconnect


congestion as objectives. The problem with this approach is that a logic cell may be
placed a long way from another logic cell to which it has just one connection. This
logic cell with one connection is less important as far as the total wire length is
concerned than other logic cells, to which there are many connections. However, the
one long connection may be critical as far as timing delay is concerned. As technology
is scaled, interconnection delays become larger relative to circuit delays and this
problem gets worse.
In timing-driven placement we must estimate delay for every net for every trial
placement, possibly for hundreds of thousands of gates. We cannot afford to use
anything other than the very simplest estimates of net delay. Unfortunately, the
minimum-length Steiner tree does not necessarily correspond to the interconnect path
that minimizes delay. To construct a minimum-delay path we may have to route with
non-Steiner trees. In the placement phase typically we take a simple
interconnect-length approximation to this minimum-delay path (typically the
half-perimeter measure). Even when we can estimate the length of the interconnect,
we do not yet have information on which layers and how many vias the interconnect
will use or how wide it will be. Some tools allow us to include estimates for these
parameters. Often we can specify metal usage , the percentage of routing on the
different layers to expect from the router. This allows the placement tool to estimate
RC values and delaysand thus minimize delay.

16.2.4 Placement Algorithms


There are two classes of placement algorithms commonly used in commercial CAD
tools: constructive placement and iterative placement improvement. A constructive
placement method uses a set of rules to arrive at a constructed placement. The most
commonly used methods are variations on the min-cut algorithm . The other
commonly used constructive placement algorithm is the eigenvalue method. As in
system partitioning, placement usually starts with a constructed solution and then
improves it using an iterative algorithm. In most tools we can specify the locations
and relative placements of certain critical logic cells as seed placements .
The min-cut placement method uses successive application of partitioning [ Breuer,
1977]. The following steps are shown in Figure 16.24 :
1. Cut the placement area into two pieces.
2. Swap the logic cells to minimize the cut cost.
3. Repeat the process from step 1, cutting smaller pieces until all the logic cells are
placed.

FIGURE 16.24 Min-cut placement. (a) Divide the chip into bins using a grid.
(b) Merge all connections to the center of each bin. (c) Make a cut and swap
logic cells between bins to minimize the cost of the cut. (d) Take the cut pieces
and throw out all the edges that are not inside the piece. (e) Repeat the process
with a new cut and continue until we reach the individual bins.

Usually we divide the placement area into bins . The size of a bin can vary, from a bin
size equal to the base cell (for a gate array) to a bin size that would hold several logic
cells. We can start with a large bin size, to get a rough placement, and then reduce the
bin size to get a final placement.
The eigenvalue placement algorithm uses the cost matrix or weighted connectivity
matrix ( eigenvalue methods are also known as spectral methods ) [Hall, 1970]. The
measure we use is a cost function f that we shall minimize, given by
1n
f= S c ij d ij 2 , (16.6)
2i=1

where C = [ c ij ] is the (possibly weighted) connectivity matrix, and d ij is the


Euclidean distance between the centers of logic cell i and logic cell j . Since we are
going to minimize a cost function that is the square of the distance between logic
cells, these methods are also known as quadratic placement methods. This type of cost
function leads to a simple mathematical solution. We can rewrite the cost function f in
matrix form:
1n
f= S c ij ( x i x j ) 2 + (y i y j ) 2
2i=1

= x T Bx + y T By . (16.7)

In Eq. 16.7 , B is a symmetric matrix, the disconnection matrix (also called the
Laplacian).
We may express the Laplacian B in terms of the connectivity matrix C ; and D , a
diagonal matrix (known as the degree matrix), defined as follows:
B =D C; (16.8)

n
d ii = S c ij , i = 1, ... , ni ; d ij = 0, i À j
i=1

We can simplify the problem by noticing that it is symmetric in the x - and y


-coordinates. Let us solve the simpler problem of minimizing the cost function for the
placement of logic cells along just the x -axis first. We can then apply this solution to
the more general two-dimensional placement problem. Before we solve this simpler
problem, we introduce a constraint that the coordinates of the logic cells must
correspond to valid positions (the cells do not overlap and they are placed on-grid).
We make another simplifying assumption that all logic cells are the same size and we
must place them in fixed positions. We can define a vector p consisting of the valid
positions:
p = [ p 1 , ..., p n ] . (16.9)

For a valid placement the x -coordinates of the logic cells,


x = [ x 1 , ..., x n ] . (16.10)

must be a permutation of the fixed positions, p . We can show that requiring the logic
cells to be in fixed positions in this way leads to a series of n equations restricting the
values of the logic cell coordinates [ Cheng and Kuh, 1984]. If we impose all of these
constraint equations the problem becomes very complex. Instead we choose just one
of the equations:
n n
S xi2=S p i 2 . (16.11)
i=1 i=1

Simplifying the problem in this way will lead to an approximate solution to the
placement problem. We can write this single constraint on the x -coordinates in matrix
form:
n
xTx=P; P=S p i 2 . (16.12)
i=1

where P is a constant. We can now summarize the formulation of the problem, with
the simplifications that we have made, for a one-dimensional solution. We must
minimize a cost function, g (analogous to the cost function f that we defined for the
two-dimensional problem in Eq. 16.7 ), where
g = x T Bx . (16.13)

subject to the constraint:


x T x = P . (16.14)

This is a standard problem that we can solve using a Lagrangian multiplier:


L = x T Bx l [ x T x P] . (16.15)

To find the value of x that minimizes g we differentiate L partially with respect to x


and set the result equal to zero. We get the following equation:
[ B l I ] x = 0 . (16.16)

This last equation is called the characteristic equation for the disconnection matrix B
and occurs frequently in matrix algebra (this l has nothing to do with scaling). The
solutions to this equation are the eigenvectors and eigenvalues of B . Multiplying Eq.
16.16 by x T we get:
l x T x = x T Bx . (16.17)

However, since we imposed the constraint x T x = P and x T Bx = g , then


l = g /P . (16.18)

The eigenvectors of the disconnection matrix B are the solutions to our placement
problem. It turns out that (because something called the rank of matrix B is n 1) there
is a degenerate solution with all x -coordinates equal ( l = 0)this makes some sense
because putting all the logic cells on top of one another certainly minimizes the
interconnect. The smallest, nonzero, eigenvalue and the corresponding eigenvector
provides the solution that we want. In the two-dimensional placement problem, the x -
and y -coordinates are given by the eigenvectors corresponding to the two smallest,
nonzero, eigenvalues. (In the next section a simple example illustrates this
mathematical derivation.)
16.2.5 Eigenvalue Placement Example
Consider the following connectivity matrix C and its disconnection matrix B ,
calculated from Eq. 16.8 [ Hall, 1970]:
0001
C=0011
0100
1100

1000 0001 1 0 01
B=0200 0011= 0 211
0010 0100 01 1 0
1100 1100 11 0 2
(16.19)

Figure 16.25 (a) shows the corresponding network with four logic cells (14) and three
nets (AC). Here is a MatLab script to find the eigenvalues and eigenvectors of B :

FIGURE 16.25 Eigenvalue placement. (a) An example network. (b) The


one-dimensional placement.The small black squares represent the centers of the logic
cells. (c) The two-dimensional placement. The eigenvalue method takes no account
of the logic cell sizes or actual location of logic cell connectors. (d) A complete
layout. We snap the logic cells to valid locations, leaving room for the routing in the
channel.
C=[0 0 0 1; 0 0 1 1; 0 1 0 0; 1 1 0 0]
D=[1 0 0 0; 0 2 0 0; 0 0 1 0; 0 0 0 2]
B=D-C
[X,D] = eig(B)
Running this script, we find the eigenvalues of B are 0.5858, 0.0, 2.0, and 3.4142. The
corresponding eigenvectors of B are
0.6533 0.5000 0.5000 0.2706
0.2706 0.5000 0.5000 0.6533
0.6533 0.5000 0.5000 0.2706
0.2706 0.5000 0.5000 0.6533
(16.20)

For a one-dimensional placement ( Figure 16.25 b), we use the eigenvector (0.6533,
0.2706, 0.6533, 0.2706) corresponding to the smallest nonzero eigenvalue (which is
0.5858) to place the logic cells along the x -axis. The two-dimensional placement (
Figure 16.25 c) uses these same values for the x -coordinates and the eigenvector (0.5,
0.5, 0.5, 0.5) that corresponds to the next largest eigenvalue (which is 2.0) for the y
-coordinates. Notice that the placement shown in Figure 16.25 (c), which shows
logic-cell outlines (the logic-cell abutment boxes), takes no account of the cell sizes,
and cells may even overlap at this stage. This is because, in Eq. 16.11 , we discarded
all but one of the constraints necessary to ensure valid solutions. Often we use the
approximate eigenvalue solution as an initial placement for one of the iterative
improvement algorithms that we shall discuss in Section 16.2.6 .

16.2.6 Iterative Placement Improvement


An iterative placement improvement algorithm takes an existing placement and tries
to improve it by moving the logic cells. There are two parts to the algorithm:
● The selection criteria that decides which logic cells to try moving.

● The measurement criteria that decides whether to move the selected cells.

There are several interchange or iterative exchange methods that differ in their
selection and measurement criteria:
● pairwise interchange,

● force-directed interchange,

● force-directed relaxation, and

● force-directed pairwise relaxation.

All of these methods usually consider only pairs of logic cells to be exchanged. A
source logic cell is picked for trial exchange with a destination logic cell. We have
already discussed the use of interchange methods applied to the system partitioning
step. The most widely used methods use group migration, especially the Kernighan
Lin algorithm. The pairwise-interchange algorithm is similar to the interchange
algorithm used for iterative improvement in the system partitioning step:
1. Select the source logic cell at random.
2. Try all the other logic cells in turn as the destination logic cell.
3. Use any of the measurement methods we have discussed to decide on whether
to accept the interchange.
4. The process repeats from step 1, selecting each logic cell in turn as a source
logic cell.
Figure 16.26 (a) and (b) show how we can extend pairwise interchange to swap more
than two logic cells at a time. If we swap l logic cells at a time and find a locally
optimum solution, we say that solution is l -optimum . The neighborhood exchange
algorithm is a modification to pairwise interchange that considers only destination
logic cells in a neighborhood cells within a certain distance, e, of the source logic
cell. Limiting the search area for the destination logic cell to the e -neighborhood
reduces the search time. Figure 16.26 (c) and (d) show the one- and
two-neighborhoods (based on Manhattan distance) for a logic cell.

FIGURE 16.26 Interchange. (a) Swapping the source logic cell with a destination
logic cell in pairwise interchange. (b) Sometimes we have to swap more than two
logic cells at a time to reach an optimum placement, but this is expensive in
computation time. Limiting the search to neighborhoods reduces the search time.
Logic cells within a distance e of a logic cell form an e-neighborhood. (c) A
one-neighborhood. (d) A two-neighborhood.

Neighborhoods are also used in some of the force-directed placement methods .


Imagine identical springs connecting all the logic cells we wish to place. The number
of springs is equal to the number of connections between logic cells. The effect of the
springs is to pull connected logic cells together. The more highly connected the logic
cells, the stronger the pull of the springs. The force on a logic cell i due to logic cell j
is given by Hookes law , which says the force of a spring is proportional to its
extension:
F ij = c ij x ij . (16.21)

The vector component x ij is directed from the center of logic cell i to the center of
logic cell j . The vector magnitude is calculated as either the Euclidean or Manhattan
distance between the logic cell centers. The c ij form the connectivity or cost matrix
(the matrix element c ij is the number of connections between logic cell i and logic
cell j ). If we want, we can also weight the c ij to denote critical connections.
Figure 16.27 illustrates the force-directed placement algorithm.

FIGURE 16.27 Force-directed placement. (a) A network with nine logic cells.
(b) We make a grid (one logic cell per bin). (c) Forces are calculated as if springs
were attached to the centers of each logic cell for each connection. The two nets
connecting logic cells A and I correspond to two springs. (d) The forces are
proportional to the spring extensions.

In the definition of connectivity (Section 15.7.1, Measuring Connectivity) it was


pointed out that the network graph does not accurately model connections for nets
with more than two terminals. Nets such as clock nets, power nets, and global reset
lines have a huge number of terminals. The force-directed placement algorithms
usually make special allowances for these situations to prevent the largest nets from
snapping all the logic cells together. In fact, without external forces to counteract the
pull of the springs between logic cells, the network will collapse to a single point as it
settles. An important part of force-directed placement is fixing some of the logic cells
in position. Normally ASIC designers use the I/O pads or other external connections
to act as anchor points or fixed seeds.
Figure 16.28 illustrates the different kinds of force-directed placement algorithms.
The force-directed interchange algorithm uses the force vector to select a pair of logic
cells to swap. In force-directed relaxation a chain of logic cells is moved. The
force-directed pairwise relaxation algorithm swaps one pair of logic cells at a time.

FIGURE 16.28 Force-directed iterative placement improvement. (a) Force-directed


interchange. (b) Force-directed relaxation. (c) Force-directed pairwise relaxation.

We reach a force-directed solution when we minimize the energy of the system,


corresponding to minimizing the sum of the squares of the distances separating logic
cells. Force-directed placement algorithms thus also use a quadratic cost function.

16.2.7 Placement Using Simulated


Annealing
The principles of simulated annealing were explained in Section 15.7.8, Simulated
Annealing. Because simulated annealing requires so many iterations, it is critical that
the placement objectives be easy and fast to calculate. The optimum connection
pattern, the MRST, is difficult to calculate. Using the half-perimeter measure (
Section 16.2.3 ) corresponds to minimizing the total interconnect length. Applying
simulated annealing to placement, the algorithm is as follows:
1. Select logic cells for a trial interchange, usually at random.
2. Evaluate the objective function E for the new placement.
3. If D E is negative or zero, then exchange the logic cells. If D E is positive, then
exchange the logic cells with a probability of exp( D E / T ).
4. Go back to step 1 for a fixed number of times, and then lower the temperature T
according to a cooling schedule: T n +1 = 0.9 T n , for example.

Kirkpatrick, Gerlatt, and Vecchi first described the use of simulated annealing applied
to VLSI problems [ 1983]. Experience since that time has shown that simulated
annealing normally requires the use of a slow cooling schedule and this means long
CPU run times [ Sechen, 1988; Wong, Leong, and Liu, 1988]. As a general rule,
experiments show that simple min-cut based constructive placement is faster than
simulated annealing but that simulated annealing is capable of giving better results at
the expense of long computer run times. The iterative improvement methods that we
described earlier are capable of giving results as good as simulated annealing, but they
use more complex algorithms.
While I am making wild generalizations, I will digress to discuss benchmarks of
placement algorithms (or any CAD algorithm that is random). It is important to
remember that the results of random methods are themselves random. Suppose the
results from two random algorithms, A and B, can each vary by ±10 percent for any
chip placement, but both algorithms have the same average performance. If we
compare single chip placements by both algorithms, they could falsely show
algorithm A to be better than B by up to 20 percent or vice versa. Put another way, if
we run enough test cases we will eventually find some for which A is better than B by
20 percenta trick that Ph.D. students and marketing managers both know well. Even
single-run evaluations over multiple chips is hardly a fair comparison. The only way
to obtain meaningful results is to compare a statistically meaningful number of runs
for a statistically meaningful number of chips for each algorithm. This same caution
applies to any VLSI algorithm that is random. There was a Design Automation
Conference panel session whose theme was Enough of algorithms claiming
improvements of 5 %.
16.2.8 Timing-Driven Placement Methods
Minimizing delay is becoming more and more important as a placement objective.
There are two main approaches: net based and path based. We know that we can use
net weights in our algorithms. The problem is to calculate the weights. One method
finds the n most critical paths (using a timing-analysis engine, possibly in the
synthesis tool). The net weights might then be the number of times each net appears in
this list. The problem with this approach is that as soon as we fix (for example) the
first 100 critical nets, suddenly another 200 become critical. This is rather like trying
to put worms in a canas soon as we open the lid to put one in, two more pop out.
Another method to find the net weights uses the zero-slack algorithm [ Hauge et al.,
1987]. Figure 16.29 shows how this works (all times are in nanoseconds).
Figure 16.29 (a) shows a circuit with primary inputs at which we know the arrival
times (this is the original definition, some people use the term actual times ) of each
signal. We also know the required times for the primary outputs the points in time at
which we want the signals to be valid. We can work forward from the primary inputs
and backward from the primary outputs to determine arrival and required times at
each input pin for each net. The difference between the required and arrival times at
each input pin is the slack time (the time we have to spare). The zero-slack algorithm
adds delay to each net until the slacks are zero, as shown in Figure 16.29 (b). The net
delays can then be converted to weights or constraints in the placement. Notice that
we have assumed that all the gates on a net switch at the same time so that the net
delay can be placed at the output of the gate driving the neta rather poor timing model
but the best we can use without any routing information.
FIGURE 16.29 The zero-slack algorithm. (a) The circuit with no net delays. (b) The
zero-slack algorithm adds net delays (at the outputs of each gate, equivalent to
increasing the gate delay) to reduce the slack times to zero.

An important point to remember is that adjusting the net weight, even for every net on
a chip, does not theoretically make the placement algorithms any more complexwe
have to deal with the numbers anyway. It does not matter whether the net weight is 1
or 6.6, for example. The practical problem, however, is getting the weight information
for each net (usually in the form of timing constraints) from a synthesis tool or timing
verifier. These files can easily be hundreds of megabytes in size (see Section 16.4 ).

With the zero-slack algorithm we simplify but overconstrain the problem. For
example, we might be able to do a better job by making some nets a little longer than
the slack indicates if we can tighten up other nets. What we would really like to do is
deal with paths such as the critical path shown in Figure 16.29 (a) and not just nets .
Path-based algorithms have been proposed to do this, but they are complex and not all
commercial tools have this capability (see, for example, [ Youssef, Lin, and
Shragowitz, 1992]).
There is still the question of how to predict path delays between gates with only
placement information. Usually we still do not compute a routing tree but use simple
approximations to the total net length (such as the half-perimeter measure) and then
use this to estimate a net delay (the same to each pin on a net). It is not until the
routing step that we can make accurate estimates of the actual interconnect delays.

16.2.9 A Simple Placement Example


Figure 16.30 shows an example network and placements to illustrate the measures for
interconnect length and interconnect congestion. Figure 16.30 (b) and (c) illustrate the
meaning of total routing length, the maximum cut line in the x -direction, the
maximum cut line in the y -direction, and the maximum density. In this example we
have assumed that the logic cells are all the same size, connections can be made to
terminals on any side, and the routing channels between each adjacent logic cell have
a capacity of 2. Figure 16.30 (d) shows what the completed layout might look like.

FIGURE 16.30 Placement example.


(a) An example network. (b) In this
placement, the bin size is equal to the
logic cell size and all the logic cells
are assumed equal size. (c) An
alternative placement with a lower
total routing length. (d) A layout that
might result from the placement
shown in b. The channel densities
correspond to the cut-line sizes.
Notice that the logic cells are not all
the same size (which means there are
errors in the interconnect-length
estimates we made during
placement).
16.4 Information Formats
With the increasing importance of interconnect a great deal of information needs
to flow between design tools. There are some de facto standards that we shall
look at next. Some of the companies involved are working toward releasing these
formats as IEEE standards.

16.4.1 SDF for Floorplanning and


Placement
In Section 13.5.6, SDF in Simulation, we discussed the structure and use of the
standard delay format ( SDF) to describe gate delay and interconnect delay. We
may also use SDF with floorplanning and synthesis tools to back-annotate an
interconnect delay. A synthesis tool can use this information to improve the logic
structure. Here is a fragment of SDF:
(INSTANCE B) (DELAY (ABSOLUTE
( INTERCONNECT A.INV8.OUT B.DFF1.Q (:0.6:) (:0.6:))))
In this example the rising and falling delay is 60 ps (equal to 0.6 units multiplied
by the time scale of 100 ps per unit specified in a TIMESCALE construct that is
not shown). The delay is specified between the output port of an inverter with
instance name A.INV8 in block A and the Q input port of a D flip-flop (instance
name B.DFF1 ) in block B. A '.' (period or fullstop) is set to be the hierarchy
divider in another construct that is not shown.
There is another way of specifying interconnect delay using NETDELAY (a
short form of the INTERCONNECT construct) as follows:
(TIMESCALE 100ps) (INSTANCE B) (DELAY (ABSOLUTE
( NETDELAY net1 (0.6)))
In this case all delays from an output port to, possibly multiple, input ports have
the same value (we can also specify the output port name instead of the net name
to identify the net). Alternatively we can lump interconnect delay at an input port:
(TIMESCALE 100ps) (INSTANCE B.DFF1) (DELAY (ABSOLUTE
( PORT CLR (16:18:22) (17:20:25))))
This PORT construct specifies an interconnect delay placed at the input port of a
logic cell (in this case the CLR pin of a flip-flop). We do not need to specify the
start of a path (as we do for INTERCONNECT ).
We can also use SDF to forward-annotate path delays using timing constraints
(there may be hundreds or thousands of these in a file). A synthesis tool can pass
this information to the floorplanning and placement steps to allow them to create
better layout. SDF describes timing checks using a range of TIMINGCHECK
constructs. Here is an example of a single path constraint:
(TIMESCALE 100ps) (INSTANCE B) ( TIMINGCHECK
( PATHCONSTRAINT A.AOI22_1.O B.ND02_34.O (0.8) (0.8)))
This describes a constraint (keyword PATHCONSTRAINT ) for the rising and
falling delays between two ports at each end of a path (which may consist of
several nets) to be less than 80 ps. Using the SUM construct we can constrain the
sum of path delays to be less than a specific value as follows:
(TIMESCALE 100ps) (INSTANCE B) ( TIMINGCHECK
(SUM (AOI22_1.O ND02_34.I1) (ND02_34.O ND02_35.I1) (0.8)))
We can also constrain skew between two paths (in this case to be less than 10 ps)
using the DIFF construct:
(TIMESCALE 100ps) (INSTANCE B) (TIMINGCHECK
( DIFF (A.I_1.O B.ND02_1.I1) (A.I_1.O.O B.ND02_2.I1) (0.1)))
In addition we can constrain the skew between a reference signal (normally the
clock) and all other ports in an instance (again in this case to be less than 10 ps)
using the SKEWCONSTRAINT construct:
(TIMESCALE 100ps) (INSTANCE B) (TIMINGCHECK
( SKEWCONSTRAINT (posedge clk) (0.1)))
At present there is no easy way in SDF to constrain the skew between a reference
signal and other signals to be greater than a specified amount.

16.4.2 PDEF
The physical design exchange format ( PDEF ) is a proprietary file format used
by Synopsys to describe placement information and the clustering of logic cells.
Here is a simple, but complete PDEF file:
(CLUSTERFILE
(PDEFVERSION "1.0")
(DESIGN "myDesign")
(DATE "THU AUG 6 12:00 1995")
(VENDOR "ASICS_R_US")
(PROGRAM "PDEF_GEN")
(VERSION "V2.2")
(DIVIDER .)
( CLUSTER (NAME "ROOT")
(WIRE_LOAD "10mm x 10mm")
(UTILIZATION 50.0)
(MAX_UTILIZATION 60.0)
(X_BOUNDS 100 1000)
(Y_BOUNDS 100 1000)
(CLUSTER (NAME "LEAF_1")
(WIRE_LOAD "50k gates")
(UTILIZATION 50.0)
(MAX_UTILIZATION 60.0)
(X_BOUNDS 100 500)
(Y_BOUNDS 100 200)
(CELL (NAME L1.RAM01)
(CELL (NAME L1.ALU01)
)
)
)
This file describes two clusters:
● ROOT , which is the top-level (the whole chip). The file describes the size
( x - and y -bounds), current and maximum area utilization (i.e., leaving
space for interconnect), and the name of the wire-load table, '
10mm x 10mm ', to use for this block, chosen because the chip is expected
to be about 10 mm on a side.
● LEAF_1 , a block below the top level in the hierarchy. This block is to use
predicted capacitances from a wire-load table named '50k gates' (chosen
because we know there are roughly 50 k-gate in this block). The LEAF_1
block contains two logic cells: L1.RAM01 and L1.ALU01 .

16.4.3 LEF and DEF


The library exchange format ( LEF ) and design exchange format ( DEF ) are
both proprietary formats originated by Tangent in the TanCell and TanGate
place-and-route tools which were bought by Cadence and now known as Cell3
Ensemble and Gate Ensemble respectively. These tools, and their derivatives, are
so widely used that these formats have become a de facto standard. LEF is used
to define an IC process and a logic cell library. For example, you would use LEF
to describe a gate array: the base cells, the legal sites for base cells, the logic
macros with their size and connectivity information, the interconnect layers and
other information to set up the database that the physical design tools need. You
would use DEF to describe all the physical aspects of a particular chip design
including the netlist and physical location of cells on the chip. For example, if
you had a complete placement from a floorplanning tool and wanted to exchange
this information with Cadence Gate Ensemble or Cell3 Ensemble, you would use
DEF.
16.3 Physical Design Flow
Historically placement was included with routing as a single tool (the term P&R
is often used for place and route). Because interconnect delay now dominates
gate delay, the trend is to include placement within a floorplanning tool and use a
separate router. Figure 16.31 shows a design flow using synthesis and a
floorplanning tool that includes placement. This flow consists of the following
steps:
1. Design entry. The input is a logical description with no physical
information.
2. Synthesis. The initial synthesis contains little or no information on any
interconnect loading. The output of the synthesis tool (typically an EDIF
netlist) is the input to the floorplanner.
3. Initial floorplan. From the initial floorplan interblock capacitances are
input to the synthesis tool as load constraints and intrablock capacitances
are input as wire-load tables.
4. Synthesis with load constraints. At this point the synthesis tool is able to
resynthesize the logic based on estimates of the interconnect capacitance
each gate is driving. The synthesis tool produces a forward annotation file
to constrain path delays in the placement step.
FIGURE 16.31 Timing-driven floorplanning and placement design flow.
Compare with Figure 15.1 on p. 806.
5. Timing-driven placement. After placement using constraints from the
synthesis tool, the location of every logic cell on the chip is fixed and
accurate estimates of interconnect delay can be passed back to the
synthesis tool.
6. Synthesis with in-place optimization ( IPO ). The synthesis tool changes
the drive strength of gates based on the accurate interconnect delay
estimates from the floorplanner without altering the netlist structure.
7. Detailed placement. The placement information is ready to be input to the
routing step.
In Figure 16.31 we iterate between floorplanning and synthesis, continuously
improving our estimate for the interconnect delay as we do so.
16.5 Summary
Floorplanning follows the system partitioning step and is the first step in
arranging circuit blocks on an ASIC. There are many factors to be considered
during floorplanning: minimizing connection length and signal delay between
blocks; arranging fixed blocks and reshaping flexible blocks to occupy the
minimum die area; organizing the interconnect areas between blocks; planning
the power, clock, and I/O distribution. The handling of some of these factors may
be automated using CAD tools, but many still need to be dealt with by hand.
Placement follows the floorplanning step and is more automated. It consists of
organizing an array of logic cells within a flexible block. The criterion for
optimization may be minimum interconnect area, minimum total interconnect
length, or performance. There are two main types of placement algorithms: based
on min-cut or eigenvector methods. Because interconnect delay in a submicron
CMOS process dominates logic-cell delay, planning of interconnect will become
more and more important. Instead of completing synthesis before starting
floorplanning and placement, we will have to use synthesis and
floorplanning/placement tools together to achieve an accurate estimate of timing.
The key points of this chapter are:
● Interconnect delay now dominates gate delay.

● Floorplanning is a mapping between logical and physical design.

● Floorplanning is the center of ASIC design operations for all types of


ASIC.
● Timing-driven floorplanning is becoming an essential ASIC design tool.

● Placement is now an automated function.


16.6 Problems
* = Difficult, ** = Very difficult, *** = Extremely difficult
16.1 (Wire loads, 30 min.) Table 16.2 shows a wire-load table. Since you might expect
the interconnect load to be a monotonic increasing function of fanout and block area, it
seems as though some of the data in Table 16.2 may be in error; these figures are
shown preceded by an asterisk, '*' (this table is from an ASIC vendor data book).
Using a spreadsheet, analyze the data in Table 16.2 .
● a. By graphing the data, indicate any figures in Table 16.2 that you think might
be in error. If you think that there is an error, predict the correct valueseither by
interpolation (for values in error in the body of the table), or by fitting the linear
model parameters, the slope and the intercept (for any values in error in the last
two columns of the table).
● b. Including any corrections, how accurate is the model that predicts load as a
linear function of fanout for a given block size? (Use the maximum error of the
linear model expressed as a percentage of the table value.)
● c. Can you fit a simple function to the (possibly corrected) figures in the last
column of the table and explain its form?
● d. What did you learn about wire-load tables from this problem?
TABLE 16.2 Wire- load table. Predicted interconnect loads (measured in
standard loads) as a function of block size and fanout (Problem 16.1 ).
Size Fanout
(/mm Slope Intercept
2) 1 2 3 4 5 6 7 8 16 32 64
0.5 ¥
0.65 0.95 1.25 1.54 1.84 2.14 2.44 2.74 5.13 9.91 19.47 0.299 0.349
0.5
1 ¥ 1 0.80 1.20 1.59 1.99 2.39 2.79 3.19 3.59 6.77 13.15 25.9 0.398 0.398
2 ¥ 2 0.96 1.48 1.99 2.51 3.02 3.54 4.05 4.57 8.68 16.92 33.38 0.515 0.448
3 ¥ 3 1.20 1.83 2.46 3.09 3.72 4.35 4.98 5.61 10.66 20.75 40.94 0.631 0.564
4 ¥ 4 1.41 2.11 2.81 3.50 4.20 4.90 5.59 6.29 11.87 23.02 45.33 0.697 0.714
5 ¥ 5 1.51 2.24 2.97 3.70 4.43 5.16 5.89 6.62 12.47 24.15 47.53 0.730 0.780
6 ¥ 6 1.56 2.31 3.05 3.80 4.55 5.30 6.04 6.79 12.77 24.72 48.62 0.747 0.813
7 ¥ 7 1.83 2.62 3.42 4.22 5.01 5.81 6.61 7.40 13.78 26.53 52.02 0.797 * 1.029
8 ¥ 8 1.88 2.74 3.6 4.47 5.33 6.19 7.06 7.92 14.82 26.64 56.26 0.863 1.013
9 ¥ 9 2.01 2.94 3.87 4.80 5.73 6.66 7.59 8.52 15.95 30.83 60.57 0.930 1.079
10 ¥
2.01 2.98 3.94 4.90 5.86 6.83 7.79 8.75 16.45 31.86 62.67 0.963 * 1.050
10
11 ¥
2.46 3.46 4.45 5.45 6.44 7.44 8.44 9.43 17.4 33.33 65.20 0.996 1.465
11
12 ¥
3.04 4.1 5.17 6.23 7.3 8.35 9.42 10.48 18.8 36.03 70.00 1.063 1.964
12

16.2 (Trees, 20 min.) For the network graph shown in Figure 16.32 (f), draw the
following trees and calculate their Manhattan lengths:
● a. The minimum Steiner tree.

● b. The chain connection .

● c. The minimum rectilinear Steiner tree.

● d. The minimum rectilinear spanning tree [ Hwang, 1976].

● e. The minimum single-trunk rectilinear Steiner tree (with a horizontal or


vertical trunk).
● f. The minimum rectilinear chain connection (easy to compute).

● g. The minimum source-to-sink connection .

Calculate:
● h. The complete-graph measure and the half-perimeter measure.

Figure 16.32 parts (ae) illustrate the definitions of these trees. There is no known
solution to the minimum Steiner-tree problem for nets with more than five terminals.

FIGURE 16.32 Tree routing. (a) The minimum rectilinear Steiner tree (MRST).
(b) The minimum rectilinear spanning tree. (c) The minimum single-trunk rectilinear
Steiner tree (1-MRST). (d) The minimum rectilinear chain connection. (e) The
minimum source-to-sink connection. (f) Example net for Problem 16.2 .

16.3 (Eigenvalue placement constraints, 10 min. [ Cheng and Kuh, 1984]) Consider
the one-dimensional placement problem with a vector list of valid positions for the
logic cells p = [ p i ] and a vector list of x -coordinates for the logic cells x = [ x i ].
Show that for a valid placement x (where the vector elements x i are some permutation
of the vector elements p i ), the following equations hold:
n n
S xi =S pi
i=1 i=1

n n
S xi2=S pi2
i=1 i=1

...
n n
S xin=S p i n (16.22)
i=1 i=1

( Hint: Consider the polynomial ( x + x i ) n . In our simplification to the problem, we


chose to impose only the second equation of these constraints.)
16.4 (*Eigenvalue placement, 30 min.) You will need MatLab, Mathematica, or a
similar mathematical calculus program for this problem.
● a. Find the eigenvalues and eigenvectors for the disconnection matrix
corresponding to the following connection matrix:
C=
[ 0 1 1 0 0 0 1 0 0;
1 0 0 0 0 0 0 0 0;
1 0 0 1 0 0 0 1 0;
0 0 1 0 0 1 0 0 0;
0 0 0 0 0 1 0 0 1;
0 0 0 1 1 0 1 0 0;
1 0 0 0 0 1 0 0 0;
0 0 1 0 0 0 0 0 1;
0 0 0 0 1 0 0 1 0;]
( Hint: Check your answer. The smallest, nonzero, eigenvalue should be 0.5045.)
● b. Use your results to place the logic cells. Plot the placement and show the
connections between logic cells (this is easy to do using an X-Y plot in an Excel
spreadsheet).
● c. Check that the following equation holds:
l = g /P .

16.5 (Die size, 10 min.) Suppose the minimum spacing between pad centers is W mil
(1 mil = 10 3 inch), there are N I/O pads on a chip, and the die area (assume a square
die) is A mil 2 :
● a. Derive a relationship between W , N , and A that corresponds to the point at
which the die changes from being pad-limited to core-limited.
● b. Plot this relationship with N (ranging from 50 to 500 pads) on the x -axis, A
on the y -axis (for dies ranging in size from 1 mm to 20 mm on a side), and W
as a parameter (for W = 1, 2, 3, and 4 mil).
16.6 (Power buses, 20 min.) Assume aluminum metal interconnect has a resistance of
about 30 m W /square (a low value). Consider a power ring for the I/O pads. Suppose
you have a high-power chip that dissipates 5 W at V DD = 5 V, and assume that half of
the supply current (0.5 A) is due to I/O. Suppose the square die is L mil on a side, and
that the I/O current is equally distributed among the N VDD pads that are on the chip.
In the worst case, you want no more than 100 mV drop between any VDD pad and the
I/O circuits drawing power (notice that there will be an equal drop on the VSS side;
just consider the VDD drop).
● a. Model the power distribution as a ring of N equally spaced pads. Each pad is
connected by a resistor equal to the aluminum VDD power-bus resistance
between two pads. Assume the I/O circuits associated with each pad can be
considered to connect to just one point on the resistors between each pad. If the
resistance between each pad is R , what is the worst-case resistance between the
I/O circuits and the supply?
● b. Plot a graph showing L (in mil) on the x -axis, W (the required power-bus
width in microns) on the y -axis, with N as a parameter (with N = 1, 2, 5, 10).
● c. Comment on your results.

● d. An upper limit on current density for aluminum metallization is about 50


kAcm 2 ; at current densities higher than this, failure due to electromigration
(which we shall cover in Section 17.3.2, Power Routing) is a problem. Assume
the metallization is 0.5 m m thick. Calculate the current density in the VDD
power bus for this chip in terms of the power-bus width and the number of pads.
Comment on your answer.
16.7 (Interconnect-length approximation, 10 min.) Figure 16.22 shows the correlation
between actual interconnect length and two approximations. Use this graph to derive a
correction function (together with an estimation of the error) for the complete-graph
measure and the half-perimeter measure.
16.8 (Half-perimeter measure, 10 min.) Draw a tree on a rectangular grid for which the
MRST is equal to the half-perimeter measure. Draw a tree on a rectangular grid for
which the MRST is twice the half-perimeter measure.
16.9 (***Min-cut, 120 min.) Many floorplanning and placement tools use min-cut
methods and allow you to alter the type and sequence of bisection cuts. Research and
describe the difference between: quadrature min-cut placement, bisection min-cut
placement, and slice/bisection min-cut placement.
16.10 (***Terminal propagation, 120 min.) There is a problem with the min-cut
algorithm in the way connectivity is measured. Figure 16.33 shows a situation in
which logic cells G and H are connected to other logic cells (A and F) outside the area
R1 that is currently being partitioned. The min-cut algorithm ignores connections
outside the area to be divided. Thus logic cells G and H may be placed in partition R3
rather than partition R2. Suggest solutions to this problem. Hint: See Dunlop [1983];
Hartoog [1986]; or the BarnesHut galaxy model.

FIGURE 16.33 (For Problem 16.10 .) A problem with the min-cut algorithm is that it
ignores connections to logic cells outside the area being partitioned. (a) We perform a
vertical cut 1 producing the areas R 1 and R 2 . (b) Next we make a horizontal cut 2,
producing L 2 and L 3 , and a cut 2', producing R 2 and R 3 . (c) The min-cut algorithm
ignores the connection from L 2 and is equally likely to produce the arrangement
shown here when we make cut 2'.

16.11 (Benchmarks and statistics, 30 min.) Your boss asks you to compare two
placement programs from companies ABC and XYZ. You run five test cases for both
on a single netlist, P1. You get results (measured in arbitrary units) of 9, 8, 9, 7, 11 for
ABC; 6, 9, 10, 13, 8 for XYZ.
● a. Calculate the mean and standard deviations for these results.

● b. What confidence (in the statistical sense) do you have in these figures?

● c. What can you say about the relative performance of ABC and XYZ?

On average each test case takes about 0.5 hours (wall clock) for both ABC and XYZ.
Next you run six test cases on another netlist, P2 with the following results: 4, 6, 7, 8,
5, 7 for ABC, and 4, 5, 3, 6, 4, 3 for XYZ. These test cases take about 0.75 hours (wall
clock) each.
● d. What can you say about the P2 results?

● e. Given the P1 and P2 results together, what can you say about ABC and XYZ?

● f. How many P1 test cases should you run to get a result so that you can say
ABC is better or worse than XYZ with 90 percent confidence (i.e., you make the
right decision 9 out of 10 times)? How long would this take?
● g. Find the same figures for the P2 netlist. Comment on your answers.

● h. Suppose you had more netlists and information about the variation of results
from each netlist, together with the average time to run each netlist. How would
you use this information to get the most meaningful result in the shortest time?
16.12 (Linear and quadratic placement, 20 min.) [ Sigl, Doll, and Johannes, 1991]
Figure 16.34 (a) shows a simple network that we will place. Figure 16.34 (b) shows
the problem. The logic cells are all the same size: 1 grid unit wide by 1 grid unit high.
Logic cells 1 and 3 are fixed at the locations shown. Logic cell 2 is movable and
placed at coordinates (for the lower-left corner) of ( x 2 , y 2 ). The lower-left corners
of logic cells should be placed at grid locations and should not overlap.
● a. What is the connection matrix c ij for this network?

● b. Calculate and draw (showing the logic-cell coordinates) the placement that
minimizes the linear cost function (or objective function) f L ,
1n
fL= S c ij d ij (16.23)
2j=1

where d ij is the distance between logic cells i and j .

FIGURE 16.34 Problem 16.12 illustrates placement objectives. (a) An example


network for placement. (b) The placement restrictions. Logic cells 1 and 3 are fixed in
position, the placement problem is to optimize the position of logic cell 2 under
different placement objectives.
● c. Calculate and draw (showing coordinates) the placement that minimizes the
quadratic cost function f Q ,
1n
fQ= S c ij d ij 2 (16.24)
2j=1

16.13 (Placement interconnect lengths, 45 min.) Figure 16.30 (d) shows the actual
routing corresponding to a placement with an estimated routing length of 8 units (
Figure 16.30 b).
● a. Draw the layout (with routing) corresponding to the placement of
Figure 16.30 (c), which has a lower estimated total routing length of 7 units.
● b. Compare the actual total routing length for both layouts and explain why they
are different from the estimated lengths and describe the sources of the errors.
● c. Consider flipping both logic cells A and B about the y -axis in the layout
shown in Figure 16.30 (d). How much does this shorten the total interconnect
length? Some placement algorithms consider such moves.
16.14 (Zero-slack algorithm, 60 min.) For the circuit of Figure 16.35 :
● a. Find all of the arrival, required, and slack times (all delays are in
nanoseconds).

FIGURE 16.35 A circuit to illustrate the zero-slack algorithm (Problem 16.14


).
● b. What is the critical path?
● c. If the gate delay of A2 is increased to 5 ns, what is the new critical path?
● d. ** Using your answer to part a find the upper bounds on net delays by means
of the zero-slack algorithm as follows:
i. Find arrival, required, and slack times on all nets.
ii. Find an input pin p with the least nonzero slack S p on a net which has not
already been selected. If there are none go to step 6.
ii. Find the path through p (may include several gates) on which all pins have
slack S p .

iv. Distribute a delay equal to the slack S p along the path assigning a fraction to
each net at the output pins of the gates on the path.
v. Work backward from p updating all the required times as necessary and
forward from p updating all the arrival times.
vi. Convert net delays to net lengths.
Hint: You can consult the original description of the zero-slack algorithm if this is not
clear [ Hauge et al., 1987].
16.15 (World planning, 60 min.) The seven continents are (with areas in millions of
square miles): Europestrictly a peninsula of Asia (4.1), Asia (17.2), North America
(9.4), South America (6.9), Australia (3.0), Africa (11.7), and Antarctica (5.1).
Assume the continents are flexible blocks whose aspect ratio may be adjusted.
● a. Create a slicing floorplan of the world with a square aspect ratio.

● b. Draw a world connectivity graph with seven nodes and whose edges are
labeled with the distances between Moscow, Beijing, Chicago, Rio de Janeiro,
Sydney, Nairobi, and the South Pole.
● c. Suppose you want to floorplan the world so that the difference in distances
between the centers of the continental blocks and the corresponding edges in the
world connectivity graph is minimized. How would you measure the differences
in distance? Suggest a method to minimize your measure.
● d. Use an eigenvalue method to floorplan the world. Draw the result with
coordinates for each block and explain your approach.
16.7 Bibliography
There are no recent monographs or review articles on floorplanning modern
ASICs with interconnect delay dominating gate delay. Placement is a much more
developed topic. Perhaps the simplest place to dig deeper is the book by Preas
and Lorenzetti that contains a chapter titled Placement, assignment, and
floorplanning [Preas and Karger, 1988]. The collection edited by Ohtsuki [ 1986]
contains a review paper by Yoshida titled Partitioning, assignment, and
placement. Sangiovanni-Vincentellis review article [ 1986] complements
Ohtsukis edited book, but both are now dated. Sechens book [1988] describes
simulated annealing and its application to placement and chip-planning for
standard cell and gate array ASICs. Part III of the IEEE Press book edited by Hu
and Kuh [ 1983] is a collection of papers on wireability, partitioning, and
placement covering some of the earlier and fundamental work in this area. For a
more recent and detailed look at the inner workings of floorplanning and
placement tools, Lengauers [ 1990] book on algorithms contains a chapter on
graph algorithms and a chapter on placement, assignment, and floorplanning.
Most of these earlier book references deal with placement before the use of
timing as an additional objective. The tutorial paper by Benkoski and Strojwas [
1991] contains a number of references on performance-driven placement. Luks
book [ 1991] describes methods for estimating net delay during placement.
Papers and tutorials on all aspects of floorplanning and placement (with an
emphasis on algorithms) are published in IEEE Transactions on Computer-Aided
Design. The newest developments in floorplanning and placement appear every
year in the Proceedings of the ACM/IEEE Design Automation Conference
(DAC) and Proceedings of the IEEE International Conference on
Computer-Aided Design (ICCAD).
16.8 References
Page numbers in brackets after a reference indicate its location in the chapter
body.
Benkoski, J., and A. J. Strojwas. 1991. The role of timing verification in layout
synthesis. In Proceedings of the 28th ACM/IEEE Design Automation
Conference, San Francisco, pp. 612619. Tutorial paper with 60 references. This
was an introduction to a session on Placement for Performance Optimization
containing five other papers on this topic. [ reference location ]

Breuer, M. A. 1977. Min-cut placement. Journal of Design Automation and


Fault Tolerant Computing, Vol. 1, no. 4, pp. 343362. [ reference location ]

Chao, A. H., E. M. Nequist, and T. D. Vuong. 1990. Direct solution of


performance constraints during placement. In Proceedings of the IEEE Custom
Integrated Circuits Conference. Describes algorithms used in Cadence Gate
Ensemble for performance-driven placement. Wiring estimate is based on single
trunk Steiner tree with corrections for bounding rectangle aspect ratio and pin
count. [ reference location ]

Cheng, C.-K., and E. S. Kuh. 1984. Module placement based on resistive


network optimization. IEEE Transactions on Computer-Aided Design for
Integrated-Circuits and Systems, Vol. CAD-3, pp. 218225. [ reference location ,
reference location ]

Dunlop, A. E., and B. W. Kernighan. 1983. A placement procedure for polycell


VLSI circuits. In Proceedings of the IEEE International Conference on
Computer Aided Design, Santa Clara, CA, September 1315. Describes the
terminal propagation algorithm. [ reference location ]

Goto, S. and T. Matsuda. 1986. Partitioning, assignment and placement. In


Layout Design and Verification, T. Ohtsuki (Ed.), Vol. 4, pp. 5597. New York:
Elsevier. ISBN 0444878947. TK 7874. L318. [ reference location ]

Hall, K. M. 1970. An r-dimensional quadratic placement algorithm.


Management Science, Vol. 17, no. 3, pp. 219229. [ reference location ]

Hanan, M. 1966. On Steiner's problem with rectilinear distance. Journal SIAM


Applied Mathematics, Vol. 14, no. 2, pp. 255265. [ reference location ]

Hanan, M., P. K. Wolff Sr., and B. J. Agule. 1973. Some experimental results on
placement techniques. In Proceedings of the 13th Design Automation
Conference. Reference to complete graph wire measure. [ reference location ]

Hartoog, M. R., 1986. Analysis of placement procedures for VLSI standard cell
layout. In Proceedings of the 23rd Design Automation Conference. [ reference
location ]

Hauge, P. S., et al. 1987. Circuit placement for predictable performance. In


Proceedings of the IEEE International Conference on Computer Aided Design,
pp. 8891. Describes the zero-slack algorithm. See also: Nair, R., C. L. Berman,
P. S. Hauge, and E. J. Yoffa, Generation of performance constraints for layout,
IEEE Transactions on Computer Aided Design, Vol. 8, no. 8, pp. 860874,
August 1989; and Burstein, M. and M. N. Housewife, Timing influenced layout
design, in Proceedings of the 22nd Design Automation Conference, 1985.
Defines required, actual, and slack times. Describes application of timing-driven
restrictions to placement using FM algorithm and hierarchical global routing. [
reference location ]

Hu, T. C., and E. S. Kuh (Eds.). 1983. VLSI Circuit Layout: Theory and Design.
New York: IEEE Press. Contains 26 papers divided into six parts; Part 1:
Overview; Part II: General; Part III: Wireability, Partitioning and Placement; Part
IV: Routing; Part V: Layout Systems; Part VI: Module Generation. ISBN
0879421932. TK7874. V5573. [ reference location ]

Hwang, F. K. 1976. On Steiner minimal trees with rectilinear distance. SIAM


Journal of Applied Mathematics, Vol. 30, pp. 104114. See also: Hwang, F. K.,
An O(n log n) Algorithm for Suboptimal Rectilinear Steiner Trees, IEEE
Transactions on Circuits and Systems, Vol. CAS-26, no. 1, pp. 7577, January
1979. Describes an algorithm to improve the rectilinear minimum spanning tree
(RMST) approximation to the minimal rectilinear Steiner tree (minimal RST).
The approximation is at most 1.5 times longer than the minimal RST, since the
RMST is at worst 1.5 times the length of the minimal RST. [ reference location ]

Kirkpatrick, S., C. D. Gerlatt Jr., and M. P. Vecchi. 1983. Optimization by


simulated annealing, Science, Vol. 220, no. 4598, pp. 671680. [ reference
location ]

Lengauer, T. 1990. Combinatorial Algorithms for Integrated Circuit Layout.


Chichester, England: Wiley. ISBN 0-471-92838-0. TK7874.L36. Contains
chapters on circuit layout; optimization problems; graph algorithms; operations
research and statistics; combinatorial layout problems; circuit partitioning;
placement; assignment; floorplanning; global routing and area routing; detailed
routing; and compaction. 484 references. [ reference location ]

Luk, W. K. 1991. A fast physical constraint generator for timing driven layout.
In Proceedings of the 28th ACM/IEEE Design Automation Conference.
Introduction to timing-driven placement and net- and path-based approaches.
Describes some different methods to estimate interconnect delay during
placement. ISBN 0-89791-395-7. [ reference location ].

Masleid, R. P. 1991. High-density central I/O circuits for CMOS. IEEE Journal
of Solid-State Circuits, Vol. 26, no. 3, pp. 431435. An I/O circuit design that
reduces the percentage of chip area occupied by I/O circuits from roughly 22
percent to under 3 percent for a 256 I/O chip. Uses IBM C4 technology that
allows package connections to be located over chip circuitry. 10 references. [
reference location ]

Ohtsuki, T. (Ed.). 1986. Layout Design and Verification. New York: Elsevier.
Includes nine papers on CAD tools and algorithms: "Layout strategy,
standardisation, and CAD tools," Ueda, Kasai and Sudo; "Layout compaction,"
Mylynski and Sung; "Layout verification," Yoshida; "Partitioning, assignment
and placement," Goto and Matsuda; "Computational complexity of layout
problems," Shing and Hu; "Computational and geometry algorithms," Asano,
Sato and Ohtsuki; an excellent survey and tutorial paper by M. Burstein:
"Channel routing"; "Maze-running and line-search algorithms" an easily-readable
paper on detailed routing by Ohtsuki; and a mathematical paper, "Global
routing," by Kuh and Marek-Sadowska. ISBN 0444878947. TK7874. L318. [
reference location ]

Preas, B. T., and P. G. Karger. 1988. Placement, assignment and floorplanning.


In Physical Design Automation of VLSI Systems, B. T. Preas and M. J.
Lorenzetti (Eds.), pp. 87155. Menlo Park, CA: Benjamin-Cummings. ISBN
0-8053-0412-9. TK7874.P47. [ reference location ]

Sangiovanni-Vincentelli, A. 1986. Automatic layout of integrated circuits. In


Nato Advanced Study on Logic Synthesis and Silicon Compilers for VLSI
Design, G. De Micheli, A. Sangiovanni-Vincentelli, and A. Paolo (Eds.).
Norwell, MA: Kluwer. ISBN 90-247-2689-1, 90-247-3561-0. TK7874.N338. [
reference location ]

Schweikert, D. G., 1976. A 2-dimensional placement algorithm for the layout of


electrical circuits. In Proceedings of the 9th Design Automation Conference.
Description of half-perimeter wire measure. [ reference location ]

Sechen, C. 1988. VLSI Placement and Global Routing Using Simulated


Annealing. Norwell, MA: Kluwer. Contains chapters on the simulated annealing
algorithm; placement and global routing; floorplanning; average interconnection
length estimation; interconnect-area estimation; a channel definition algorithm;
and a global router algorithm. ISBN 0898382815. TK7874. S38. [ reference
location ]

Sigl, G., K. Doll, and F. M. Johannes. 1991. Analytical placement: a linear or


quadratic objective function? In Proceedings of the 28th ACM/IEEE Design
Automation Conference. Compares quadratic and linear cost function for
placement algorithms. Explains the Gordian place-and-route system from the
Technical University of Munich. ISBN 0-89791-395-7. [ reference location ].
Wada, T., M. Eino, and K. Anami. 1990. Simple noise model and low-noise
data-output buffer for ultrahigh-speed memories. IEEE Journal of Solid-State
Circuits, Vol. 25, no. 6, pp. 15861588. An analytic noise model for voltage
bounce on internal VDD/VSS lines. [ reference location ]

Wong, D. F., H. W. Leong, and C. L. Liu. 1988. Simulated Annealing for VLSI
Design. Norwell, MA: Kluwer. Introduction; Placement; Floorplan Design;
Channel Routing; Permutation Channel Routing; PLA Folding; Gate Matrix
Layout; Array Optimization. ISBN 0898382564. TK7874. W65. [ reference
location ]

Youssef, H., R.-B. Lin, and E. Shragowitz. 1992. Bounds on net delays for VLSI
circuits. IEEE Transactions on Circuits and Systems II: Analog and Digital
Signal Processing, Vol. 39, no. 11, pp. 315324. An alternative to the
weight-based approach is development of delay bounds on all nets. 21 references.
[ reference location ].
L ast E d ited by S P 1411 2 0 0 4

ROUTING
Once the designer has floorplanned a chip and the logic cells within the flexible
blocks have been placed, it is time to make the connections by routing the chip.
This is still a hard problem that is made easier by dividing it into smaller
problems. Routing is usually split into global routing followed by detailed routing
.
Suppose the ASIC is North America and some travelers in California need advice
on how to drive from Stanford (near San Francisco) to Caltech (near Los
Angeles). The floorplanner has decided that California is on the left (west) side of
the ASIC and the placement tool has put Stanford in Northern California and
Caltech in Southern California. Floorplanning and placement have defined the
roads and freeways. There are two ways to go: the coastal route (using Highway
101) or the inland route (using Interstate I5, which is usually faster). The global
router specifies the coastal route because the travelers are not in a hurry and I5 is
congested (the global router knows this because it has already routed onto I5
many other travelers that are in a hurry today). Next, the detailed router looks at a
map and gives indications from Stanford onto Highway 101 south through San
Jose, Monterey, and Santa Barbara to Los Angeles and then off the freeway to
Caltech in Pasadena.
Figure 17.1 shows the core of the Viterbi decoder after the placement step. This
implementation consists entirely of standard cells (18 rows). The I/O pads are not
included in this examplewe can route the I/O pads after we route the core
(though this is not always a good idea). Figure 17.2 shows the Viterbi decoder
chip after global and detailed routing. The routing runs in the channels between
the rows of logic cells, but the individual interconnections are too small to see.
FIGURE 17.1 The core of the Viterbi decoder chip after placement (a screen
shot from Cadence Cell Ensemble). This is the same placement as shown in
Figure 16.2, but without the channel labels. You can see the rows of standard
cells; the widest cells are the D flip-flops.
FIGURE 17.2 The core of the Viterbi decoder chip after the completion of
global and detailed routing (a screen shot from Cadence Cell Ensemble). This
chip uses two-level metal. Although you cannot see the difference, m1 runs in
the horizontal direction and m2 in the vertical direction.
17.1 Global Routing
The details of global routing differ slightly between cell-based ASICs, gate
arrays, and FPGAs, but the principles are the same in each case. A global router
does not make any connections, it just plans them. We typically global route the
whole chip (or large pieces if it is a large chip) before detail routing the whole
chip (or the pieces). There are two types of areas to global route: inside the
flexible blocks and between blocks (the Viterbi decoder, although a cell-based
ASIC, only involved the global routing of one large flexible block).

17.1.1 Goals and Objectives


The input to the global router is a floorplan that includes the locations of all the
fixed and flexible blocks; the placement information for flexible blocks; and the
locations of all the logic cells. The goal of global routing is to provide complete
instructions to the detailed router on where to route every net. The objectives of
global routing are one or more of the following:
● Minimize the total interconnect length.

● Maximize the probability that the detailed router can complete the routing.

● Minimize the critical path delay.

In both floorplanning and placement, with minimum interconnect length as an


objective, it is necessary to find the shortest total path length connecting a set of
terminals . This path is the MRST, which is hard to find. The alternative, for both
floorplanning and placement, is to use simple approximations to the length of the
MRST (usually the half-perimeter measure). Floorplanning and placement both
assume that interconnect may be put anywhere on a rectangular grid, since at this
point nets have not been assigned to the channels, but the global router must use
the wiring channels and find the actual path. Often the global router needs to find
a path that minimizes the delay between two terminalsthis is not necessarily the
same as finding the shortest total path length for a set of terminals.

17.1.2 Measurement of Interconnect


Delay
Floorplanning and placement need a fast and easy way to estimate the
interconnect delay in order to evaluate each trial placement; often this is a
predefined look-up table. After placement, the logic cell positions are fixed and
the global router can afford to use better estimates of the interconnect delay. To
illustrate one method, we shall use the Elmore constant to estimate the
interconnect delay for the circuit shown in Figure 17.3 .

FIGURE 17.3 Measuring the delay of a net. (a) A simple circuit with an inverter
A driving a net with a fanout of two. Voltages V 1 , V 2 , V 3 , and V 4 are the
voltages at intermediate points along the net. (b) The layout showing the net
segments (pieces of interconnect). (c) The RC model with each segment replaced
by a capacitance and resistance. The ideal switch and pull-down resistance R pd
model the inverter A.

The problem is to find the voltages at the inputs to logic cells B and C taking into
account the parasitic resistance and capacitance of the metal interconnect.
Figure 17.3 (c) models logic cell A as an ideal switch with a pull-down resistance
equal to R pd and models the metal interconnect using resistors and capacitors for
each segment of the interconnect.
The Elmore constant for node 4 (labeled V 4 ) in the network shown in
Figure 17.3 (c) is
4
tD4=S Rk4Ck (17.1)
k=1

= R 14 C 1 + R 24 C 2 + R 34 C 3 + R 44 C 4 ,

where,
R 14 = R pd + R 1 (17.2)
R 24 = R pd + R 1
R 34 = R pd + R 1 + R 3
R 44 = R pd + R 1 + R 3 + R 4

In Eq. 17.2 notice that R 24 = R pd + R 1 (and not R pd + R 1 + R 2 ) because R 1


is the resistance to V 0 (ground) shared by node 2 and node 4.

Suppose we have the following parameters (from the generic 0.5 m m CMOS
process, G5) for the layout shown in Figure 17.3 (b):
● m2 resistance is 50 m W /square.
● m2 capacitance (for a minimum-width line) is 0.2 pFmm 1 .
● 4X inverter delay is 0.02 ns + 0.5 C L ns ( C L is in picofarads).
● Delay is measured using 0.35/0.65 output trip points.
● m2 minimum width is 3 l = 0.9 m m.
● 1X inverter input capacitance is 0.02 pF (a standard load).
First we need to find the pull-down resistance, R pd , of the 4X inverter. If we
model the gate with a linear pull-down resistor, R pd , driving a load C L , the
output waveform is exp t /( C L R pd ) (normalized to 1V). The output reaches
63 percent of its final value when t = C L R pd , because exp (1) = 0.63. Then,
because the delay is measured with a 0.65 trip point, the constant 0.5 nspF 1 =
0.5 k W is very close to the equivalent pull-down resistance. Thus, R pd ª 500 W .

From the given data, we can calculate the R s and C s:


(0.1 mm) (50 ¥ 10 3 W )
R1=R2= =6W
0.9 m m

(1 mm) (50 ¥ 10 3 W )
R3 = = 56 W
0.9 m m

(2 mm) (50 ¥ 10 3 W )
R4 = = 112 W
0.9 m m
(17.3)
C 1 = (0.1 mm) (0.2 ¥ pFmm 1 ) = 0.02 pF
C 2 = (0.1 mm) (0.2 ¥ pFmm 1 ) + 0.02 pF = 0.04 pF
C 3 = (1 mm) (0.2 ¥ pFmm 1 ) = 0.2 pF
C 4 = (2 mm) (0.2 ¥ pFmm 1 ) + 0.02 pF = 0.42 pF
(17.4)

Now we can calculate the path resistance, R ki , values (notice that R ki = R ik ):


R 14 = 500 W + 6 W = 506 W
R 24 = 500 W + 6 W = 506 W
R 34 = 500 W + 6 W + 56 W = 562 W
R 44 = 500 W + 6 W + 56 W + 112 W = 674 W
(17.5)

Finally, we can calculate Elmores constants for node 4 and node 2 as follows:
t D 4 = R 14 C 1 + R 24 C 2 + R 34 C 3 + R 44 C 4 (17.6)
= (506)(0.02) + (506)(0.04)
+ (562)(0.2) + (674)(0.42)
= 425 ps .
t D 2 = R 12 C 1 + R 22 C 2 + R 32 C 3 + R 42 C 4 (17.7)
= ( R pd + R 1 )( C 2 + C 3 + C 4 )
+ ( R pd + R 1 + R 2 ) C 2
= (500 + 6 + 6)(0.04)
+ (500 + 6)(0.02 + 0.2 + 0.2)
= 344 ps .

and t D 4 t D2 = (425 344) = 81 ps.

A lumped-delay model neglects the effects of interconnect resistance and simply


sums all the node capacitances (the lumped capacitance ) as follows:
t D = R pd ( C 1 + C 2 + C 3 + C 4 ) (17.8)
= (500) (0.02 + 0.04 + 0.2 + 0.42)
= 340 ps .

Comparing Eqs. 17.6 17.8 , we can see that the delay of the inverter can be
assigned as follows: 20 ps (the intrinsic delay, 0.2 ns, due to the cell output
capacitance), 340 ps (due to the pull-down resistance and the output capacitance),
4 ps (due to the interconnect from A to B), and 65 ps (due to the interconnect
from A to C). We can see that the error from neglecting interconnect resistance
can be important.
Even using the Elmore constant we still made the following assumptions in
estimating the path delays:
● A step-function waveform drives the net.

● The delay is measured from when the gate input changes.


● The delay is equal to the time constant of an exponential waveform that
approximates the actual output waveform.
● The interconnect is modeled by discrete resistance and capacitance
elements.
The global router could use more sophisticated estimates that remove some of
these assumptions, but there is a limit to the accuracy with which delay can be
estimated during global routing. For example, the global router does not know
how much of the routing is on which of the layers, or how many vias will be used
and of which type, or how wide the metal lines will be. It may be possible to
estimate how much interconnect will be horizontal and how much is vertical.
Unfortunately, this knowledge does not help much if horizontal interconnect may
be completed in either m1 or m3 and there is a large difference in parasitic
capacitance between m1 and m3, for example.
When the global router attempts to minimize interconnect delay, there is an
important difference between a path and a net. The path that minimizes the delay
between two terminals on a net is not necessarily the same as the path that
minimizes the total path length of the net. For example, to minimize the path
delay (using the Elmore constant as a measure) from the output of inverter A in
Figure 17.3 (a) to the input of inverter B requires a rather complicated algorithm
to construct the best path. We shall return to this problem in Section 17.1.6 .

17.1.3 Global Routing Methods


Global routing cannot use the interconnect-length approximations, such as the
half-perimeter measure, that were used in placement. What is needed now is the
actual path and not an approximation to the path length. However, many of the
methods used in global routing are still based on the solutions to the tree on a
graph problem.
One approach to global routing takes each net in turn and calculates the shortest
path using tree on graph algorithmswith the added restriction of using the
available channels. This process is known as sequential routing . As a sequential
routing algorithm proceeds, some channels will become more congested since
they hold more interconnects than others. In the case of FPGAs and channeled
gate arrays, the channels have a fixed channel capacity and can only hold a
certain number of interconnects. There are two different ways that a global router
normally handles this problem. Using order-independent routing , a global router
proceeds by routing each net, ignoring how crowded the channels are. Whether a
particular net is processed first or last does not matter, the channel assignment
will be the same. In order-independent routing, after all the interconnects are
assigned to channels, the global router returns to those channels that are the most
crowded and reassigns some interconnects to other, less crowded, channels.
Alternatively, a global router can consider the number of interconnects already
placed in various channels as it proceeds. In this case the global routing is order
dependent the routing is still sequential, but now the order of processing the nets
will affect the results. Iterative improvement or simulated annealing may be
applied to the solutions found from both order-dependent and order-independent
algorithms. This is implemented in the same way as for system partitioning and
placement: A constructed solution is successively changed, one interconnect path
at a time, in a series of random moves.
In contrast to sequential global-routing methods, which handle nets one at a time,
hierarchical routing handles all nets at a particular level at once. Rather than
handling all of the nets on the chip at the same time, the global-routing problem
is made more tractable by dividing the chip area into levels of hierarchy. By
considering only one level of hierarchy at a time the size of the problem is
reduced at each level. There are two ways to traverse the levels of hierarchy.
Starting at the whole chip, or highest level, and proceeding down to the logic
cells is the top-down approach. The bottom-up approach starts at the lowest level
of hierarchy and globally routes the smallest areas first.

17.1.4 Global Routing Between Blocks


Figure 17.4 illustrates the global-routing problem for a cell-based ASIC. Each
edge in the channel-intersection graph in Figure 17.4 (c) represents a channel.
The global router is restricted to using these channels. The weight of each edge in
the graph corresponds to the length of the channel. The global router plans a path
for each interconnect using this graph.

FIGURE 17.4 Global routing for a cell-based ASIC formulated as a graph


problem. (a) A cell-based ASIC with numbered channels. (b) The channels form
the edges of a graph. (c) The channel-intersection graph. Each channel
corresponds to an edge on a graph whose weight corresponds to the channel
length.

Figure 17.5 shows an example of global routing for a net with five terminals,
labeled A1 through F1, for the cell-based ASIC shown in Figure 17.4 . If a
designer wishes to use minimum total interconnect path length as an objective,
the global router finds the minimum-length tree shown in Figure 17.5 (b). This
tree determines the channels the interconnects will use. For example, the shortest
connection from A1 to B1 uses channels 2, 1, and 5 (in that order). This is the
information the global router passes to the detailed router. Figure 17.5 (c) shows
that minimizing the total path length may not correspond to minimizing the path
delay between two points.

FIGURE 17.5 Finding paths in global routing. (a) A cell-based ASIC (from
Figure 17.4 ) showing a single net with a fanout of four (five terminals). We
have to order the numbered channels to complete the interconnect path for
terminals A1 through F1. (b) The terminals are projected to the center of the
nearest channel, forming a graph. A minimum-length tree for the net that uses
the channels and takes into account the channel capacities. (c) The
minimum-length tree does not necessarily correspond to minimum delay. If we
wish to minimize the delay from terminal A1 to D1, a different tree might be
better.

Global routing is very similar for cell-based ASICs and gate arrays, but there is a
very important difference between the types of channels in these ASICs. The size
of the channels in sea-of-gates arrays, channelless gate arrays, and cell-based
ASICs can be varied to make sure there is enough space to complete the wiring.
In channeled gate-arrays and FPGAs the size, number, and location of channels
are fixed. The good news is that the global router can allocate as many
interconnects to each channel as it likes, since that space is committed anyway.
The bad news is that there is a maximum number of interconnects that each
channel can hold. If the global router needs more room, even in just one channel
on the whole chip, the designer has to repeat the placement-and-routing steps and
try again (or use a bigger chip).
17.1.5 Global Routing Inside Flexible
Blocks
We shall illustrate global routing using a gate array. Figure 17.6 (a) shows the
routing resources on a sea-of-gates or channelless gate array. The gate array base
cells are arranged in 36 blocks, each block containing an array of 8-by-16
gate-array base cells, making a total of 4068 base cells.
The horizontal interconnect resources are the routing channels that are formed
from unused rows of the gate-array base cells, as shown in Figure 17.6 (b) and
(c). The vertical resources are feedthroughs. For example, the logic cell shown in
Figure 17.6 (d) is an inverter that contains two types of feedthrough. The inverter
logic cell uses a single gate-array base cell with terminals (or connectors ) located
at the top and bottom of the logic cell. The inverter input pin has two electrically
equivalent terminals that the global router can use as a feedthrough. The output of
the inverter is connected to only one terminal. The remaining vertical track is
unused by the inverter logic cell, so this track forms an uncommitted
feedthrough.
You may see any of the terms landing pad (because we say that we drop a via to
a landing pad), pick-up point , connector , terminal , pin , or port used for the
connection to a logic cell. The term pick-up point refers to the physical pieces of
metal (or sometimes polysilicon) in the logic cell to which the router connects. In
a three-level metal process, the global router may be able to connect to anywhere
in an areaan area pick-up point . In this book we use the term connector to refer
to the physical pick-up point. The term pin more often refers to the connection on
a logic schematic icon (a dot, square box, or whatever symbol is used), rather
than layout. Thus the difference between a pin and a connector is that we can
have multiple connectors for one pin. Terminal is often used when we talk about
routing. The term port is used when we are using text (EDIF netlists or HDLs, for
example) to describe circuits.
In a gate array the channel capacity must be a multiple of the number of
horizontal tracks in the gate-array base cell. Figure 17.6 (e) shows a gate-array
base cell with seven horizontal tracks (see Section 17.2 for the factors that
determine the track width and track spacing). Thus, in this gate array, we can
have a channel with a capacity of 7, 14, 21, ... horizontal tracksbut not between
these values.
FIGURE 17.6 Gate-array global routing. (a) A small gate array. (b) An enlarged
view of the routing. The top channel uses three rows of gate-array base cells; the
other channels use only one. (c) A further enlarged view showing how the
routing in the channels connects to the logic cells. (d) One of the logic cells, an
inverter. (e) There are seven horizontal wiring tracks available in one row of
gate-array base cellsthe channel capacity is thus 7.

Figure 17.7 shows the inverter macro for the sea-of-gates array shown in
Figure 17.6 . Figure 17.7 (a) shows the base cell. Figure 17.7 (b) shows how the
internal inverter wiring on m1 leaves one vertical track free as a feedthrough in a
two-level metal process (connectors placed at the top and bottom of the cell). In a
three-level metal process the connectors may be placed inside the cell abutment
box ( Figure 17.7 c). Figure 17.8 shows the global routing for the sea-of-gates
array. We divide the array into nonoverlapping routing bins (or just bins , also
called global routing cells or GRCs ), each containing a number of gate-array
base cells.
FIGURE 17.7 The gate-array inverter from Figure 17.6 d. (a) An oxide-isolated
gate-array base cell, showing the diffusion and polysilicon layers. (b) The metal
and contact layers for the inverter in a 2LM (two-level metal) process. (c) The
routers view of the cell in a 3LM process.

We need an aside to discuss our use of the term cell . Be careful not to confuse
the global routing cells with gate-array base cells (the smallest element of a gate
array, consisting of a small number of n -type and p -type transistors), or with
logic cells (which are NAND gates, NOR gates, and so on).
A large routing bin reduces the size of the routing problem, and a small routing
bin allows the router to calculate the wiring capacities more accurately. Some
tools permit routing bins of different size in different areas of the chip (with
smaller routing bins helping in areas of dense routing). Figure 17.8 (a) shows a
routing bin that is 2 -by-4 gate-array base cells. The logic cells occupy the lower
half of the routing bin. The upper half of the routing bin is the channel area,
reserved for wiring. The global router calculates the edge capacities for this
routing bin, including the vertical feedthroughs. The global router then
determines the shortest path for each net considering these edge capacities. An
example of a global-routing calculation is shown in Figure 17.8 (b). The path,
described by a series of adjacent routing bins, is passed to the detailed router.
FIGURE 17.8 Global routing a gate array. (a) A single global-routing cell (GRC
or routing bin) containing 2-by-4 gate-array base cells. For this choice of routing
bin the maximum horizontal track capacity is 14, the maximum vertical track
capacity is 12. The routing bin labeled C3 contains three logic cells, two of
which have feedthroughs marked 'f'. This results in the edge capacities shown.
(b) A view of the top left-hand corner of the gate array showing 28 routing bins.
The global router uses the edge capacities to find a sequence of routing bins to
connect the nets.

17.1.6 Timing-Driven Methods


Minimizing the total pathlength using a Steiner tree does not necessarily
minimize the interconnect delay of a path. Alternative tree algorithms apply in
this situation, most using the Elmore constant as a method to estimate the delay
of a path ( Section 17.1.2 ). As in timing-driven placement, there are two main
approaches to timing-driven routing: net-based and path-based. Path-based
methods are more sophisticated. For example, if there is a critical path from logic
cell A to B to C, the global router may increase the delay due to the interconnect
between logic cells A and B if it can reduce the delay between logic cells B and
C. Placement and global routing tools may or may not use the same algorithm to
estimate net delay. If these tools are from different companies, the algorithms are
probably different. The algorithms must be compatible, however. There is no use
performing placement to minimize predicted delay if the global router uses
completely different measurement methods. Companies that produce
floorplanning and placement tools make sure that the output is compatible with
different routing toolsoften to the extent of using different algorithms to target
different routers.
17.1.7 Back-annotation
After global routing is complete it is possible to accurately predict what the
length of each interconnect in every net will be after detailed routing, probably to
within 5 percent. The global router can give us not just an estimate of the total net
length (which was all we knew at the placement stage), but the resistance and
capacitance of each path in each net. This RC information is used to calculate net
delays. We can back-annotate this net delay information to the synthesis tool for
in-place optimization or to a timing verifier to make sure there are no timing
surprises. Differences in timing predictions at this point arise due to the different
ways in which the placement algorithms estimate the paths and the way the
global router actually builds the paths.
17.2 Detailed Routing
The global routing step determines the channels to be used for each interconnect.
Using this information the detailed router decides the exact location and layers
for each interconnect. Figure 17.9 (a) shows typical metal rules. These rules
determine the m1 routing pitch ( track pitch , track spacing , or just pitch ). We
can set the m1 pitch to one of three values:
1. via-to-via ( VTV ) pitch (or spacing),
2. via-to-line ( VTL or line-to-via ) pitch, or
3. line-to-line ( LTL ) pitch.
The same choices apply to the m2 and other metal layers if they are present.
Via-to-via spacing allows the router to place vias adjacent to each other.
Via-to-line spacing is hard to use in practice because it restricts the router to
nonadjacent vias. Using line-to-line spacing prevents the router from placing a
via at all without using jogs and is rarely used. Via-to-via spacing is the easiest
for a router to use and the most common. Using either via-to-line or via-to-via
spacing means that the routing pitch is larger than the minimum metal pitch.
Sometimes people draw a distinction between a cut and a via when they talk
about large connections such as shown in Figure 17.10 (a). We split or stitch a
large via into identically sized cuts (sometimes called a waffle via ). Because of
the profile of the metal in a contact and the way current flows into a contact,
often the total resistance of several small cuts is less than that of one large cut.
Using identically sized cuts also means the processing conditions during contact
etching, which may vary with the area and perimeter of a contact, are the same
for every cut on the chip.
In a stacked via the contact cuts all overlap in a layout plot and it is impossible to
tell just how many vias on which layers are present. Figure 17.10 (bf) show an
alternative way to draw contacts and vias. Though this is not a standard, using the
diagonal box convention makes it possible to recognize stacked vias and contacts
on a layout (in any orientation). I shall use these conventions when it is
necessary.
FIGURE 17.9 The metal routing pitch. (a) An example of l -based metal design
rules for m1 and via1 (m1/m2 via). (b) Via-to-via pitch for adjacent vias.
(c) Via-to-line (or line-to-via) pitch for nonadjacent vias. (d) Line-to-line pitch
with no vias.

FIGURE 17.10 (a) A large m1 to m2 via. The black squares represent the holes
(or cuts) that are etched in the insulating material between the m1 and 2 layers.
(b) A m1 to m2 via (a via1). (c) A contact from m1 to diffusion or polysilicon (a
contact). (d) A via1 placed over (or stacked over) a contact. (e) A m2 to m3 via
(a via2) (f) A via2 stacked over a via1 stacked over a contact. Notice that the
black square in parts bc do not represent the actual location of the cuts. The
black squares are offset so you can recognize stacked vias and contacts.

In a two-level metal CMOS ASIC technology we complete the wiring using the
two different metal layers for the horizontal and vertical directions, one layer for
each direction. This is Manhattan routing , because the results look similar to the
rectangular northsouth and eastwest layout of streets in New York City. Thus,
for example, if terminals are on the m2 layer, then we route the horizontal
branches in a channel using m2 and the vertical trunks using m1. Figure 17.11
shows that, although we may choose a preferred direction for each metal layer
(for example, m1 for horizontal routing and m2 for vertical routing), this may
lead to problems in cases that have both horizontal and vertical channels. In these
cases we define a preferred metal layer in the direction of the channel spine. In
Figure 17.11 , because the logic cell connectors are on m2, any vertical channel
has to use vias at every logic cell location. By changing the orientation of the
metal directions in vertical channels, we can avoid this, and instead we only need
to place vias at the intersection of horizontal and vertical channels.

FIGURE 17.11 An expanded view of part of a cell-based ASIC. (a) Both


channel 4 and channel 5 use m1 in the horizontal direction and m2 in the vertical
direction. If the logic cell connectors are on m2 this requires vias to be placed at
every logic cell connector in channel 4. (b) Channel 4 and 5 are routed with m1
along the direction of the channel spine (the long direction of the channel). Now
vias are required only for nets 1 and 2, at the intersection of the channels.

Figure 17.12 shows an imaginary logic cell with connectors. Double-entry logic
cells intended for two-level metal routing have connectors at the top and bottom
of the logic cell, usually in m2. Logic cells intended for processes with three or
more levels of metal have connectors in the center of the cell, again usually on
m2. Logic cells may use both m1 and m2 internally, but the use of m2 is usually
minimized. The router normally uses a simplified view of the logic cell called a
phantom . The phantom contains only the logic cell information that the router
needs: the connector locations, types, and names; the abutment and bounding
boxes; enough layer information to be able to place cells without violating design
rules; and a blockage map the locations of any metal inside the cell that blocks
routing.
FIGURE 17.12 The different types of connections that can be made to a cell.
This cell has connectors at the top and bottom of the cell (normal for cells
intended for use with a two-level metal process) and internal connectors (normal
for logic cells intended for use with a three-level metal process). The
interconnect and connections are drawn to scale.

Figure 17.13 illustrates some terms used in the detailed routing of a channel. The
channel spine in Figure 17.13 is horizontal with terminals at the top and the
bottom, but a channel can also be vertical. In either case terminals are spaced
along the longest edges of the channel at given, fixed locations. Terminals are
usually located on a grid defined by the routing pitch on that layer (we say
terminals are either on-grid or off-grid ). We make connections between
terminals using interconnects that consist of one or more trunks running parallel
to the length of the channel and branches that connect the trunk to the terminals.
If more than one trunk is used, the trunks are connected by doglegs . Connections
exit the channel at pseudoterminals .
FIGURE 17.13 Terms used in channel routing. (a) A channel with four
horizontal tracks. (b) An expanded view of the left-hand portion of the channel
showing (approximately to scale) how the m1 and m2 layers connect to the logic
cells on either side of the channel. (c) The construction of a via1 (m1/m2 via).

The trunk and branch connections run in tracks (equispaced, like railway tracks).
If the trunk connections use m1, the horizontal track spacing (usually just called
the track spacing for channel routing) is equal to the m1 routing pitch. The
maximum number of interconnects we need in a channel multiplied by the
horizontal track spacing gives the minimum height of a channel (see
Section 17.2.2 on how to determine the maximum number of interconnects
needed). Each terminal occupies a column . If the branches use m2, the column
spacing (or vertical track spacing ) is equal to the m2 routing pitch.

17.2.1 Goals and Objectives


The goal of detailed routing is to complete all the connections between logic
cells. The most common objective is to minimize one or more of the following:
● The total interconnect length and area

● The number of layer changes that the connections have to make

● The delay of critical paths

Minimizing the number of layer changes corresponds to minimizing the number


of vias that add parasitic resistance and capacitance to a connection.
In some cases the detailed router may not be able to complete the routing in the
area provided. In the case of a cell-based ASIC or sea-of-gates array, it is
possible to increase the channel size and try the routing steps again. A channeled
gate array or FPGA has fixed routing resources and in these cases we must start
all over again with floorplanning and placement, or use a larger chip.

17.2.2 Measurement of Channel Density


We can describe a channel-routing problem by specifying two lists of nets: one
for the top edge of the channel and one for the bottom edge. The position of the
net number in the list gives the column position. The net number zero represents
a vacant or unused terminal. Figure 17.14 shows a channel with the numbered
terminals to be connected along the top and the bottom of the channel.
We call the number of nets that cross a line drawn vertically anywhere in a
channel the local density . We call the maximum local density of the channel the
global density or sometimes just channel density . Figure 17.14 has a channel
density of 4. Channel density is an important measure in routingit tells a router
the absolute fewest number of horizontal interconnects that it needs at the point
where the local density is highest. In two-level routing (all the horizontal
interconnects run on one routing layer) the channel density determines the
minimum height of the channel. The channel capacity is the maximum number of
interconnects that a channel can hold. If the channel density is greater than the
channel capacity, that channel definitely cannot be routed (to learn how channel
density is calculated, see Section 17.2.5 ).

FIGURE 17.14 The definitions of local channel density and global channel
density. Lines represent the m1 and m2 interconnect in the channel to simplify
the drawing.

17.2.3 Algorithms
We start discussion of routing methods by simplifying the general
channel-routing problem. The restricted channel-routing problem limits each net
in a channel to use only one horizontal segment. In other words the channel
router uses only one trunk for each net. This restriction has the effect of
minimizing the number of connections between the routing layers. This is
equivalent to minimizing the number of vias used by the channel router in a
two-layer metal technology. Minimizing the number of vias is an important
objective in routing a channel, but it is not always practical. Sometimes
constraints will force a channel router to use jogs or other methods to complete
the routing (see Section 17.2.5 ). Next, though, we shall study an algorithm that
solves the restricted channel-routing problem.

17.2.4 Left-Edge Algorithm


The left-edge algorithm ( LEA ) is the basis for several routing algorithms [
Hashimoto and Stevens, 1971]. The LEA applies to two-layer channel routing,
using one layer for the trunks and the other layer for the branches. For example,
m1 may be used in the horizontal direction and m2 in the vertical direction. The
LEA proceeds as follows:
1. Sort the nets according to the leftmost edges of the nets horizontal
segment.
2. Assign the first net on the list to the first free track.
3. Assign the next net on the list, which will fit, to the track.
4. Repeat this process from step 3 until no more nets will fit in the current
track.
5. Repeat steps 24 until all nets have been assigned to tracks.
6. Connect the net segments to the top and bottom of the channel.
FIGURE 17.15 Left-edge algorithm. (a) Sorted list of segments.
(b) Assignment to tracks. (c) Completed channel route (with m1 and m2
interconnect represented by lines).

Figure 17.15 illustrates the LEA. The algorithm works as long as none of the
branches touchwhich may occur if there are terminals in the same column
belonging to different nets. In this situation we have to make sure that the trunk
that connects to the top of the channel is placed above the lower trunk. Otherwise
two branches will overlap and short the nets together. In the next section we shall
examine this situation more closely.

17.2.5 Constraints and Routing Graphs


Two terminals that are in the same column in a channel create a vertical
constraint . We say that the terminal at the top of the column imposes a vertical
constraint on the lower terminal. We can draw a graph showing the vertical
constraints imposed by terminals. The nodes in a vertical-constraint graph
represent terminals. A vertical constraint between two terminals is shown by an
edge of the graph connecting the two terminals. A graph that contains
information in the direction of an edge is a directed graph . The arrow on the
graph edge shows the direction of the constraintpointing to the lower terminal,
which is constrained. Figure 17.16 (a) shows an example of a channel, and
Figure 17.16 (b) shows its vertical constraint graph.
FIGURE 17.16 Routing graphs. (a) Channel with a global density of 4. (b) The
vertical constraint graph. If two nets occupy the same column, the net at the top
of the channel imposes a vertical constraint on the net at the bottom. For
example, net 2 imposes a vertical constraint on net 4. Thus the interconnect for
net 4 must use a track above net 2. (c) Horizontal-constraint graph. If the
segments of two nets overlap, they are connected in the horizontal-constraint
graph. This graph determines the global channel density.

We can also define a horizontal constraint and a corresponding


horizontal-constraint graph . If the trunk for net 1 overlaps the trunk of net 2, then
we say there is a horizontal constraint between net 1 and net 2. Unlike a vertical
constraint, a horizontal constraint has no direction. Figure 17.16 (c) shows an
example of a horizontal constraint graph and shows a group of 4 terminals
(numbered 3, 5, 6, and 7) that must all overlap. Since this is the largest such
group, the global channel density is 4.
If there are no vertical constraints at all in a channel, we can guarantee that the
LEA will find the minimum number of routing tracks. The addition of vertical
constraints transforms the restricted routing problem into an NP-complete
problem. There is also an arrangement of vertical constraints that none of the
algorithms based on the LEA can cope with. In Figure 17.17 (a) net 1 is above
net 2 in the first column of the channel. Thus net 1 imposes a vertical constraint
on net 2. Net 2 is above net 1 in the last column of the channel. Then net 2 also
imposes a vertical constraint on net 1. It is impossible to route this arrangement
using two routing layers with the restriction of using only one trunk for each net.
If we construct the vertical-constraint graph for this situation, shown in
Figure 17.17 (b), there is a loop or cycle between nets 1 and 2. If there is any
such vertical-constraint cycle (or cyclic constraint ) between two or more nets,
the LEA will fail. A dogleg router removes the restriction that each net can use
only one track or trunk. Figure 17.17 (c) shows how adding a dogleg permits a
channel with a cyclic constraint to be routed.

FIGURE 17.17 The


addition of a dogleg, an
extra trunk, in the wiring
of a net can resolve cyclic
vertical constraints.

The channel-routing algorithms we have described so far do not allow


interconnects on one layer to run on top of other interconnects on a different
layer. These algorithms allow interconnects to cross at right angles to each other
on different layers, but not to overlap . When we remove the restriction that
horizontal and vertical routing must use different layers, the density of a channel
is no longer the lower bound for the number of tracks required. For two routing
layers the ultimate lower bound becomes half of the channel density. The
practical reasoning for restricting overlap is the parasitic overlap capacitance
between signal interconnects. As the dimensions of the metal interconnect are
reduced, the capacitance between adjacent interconnects on the same layer (
coupling capacitance ) is comparable to the capacitance of interconnects that
overlap on different layers ( overlap capacitance ). Thus, allowing a short overlap
between interconnects on different layers may not be as bad as allowing two
interconnects to run adjacent to each other for a long distance on the same layer.
Some routers allow you to specify that two interconnects must not run adjacent to
each other for more than a specified length.
The channel height is fixed for channeled gate arrays; it is variable in discrete
steps for channelless gate arrays; it is continuously variable for cell-based ASICs.
However, for all these types of ASICs, the channel wiring is fully customized and
so may be compacted or compressed after a channel router has completed the
interconnect. The use of channel-routing compaction for a two-layer channel can
reduce the channel height by 15 percent to 20 percent [ Cheng et al., 1992].
Modern channel routers are capable of routing a channel at or near the theoretical
minimum density. We can thus consider channel routing a solved problem. Most
of the difficulty in detailed routing now comes from the need to route more than
two layers and to route arbitrary shaped regions. These problems are best handled
by area routers.

17.2.6 Area-Routing Algorithms


There are many algorithms used for the detailed routing of general-shaped areas
(see the paper by Ohtsuki in [ Ohtsuki, 1986]). Many of these were originally
developed for PCB wiring. The first group we shall cover and the earliest to be
used historically are the grid-expansion or maze-running algorithms. A second
group of methods, which are more efficient, are the line-search algorithms.
FIGURE 17.18 The Lee maze-running
algorithm. The algorithm finds a path from
source (X) to target (Y) by emitting a wave
from both the source and the target at the
same time. Successive outward moves are
marked in each bin. Once the target is
reached, the path is found by backtracking
(if there is a choice of bins with equal
labeled values, we choose the bin that
avoids changing direction). (The original
form of the Lee algorithm uses a single
wave.)

Figure 17.18 illustrates the Lee maze-running algorithm . The goal is to find a
path from X to Yi.e., from the start (or source) to the finish (or target)avoiding
any obstacles. The algorithm is often called wave propagation because it sends
out waves, which spread out like those created by dropping a stone into a pond.
Algorithms that use lines rather than waves to search for connections are more
efficient than algorithms based on the Lee algorithm. Figure 17.19 illustrates the
Hightower algorithm a line-search algorithm (or line-probe algorithm ):
1. Extend lines from both the source and target toward each other.
2. When an extended line, known as an escape line , meets an obstacle,
choose a point on the escape line from which to project another escape line
at right angles to the old one. This point is the escape point .
3. Place an escape point on the line so that the next escape line just misses the
edge of the obstacle. Escape lines emanating from the source and target
intersect to form the path.
FIGURE 17.19 Hightower
area-routing algorithm. (a)
Escape lines are constructed
from source (X) and target (Y)
toward each other until they hit
obstacles. (b) An escape point
is found on the escape line so
that the next escape line
perpendicular to the original
misses the next obstacle. The
path is complete when escape
lines from source and target
meet.

The Hightower algorithm is faster and requires less memory than methods based
on the Lee algorithm.
17.2.7 Multilevel Routing
Using two-layer routing , if the logic cells do not contain any m2, it is possible to
complete some routing in m2 using over-the-cell (OTC) routing. Sometimes poly
is used for short connections in the channel in a two-level metal technology; this
is known as 2.5-layer routing . Using a third level of metal in three-layer routing ,
there is a choice of approaches. Reserved-layer routing restricts all the
interconnect on each layer to flow in one direction in a given routing area (for
example, in a channel, either parallel or perpendicular to the channel spine).
Unreserved-layer routing moves in both horizontal and vertical directions on a
given layer. Most routers use reserved routing. Reserved three-level metal routing
offers another choice: Either use m1 and m3 for horizontal routing (parallel to the
channel spine), with m2 for vertical routing ( HVH routing ) or use VHV routing
. Since the logic cell interconnect usually blocks most of the area on the m1 layer,
HVH routing is normally used. It is also important to consider the pitch of the
layers when routing in the same direction on two different layers. Using HVH
routing it is preferable for the m3 pitch to be a simple multiple of the m1 pitch
(ideally they are the same). Some processes have more than three levels of metal.
Sometimes the upper one or two metal layers have a coarser pitch than the lower
layers and are used in multilevel routing for power and clock lines rather than for
signal interconnect.
Figure 17.20 shows an example of three-layer channel routing. The logic cells are
64 l high, the m1 routing pitch is 8 l, and the m2 and m3 routing pitch is 16 l .
The channel in Figure 17.20 is the same as the channel using two-layer metal
shown in Figure 17.13 , but using three-level metal reduces the channel height
from 40 l ( = 5 ¥ 8 l ) to 16 l . Submicron processes try to use the same metal
pitch on all metal layers. This makes routing easier but processing more difficult.
FIGURE 17.20 Three-level channel routing. In this diagram the m2 and m3
routing pitch is set to twice the m1 routing pitch. Routing density can be
increased further if all the routing pitches can be made equala difficult process
challenge.

With three or more levels of metal routing it is possible to reduce the channel
height in a row-based ASIC to zero. All of the interconnect is then completed
over the cell. If all of the channels are eliminated, the core area (logic cells plus
routing) is determined solely by the logic-cell area. The point at which this
happens depends on not only the number of metal layers and channel density, but
also the routing resources (the blockages and feedthroughs) in the logic cell. This
the cell porosity . Designing porous cells that help to minimize routing area is an
art. For example, it is quite common to be able to produce a smaller chip using
larger logic cells if the larger cells have more routing resources.

17.2.8 Timing-Driven Detailed Routing


In detailed routing the global router has already set the path the interconnect will
follow. At this point little can be done to improve timing except to reduce the
number of vias, alter the interconnect width to optimize delay, and minimize
overlap capacitance. The gains here are relatively small, but for very long
branching nets even small gains may be important. For high-frequency clock nets
it may be important to shape and chamfer (round) the interconnect to match
impedances at branches and control reflections at corners.

17.2.9 Final Routing Steps


If the algorithms to estimate congestion in the floorplanning tool accurately
perfectly reflected the algorithms used by the global router and detailed router,
routing completion should be guaranteed. Often, however, the detailed router will
not be able to completely route all the nets. These problematical nets are known
as unroutes . Routers handle this situation in one of two ways. The first method
leaves the problematical nets unconnected. The second method completes all
interconnects anyway but with some design-rule violations (the problematical
nets may be shorted to other nets, for example). Some tools flag these problems
as a warning (in fact there can be no more serious error).
If there are many unroutes the designer needs to discover the reason and return to
the floorplanner and change channel sizes (for a cell-based ASIC) or increase the
base-array size (for a gate array). Returning to the global router and changing bin
sizes or adjusting the algorithms may also help. In drastic cases it may be
necessary to change the floorplan. If just a handful of difficult nets remain to be
routed, some tools allow the designer to perform hand edits using a rip-up and
reroute router (sometimes this is done automatically by the detailed router as a
last phase in the routing procedure anyway). This capability also permits
engineering change orders ( ECO )corresponding to the little yellow wires on a
PCB. One of the last steps in routing is via removal the detailed router looks to
see if it can eliminate any vias (which can contribute a significant amount to the
interconnect resistance) by changing layers or making other modifications to the
completed routing. Routing compaction can then be performed as the final step.
17.3 Special Routing
The routing of nets that require special attention, clock and power nets for
example, is normally done before detailed routing of signal nets. The architecture
and structure of these nets is performed as part of floorplanning, but the sizing
and topology of these nets is finalized as part of the routing step.

17.3.1 Clock Routing


Gate arrays normally use a clock spine (a regular grid), eliminating the need for
special routing (see Section 16.1.6, Clock Planning). The clock distribution grid
is designed at the same time as the gate-array base to ensure a minimum clock
skew and minimum clock latencygiven power dissipation and clock buffer area
limitations. Cell-based ASICs may use either a clock spine, a clock tree, or a
hybrid approach. Figure 17.21 shows how a clock router may minimize clock
skew in a clock spine by making the path lengths, and thus net delays, to every
leaf node equalusing jogs in the interconnect paths if necessary. More
sophisticated clock routers perform clock-tree synthesis (automatically choosing
the depth and structure of the clock tree) and clock-buffer insertion (equalizing
the delay to the leaf nodes by balancing interconnect delays and buffer delays).

FIGURE 17.21 Clock routing. (a) A clock network for the cell-based ASIC
from Figure 16.11. (b) Equalizing the interconnect segments between CLK and
all destinations (by including jogs if necessary) minimizes clock skew.

The clock tree may contain multiply-driven nodes (more than one active element
driving a net). The net delay models that we have used break down in this case
and we may have to extract the clock network and perform circuit simulation,
followed by back-annotation of the clock delays to the netlist (for circuit
extraction, see Section 17.4 ) and the bus currents to the clock router. The sizes of
the clock buses depend on the current they must carry. The limits are set by
reliability issues to be discussed next.
Clock skew induced by hot-electron wearout was mentioned in Section 16.1.6,
Clock Planning. Another factor contributing to unpredictable clock skew is
changes in clock-buffer delays with variations in power-supply voltage due to
data-dependent activity. This activity-induced clock skew can easily be larger
than the skew achievable using a clock router. For example, there is little point in
using software capable of reducing clock skew to less than 100 ps if, due to
fluctuations in power-supply voltage when part of the chip becomes active, the
clock-network delays change by 200 ps.
The power buses supplying the buffers driving the clock spine carry direct
current ( unidirectional current or DC), but the clock spine itself carries
alternating current ( bidirectional current or AC). The difference between
electromigration failure rates due to AC and DC leads to different rules for sizing
clock buses. As we explained in Section 16.1.6, Clock Planning, the fastest way
to drive a large load in CMOS is to taper successive stages by approximately e ª
3. This is not necessarily the smallest-area or lowest-power approach, however [
Veendrick, 1984].

17.3.2 Power Routing


Each of the power buses has to be sized according to the current it will carry. Too
much current in a power bus can lead to a failure through a mechanism known as
electromigration [Young and Christou, 1994]. The required power-bus widths
can be estimated automatically from library information, from a separate power
simulation tool, or by entering the power-bus widths to the routing software by
hand. Many routers use a default power-bus width so that it is quite easy to
complete routing of an ASIC without even knowing about this problem.
For a direct current ( DC) the mean time to failure ( MTTF) due to
electromigration is experimentally found to obey the following equation:
MTTF = A J 2 exp E / k T , (17.9)

where J is the current density; E is approximately 0.5 eV; k , Boltzmanns


constant, is 8.62 ¥ 10 5 eVK 1 ; and T is absolute temperature in kelvins.
There are a number of different approaches to model the effect of an AC
component. A typical expression is
A J 2 exp E / k T
MTTF = , (17.10)
J | J | + k AC/DC | J | 2

where J is the average of J(t) , and | J | is the average of | J |. The constant k AC/DC
relates the relative effects of AC and DC and is typically between 0.01 and
0.0001. Electromigration problems become serious with a MTTF of less than 10 5
hours (approximately 10 years) for current densities (DC) greater than 0.5 GAm
2 at temperatures above 150 °C.

Table 17.1 lists example metallization reliability rules limits for the current you
can pass through a metal layer, contact, or viafor the typical 0.5 m m three-level
metal CMOS process, G5. The limit of 1 mA of current per square micron of
metal cross section is a good rule-of-thumb to follow for current density in
aluminum-based interconnect.
Some CMOS processes also have maximum metal-width rules (or fat-metal rules
). This is because stress (especially at the corners of the die, which occurs during
die attach mounting the die on the chip carrier) can cause large metal areas to
lift. A solution to this problem is to place slots in the wide metal lines. These
rules are dependent on the ASIC vendors level of experience.
To determine the power-bus widths we need to determine the bus currents. The
largest problem is emulating the systems operating conditions. Input vectors to
test the system are not necessarily representative of actual system operation.
Clock-bus sizing depends strongly on the parameter k AC/DC in Eq. 17.10 , since
the clock spine carries alternating current. (For the sources of power dissipation
in CMOS, see Section 15.5, Power Dissipation.)
Gate arrays normally use a regular power grid as part of the gate-array base. The
gate-array logic cells contain two fixed-width power buses inside the cell,
running horizontally on m1. The horizontal m1 power buses are then strapped in
a vertical direction by m2 buses, which run vertically across the chip. The
resistance of the power grid is extracted and simulated with SPICE during the
base-array design to model the effects of IR drops under worst-case conditions.
TABLE 17.1 Metallization reliability rules for a typical 0.5 micron ( l = 0.25 m
m) CMOS process.
Current limit Metal thickness
Layer/contact/via Resistance 3
1 2
m1 1 mA m m 1 7000 Å 95 m W /square
m2 1 mA m m 1 7000 Å 95 m W /square
m3 2 mA m m 1 12,000 Å 48 m W /square
0.8 m m square m1 contact to
0.7 mA 11 W
diffusion
0.8 m m square m1 contact to
0.7 mA 16 W
poly
0.8 m m square m1/m2 via
0.7 mA 3.6 W
(via1)
0.8 m m square m2/m3 via
0.7 mA 3.6 W
(via2)

Standard cells are constructed in a similar fashion to gate-array cells, with power
buses running horizontally in m1 at the top and bottom of each cell. A row of
standard cells uses end-cap cells that connect to the VDD and VSS power buses
placed by the power router. Power routing of cell-based ASICs may include the
option to include vertical m2 straps at a specified intervals. Alternatively the
number of standard cells that can be placed in a row may be limited during
placement. The power router forms an interdigitated comb structure, minimizing
the number of times a VDD or VSS power bus needs to change layers. This is
achieved by routing with a routing bias on preferred layers. For example, VDD
may be routed with a left-and-down bias on m1, with VSS routed using
right-and-up bias on m2.
Three-level metal processes either use a m3 with a thickness and pitch that is
comparable to m1 and m2 (which usually have approximately the same thickness
and pitch) or they use metal that is much thicker (up to twice as thick as m1 and
m2) with a coarser pitch (up to twice as wide as m1 and m2). The factor that
determines the m3/4/5 properties is normally the sophistication of the fabrication
process.
In a three-level metal process, power routing is similar to two-level metal ASICs.
Power buses inside the logic cells are still normally run on m1. Using HVH
routing it would be possible to run the power buses on m3 and drop vias all the
way down to m1 when power is required in the cells. The problem with this
approach is that it creates pillars of blockage across all three layers.
Using three or more layers of metal for routing, it is possible to eliminate some of
the channels completely. In these cases we complete all the routing in m2 and m3
on top of the logic cells using connectors placed in the center of the cells on m1.
If we can eliminate the channels between cell rows, we can flip rows about a
horizontal axis and abut adjacent rows together (a technique known as flip and
abut ). If the power buses are at the top (VDD) and bottom (VSS) of the cells in
m1 we can abut or overlap the power buses (joining VDD to VDD and VSS to
VSS in alternate rows).
Power distribution schemes are also a function of process and packaging
technology. Recall that flip-chip technology allows pads to be placed anywhere
on a chip (see Section 16.1.5, I/O and Power Planning, especially
Figure 16.13d). Four-level metal and aggressive stacked-via rules allow I/O pad
circuits to be placed in the core. The problems with this approach include placing
the ESD and latch-up protection circuits required in the I/O pads (normally kept
widely separated from core logic) adjacent to the logic cells in the core.
1. At 125 °C for unidirectional current. Limits for 110 °C are ¥ 1.5 higher. Limits
for 85 °C are ¥ 3 higher. Current limits for bidirectional current are ¥ 1.5 higher
than the unidirectional limits.
2. 10,000 Å (ten thousand angstroms) = 1 m m.
3. Worst case at 110 °C.
17.4 Circuit Extraction and
DRC
After detailed routing is complete, the exact length and position of each
interconnect for every net is known. Now the parasitic capacitance and resistance
associated with each interconnect, via, and contact can be calculated. This data is
generated by a circuit-extraction tool in one of the formats described next. It is
important to extract the parasitic values that will be on the silicon wafer. The
mask data or CIF widths and dimensions that are drawn in the logic cells are not
necessarily the same as the final silicon dimensions. Normally mask dimensions
are altered from drawn values to allow for process bias or other effects that occur
during the transfer of the pattern from mask to silicon. Since this is a problem
that is dealt with by the ASIC vendor and not the design software vendor, ASIC
designers normally have to ask very carefully about the details of this problem.
Table 17.2 shows values for the parasitic capacitances for a typical 1 m m CMOS
process. Notice that the fringing capacitance is greater than the parallel-plate
(area) capacitance for all layers except poly. Next, we shall describe how the
parasitic information is passed between tools.

17.4.1 SPF, RSPF, and DSPF


The standard parasitic format ( SPF ) (developed by Cadence [ 1990], now in the
hands of OVI) describes interconnect delay and loading due to parasitic
resistance and capacitance. There are three different forms of SPF: two of them (
regular SPF and reduced SPF ) contain the same information, but in different
formats, and model the behavior of interconnect; the third form of SPF ( detailed
SPF ) describes the actual parasitic resistance and capacitance components of a
net. Figure 17.22 shows the different types of simplified models that regular and
reduced SPF support. The load at the output of gate A is represented by one of
three models: lumped-C, lumped-RC, or PI segment. The pin-to-pin delays are
modeled by RC delays. You can represent the pin-to-pin interconnect delay by an
ideal voltage source, V(A_1) in this case, driving an RC network attached to each
input pin. The actual pin-to-pin delays may not be calculated this way, however.
TABLE 17.2 Parasitic capacitances for a typical 1 m m ( l = 0.5 m m)
three-level metal CMOS process. 1
Element Area / fF m m 2 Fringing / fF m m 1
poly (over gate oxide) to
1.73 NA 2
substrate
poly (over field oxide) to
0.058 0.043
substrate
m1 to diffusion or poly 0.055 0.049
m1 to substrate 0.031 0.044
m2 to diffusion 0.019 0.038
m2 to substrate 0.015 0.035
m2 to poly 0.022 0.040
m2 to m1 0.035 0.046
m3 to diffusion 0.011 0.034
m3 to substrate 0.010 0.033
m3 to poly 0.012 0.034
m3 to m1 0.016 0.039
m3 to m2 0.035 0.049
n+ junction (at 0V bias) 0.36 NA
p+ junction (at 0V bias) 0.46 NA
FIGURE 17.22 The regular and reduced standard parasitic format (SPF) models
for interconnect. (a) An example of an interconnect network with fanout. The
driving-point admittance of the interconnect network is Y ( s ). (b) The SPF
model of the interconnect. (c) The lumped-capacitance interconnect model.
(d) The lumped-RC interconnect model. (e) The PI segment interconnect model
(notice the capacitor nearest the output node is labeled C 2 rather than C 1 ). The
values of C , R , C 1 , and C 2 are calculated so that Y 1 ( s ), Y 2 ( s ), and Y 3 ( s
) are the first-, second-, and third-order Taylor-series approximations to Y ( s ).

The key features of regular and reduced SPF are as follows:


● The loading effect of a net as seen by the driving gate is represented by
choosing one of three different RC networks: lumped-C, lumped-RC, or PI
segment (selected when generating the SPF) [ OBrien and Savarino,
1989].
● The pin-to-pin delays of each path in the net are modeled by a simple RC
delay (one for each path). This can be the Elmore constant for each path
(see Section 17.1.2 ), but it need not be.

Here is an example regular SPF file for just one net that uses the PI segment
model shown in Figure 17.22 (e):

#Design Name : EXAMPLE1


#Date : 6 August 1995
#Time : 12:00:00
#Resistance Units : 1 ohms
#Capacitance Units : 1 pico farads
#Syntax :
#N <netName>
#C <capVal>
# F <from CompName> <fromPinName>
# GC <conductance>
#|
# REQ <res>
# GRC <conductance>
# T <toCompName> <toPinName> RC <rcConstant> A <value>
#|
# RPI <res>
# C1 <cap>
# C2 <cap>
# GPI <conductance>
# T <toCompName> <toPinName> RC <rcConstant> A <value>
# TIMING.ADMITTANCE.MODEL = PI
# TIMING.CAPACITANCE.MODEL = PP
N CLOCK
C 3.66
F ROOT Z
RPI 8.85
C1 2.49
C2 1.17
GPI = 0.0
T DF1 G RC 22.20
T DF2 G RC 13.05
This file describes the following:
● The preamble contains the file format.

● This representation uses the PI segment model ( Figure 17.22 e).

● This net uses pin-to-pin timing.


● The driving gate of this net is ROOT and the output pin name is Z .
● The PI segment elements have values: C1 = 2.49 pF, C2 = 1.17 pF, RPI =
8.85 W . Notice the order of C1 and C2 in Figure 17.22 (e). The element
GPI is not normally used in SPF files.
● The delay from output pin Z of ROOT to input pin G of DF1 is 22.20 ns.
● The delay from pin Z of ROOT to pin G of DF2 is 13.05 ns.
The reduced SPF ( RSPF) contains the same information as regular SPF, but uses
the SPICE format. Here is an example RSPF file that corresponds to the previous
regular SPF example:
* Design Name : EXAMPLE1
* Date : 6 August 1995
* Time : 12:00:00
* Resistance Units : 1 ohms
* Capacitance Units : 1 pico farads
*| RSPF 1.0
*| DELIMITER "_"
.SUBCKT EXAMPLE1 OUT IN
*| GROUND_NET VSS
* TIMING.CAPACITANCE.MODEL = PP
*|NET CLOCK 3.66PF
*|DRIVER ROOT_Z ROOT Z
*|S (ROOT_Z_OUTP1 0.0 0.0)
R2 ROOT_Z ROOT_Z_OUTP1 8.85
C1 ROOT_Z_OUTP1 VSS 2.49PF
C2 ROOT_Z VSS 1.17PF
*|LOAD DF2_G DF1 G
*|S (DF1_G_INP1 0.0 0.0)
E1 DF1_G_INP1 VSS ROOT_Z VSS 1.0
R3 DF1_G_INP1 DF1_G 22.20
C3 DF1_G VSS 1.0PF
*|LOAD DF2_G DF2 G
*|S (DF2_G_INP1 0.0 0.0)
E2 DF2_G_INP1 VSS ROOT_Z VSS 1.0
R4 DF2_G_INP1 DF2_G 13.05
C4 DF2_G VSS 1.0PF
*Instance Section
XDF1 DF1_Q DF1_QN DF1_D DF1_G DF1_CD DF1_VDD DF1_VSS DFF3
XDF2 DF2_Q DF2_QN DF2_D DF2_G DF2_CD DF2_VDD DF2_VSS DFF3
XROOT ROOT_Z ROOT_A ROOT_VDD ROOT_VSS BUF
.ENDS
.END
This file has the following features:
● The PI segment elements ( C1 , C2 , and R2 ) have the same values as the
previous example.
● The pin-to-pin delays are modeled at each of the gate inputs with a
capacitor of value 1 pF ( C3 and C4 here) and a resistor ( R3 and R4 )
adjusted to give the correct RC delay. Since the load on the output gate is
modeled by the PI segment it does not matter what value of capacitance is
chosen here.
● The RC elements at the gate inputs are driven by ideal voltage sources ( E1
and E2 ) that are equal to the voltage at the output of the driving gate.
The detailed SPF ( DSPF) shows the resistance and capacitance of each segment
in a net, again in a SPICE format. There are no models or assumptions on
calculating the net delays in this format. Here is an example DSPF file that
describes the interconnect shown in Figure 17.23 (a):

.SUBCKT BUFFER OUT IN


* Net Section
*|GROUND_NET VSS
*|NET IN 3.8E-01PF
*|P (IN I 0.0 0.0 5.0)
*|I (INV1:A INV A I 0.0 10.0 5.0)
C1 IN VSS 1.1E-01PF
C2 INV1:A VSS 2.7E-01PF
R1 IN INV1:A 1.7E00
*|NET OUT 1.54E-01PF
*|S (OUT:1 30.0 10.0)
*|P (OUT O 0.0 30.0 0.0)
*|I (INV:OUT INV1 OUT O 0.0 20.0 10.0)
C3 INV1:OUT VSS 1.4E-01PF
C4 OUT:1 VSS 6.3E-03PF
C5 OUT VSS 7.7E-03PF
R2 INV1:OUT OUT:1 3.11E00
R3 OUT:1 OUT 3.03E00
*Instance Section
XINV1 INV:A INV1:OUT INV
.ENDS
The nonstandard SPICE statements in DSPF are comments that start with '*|' and
have the following formats:
*|I(InstancePinName InstanceName PinName PinType PinCap X Y)
*|P(PinName PinType PinCap X Y)
*|NET NetName NetCap
*|S(SubNodeName X Y)
*|GROUND_NET NetName
Figure 17.23 (b) illustrates the meanings of the DSPF terms: InstancePinName ,
InstanceName , PinName , NetName , and SubNodeName . The PinType is I (for
IN) or O (the letter 'O', not zero, for OUT). The NetCap is the total capacitance
on each net. Thus for net IN, the net capacitance is
0.38 pF = C1 + C2 = 0.11 pF + 0.27 pF.
This particular file does not use the pin capacitances, PinCap . Since the DSPF
represents every interconnect segment, DSPF files can be very large in size
(hundreds of megabytes).

FIGURE 17.23 The detailed standard parasitic format (DSPF) for interconnect
representation. (a) An example network with two m2 paths connected to a logic
cell, INV1. The grid shows the coordinates. (b) The equivalent DSPF circuit
corresponding to the DSPF file in the text.

17.4.2 Design Checks


ASIC designers perform two major checks before fabrication. The first check is a
design-rule check ( DRC ) to ensure that nothing has gone wrong in the process
of assembling the logic cells and routing. The DRC may be performed at two
levels. Since the detailed router normally works with logic-cell phantoms, the
first level of DRC is a phantom-level DRC , which checks for shorts, spacing
violations, or other design-rule problems between logic cells. This is principally a
check of the detailed router. If we have access to the real library-cell layouts
(sometimes called hard layout ), we can instantiate the phantom cells and perform
a second-level DRC at the transistor level. This is principally a check of the
correctness of the library cells. Normally the ASIC vendor will perform this
check using its own software as a type of incoming inspection. The Cadence
Dracula software is one de facto standard in this area, and you will often hear
reference to a Dracula deck that consists of the Dracula code describing an ASIC
vendors design rules. Sometimes ASIC vendors will give their Dracula decks to
customers so that the customers can perform the DRCs themselves.
The other check is a layout versus schematic ( LVS ) check to ensure that what is
about to be committed to silicon is what is really wanted. An electrical schematic
is extracted from the physical layout and compared to the netlist. This closes a
loop between the logical and physical design processes and ensures that both are
the same. The LVS check is not as straightforward as it may sound, however.
The first problem with an LVS check is that the transistor-level netlist for a large
ASIC forms an enormous graph. LVS software essentially has to match this
graph against a reference graph that describes the design. Ensuring that every
node corresponds exactly to a corresponding element in the schematic (or HDL
code) is a very difficult task. The first step is normally to match certain key nodes
(such as the power supplies, inputs, and outputs), but the process can very
quickly become bogged down in the thousands of mismatch errors that are
inevitably generated initially.
The second problem with an LVS check is creating a true reference. The starting
point may be HDL code or a schematic. However, logic synthesis, test insertion,
clock-tree synthesis, logical-to-physical pad mapping, and several other design
steps each modify the netlist. The reference netlist may not be what we wish to
fabricate. In this case designers increasingly resort to formal verification that
extracts a Boolean description of the function of the layout and compare that to a
known good HDL description.

17.4.3 Mask Preparation


Final preparation for the ASIC artwork includes the addition of a maskwork
symbol (M inside a circle), copyright symbol (C inside a circle), and company
logos on each mask layer. A bonding editor creates a bonding diagram that will
show the connection of pads to the lead carrier as well as checking that there are
no design-rule violations (bond wires that are too close to each other or that leave
the chip at extreme angles). We also add the kerf (which contains alignment
marks, mask identification, and other artifacts required in fabrication), the scribe
lines (the area where the die will be separated from each other by a diamond
saw), and any special hermetic edge-seal structures (usually metal).
The final output of the design process is normally a magnetic tape written in
Caltech Intermediate Format ( CIF , a public domain text format) or GDSII
Stream (formerly also called Calma Stream, now Cadence Stream), which is a
proprietary binary format. The tape is processed by the ASIC vendor or foundry
(the fab ) before being transferred to the mask shop .
If the layout contains drawn n -diffusion and p -diffusion regions, then the fab
generates the active (thin-oxide), p -type implant, and n -type implant layers. The
fab then runs another polygon-level DRC to check polygon spacing and overlap
for all mask levels. A grace value (typically 0.01 m m) is included to prevent
false errors stemming from rounding problems and so on. The fab will then adjust
the mask dimensions for fabrication either by bloating (expanding), shrinking,
and merging shapes in a procedure called sizing or mask tooling . The exact
procedures are described in a tooling specification . A mask bias is an amount
added to a drawn polygon to allow for a difference between the mask size and the
feature as it will eventually appear in silicon. The most common adjustment is to
the active mask to allow for the birds beak effect , which causes an active area to
be several tenths of a micron smaller on silicon than on the mask.
The mask shop will use e-beam mask equipment to generate metal (usually
chromium) on glass masks or reticles . The e-beam spot size determines the
resolution of the mask-making equipment and is usually 0.05 m m or 0.025 m m
(the smaller the spot size, the more expensive is the mask). The spot size is
significant when we break the integer-lambda scaling rules in a deep-submicron
process. For example, for a 0.35 m m process ( l = 0.175 m m), a 1.5 l separation
is 0.525 m m, which requires more expensive mask-making equipment with a
0.025 m m spot size. For critical layers (usually the polysilicon mask) the mask
shop may use optical proximity correction ( OPC ), which adjusts the position of
the mask edges to allow for light diffraction and reflection (the deep-UV light
used for printing mask images on the wafer has a wavelength comparable to the
minimum feature sizes).

1. Fringing capacitances are per isolated line. Closely spaced lines will have
reduced fringing capacitance and increased interline capacitance, with increased
total capacitance.
2. NA = not applicable.

You might also like