Module 2
FLOOR PLANNING,
PLACEMENT & ROUTING:
    Introduction:
• The input to the floorplanning step is output of
  system partitioning and design entry—a netlist.
• Netlist-is a hierarchical description of circuit blocks ,
  the logic cells within the blocks , and their
  connections.
• Interconnect and gate delays decrease as feature
  sizes decrease .
• Both average interconnect delay and average gate
  delay decrease—but at different rates . This is
  because interconnect capacitance tends to a limit(
  that is independent of scaling).
• Interconnect delay now dominates gate delay.
• Floorplanning- is used To predict interconnect delay
  by estimating interconnect length.
 The starting point of
 floorplaning and placement
 steps for the viterbi decoder
• Small boxes that look like bricks-outlines of
  the standard cells.
• Largest standard cells , at the bottom of the
  display (label ddfctnb)-188 D flip flops.
• '+'symbols-drawing origins of the standard
  cells—for the D-flip-flops they are shifted to
  the left and below the logic cell bottom left-
  hand corner.
• Large box surrounding all the logic cells-
  estimated chip size.(This is a screenshot from
  Cadence Cell Ensemble.)
     The viterbi decoder after
     floorplanning and
     placement
• 8 rows of standard cells
  separated by 17 horizontal
  channels (labeled2–18).
• Channels are routed as
  numbered.
• In this example , the I/O
  pads are omitted to show
  the cell placement more
  clearly.
Floorplanning:
• The input to a floorplanning tool is a hierarchical netlist that
  describes:
   • –the interconnection of the blocks (RAM, ROM, ALU, cache controller, and so
     on)
   • –the logic cells (NAND, NOR, D flip-flop, and so on) within the blocks
   • –the logic cell connectors (terminals , pins , or ports)
• If netlist is a logical description of the ASIC;
• The floorplan is a physical description of an ASIC.
• Floorplanning is a mapping between the logical description and the
  physical description .
Floorplanning Goals and Objectives:
• The Goals of Floorplanning are to:
   •   Arrange the blocks on a chip,
   •   Decide the location of the I/O pads,
   •   Decide the location and number of the power pads,
   •   Decide the type of power distribution, and
   •   Decide the location and type of clock distribution.
• Objectives of Floorplanning–
   • -To minimize the chip area
   • -To minimize delay.
• Measuring area is straightforward, but measuring delay is more
  difficult.
Measurement of Delay in
Floor planning:
A floorplanning tool can use predicted-capacitance
tables (also known as interconnect-load tables or
wire-load tables).
•Typically between 60 and 70 percent of nets have
a FO=1.
•The distribution for a FO=1 has a very long tail ,
stretching to interconnects that run from corner to
corner of the chip.
•The distribution for a FO=1 has two peaks ,
corresponding to a distribution for close neighbors
in subgroups within a block , superimposed on a
distribution corresponding to routing between
subgroups.We often see a twin-peaked distribution      FIGURE:Predicted capacitance. (a)Interconnect lengths as a function of
at the chip level also,corresponding to separate       fanout (FO) and circuit-block size. (b)Wire-load table. There is only one
distributions for interblock routing (inside blocks)   capacitance value for each fanout (typically the average value). (c)The
and intrablock routing (between blocks) .              wire-load table predicts the capacitance and delay of a net (with a
•The distributions for FO>1 are more symmetrical       considerable error). Net A and net B both have a fanout of 1, both have
and flatter than for FO=1.                             the same predicted net delay, but net B in fact has a much greater delay
                                                       than net A in the actual layout (of course we shall not know what the
                                                       actual layout is until much later in the design process).
• The wire-load tables can only contain one number , for example the average net capacitance , for
  any one distribution.
• Many tools take a worst-case approach and use the 80-or 90-percentile point instead of the
  average (Thus a tool may use a predicted capacitance for which we know 90 percent of the nets
  will have less than the estimated capacitance.)
• Repeat the statistical analysis for blocks with different sizes.
• For example , a net with a FO=1 in a 25k-gate block will have a different (larger) average length
  than if the net were in a 5k-gate block.
• The statistics depend on the shape (aspect ratio) of the block (usually the statistics are only
  calculated for square blocks).
• The statistics will also depend on the type of netlist.
• For example , the distributions will be different for a netlist generated by setting a constraint for
  minimum logic delay during synthesis—which tends to generate large numbers of two-input
  NAND gates—than for netlists generated using minimum-area constraints.
Floorplanning Tools:
• Floorplanning in CBIC:
• •Flexible blocks (or variable blocks ):
          –Their total area is fixed,
          –Their shape (aspect ratio) and connector locations may be adjusted during the placement.
• •Fixed blocks:
     –The dimensions and connector locations of the fixed blocks (perhaps RAM, ROM, compiled cells, or
     megacells) can only be modified when they are created.
• •Seeding:
• -Force logic cells to be in selected flexible blocks by seeding .
• -We choose seed cells by name.
• -Seeding may be hard or soft.
• Hard seed-fixed and not allowed to move during the remaining floor
  planning and placement steps.
• Soft seed-an initial suggestion only and can be altered if necessary by
  the floor planner.
• Seed connectors : within flexible blocks—forcing certain nets to
  appear in a specified order, or location at the boundary of a flexible
  block.
• Rat’s nest:-display the connection between the blocks
   -Connections are shown as bundles between the centers of blocks or as flight
   lines between connectors.
Floor planning a cell-based ASIC.
(a)Initial floor plan generated by the
floor planning tool. Two of the blocks
are flexible (A and C) and contain rows
of standard cells (unplaced) .A pop-up
window shows the status of block A.
(b)An estimated placement for flexible
blocks A and C . The connector
positions are known and a rat’s nest
display shows the heavy congestion
below block B.
(c)Moving blocks to improve the floor
plan.
(d)The updated display shows the
reduced congestion after the changes.
(a)The initial floor plan with a 2:1.5 die aspect ratio.
(b) Altering the floor plan to give a 1:1 chip aspect
ratio.
(c) A trial floor plan with a congestion map . Blocks A
and C have been placed so that we know the
terminal positions in the channels . Shading indicates
the ratio of channel density to the channel capacity .
Dark areas show regions that cannot be routed
because the channel congestion exceeds the
estimated capacity.
(d) Resizing flexible blocks A and C alleviates
congestion . Congestion analysis-One measure of
congestion is the difference between the number of
interconnects that we actually need,called the
channel density, and the channel capacity
During the floor planning step , assign the areas between blocks that are to be used for interconnect.
•Route the stem of a T-junction before route the top.
FIGURE : Pad-limited and core-limited die.
(a)A pad-limited die . Tall, thin pad-limited pads, which maximize the number of pads we can fit around the
outside of the chip. The number of pads determines the die size.
(b)A core-limited die: short, wide core-limited pads . The core logic determines the die size.
(c)Using both pad-limited pads and core-limited pads for a square die.
Power Pads and I/O Pads:
•Special power pads are used for:
   • 1. positive supply, or VDD, power buses (or power rails ) and
   • 2. ground or negative supply, VSS or GND.
• One set of VDD/VSS pads supplies power to the I/O pads only.
• Another set of VDD/VSS pads connects to a second power ring that
  supplies the logic core.
• The I/O power is a dirty power since it has to supply large transient
  currents to the output transistors. We keep dirty power separate to avoid
  injecting noise into the internal-logic power(the clean power).
• I/O pads also contain special circuits to protect against electrostatic
  discharge ( ESD ). These circuits can withstand very short high-voltage
  (several kilovolt) pulses that can be generated during human or machine
  handling.
• A double bond connects two pads to one chip-carrier finger and one package pin. We can
  do this to save package pins or reduce the series inductance of bond wires (typically a
  few Nano-Henries) by parallel connection of the pads.
• •Multiple VDD and VSS pads:
• •To reduce the series resistive and inductive impedance of power supply networks, it is
  normal to use multiple VDD and VSS pads.
• •This is particularly important with the simultaneously switching outputs ( SSOs ) that
  occur when driving buses .
• •The output pads can easily consume most of the power on a CMOS ASIC, because the
  load on a pad (usually tens of Pico-Farads) is much larger than typical on-chip capacitive
  loads.
• •Depending on the technology it may be necessary to provide dedicated VDD and VSS
  pads for every few SSOs . Design rules set how many SSOs can be used per VDD/VSS pad
  pair. These dedicated VDD/VSS pads must “follow” groups of output pads as they are
  seeded or planned on the floor plan.
• Using a pad mapping we translate the logical pad in a net list to a physical pad from a pad library . We might
  control pad seeding and mapping in the floor planner.
• •Handling of I/O pads can become quite complex; there are several nonobvious factors that must be
  considered when generating a pad ring:
• •Design library pad cells for one orientation.
• –For example, an edge pad for the south side of the chip, and a corner
• pad for the southeast corner.
• •We could then generate other orientations by rotation and flipping (mirroring). Some ASIC vendors will not
  allow rotation or mirroring of logic cells in the mask file.
• •To avoid these problems we may need to have separate horizontal, vertical, left-handed, and right-handed
  pad cells in the library with appropriate logical to physical pad mappings.
• •If we mix pad-limited and core-limited edge pads in the same pad ring, this complicates the design of
  corner pads. Usually the two types of edge pad cannot abut. In this case a corner pad also becomes a pad-
  format changer , or hybrid corner pad .
• •In single-supply chips we have one VDD net and one VSS net, both global power nets . It is also possible to
  use mixed power supplies (for example, 3.3 V and 5 V) or multiple power supplies ( digital VDD, analog VDD).
Other I/O bond:
• •stagger-bond arrangement using two rows of I/O pads. In this case the design
  rules for bond wires (the spacing and the angle at which the bond wires leave the
  pads) become very important.
• •Area-bump bonding arrangement (also known as flip-chip, solder-bump) used,
  for example, with ball-grid array ( BGA )packages.Even though the bonding pads
  are located in the center of the chip, the I/O circuits are still often located at the
  edges of the chip because of difficulties in power supply distribution and
  integrating I/O circuits together with logic in the center of the die.
• •I/O pads in MGA and CBIC : In an MGA the pad spacing and I/O-cell spacing is
  fixed—each pad occupies a fixed pad slot (or pad site ). This means that the
  properties of the pad I/O are also fixed but, if we need to, we can parallel
  adjacent output cells to increase the drive. To increase flexibility further the I/O
  cells can use a separation, the I/O-cell pitch , that is smaller than the pad pitch .
FIGURE : Bonding pads. (a)This chip uses both pad-limited and core-limited pads. (b)A hybrid corner pad. (c)A chip
with stagger-bonded pads. (d)An area-bump bonded chip (or flip-chip). The chip is turned upside down and solder
bumps connect the pads to the lead frame
FIGURE : Gate-array I/O pads. (a)Cell-based ASICs may contain pad cells of different sizes and widths. (b)A
corner of a gate-array base. (c)A gate-array base with different I/O cell and pad pitches
FIGURE : Power distribution. (a)Power distributed using m1 for VSS and m2 for VDD. This helps minimize the number of vias
and layer crossings needed but causes problems in the routing channels. (b)In this floor plan m1 is run parallel to the
longest side of all channels, the channel spine. This can make automatic routing easier but may increase the number of vias
and layer crossings. (c)An expanded view of part of a channel (interconnect is shown as lines). If power runs on different
layers along the spine of a channel, this forces signals to change layers. (d)A close up of VDD and VSS buses as they cross.
Changing layers requires a large number of via contacts to reduce resistance. The long direction of a rectangular channel is
the channel spine . •Some automatic routers may require that metal lines parallel to a channel spine use a preferred
layer(either m1, m2, or m3). Alternatively we say that a particular metal layer runs in a preferred direction
      Clock planning:
Clock spine : routing scheme with all clock pins driven
directly from the clock driver . MGAs and FPGAs often use
this fish bone type of clock distribution scheme.
Suppose we have an ASIC with the following
specifications:
•40,000 flip-flops.
•Input capacitance of the clock input to each flip-flops is
0.025pF
•Clock frequency is 200MHz
•VDD=3.3V
•Chip size is 20mm on a side                                  FIGURE : Clock distribution. (a) A clock spine for a gate array.
•Clock spine consists of 200 lines across the chip             (b) A clock spine for a cell-based ASIC (typical chips have
•Interconnect capacitance is 2pFcm-1                          thousands of clock nets). (c) A clock spine is usually driven from
                                                              one or more clock-driver cells . Delay in the driver cell is a
                                                              function of the number of stages and the ratio of output to
                                                              input capacitance for each stage (taper).(d) Clock latency and
                                                              clock skew . We would like to minimize both latency and skew.
• The clock spine capacitance is CL =200*2cm*2pFcm-1 =800pF
• If we drive the clock spine with the chain of buffers with taper equal to e
  ≈2.7 and with a first stage input capacitance of 0.025pF
• We will need log (800×10−12 / 0.025×10−12).
• =10.4≈11 𝑠𝑡𝑎𝑔𝑒𝑠.
• The power dissipated charging the input capacitance of the flip flop clock is
  = fCV2 or P1 = (4×104)(200MHz)(0.025pF)(3.3V)2 =2.178 W,≈2𝑊.
• Which is only slightly larger than the power dissipated driving the 800 pF
  clock spine interconnect that we can calculate as follows:
• P2 =(200×200𝑀𝐻𝑧×20𝑚𝑚×2𝑝𝐹cm-1 )(3.3V)2 = 1.7424W.
• All of these power is dissipated in the clock driver cell.
• •The worst problem however is the dissipation of the enormous peak
  current through final stage inverter
• •Now if we assume that the needed raise time is 0.1ns the peak current
  has to reach
• •I = 800𝑝𝐹×3.3𝑉0.1𝑛𝑠 = 25A, Which is too large .
• •So an expensive design has to be in place to handle such large current or
  we have to find a way to spread the power dissipation more evenly across
  the chip.
• •We can design a tree of clock buffers so that the taper of each stage is
  e≈2.7 by using a tree at each node as shown in Fig below.
 Figure: clock tree. (a) Minimum delay is achieved when the taper of successive stages is about 3. (b) Using a
fanout of three at successive nodes. (c) A clock tree for a cell-based ASIC We have to balance the clock arrival
times at all of the leaf nodes to minimize clock skew.
 Designing the clock tree that balances the rise and fall times at the leaf nodes has the beneficial side effect of minimizing
 the Hot-electron Wearout.
 •This problem occurs when an electron gains enough energy to become “hot” and jump out of channel into gate
 oxide.(the problem is worse for electron in n-Channel devices as they are more mobile than holes).
 •As these electrons accumulate in gate oxide they start influencing the threshold voltage of transistors which in-turn
 alters the delay of the buffer.
 •As the buffer delay change with time this introduces unpredictable skew.
 •The problem is worse for n-channel device carrying maximum current with a high voltage across the channel- during
 rise and fall time transitions.
 •Balancing the rise and fall time in each buffer means that they all wear out at the same rate minimizing any additional
 skew.
Phase Locked Loop:
• •Its an electronic flywheel that locks in frequency to an input clock
  signal.
• •the input and output signal frequency may differ in phase.
• •We can drive a clock network with a PLL in such a way that the
  output of the clock network is locked in-phase to the incoming clock,
  thus eliminating the latency of the clock network
• •It also reduces the random variation of the input clock frequency
  known as Jitter which is unpredictable.
• •Actel was one of the first company to introduce the PLL on chip of
  FPGA.
Placement:
Goal of a placement–
•To arrange all the logic cells within the flexible blocks on a chip.
•Ideally, the objectives of the placement step are to
          •-Guarantee the router can complete the routing step
          •-Minimize all the critical net delays
          •-Make the chip as dense as possible
Additional objectives:
          •Minimize power dissipation
          •Minimize cross talk between signals
The most commonly used placement objectives (by current Placement tools):
          •Minimize the total estimated interconnect length
          •Minimize the interconnect congestion
          •Meet the timing requirements for critical nets
    Placement Algorithms:
There are two classes of placement algorithms used in CAD tools:
• Constructive placement: uses a set of rules to arrive at a constructed placement . Example :min-cut
  algorithm. Eigenvalue method.
• Iterative placement improvement -As in system partitioning, placement usually starts with a constructed
  solution and then improves it using an iterative algorithm.
• In most tools we can specify the location and relative placement of certain critical logic cells as seed
  placements
Min-cut placement method:-uses successive application of partitioning.
The steps are:
          –Cut the placement area into two pieces.
          –Swap the logic cells to minimize the cut cost.
          –Repeat the process from step 1, cutting smaller pieces until all the logic cells are placed
• Usually we divide the placement area into bins .
• The size of a bin can vary, from a bin size equal to the base cell (for a gate array) to a bin size that would hold
  several logic cells.
• We can start with a large bin size, to get a rough placement, and then reduce the size to get a final
  placement.
• Fig:(a)Divide the chip into bins using a grid. (b)Merge all connections
  to the center of each bin. (c)Make a cut and swap logic cells between
  bins to minimize the cost of the cut. •(d)Take the cut pieces and
  throw out all the edges that are not inside the piece. •(e)Repeat the
  process with a new cut and continue until we reach the individual
  bins.
• The pair wise-interchange algorithm
       –Select the source logic cell at random.
       –Try all the other logic cells in turn as the destination logic cell.
       –Use any of the measurement methods we have discussed to decide
        on whether to accept the interchange.
       –The process repeats from step 1, selecting each logic cell in turn as a
        source logic cell.
• Neighborhood exchange algorithm
   • Modification to pairwise interchange that considers only destination logic cells in a
     neighborhood —cells within a certain distance, 𝜀 , of the source logic cell.
   • Limiting the search area for the destination logic cell to the 𝜀-neighborhood reduces
     the search time.
FIGURE: Force-directed iterative placement improvement.
•(a)Force-directed interchange.
•(b)Force-directed relaxation.
•(c)Force-directed pairwise relaxation.
• Timing-Driven Placement Methods Minimizing delay is becoming
  more and more important as a placement objective.
   •There are two main approaches:
   –net based
   –path based.
•We know that we can use net weights in our algorithms.
•The net weights might then be the number of times each net appears
  in this list.
•The problem is to calculate the weights.
•One method finds the n most critical paths.
• Find the net weights uses the zero-slack algorithm [Haugeetal.,1987] Figure
  shows how this works (all times are in nano seconds).
• Primary inputs at which the arrival times (this is the original definition ,
  some people use the term actual times ) of each signal are known.
• Required times for the primary outputs—the points in time at which the
  signals to be valid.
• Work forward from the primary inputs and backward from the primary
  outputs to determine arrival and required times at each input pin for each
  net.
• The difference between the required and arrival times at each input pin is
  the slack time (the time we have to spare).
Timing-Driven Placement Methods:
• •The zero –slack algorithm-adds delay to each net until the slacks are
  zero , as shown in Figure
• •The net delays can then be converted to weights or constraints in
  the placement.
• •Assumed that all the gates on a net switch at the same time so that
  the net delay can be placed at the output of the gate driving the
  net—a rather poor timing model but the best we can use without any
  routing information.
Timing-Driven Placement Methods
• •Advantage
• •An important point to remember is that adjusting the net weight , even
  for every net on a chip , does not theoretically make the placement
  algorithms any more complex—we have to deal with the numbers anyway .
  It does not matter whether the net weight is 1 or 6.6 , for example.
• •Disadvantage
• •The practical problem , however ,is getting the weight information for
  each net ( usually in the form of timing constraints ) from a synthesis tool
  or timing verifier.
• •These files can easily be hundreds of mega bytes in size.
•With the zero –slack algorithm we simplify but over constrain the problem.
Solution
• •Deal with paths such as the critical path shown in Figure (a) and not just nets.
• •Disadvantage of Path-based method
• •They are complex and not all commercial tools have this capability
• •Path delays between gates cannot be predicted with only placement
  information.
• •Because we are using simple approximations to the total net length (such as the
  half – perimeter measure )and then use this to estimate an et delay (the same to
  each pin on a net).
• •It is not until the routing step that we can make accurate estimates of the actual
  interconnect delays.
Physical design Flow:
Routing:
• FIGURE:The core of the
  Viterbi decoder chip after
  placement (a screen shot
  from Cadence Cell
  Ensemble)
• The core of the Viterbi
  decoder chip after
  placement.
• You can see the rows of
  standard cells; the widest
  cells are the D flip-flops.
• FIGURE: The core of the Viterbi
  decoder chip after the
  completion of global and
  detailed routing (a screen shot
  from Cadence Cell Ensemble).
• The core of the Viterbi decoder
  chip after the completion of
  global and detailed routing.
• This chip uses two-level metal.
  Although you cannot see the
  difference m1 runs in the
  horizontal direction and m2 in
  the vertical direction.
     Global Routing:
• The details of global routing       Goals and Objectives
  differs lightly between cell-       •Input to routing:
  based ASICs, gate arrays, and       1)Floor-plan that includes the locations of all
  FPGAs, but the principles are the   the fixed and flexible blocks;
  same.                               2)Placement information for flexible blocks;
• A global router does not make       3)Locations of all the logic cells.
  any connections, it just plans      •Goal of global routing
  them.                               –To provide complete instructions to the
• Global router route the whole       detailed router on
  chip (or large pieces if it is a       where to route every net.
  large chip) before detail routing   •Objectives of global routing:
  the whole chip (or the pieces).     –Minimize the total interconnect length.
                                      –Maximize the probability that the detailed
                                        router can complete the routing.
                                      –Minimize the critical path delay.
   Global Routing Methods:
• Many of the methods used in global routing are based on the solutions to
  the tree on a graph problem.
•sequential routing :
• One approach to global routing takes each net in turn and calculates the
  shortest path using tree on graph algorithms—with the added restriction of
  using the available channels.
•Disadvantage:
• •As a sequential routing algorithm proceeds , some channels will become
  more congested since they hold more interconnects than others.
• •In the case of FPGAs and channeled gate arrays , the channels have a fixed
  channel capacity and can only hold a certain number of interconnects.
• There are two types of areas to global route:
• –between blocks
• –inside the flexible blocks
• •There are two different ways that a global router normally handles this problem.
    • 1.Order-independen tRouting
    • 2.Order-dependent Routing
• •Order-independent routing , a global router proceeds by routing each net , ignoring
  how crowded the channels are . Whether a particular net is processed first or last does
  not matter , the channel assignment will be the same , after all the interconnects are
  assigned to channels , the global router returns to those channels that are the most
  crowded and reassigns some interconnects to other , less crowded , channels.
• •order dependent : A global router can consider the number of interconnects already
  placed in various channels as it proceeds. In this case the global routing is order
  dependent—the routing is still sequential , but now the order of processing the nets will
  affect the results.
• •Iterative improvement or simulated annealing may be applied to the solutions found
  from both order-dependent and order-independent algorithms to improve the routing
  taking net at a time
• Hierarchical routing : handles all nets at a particular level at once.
• •Rather than handling all of the nets on the chip at the same time, the
  global-routing problem is made more tractable by dividing the chip area
  into levels of hierarchy.
• •By considering only one level of hierarchy at a time the size of the
  problem is reduced at each level.
• •There are two ways to traverse the levels of hierarchy.
• •top-down approach:-Starting at the whole chip, or
• highest level, and proceeding down to the logic cells .
• •The bottom-up approach starts at the lowest level of
• hierarchy and globally routes the smallest areas first.