STA Basic Concepts
STA Basic Concepts
(Manuscript)
Introduction
[p1] Title
Hi there, welcome to the ASIC Boot camp! My name is Neil Jiang. As a group of engineers who
have been working in the semiconductor industry for years, my team are here to show you what
knowledge is truly important to know and what technique is really being adopted in our day-to-
day design work.
Especially, Static Timing Analysis is the focus of this course. STA is the foundation and the first
priority of all backend design work. If you go for a job hunting for ASIC backend design, you will
find that STA engineer is absolutely the most visible role of a team. As a matter of fact, timing
closure consumes more than 90% of the backend design cycle. Out of the three major QoR metrics,
we are always saying: “Area is important, Power is also important, but Timing is the King”.
During this basic level course, you will learn all the common STA concepts systematically and
comprehensively. A lot of the concepts are not taught in school or anywhere else even if you are
a VLSI major student. However, when you take your first job, the company often assumes you
know these “basics”. That’s why we are here to make sure you can understand the most of the
topics people are talking about every day.
With this course, you can have a jump start of your career and quickly gain experience that
equivalent to what you will learn normally from 2 years of actual work. This is because we have
sort out all kinds of misleading and confusing concepts to make this knowledge system ready for
you.
You will have a good understanding and in-depth knowledge of STA which allows you to involve
in the real design problem and be more efficient to debug constraint issue and fix timing violations.
The Target audience of this basic level course is mainly for new graduate student or active job
seekers. Entry/Intermediate level engineers who want to combing your existing STA knowledge is
also warmly welcomed.
First, I will explain the most often used concepts in STA and help you establish the knowledge
system from scratch.
We will talk about backgrounds such as process variations, operating conditions, and delay
calculations of a timing arc, how crosstalk & noise get calculated in an STA engine.
Once you understand these concepts, you are already mastered the biggest chunk of basics
needed for entire backend design.
Secondly, I will show you the most common SDC commands you should know because constraint
develop is one of the most important aspect of timing a design. You can get valid results only after
the design is correctly constrained. This helps you with later on constraint debug. The topics will
cover from clock creation, tracing, debug unconstrained endpoint, case analysis and multi-cycle
path.
Thirdly, a basic ability for any STA engineer is to read and understand the STA timing report really
well. This means you will know the exact meaning of each part of the timing report, the meaning
of each symbol and the relationship between the numbers.
Lastly, as the cheat sheet to help you prepare the most classic interview questions, I will show you
some of the simplest and most effective timing closure approach.
[p3] Agenda
Here is the agenda:
In this chapter we will discuss the definition of STA, why we need STA and the limitation of STA.
As the back ground of all analysis, I will introduce the concept of PVT corner, different STA analysis
mode used in practice.
The STA issues are basically delay matching problems. In this chapter, I will show you how the
backend tools calculate delays in a modern design for both standard cells and interconnections.
We will learn concepts of timing graphs, timing arcs, unateness, cell delay model, wire delay
model and RC tree delay calculation method.
In the end, the concept of graph-based analysis and path-based analysis will also be introduced,
which used quite often in the real design world.
This is the most famous setup/hold checks. However, the timing checks are not limited to only to
setup/hold checks.
We will first talk about the detail mechanism of why we need setup/hold check. Then, I will show
you the concept of multicycle path, the multicycle version of normal setup/hold check which is a
must to see in the real life design. Besides, I will expand the concept of setup/hold check to
another pair of essential checks in STA. the recovery check and removal check.
In this chapter, we will talk about timing checks cover cases more than basic setup/hold checks.
You will learn how the max/min timing checks are performed if the datapath is crossing multiple
clock domains.
You will learn basic method to deal with Clock Domain Crossing issues.
You will learn concepts of latch-timing, which is called time borrowing techniques.
Besides that, we will discuss more on following high-frequency checks other than setup/hold
check, which are:
We will talk about two types of crosstalk effects, glitch and delta delay and the techniques to
avoid the noise problem from affecting your design.
This chapter is important because we can see the impact of how people reduce pessimism inherit
in their methodologies.
This helps you connect the real world with theoretical analysis.
I will first introduce the concepts of Statistical Static Timing Analysis (SSTA), and then the practical
method to modeling the statistical nature of process into STA analysis.
Lastly, I will introduce another commonly adopted design methodology, Multi-Corner Multi Mode
and also Merge Mode analysis.
We also go over 25 special topics right after certain knowledge point as a complemental material
to extend the concept. These topics are designed to be easily understood and very practical.
For some difficult concept, we introduce a set of rules to help you get to the point quickly. This
aims to clarify undefined conceptual confusions. You can test and rely on these rules in your
design work.
So basically all the timing problem is essentially check the path delay against some threshold, or
limit to make sure the chip can operate at a certain speed. So here is the question, given a circuit
consists of logic gates and the wires between them, how do you calculate the delay out of it? A
traditional way is to flatten the design into transistor level, and do an SPICE simulation on it.
There is no objection that SPICE simulation is the most accurate way to find out how much the
delay is. However, we are talking about millions or trillions of transistors co-exists in the same
ASIC chips nowadays, SPICE simulation is simply too slow to compute such large amount of data.
Or we can say the time cost for such task is unacceptable for production.
Therefore, we truly need to have another to find reasonably accurate delay results but with much,
much faster turn-around time. So that is where STA comes into play. Let’s see how STA address
these two issues.
The first task for STA is to maintain accuracy. We all know that a circuit consists of cell and wire.
For cell, as long as STA can get accurate delay, it can maintain the accuracy. This can be done by
characterize the cell delay with SPICE into a library ahead of actual STA analysis. Once the library
data is ready, STA can load in the cell delay value and get accurate results. For wire delay, STA can
either estimate the delay before the actual layout is done or extract parasitic and then calculate
the delay after the layout is done.
The second task for STA is to be fast. To achieve this, we have to reduce the amount of nodes that
evolved in a timing problem. Instead of analyzing at transistor level, STA first convert the circuit
into a timing graph in which each gate is the lowest level of delay elements. Also, the way to judge
if the circuit can operate successfully is not simulation based but is purely mathematical. This
dramatically reduced the amount of time needed to handle a large design.
For a given timing path between a startpoint and an endpoint, the logic will never going to change
once the chip is under-analysis. So it is the same bunch of cells being exercised again and again
every clock cycle. What STA does is it find this pattern and collapse the entire timeline of our
world into one clock cycle and just analyze that one cycle. STA does this for all the timing paths
on the chip at the same time. It then use a pure mathematical way to check path delay against
timing requirement. This is why we call it Static.
The second feature of STA is, unlike SPICE simulation, it does not depends on input stimulus. Being
a mathematic checking, it traverse all possible logic connections between any startpoint and
endpoint, which in the end literally covers all possible scenario that may occur in a design. This
feature sounds good but in some cases it impact the timing closure negatively. For example, STA
will flag path that never used in real application, this could potentially cause over design if people
missed to filter that path out of analysis. This is what we called “exhaustive”.
The third feature of STA is, it does NOT care out functionality. Yes you have not heard wrong, how
can such a powerful thing are care about functionality? What if the design meet timing but have
functional failure? Well, the STA is only responsible for making sure the path delay is within the
target so that value from the launch point can be captured without err at the capture side. It
doesn’t really responsible for if the logic is implemented correctly. So in order to really be ready
to ship your design to the customer, you have to run another type of check called functional
equivalence check where we assume the design meet timing unconditionally and then use some
logic operation algorithm to check if its functionality has been correctly implemented.
The last feature I will say about STA is it can be very conservative and pessimistic... How can a tool
be pessimistic? Well, STA assumes everything is doing it worst when analyzing the chip. For
example, it use slowest delay a cell can get for max timing check, it checks the most restrictive
pair of launch and capture clock edge. It has to do so because we need a robust design in all
condition to consider into manufacture variation and application needs. So it has to have a lot of
design margins or redundancy built in our design.
[p8] Rule of STA
As I said earlier, timing is the king of the backend design. 90% of the design cycle is to achieve
timing closure. This can also be explained by looking at the role of STA in the backend design.
If certain function need too many logic levels on the critical path, it is not possible to finish
operation within one clock cycle at a given technology node. The RTL coding must be tweaked for
easier timing closure so that the functionality can be physically implemented.
Logic synthesis need to be optimized to reduce the logic levels and choose structure fast enough
to operate at the target clock frequency;
Place and Route must do a lot of work to make sure the design can meet timing while being
routable and can be manufactured;
Even power integrity issues like IR drop need to take into consideration of clock frequency, timing
window and simultaneous switching.
This is why the STA engineer usually possess a core position in a backend team.
The most obvious thing is since STA is our “safe guard” in this ASIC design game, it has to be very
conservative. It won’t allow you to take any chance for failing timing and cause a functional failure.
So that’s why it considers so many delay variations and analysis modes only to try to make you
meet timing even in most restrictive condition. Some of these restrictive condition can be purely
mathematic combination or permutations, which are very rarely exist in reality, so if we rely on
STA, we have to spend effort to close timing on the unrealistic case.
Another point is as mentioned earlier in this chapter, STA may flag paths that are not functionally
meaningful to an application. We call such timing paths “false path”. Usually the architecture guys
or RTL team will be the one to tell if a path truly meaningful in a design. So this is one of the items
should be communicated between the RTL team and backend team. In the backend design side,
we use SDC constraints to tell STA tool the paths are real or not. We will talk more about SDC in
upcoming lectures.
The last point is it is not suitable for asynchronous design. In STA, in order to collapse the design
timeline into minimal cycle, the launch clock edge and capture clock edge must have a
deterministic relation so the pattern can repeat itself.
If the timing path under analysis is between two asynchronous clocks, then the required time can
be some arbitrary number, we can’t rely on a static process to analyze it. That’s why all large scale
designs are synchronous design. However, this doesn’t mean there is absolutely no asynchronous
path in the design, in some very predictable condition, we have special constraints to check timing
for asynchronous paths as well.
[p10] Operation Condition
So let’s learn the concept of operation condition. For any STA analysis, it really doesn’t make any
sense without talking about under what operation condition the analysis is done. The chip is
working in a real physical world where the cell delay and wire delay values are largely impacted
by its surroundings. A cell can be much faster on certain wafers than on some other wafers. A
wire could be faster when temperature is high. Not everything is created equal so we have to take
all these into considerations. Thus, operation condition is the first thing we need to notice of.
Normally, the operation condition also referred as PVT corners. PVT stands for Process, Voltage
and Temperature, which are the most dominant factors that impact the cell/wire delay.
The source of process variation can come from a lot of places such as oxide thickness variation,
voltage threshold variation, and channel length variation.
The impact of process variation is throughout the entire STA analysis. It impacts the entire analysis
from every aspect even though it does not directly factor into the mathematic equation of timing
checks.
There are two types of process variations, namely global process variation and the on-chip
variation.
The global variation is also called inter-die variation which refers to the process effect that impact
all device on the same die but differ from die to die.
The on-chip process variation is also called local variation or intra-die variation, which refers to
the variations in the process which can affect the devices differently even on the same die so the
same type of device placed next to each other could have different behavior on the same die.
For the consideration of timing verification, which is normally tend to be conservative and
pessimistic, we choose the nominal case as well as extreme corner case to guard band our design.
So for the impact from global process variation, we end up having three corners, which is fast,
typical and slow.
Around the central line of each corner, the on-chip variation kicks in but with a much less standard
deviation.
Thus transistor formation where each gate consists of PMOS and NMOS, we have 4 extreme
combinations in total. That is slowest NMOS + slowest PMOS, slowest NMOS + fastest PMOS,
fastest NMOS + slowest PMOS, fastest NMOS + fastest PMOS.
For metallization, we also have process variation due to manufacture parameters such as the
thicknesses of the metal and the dielectric, and the metal etch which affects the width and spacing
of the metal trace in various metal layers.
The extreme process corners are defined as the +/- 3 sigma corner from the nominal of process
variation distribution, which models only the global process variation.
The cell delay can be roughly describe by this equation, where miu is the electron mobility and
Vth is the threshold voltage.
As temperature increases, both the mobility and Vth decrease. Generally, the mobility has greater
impact on the overall cell delay so the cell become slower when temperature increases. But if the
supply voltage is very low in some application, there is a range that Vth decease overweigh the
mobility decrease and the cell can get slower even if the temperature goes low.
Normally, among all PVT corners, worst case slow and Best case fast are the two need most
attention since they are the 3 sigma extreme corners and guar band the whole design.
The slow process models correspond to the +3 sigma corner condition for the inter-die variations.
The fast process models correspond to the -3 sigma corner condition for the inter-die variations.
Voltage variation has a very simple relation: the higher the voltage, the faster the cell.
To simplify multi-voltage analysis, the STA tool supports voltage and temperature scaling. This
scaling allows us to analyze timing at voltage condition that is not being characterized.
The STA tool uses interpolation to derive appropriate timing, noise and power values to be used
for current voltage condition.
One-dimensional scaling only requires two set of libraries such that only one characterization
parameter (voltage rail or temperature) is varying and remaining are held consistent across all
libraries.
Two-dimensional scaling needs at least three libraries to accommodate two parameter varying at
the same time. Normally, to achieve good accuracy, if there are 2 libraries per dimension, in total
we need 4 libraries to support the 2 dimensional scaling.
Three dimensional scaling is not used too often due to it involves a lot of library characterization
work to be done. It allows three independent parameters to be varying at the same time at the
cost of much more library sample points.
First of all is the netlist which represents our design. STA only works on design that is already
mapped into gate-level netlist. For pre-layout design phase, Verilog comes from logic synthesis
tool or Place and Route tool before the design is detail routed. For post-layout design phase,
Verilog is generated by place and route tool after the design is detail routed.
Then we need timing constraint to make the STA analysis alive. The constraint is in the form of
SDC, which stands for synopsys design constraints format. For Multi-mode multi-corner analysis,
we need design constraints for each individual scenario.
Then, after the design is routed, we need to back-annotate the wire resistance, capacitance,
inductance value into the design. This is called the parasitic extraction. It is often generated from
layout extraction tool and in a distributed RLC network format called standard parasitic exchange
format or simply SPEF.
Once we have the parasitic values, the STA engine can begin time the design by calculating delay
through each timing arc. It will use SPICE-simulated delay values stored in the timing model library.
The timing model is in form of synopsys liberty format or simply called dotlib.
Lastly, in case we want the STA to capture on-chip variation effect, we need to provide a table
with derating factors that should be used in calculation. The most often used OCV method
nowadays is AOCV table.
A little more complicated case is where one cell drives multiple downstream cells. In such case,
the output of a cell could fan out to multiple places. Multiple fanout is definitely ok and actually
pretty common, but since it increases the load on the net, it could slow down the signal
propagation.
However, if there are two different source driving the same input leg of a cell, then it is a violation.
If this happens, the two input source is fighting each other so the value on the net is no longer
deterministic. This situation must be eliminated from the design.
So the take away here is for a net, it could have only one driver but multiple loads.
For a cell, it could have multiple drives and also multiple loads.
Well, the hierarchical objects refer to the logic division boundary, it could be a pin/port on a virtual
design wrapper. The leaf objects refer to the physical cell boundary, usually means a pin/port on
a library cell. The leaf is the lowest hierarchy in a design.
The immediate fanout means the leaf cells driven by the particular cell directly. They are the first
cells encountered when you trace the netlist upstream or downstream.
The fan-in cone and fanout cone of a cell is not limited to only the first encounter though, they
are referring to all the cells could be trace backward and forward from where the cell locates.
[p21] Topic 4: Basic query commands
get_cells / get_pins / get_nets are the most basic and frequently used commands to manipulating
design objects. By default, the commands will only return objects in hierarchy of current design
scope. For example, when I just do get_cells *_reg, it returns all the cells with _reg in the end in
current hierarchy. We can tell from sizeof_collection command that we have 18 of such cells.
The option –hierarchical can be used to broaden the searching range of this command all the way
from current hierarchy down to the leaf cell level. So with –hier option, the same command
returns 20 design objects. Compared to previous result, icache_0/miss_outstanding_reg and
if_stage_0/read_for_valid_reg now appears on the list.
We can then narrow down the search result if needed by applying filter switch. Like what we did
here, once we use *outstanding* as a key word to refine the results, this command will return
only icache_0/miss_outstanding_reg since no other cell names matching our criterion.
If you know the full hierarchical name of a design object, you could directly use get_cell/get_nets
command on it.
get_pins command can be used to set scope on a particular pin, or to find all available pins on a
certain cell if –of_object is used. Here, we have found all the pins of cell miss_outstanding_reg,
including set/reset, data input pin D, clock pin CLK, scan related pins Scan in, Scan enable, output
pin Q and QN. Since the cell doesn’t have a dedicated scan out pin, most likely the Q pin is reused
as scan out pin during scan mode.
The get_net command can also be used to find out all available nets hooked up to the particular
pin. Here we find the net connected to the clock pin of miss_outstanding_reg is named as
icache_0/clock.
One useful switch along with get_nets is –segments, it can return all the hierarchical name
referring to the same nets. I mean the same net could have different names in different design
scopes, so for icache_0/clock, it is being called net181 in wb_stage_0 hierarchy. If use –
top_net_of_hierarchical_group switch along with –segment, the command will return the net
name in top design, in our case, it’s simply clock.
[p22] (cont’d)
Once we get the design object, sometimes we would like to know some attributes on it.
Depending on what the design object is, their attribute could be different. We can use
list_attribute command to list out all the available attributes on the object. –class can be used to
specify what kind of object it is and –application means only listing our attributes defined by the
tool. Customized attributes can also be defined by the user.
An alternative to show all the active attribute on an object is report_attribute. For example, the
miss_outstanding_reg has several attributes listed by this command. We can use get_attribute to
specifically query one particular attribute, such as is_hierarhical in this case. It returns false means
the cell under query is not a hierarchical cell, but a leaf cell.
[p23] (cont’d)
Report_cell is another commonly used command to report attributes especially for cell objects.
Personally I would like to use –connection and –verbose switch along with it to show connectivity
information on each pin of the cell.
If we apply all_connected –leaf on the driving net n323, it will return the driver pin u5322/Y and
four sinks
U610/A1, u541/A1, u360/A and u361/A. Same holds true if we apply the command on the pin
object. The command can return net n323 is the only net connected to pin 5322/Y.
[p25] (cont’d)
Finding the fan out cone is also easy, let’s look at another cell icache_0/u65 on the timing path.
We can use all_fanout –flat –from command on the output pin Y to list all cells that consists of its
fanout cone. In this example, the results returned will be the pin Y itself and the input D pin of the
miss_outstanding_reg. the –level switch controlled how many layer of fanout cells it returns. A
commonly used switching is –endpoint_only, which make the command only returns the
endpoints among all fanout cells. A minor tweak option with switch –only_cell will return the cell
object instead of pin object.
[p26] (cont’d)
Very similarly, all_fanin –flat –to command can be used to trace back the fan-in cone of a
particular cell or pin. It also comes with –level switch, but what make it different is since the results
are fanin, we need to use –startpoint_only to find out the start point.
Another useful procedure to debug scripts issue is proc_body. It will display the content of a
procedure. If you are not sure where some procedure comes from and what it is doing, you can
use this command to find out the source code of the procedure.
On the side, we have learned some basic terminology and commands in synopsys design tools.
We get the ideal what is hierarchical cells/pins, fan in and fan out logic cone. Query commands
like get_cells/get_pins, tracing commands like all_connected, all_fanout.
STA convert the actual circuit into a map of nodes and calculate the delay between any two nodes.
As the graph represents the entire design, all possible timing paths are contained within the graph.
The graph gets its name from its representation of the design as a node graph. The ports and pins
in the design become the nodes in the graph, and the timing arcs become the connections
between the nodes.
They have different types of timing arcs. For a combinational cell, such as inverter, buffer, AND,
NAND, OR gate, the timing arc is simply connection between each input and its output.
For sequential cells, it is more complicated, it has clock to data arc, clock to reset arc, clock to
output arc. The clock to data arc is actually where the setup and hold checks come from, the clock
to output is the propagation path of the sequential cell.
Positive unate timing arcs are the arc where input and output signal holds the same direction of
transition. A rising transition on an input causes the output to rise or to be unchanged. A falling
transition on an input causes the output to fall or to be unchanged. For example, if the B input of
this AND is 1, the gate will become transparent to the A input, so output transition follows A pin.
If the B input of this AND is 0, the gate output will stuck at zero, so any change on A pin won’t
cause a change on the output.
Vice versa, the negative unate timing arcs are the ones where output transition has opposite
direction of transition from the input waveform. A rising transition on an input causes the output
to fall or to be unchanged. A falling transition on an input causes the output to rise or to be
unchanged.
Non-nate arcs are the ones whose output transition cannot be simply determined by single input.
The output transition also depends upon the state of the other inputs. Such as the XOR gate, if
the B input is zero, then the rising transition on A pin will cause output also rise; if the B input is
one, then the rising transition on A will cause output to fall.
What the timing path is describing is whenever there is a data switching happens at the startpoint,
the signal propagates through the cloud of logic gates, doing some logic operation, then reach the
endpoint.
Valid path means the path represent a real functional transaction. The signal has to go through
the cloud of logic and to be captured by the endpoint in order to have the design functioning well.
This type of path should be the one we care about in STA.
Leakage path mean even though the path from the startpoint and to the endpoint exists
topologically since STA is doing an exhaustive check on all possible path trace combinations, it is
not required functionally for the chip to operate. This type of path sometimes could happen and
we should ignore them.
Don’t care path is a more of a constraint issue. For example, if there a timing path reported by the
tool but the startpoint holds a static value, for example, some pre-programed configuration
register, then the startpoint won’t be toggle during normal functional operation. There is no point
to check the timing path for this startpoint. We can set false path from this startpoint, but to deal
with this kind of path requires related design knowledge.
The actual path taken depends upon the state of the other inputs along the logic path.
Max path (longest path/late path) = the path with the largest delay between two end points. ->
This type of path is usually used for setup check
Min path (shortest path/early path) = the path with the smallest delay between two end points.
-> f or hold check
For the same start/end point pair, max path and min path can be two completely different timing
path consist of different sets of gates.
Max delay and min delay for a given cell refers to delay through timing arc of the same cell. The
delay difference is due different stimuli applied to the analysis to modeling process variation and
worst slew merge scenario.
We know that due to capacitive loading from the parasitic on the wire which must be charged or
discharged in order to raise or drop the voltage level on a net, there will be a finite time for the
signal to transition.
That is where the rise and fall transition or slew rate come from.
Transition time is traditionally measured as time required from 10% to 90% of the transition
(rise/fall).
Slew is the inversion of transition time - the larger the transition time, the slower the slew.
However, the slew threshold is chosen to correspond to the linear portion of the transition
waveform. So for relatively new generation of technology node, the slew threshold is usually
chosen from 30% to 70%.
Propagation delay is the main source of the gate delay. It stands for the time needed to propagate
through the logic cell itself.
If the transition time is ideal, namely zero, then the propagation delay is purely the delay between
two transition edges.
However, since we have finite transition time, the propagation delay is now defined as delay
between 50% of the input waveform and to 50% of the output waveform.
The propagation delay of a cell is a function of input transition of each inputs and the output load
capacitance.
Given an input slew or waveform at the driver input, the goal of delay calculation is to compute
the response at the driver output and at the input of the receiver pins.
The computed responses are then used to determine the cell delay for the driver and the input
transition time at the load pins.
The purpose of cell delay model is to provide a mathematic model for the STA tool to compute
propagation delay through the driving cell.
Remember that the all these model used in STA is try to be correlated with real physics as close
as possible, but they are not aim to simulate the circuit, which SPICE simulation does.
So all these delay models are a trade-off between runtime and accuracy.
[p38] NLDM
Non-linear delay model (NLDM) is a voltage-based delay calculation model which is widely used
models representing the response characteristics of cells in the libraries. It is very simple and less
time consuming for the tools to obtain the response of the cells. This model uses two dimensional
tables to represent the cell delay, output slew and other timing checks. In this method of
modelling the driver cell is modeled to be a voltage source with resistance in series (Thevenin
Model). The receiver is modeled to be a load capacitor.
Notice that the NLDM table is characterized under the condition where the output wire resistance
is zero since we have no idea what the load will be when just creating the library.
• Performing table lookup and interpolation in a cell delay table provided in the library (most
common)
For a given library, if cell delay tables for a timing arc is provided, then the propagation delay
tables must NOT be provided. The same holds true: if the propagation table is specified, then the
cell delay table must NOT be provided. So the tool can only choose one of the method for delay
calculation.
Here is an example for timing arc and timing sense inside a normal library file. The pin section
shows what the current pin to be characterized is. Here the pin name is Y and it is an output pin.
It also tells you the associated power and ground, the direction and the logic function generated
on this pin. This section also contains design constraints such as max capacitance and max
transition spec which supposed to guide the physical design tool for optimization.
The timing section shows the related pin for this particular arc, A1, which means the following
table is describing the timing behavior for arc A1 to Y. As we can see, for an AND gate, the arc is a
positive unate arc because the output if rising along with the A1 if A2 is a one, or Y stays low if A2
is zero. Y will never be falling when A1 is rising.
Note that there are two tables show up here: the cell_rise and rise_transition.
The cell_rise is the direct cell delay look up table showing cell delay when the signal is rising up. It
uses input transition and output load as index to interpolate the entire cell delay through the cell.
The rise_transition table describes the output transition time based on input transition and output
load. It can be used as input transition to next stage after applicable slew degradation. If there
are propagation table provided in library, the second method of calculating cell delay can be used
as well.
Let’s see we are given an input transition of 0.09 and output load 0.67. Four nearest pre-
characterized points in the cell delay lookup table is highlighted in the rectangle. In a 3-
dimensional graph, the cell delay value of these four points forms a plane. We can use a simple
equation to model: Z = A + B*X + C*Y + D*X*Y.
To interpolate the corresponding cell delay using these 4 points, the first step is to solve the
hidden coefficients A, B, C and D. This can be easily done by substitute the datapoints into the
equations. Once the coefficients are revealed, then the cell delay can be calculated for the given
input transition/output load combination.
In deep submicron technology, the impedance of the net is in comparable range with the driver
resistance. Thus the resistive impact on the wire cannot be simply ignored anymore.
The cell output waveform with RC load is very different from the waveform with a single capacitive
load.
Let’s look at the example of an inverter. As the input waveform rising up, the output waveform
falls down.
The parasitic of the wire and load pin can be modelled as a distributed RC network. This RC
network can be further estimated by a RC PI model consists of a total capacitance of Cnear + Cfar,
but with a Rwire in between. Thus, near-end capacitance will be charged quicker than far-end
capacitance because of the resistive wire.
The output waveform for the actual load will be crossing the 50% transition bar at a much earlier
timestamp than when using total capacitance of the wire for delay calculation.
Recall that the NLDM table is characterized under the condition where the output wire resistance
is zero so it cannot be used directly.
We have to modify that model in order to make the table useful again.
The idea of effective capacitance is to obtain an equivalent output capacitance Ceff which produce
the same delay through the driver cell as the original design with the actual RC load.
However, the effective capacitance only ensures the delay to the 50% waveform transition bar is
matching up the actual RC load, it doesn’t provide a matching output waveform with the actual
load. That means the accuracy of transition from this method is not considered.
We can see because of the interconnection resistance, the capacitance seen from the driver
output is actually smaller than the total cap on the wire. This effect is called resistive shielding.
If the interconnect resistance is negligible, the effective capacitance is nearly equal to the total
resistance;
If the interconnect resistance is very large, the effective capacitance is almost equal to the near-
end capacitance (extreme case is like an open circuit)
Resistive shielding is a very important concept in advanced technology node. This phenomenon
is dominant in case of long wire and make the wire delay calculation pessimistic.
Because of this, when we want to improve the RC delay of a wire, it’s better to reduce the
resistance at the near-end of the drive output so it can reduce the inaccuracy of the delay model
and improve overall delay.
Then the signal goes through a RC network, which will widen the transition time even further. This
is called slew degradation. When it reaches the input of the next stage, in this case it is the NAND
gate, the pulse is used as the input transition to the NAND gate. Since the NAND gate is also having
a negative unate arc, we are going to use the cell_rise table and rise_transition table for the cell
delay and transition time calculation.
The composite current source model or simply called CCS model consists of two parts.
These characterization experiments are repeated for a table of different input transition and
Output load combinations.
The current through Cout is saved for every circuit simulation time step and then reduced to a
much smaller set of current and time (i, t) points.
When to use this driver model on an actual circuit, the first step is to calculate an effective
capacitance from the reduced-order rc network. Then we can apply the output current table.
When we apply these currents to their respective capacitances, we can reconstruct the voltage
waveforms. If we are presented with an output capacitance that we did not pre-characterize with,
we can interpolate between the currents to predict the resulting waveform. Similarly, if we are
presented with an input slew that we did not used for pre-characterizing, we can also interpolate.
Due to interconnection RC and non-linear capacitance from the input devices of the load, the
receiver capacitance value varies at different points on the transitioning waveform.
In CCS model, this capacitance is modeled differently in the initial (or leading) portion of waveform
versus the trailing portion of the waveform. For each input slew and output load combination, the
model provides two different values C1 and C2 to be used for delay calculation. This two-
capacitance approach enables a dynamic calculation that closely matches circuit simulation for
load inputs that having non-linear capacitance.
We know so far, the CCS model is current versus time, namely in I-T domain. But we can also
characterize the transition process using current versus voltage, namely in I-V domain. Rather
than use time-stamp based current and voltage curve, compact CCS models current v.s. voltage
curve. The benefit of characterize the transition in I-V domain is the transition curve is much
smoother than I-T domain or V-T domain, so we can reconstruct the curve much more easily and
reduce the storage space.
Notice that the I-V switching curves are usually convex, and they have no inflection point in the
middle, a feature that facilitates compact modeling. For a given I-V curve, we can split it into two
halves. Each halve will be matched with a pre-characterized “base curve”. To Exactly match the I-
V curve, only 6 parameters are needed:
Compared with the traditional CCS method, which may need 20 to 30 sample data sets to describe
the same transition process, the compact CCS method consumes much less space to store the
data. Compact model that uses indirectly shared base curves to model the shape of switching
curves. By allowing each base curve to model multiple switching curves with similar shapes, the
modeling efficiency is improved and the library size is compressed.
What comes along with the base_curves group is the compact_lut_template. The template is a
look up table describes the current-voltage waveform depending on input net transition as the
first index and total output net capacitance as the second index. The waveform attribute is the
third index of the data which consists the six essential parameters stated above, namely, initial
current, peak current, peak voltage, peak time, left base curve ID and right curve ID.
So there are often several cases description for the same timing arc, one for each inputs
combination and also a default case. The rule of conditional timing arc is as following:
1. we use When statement to specify the condition. If the condition expression evaluates to true,
the following timing values in that case are active. At the same time, the default case is disabled.
2. If there is state cannot be determined in a condition and result the entire condition to be
undetermined state as well (X), the condition will still be evaluated to be true.
3. The timing engine will pick the worst active timing value for this particular timing arc.
4. To disable a particular conditional timing, we must force its condition to be known state false.
1. There are three cases in parallel as shown on the left. When B is true, not true or don’t case
2. If there is no case analysis set for input B, then all three cases will be active and taking into
account by timing engine. It will pick the worst case to use to calculation. Usually the worst case
is the default case.
3. If we set case analysis to make B constant at one value, then only one of the above two cases
will be picked and used during calculation. The default case is also disabled.
Thus, the interconnection parasitic can be represented by an RC network. The value for the
parasitic can be pre-layout estimation or post-layout extraction.
Pre-layout phase
Estimation of interconnection delay happens before the design has been actually routed.
Implementation tool has to estimate wire delay to come up with solutions for logic optimization
and placement.
Post-layout phase
After the design is fully routed, we know the exact topology and length of the route. Extraction
will be done to get the wire parasitic values which forms a distributed RC network for later delay
calculation.
[p52] RC Tree
First of all, an importance concept is all backend tool relies on RC tree topology.
RC tree is a reduced order model for tool to calculate interconnection delay through a net.
RC values per unit length of the wire are obtained from the library. Extraction data of already
routed designs are used to build a fanout-to-length table called the wireload model
They represent formulas which establish connection between interconnect delay and its
geometrical sizes such as wire length, width, metal thickness.
All nets with the same fanout have the same estimated interconnect delay during front-end
design. Naturally it does not coincide with the reality.
To further correlate with the physical implementation, the industry also tries to bring some place
and route feature into delay estimation.
Even though in STA tool all we need is just the parasitic values, it’s better to know that in some
implementation and optimization tool, they could estimate the wire based on an initial placer that
can take in physical constraints like floorplan and technology file to perform some global routing
and do wire estimation based on that. This way, the RC network topology is based on actual
physical topology and RC value are derived from provided technology file.
For any fanout number not explicitly listed in the table, the wire length is obtained using linear
extrapolation with the specified slope
Note that for nets with same fanout, no matter how they are routed eventually, they will get same
wire delay estimation. All nets with the same fanout have the same estimated interconnect delay
which is obviously not an accurate way to calculate wire delay.
[p55] WLM in library
Here shows an example of the relation of fanout-length defined in library. On the left hand side,
we have two tables depict fanout-length slope-ratio. Unit RC as long as length look up table with
respect to fanout is listed out. On the right hand side, since different wire load model may be
applied to different cells based on the area, the wire_load_selection groups can tell the tool about
what wire load table to pick in case of different design area.
First is Best-case tree where the destination pin is physically adjacent to the driver. None of the
wire resistance is in the path to the destination pin.
The resistance contribution in the wireload models is set to 0. So in this case, the wireload
contribution is purely capacitive. Total delay value can be obtained by directly reading from library
NLDM table.
The Second one is the balanced tree where each destination pin is on a separate branch of the
interconnect wire. Each path to the destination sees an equal portion of the total wire RC.
The last one is worst-case tree. All destination pins are clustered together at the far end of the
wire. So each destination pin sees the total RC of the wire.
In STA, we have to setup which topology to be used according to our application needs. But
remember all of these are just an early-on estimation method for delay calculation.
After the design gets into post-layout phase, where the actual route is done, we usually use layout
extraction tool to generate more accurate design parasitic values in format of SPEF.
SPEF contains distributed RLC interconnect values and corresponding (x, y) coordinates for nodes
For example, on the LHS is a snapshot of a sample SPEF file. It describes a net topology along with
RC values between all nodes as the figure shown on RHS.
Even though above three “special net” categories are not uncommon, they do have some side
effect to cause partial annotation issue in the timing analysis.
Floating metal pieces in the signal routing could make exaction tool difficult to calculate the RC
since there are an extra piece of geometry belong to the same logical nets. Let’s see during ECO
phase, the layout engineer is trying to reroute a particular signal so he created a new route, but
for some reason he forgets to remove all of the original route and left over a small piece of metal.
In this situation, even a small piece of extra metal could cause partial parasitic extraction issue.
We need to be careful and remove all the dangling unused route shapes.
Dangling ports and pins in RTL could confuse the tool as well, without knowing what the driver
and load is, timing analysis can’t be done correctly on those nets, so ideally they should be
eliminated from RTL or optimized away from logic synthesis tool like Design Compiler.
Constant nets are not that concerning since they are quite static, but we also need to verify to
make sure they are expected. Usually, we shouldn’t have too many constant nets in the design
since at the most of the time, constant means redundancy so logic synthesis tool is working hard
to optimized away any constant logic out of the design.
- Tool could assume unrealistic delay value and create false violation
- Timing results cannot be trusted on these problematic nets having annotation issue
- Other nets might as well be impacted so the accuracy of timing analysis degrades.
Thus, we should try to eliminate as much annotation as possible to make STA analysis reliable.
Since all of the RC delay calculation method are based on RC tree, let’s review what a RC tree is
first:
Thus, both WLMs and SPEF RC network are meeting the RC tree requirements.
Once the parasitic value is obtained and RC tree topology is determined, we can calculate the wire
delay based on the interconnection models.
Mathematically, it uses the first moment of the RC tree transfer function to estimate the RC delay.
It defined a Shared Path Resistance between any of the two nodes in a rc tree which is the sum of
resistance on the common path from the source to both nodes.
Elmore delay is the sum of product of the capacitance of each node in the RC tree and its
corresponding shared path resistance with all other nodes. For example, the Elmore delay to node
2 on the LHS is calculated as:
The cap on node 1 multiplied by shared path resistance of node 1 and node 2 which is R1. Plus;
The cap on node 2 multiplied by shared path resistance of node 2 and itself which is (R1 + R2).
Plus;
The cap on node 3 multiplied by shared path resistance of node 3 and node 2 which is (R1 + R2).
Plus;
The cap on node 4 multiplied by shared path resistance of node 4 and node 2 which is R1. Plus;
The cap on node 5 multiplied by shared path resistance of node 5 and node 2 which is (R1 + R2).
Elmore delays calculation is the simplest and most widely used method during the entire design
phase.
Elmore delay model only works for RC trees. It provides an accurate result (true delay) for nodes
that are far from the driving point, it can be inaccurate by orders of magnitude for those nodes
that are near the driving point. This inaccuracy in Elmore delay calculation is primarily due to
resistive shielding mentioned previously.
The main application of Elmore Delay calculation happens in pre-route database which don’t have
parasitic extracted yet. It is a reasonable method when analysis time is a concern and people want
fast turnaround time.
A more accurate but complicated method is High Order Algorithm (AWE / Arnoldi)
Although the Elmore delay inaccuracy can be improved by corrective factors (i.e. effective
capacitance), there are more accurate methods that use higher order moments of an RC circuit
transfer function. However, these methods are all significantly more expensive to compute than
the Elmore delay and that makes them difficult to utilize within ASIC design tools.
One of the more popular and accurate methods used in current physical design and timing
analysis environment for estimating wire delay is Asymptotic Waveform Evaluation (AWE) which
will not be discussed here.
As we can see from the left side, for the fall transition cell delay, the cell u5070 has an input
transition of 0.041 and an output capacitance of 0.800, thus the reported delay calculation is
calculated out of the fall delay table in the library. The report_delay_calculation commands shows
the relevant portion of the look up table it picks from the library and how it calculates the earlier
mentioned coefficients. From there the cell fall delay is calculated. And then we have to multiply
it by the derating factor to create some more margin since this run is using wire load model so it’s
optimistic. The calculated delay 0.063 multiplied by derating factor 1.35 matches what’s there in
the timing report 0.085.
The output slew of a driver depends on the input slew at this driver and the load capacitance seen
from the output of this driver.
The slew rate at load input pin depends on both the output slew of the driver and the slew
degradation along path due to resistive nature of the wire.
If this is a multiple-fanout net, which means one driver drives more than one loads, each load pin
can have a different slew rate at its input.
It is worth noting that, even in the zero interconnect mode, driver resistance and load pin
capacitance still exist and being taking into account of delay calculation.
[p64] Slew Merge
Delay calculation is performed as the signal edge propagates in a forward direction across the
logic path.
If we were computing the timing on a chain of buffers, we will derive the input slew of the next
stage from the output slew of previous stage, performing delay calculation and storing the results
on the timing graph as we propagate along in the direction.
However, when two slews arrive at the same point on the graph, static timing analysis tools like
PrimeTime must choose one of these slews to propagate forward so it can continue delay
calculation for the downstream logic. One common case is the output pin of a multi-fanin cell
where multiple timing arcs converging at.
These points where a slew must be chosen are called slew merge points.
To ensure the min/max graph values always bound the fastest and slowest possible timing, the
worst slew must be chosen and propagated forward.
We can see that there is inherent pessimism in a graph representation of a design's timing. For
the real physical device, a timing arc can have multiple timing behaviors depending on the
upstream logic that sources its transition. However, a graph representation allows each timing arc
to have only a single min/max rise/fall timing behavior - if we tried to store a value for every
possible upstream path, we could have thousands or millions of values stored per arc, resulting in
impossible memory and runtime requirements. To ensure the min/max graph values always
bound the fastest and slowest possible timing, the worst slew must be chosen and propagated
forward.
Vice versa, When calculation delay for min-delay arc, tool will pick:
Minimum annotated lumped capacitive wire load from min SPEF
Minimum pin capacitance or receiver model are used from min timing library
Minimum slew propagation is performed at slew merge point when calculating cell delay.
The simplest method is to use a single delay possibility across the entire chip. every timing arc is
evaluated as max delay arcs. Both setup and hold paths use the computed max-delay arcs. That
is, we use same library characterized in one operating condition for both max and min timing
checks. The cell delay is deterministic and have only one possibility. However, this method does
not reflect any process variation which happens in reality. The accuracy is very bad.
That’s the best-case / worst case mode appears. When we set the STA in this mode, it reads in
two sets of extreme delay values. The two sets of delay value can represent two PVT
(process/voltage/temperature) which cannot physically coexist at the same time. The two corners
in bc_wc mode represent two completely independent PVT corners. Setup paths use the longest
path through the max-delay arcs for launch, and the shortest path through the max-delay arcs for
capture. Hold paths use the shortest path through the min-delay arcs for launch, and the longest
path through the min-delay arcs for capture. In other words, the bc_wc analysis mode only checks
setup at the max corner, and hold at the min corner. It is important to remember that setup paths
are not checked at the min corner, and hold paths are not checked at the max corner. This could
miss timing violations due to differences in how the launch and capture paths track the PVT
difference between the corners.
The single and bc_wc analysis modes both have a serious accuracy limitation: either the fast
launch/capture paths are computed by using max delay arcs (both single and bc_wc), or the slow
capture path is computed by using the min delay arcs (bc_wc). These modes were suitable for
designs in older technologies where slew sensitivity and slew variation were minimal. These
modes can, however, result in optimism when used on modern small-geometry designs.
We know that there could be multiple different paths reaching one common node in a timing path.
they could have gone through different type of logic gates, different amount of gates, experience
different process variation and crosstalk effect, so certainly the arrival time is different. The timing
window is simply the window between the earliest possible switching time of a node and the
latest possible switching time.
The calculation of timing window is very important in Graphic based analysis and that affect the
noise and crosstalk calculation of a design too.
For example, among 3 paths on LHS figure, path 1 has gone through the largest path delay and
arrivals later than the other 2 paths. Path 3 goes through least gates and having shortest path
delay. So the timing window for the signal on the input net to the D pin is between the path 1
delay and path 3 delay.
GBA mode is short for timing graph-based analysis. In this mode, a timing arc could have only one
single set of most conservative rise and fall transition time and the timing window is propagated
along the path.
The principle is to guard band the entire path using worst case value.
Let’s assume the path above is a long path before the slew merge point. But it has a large drive
strength inverter driving the nand gate input so it has a steeper transition.
On the other hand, the path down below is a short path but its transition is slower.
In GBA mode, we could have only one value for the slew to be propagated through this NAND
gate. So according to the GBA principle, the slower one is picked for max timing path.
This choice makes the upper paths more pessimistic after the slew merge point. The signal arrival
time is within the range of a timing window no matter where the path it is coming from.
This time, the faster transition is able to be propagated through the NAND gate so the above path
could see a real transition even after the slew merge point.
The signal arrival time is calculated specific to this path and it has only single switching edge. So
we can calculate noise more accurately from it.
Normally, PBA mode timing QoR is more accurate because of above reasons and should be always
looks better than GBA mode.
But since it has to isolate each path and store unique slew rate and arrival time for each timing
node, it requires much more computing power to process this high volume of data.
The normal strategy is to use GBA for timing closure when the design still has quite a lot failing
path or timing violation. Then after all major timing problem have been solved and only very few
tail paths left, we can enable PBA mode for accurate signoff check.
More information is needed beyond that when we try to analyze a path or doing any kind of timing
fix. So usually we have to append some additional switches to the original command. Just like the
one shown on the right hand side, the timing report now is showing input nets as well as the
fanout for each net, the total capacitance shown on the wire and transition time for each signal
rise and fall. According to these information, we can decide where has the potential to improve
timing results, fixing speed timing path.
[p72] (cont’d)
The most common switches are listed here.
-net not only shows the net between pin nodes, but also show the number of fanouts for each
net
-input shows the input pin through which the path is going through. It’s useful when tracing report
through multiple input cells. Also, it split the delay associate to a cell into net delay and cell delay.
-tran shows both the input and output transition time used for or calculated by the delay
calculation.
-cap shows the total capacitance appear on the net, including both wire capacitance and input pin
cap from next stage
If the timing constraints have been changed on the fly during debug, STA engine will need to re-
calculate the timing graph again. Even though report_timing will re-time the design implicitly and
incrementally, It’s always a good practice to do an explicit update_timing explicitly before
report_timing.
Last but not least, few easily got confused concepts have been clarified such as slew merging,
min/max arc delay, analysis mode and timing window. GBA and PBA mode concept has also been
mentioned. Overall, this is a big trunk of design knowledge need to be understand for later study.
Chapter 3 Constraint Develop
The trigger can happen at the positive or negative edge or both edges of the control signal.
Such a control signal which acts as a trigger for a synchronous design is called a clock and the edge
on which the design triggers is called the active edge of the clock.
Clock Diagram is the first thing the STA engineer should get from the clock architecture designer.
The clock scheme of a design is largely depending on the functionality it wants to realize.
But normally all the clock structures have something in common. The picture shown here is a very
generic clock diagram.
First, let’s exam what kind of elements are there in this clock diagram.
You will see the mainly three parts: clock generation part, clock selection part and clock gating
part before the clock actually reaches the clock pin of the flip flop.
For clock generation, the on chip system clock is usually generated from an analog block called
phase lock loop. The PLL has a feedback loop which can raise the clock frequency of a low
frequency reference clock source to the real operating speed of the chip. But depends on the
application, the on chip clock may not be the only clock source get used. That’s why we have clock
selection logics.
System clocks can come from either outside the chip as an external source, or generated from PLL.
Beside functional clocks, we could have test clocks targeting at debug the chip. The clock selection
logic selects the proper clock to be propagated for downstream logic under certain functionality.
As the power saving is more and more of a concern in an ASIC design, clock gating technique is
widely used. The clock gating allows the designer to disable the toggle on certain clock when it is
not used by downstream logics, so it can save a lot power. The clock gates can be architectural
designed and coded in the RTL Verilog or it can be inferred by the physical implementation tool.
We will explain more in a few slides.
The startpoint of a timing path can be input ports, the clock pin of synchronous flops or memories.
Similarly, the endpoint of a timing path can be output ports or data input pin of synchronous
devices.
Generally speaking, the path groups for a design can be divided into external path and internal
path.
External paths are the path talking to outside design. For example, if the launch clock of a path is
from another partition but the path ending up in the current partition.
1) Paths from input data port to output data port, which we call it a feed through path.
2) path from the input data port to data input of a flop/memory
3) path from the Clock pin of a flop to the Output data port
4) path from the Clock pin of a flop to Data input of a flop/memory
In most mainstream STA tool such as Primetime, the tool creates internal path group for each
clock domain according to endpoint clock. That means if a path goes from clock_A domain to clk_B
domain, it belongs to clk_B path group.
Default path group includes all non-clocked paths such as asynchronous set/reset paths.
The default report_timing commands will dump out the worst timing path for each of these path
groups.
First group is for timing paths from an internal register to an output port. It can be created by
simply specifying –to option with all all_outputs. The all_outputs is actually a built-in commands
to return all the output ports in current design.
Next, we are going to find all the clock input port by using all_fanout –clock_tree –level0, this is
return all the clock source nodes, including ports and pins. The returned objects will then be
filtered by get_ports commands to be only ports. Then, we exclude these clock inputs from all
input ports, which gives us all the input data ports. By specifying paths from all input dataports,
we created the input to register path group.
Then, we create another path group from all inputs to all outputs, which describes all the
feedthrough paths.
Anything left will be timing paths from internal register to internal register.
On top of these 4 groups, user can create their only dedicated path group targeting specific timing
paths. You can set more weight on any path group that need to be worked on by the optimization
engine.
The industry standard is by using synopsys design constraints or simply called SDC.
SDC is a tcl based text format with commands created by synopsys for timing constraints. BTW,
this is one of the reason why the job position requires you to have experience and knowledge of
tcl language.
The most essential commands to constrain a design can be clock creation, input/output delay
constraints, environment setup and timing exceptions.
By using these commands, the STA engineer guide the STA tool to look at the real design issues
but not false paths. A good timing constraint is valuable for a timely design closure and critical for
a functional success.
Remember we have divide all the timing path as external path and internal path. After the clock
creation, ideally all the functional flop to flop path will be clocked so internal timing path can
already be analyzed.
The way in STA to create a clock source is by describing its waveform. Waveform option specifies
the waveform within one clock period, which then repeats itself.
The first argument specifies the time at which rising edge occurs, the second argument specifies
the time at which the falling edge occurs. All the edges must be monotonically increasing and
within one period. The edge time alternate starting from the first rising edge after time zero, then
falling edge, then rising edge again and so on. There must be an even number of edges specified.
When the –waveform option is not specified, the clock is assumed to have a 50 % duty cycle. If
there are more than one clock on the same clock source, we must use -name and -add switch to
make them co-exists. Otherwise, the one defines later will override the one defined earlier.
Define clock correctly is extremely important to the STA analysis. If even a single clock is specified
incorrectly, the impact could be felt by millions of paths within the design. It may cause the block
to not meet timing. Even if the block meets timing, it may give a false sense of timing closure.
A missing clock constraint would also mean that a huge number of paths in the design may not be
timed.
Since clock specifications impact maximum number of paths, even a single incorrect or missing
specification could be highly detrimental to the design.
In most cases, synchronous clocks originate from the same clock source.
When a new clock is generated in a design that is based on a master clock, which means it will
have phase relationship with the master clock, it can be defined as a generated clock. This
definition is needed because STA does not know that the clock period has been changed at the
output of clock divider or multiplier and what is the new period should be.
Typical scenario for generated clock is when the signal coming out of a clock divider Logic, clock
multiplier or clock gating logic. To distinguish the divided version or gated version of the clock, a
new generated clock needs to be created.
A source object can have more than one clock. If the master clock source pin has more than one
clock in its fan-in cone, then the generated clock must indicate the master clock which causes the
generated clock to be derived. This is specified using the -master_clock option. This option takes
the name of the SDC clock that has been defined to drive the master clock source pin.
Once a generated clock has been defined, the clock attributes such as waveform or period would
be derived by the tool, based on the characteristics of the waveform at the source.
To describe the waveform relation between the master clock and generated clock, the most
commonly used switches are as following:
1. -edges
This is represented as a list of integers that correspond to the edge of the source clock from which
the generated clock has been obtained. The edges indicate alternating rising and falling edge of
the generated clock. The edges must contain an odd number of integers and should contain at
least 3 integers to represent one full cycle of the generated clock. The count of edge starts with
“1” and represents the first rising edge of the source clock.
2. -divide_by
This represents a generated clock where the frequency has been divided by the specified factor,
which means the period is multiplied by the same factor.
3. -multiply_by
This represents a generated clock where the frequency has been multiplied by a factor, which
means the period is divided by the same factor.
It should be noted that though clocks are defined using period attribute, the multiply_by or
divide_by are specified using frequency in mind. When a generated clock defined using -divide_by
or -multiply_by options that needs to be inverted, then it can be specified using the -invert option
and make the generated clock start with a falling transition instead of a rising transition.
[p81] Topic 12: generated clock blockage
Another important thing about generated clock is it will block other clock’s propagation, even the
clock is its own master clock. When a generated clock is created at a pin, all other clocks arriving
at that pin are blocked unless they too have generated clock versions created at that pin.
The definition of a generated clock is like a breakpoint to all other clock paths. It can be explained
in this example. Let’s say we defined a master clock at the input ports clock with a period of 8ns.
Then the clock goes through some dividing logic and produces a div-by-2 version and div-by-4
version. The two divided version merge after a MUX, leaving user to select from according to
functional needs. Let’s say for some reason, we only defined the div-by-2 waveform at the output
of the MUX, but we also want to analyze the div-by-4 clock as well. What should we do?
1. There won’t be divided-by-4 clock show up at node 4 since the definition of div-by-2
generated clock has blocked the clock path of divded-by-4 clock.
2. It also blocked the way for master clock, but since there is no consumer of the master
clocks downstream, we are not so care about it;
3. To fix the issue, we can define the div-by-4 generated clock also at node 4 with –add
option to the create_generated_clock command, but it is not the preferred way. Here I
will let you to think about why;
4. One preferred way is to define the create_generated_clock for divided-by-4 clock at the
input of the MUX and then the MUX should be able to propagate both divided clocks.
5. Depends on the design content downstream, we may need to set exclusive relation
between these two clocks since timing paths between them may not be real.
First thing is to identify the loop by using check_timing –override generated_clock –verbose. This
gives you the loop seen by PT. note that every time after you fix the loop, run this command again
to check if the issue has completely gone. The tool may see new loops coming up once old loops
are fixed. It’s an onion peeling process.
Obviously, the timing loop occurs in the left side. Flop on the left is generating the clock for flop
on the right;
The output of the flop on the right feeds back to the datapath of the flop on the left. Ideally, such
feedback shouldn’t be happening, so the real fix needs to come from the RTL change. But as a
quick workaround, we can disable timing arc at either node 1 or node 2. On node 1, we can break
the select to output arc of the MUX; on node 2, we can break the data input to data output arc.
Again, it’s not a good idea to let STA tool such as primetime to break the loop automatically.
Manually break the loop to maintain the design consistency over design phases and tools.
Clock Skew
When a clock is generated by a source, it may not arrive at all the flops at the same time. The
difference in the arrival time at various flops due to different paths through clock network, or
coupling capacitance from crosstalk or PVT variations in the design. This causes the edges of the
same clock not to align when they reach the various devices. This difference between clock arrivals
at different points in the design is referred to as clock skew. Clock skew can be between different
points of the same clock ( intraclock ) or different (usually, synchronous ) clocks ( interclock ).
Clock Jitter
At the clock generating device (say: PLL) itself, a clock’s edge may not be deterministic on account
of crosstalk or electromagnetic interference or due to PLL characteristics.
The above two phenomenon means the clock period itself could have variation. Thus, in STA we
use clock uncertainty to take these variations into consideration.
Another important aspect of uncertainty is that its value varies between Prelayout and post-
layout. In the pre-layout stage there is no CTS performed; the uncertainty value must take into
effect the possible impact of skew that will be inserted. However post CTS, the actual clock
network has been build so that clock arrivals an be propagated through the network and the skew
portion is already known and doesn’t need to be specified as uncertainty. So, the clock uncertainty
in the post-layout stage is generally less than pre-layout.
Network latency is the delay from the clock definition point (create_clock) to the clock pin of a
flip-flop.
The network latency is an estimate of the delay through the clock tree before clock tree synthesis
stage. After the clock tree is built, network latency will be replaced by actual clock network delay.
Source latency, also called insertion delay, is the delay from the clock source to the clock definition
point.
It is recommended to use set_propagated_clock command to give directive to the tool that clock
network latency needs to be computed based on the actual circuit elements – including parasitic
Say if we have to generate two version of clocks, one is divided by 2. The other is divided by 4. In
order to generate the divided-by-4 clock, we used two cascaded divided by 2 clocks. Thus, the
clock path for divide by 4 clock will be longer than divided-by-2 clock. When we mux them
together and feed the output clock to downstream logic, it could see a difference when balancing
the clock tree during CTS.
We want to minimize the clock source latency difference. One way to do it is by having the divider
logic on the datapath rather than clock path. Then use any flop to capture the divided enable
signal. This way the divider logic only modulates the free-running clock waveform. No matter it is
divided-by-2 or divided-by-4, the clock source latency will always be the same.
A virtual clock has no source specified. In reality, it might have a source, but that source could be
outside the block being constrained.
In case 1, we have a pure combinational feed thru path between one input port and the other
output port. Their related clocks are not used anywhere else inside the partition. Instead of using
a clock which is being declared for this block, a virtual clock can be declared, just for constraining
the combinational path.
In case 2, in order to constrain the output port which goes to an external flop, we can specify a
delay with respect to the real clock itself, but the flop outside the partition will also get the same
clock latency in STA calculation. We can specify the output delay with –source_latency_included
option, but that will hard code the delay values and we have to change every time if the clock
latency is changed.
Thus, in this situation, to define a virtual clock can help out to specify a unique clock source latency
for the flop outside the partition boundary. And if the clock insertion delay changed outside, we
can just modify the virtual clock latency for all the port associated with it.
In practice, virtual clocks are not often used and a good practice often seen is to remove all the
virtual clocks in the end and replaced with real clocks for timing signoff.
[p87] Topic 15: report_clocks
During debug, we may need basic information about the clocks defined in current timing run. We
can check whether create_clock or create_generated_clock commands have been interpreted by
the tool correctly. Report_clocks is the command to do this. It can return all clocks or specified
clocks with their period, waveform, clock root information. This command usually is the first thing
we may check when you debug any clock related timing issue.
There is another command called report_clock_timing which does similar thing with some other
optional switches. That command could also be used depends on your preference.
But sometimes we want to debug unconstrained path, so by turning it on, it allows you to check
timing path in most cases even if it is not constrained.
The criteria for propagating the ideal property, starting at the source pins and ports, are as follows:
In addition to disabling timing updates and timing optimizations, all cells and nets in the ideal
network have the dont_touch attribute set.
The size_only attribute is set on all cells of ideal network sources. If nets are specified, size_only
is set on all cells that are cells of the specified nets' global driver pins. This guarantees that ideal
network sources are not optimized away by compile.
[p91] cont’d
Let’s look at the situation when the set_ideal_network command is set at the clock source of a
clock path. We use set_ideal_network [get_ports clock] which means all nets, cells, and pins on
the transitive fanout of clock port will become ideal. In this case, the next pin hooking up to the
clock net u3479/A2 is getting this ideal attribute. We notice that the transition at both the clock
port and the u3479 A2 pin are zero, capacitance on the clock net are zero and the incremental
wire delay of the clock net are also zero.
Compared with the original clock path where nothing is ideal, the cell delay of u3479 decreased
from 0.10 ns to 0.08 ns which could be a reaction of the zero input transition at the input pin.
[p92] cont’d
However, if look carefully, we can see the ideal property is not propagated through the u3479
onward. The net N24 still have non-zero capacitance and pin C3176/A1 still has non-zero input
transition. Recall that a combinational cell is marked as ideal if all of its input pins are either ideal
or attached to a constant net, we notice that the cell u3479 could be the issue here.
Since u3479 is an AND gate with 2 inputs, let’s check the timing path on the other input of this
cell. Through report_cell with -connection option, we can see on pin A1 it is hooked up to tenable
input port. The timing path from tenable through u3479 is reported on the right hand side. Since
the tenable is not constrained initially, the cell u3479 is not marked as ideal so it blocks the ideal
network propagation on this path. Setting the tenable to be ideal so now u3479 will also become
ideal and getting the zero transition and capacitance.
[p93] cont’d
Another thing worth mentioning is the usage of –no_propagate option. It is sometimes wanted to
only set ideal property to a net segment rather than the entire fanout cone of the net. At the
meantime, we may still want to tool to do sizing optimization on the driver pin of this net. For
example, on the left hand side there is a reset path going through net n3490. Say we want to only
mark the n3490 to be ideal but allow topological optimization on other net segments. We could
set_ideal_network with –no_propagate option on this net, so the wire capacitance and transition
on this net are zero, the driving pin u42/Y is having size_only attribute as well.
Input paths are the ones being launched externally but captured inside the partition, while output
paths are launched locally but captured outside.
To the partition boundary, we only see a data port on the datapath and a clock port
sending/receiving clock signals.
Thus, to correctly model how long it takes for the signal travel outside the boundary, we can use
SDC command set_input_delay and set_output_delay.
This value can come from interface budgeting in the beginning of the project or be back-
annotated from actual full chip database once people start to iterate and optimize the design.
-clock option is used to specify the reference clock, with respect to which the delay value is
specified. This should usually refer to the name of the clock which is used to trigger/sample the
signal that reaches this input port. if the clock which samples the data does not enter the block of
interest, we would need to specify a virtual clock with the same characteristics and use that virtual
clock.
[p95] cont’d
This is an example of how the specified input/output delay are shown in the timing report. Let’s
set input delay to the reset port to be 0.4ns with respect to the clock clk, which means this reset
signal is probably a synchronously generated reset send from outside. The 0.4ns input delay is
shown as input external delay in the timing report below. This value has been an incremental
delay and then accounted towards the total path delay of the datapath.
Similarly, we set output delay to be 0.3ns on output port proc2mem_command with respect to
clock clk. The output delay value appears in the timing report as output external delay and was
deducted from the data required time. The output delay mimics what’s the path delay is outside
the output port.
If the launch flop and the capture flop are clocked by two different clocks, the input delay is usually
with respect to the launch clock and the output delay is usually with respect to capture clock.
There are two other methods of describing port drive capability, using the set_drive or
set_input_transition command. The most recent drive command has precedence. If possible,
always use the set_driving_cell command instead of the set_drive command because the
set_driving_cell command allows accurate calculation of port delay and transition time for library
cells with nonlinear dependence on capacitance.
Similarly, set_load command works on the output ports to set the capacitance to a specified value
on the specified ports and nets in the current design.
By using case analysis, designer could control the right value to be propagated through the logic
cone. There are two types of case values in the design. One is the user-specified value, where the
case value is explicitly set by the user on a certain pin or port. So that the logic value of that node
is forced to be the value designer want; the other way is for the tool to automatically derive
constant value from the logic propagation from upstream logic. This usually happens on nodes
that are not set by the user directly.
To correctly constrain the design, we need to make sure the case value follows the design
intention and the derived constant value is not being propagated to wrong places and causing
trouble. We will talk more on this in constraint debug course.
[p98] cont’d
This is an example of how the case analysis affects the clock propagation along the clock mux. This
clock mux is operating like this: when the test enable pin T enable is zero, it allows functional clock
propagating through. When the test enable pin is one, it allows the test clock going through. The
corresponding clock paths from both functional clock and test clock are shown on the right side.
Initially if there is no case analysis being set, if we query the clocks attributes on the output pin of
cell C3176, it will return both clocks, which means from STA perspective, both clocks will be
propagating through. If we set case analysis on t enable to 1 or 0, only one of the clock will be
returned. This is actually the way to setup functional mode and test mode STA runs.
For data signals, we can use set_disable_timing command to break the timing arc of a cell.
For clock signals, we can use set_clock_sense –stop_propagation to block clock propagation
through certain timing arc.
The advantage of identifying the false paths is that the analysis space is reduced, thereby allowing
the analysis to focus only on the real paths, also cut down the run time.
For example, the CLK A and CLK B are two inputs of the same MUX. According to design intention,
only one of them can be propagated through the MUX at one time.
However, if we somehow forget to set the case analysis and also didn’t say anything about the
relation between CLK A and CLK B, the STA will make both clocks propagate through the MUX and
start analyzing timing paths between these two clocks.
From the STA perspective, this is the most comprehensive way to handle this case so it won’t miss
any possible timing paths. But from the designer’s perspective, such timing path will never be
going to happen so this analysis does not make any sense.
Thus, we can set false path between these two clock domains.
Given a list of clocks to be set in one clock group, the tool will assume these clocks belong to same
synchronous design, and all other clocks that has not been specified in this clock group to be async
to all the clocks in that group.
First one is they are logically exclusive, which means both clocks coexists in the design, but they
don’t talk to each other (no timing path between them). Such as CLK_A and CLK_B in left figure.
The other one is they are physically exclusive. That means only one clock can exist in the design
at one time when multiple clocks are define on the same design object. Such as the GCLK_A and
GCLK_B
Note that the CLK_A and CLK_B are not logically exclusive anymore in the right picture since they
have timing path in between on the other part of the design.
To guard band the design, we usually use different derating factor for early path and late path.
That means for setup check and hold check, different derating factor will be used, both represent
the worst scenario for that specific check.
Since the clock and data paths can be affected differently by the OCV, STA can model the OCV by
making the PVT condition for launch and capture path to be slightly different.
Cell delay and wire delay can undergo different derating factor. On cell dominant path, increase
cell derate can help logic optimization tool to be pessimistic and consolidate logic depth. -
cell_check option allow us to derate setup/hold/recovery/removal time requirement.
Global OCV – flat derating factor across all paths, computes worst-case early/late bounds,
pessimistic at smaller process nodes.
AOCV – LUT (look-up table) based derating factor annotated on paths according to logic depth
and path distance. Due to OCV, cell/wire could have different delay on longer path than cell/wire
on shorter paths. We will talk more detail on this topic in later chapter.
[p103] Topic 18: set_timing_derate
An example of the set_timing_derate. We usually use a large derating factor for more design
margin when the delay is calculated with wire load model. Here we derate the max cell delay to
1.35. From the library, we can calculate the basic delay value of this cell u5070 to be around 63ps,
which matches up with the value 85ps in the timing report when scaled by the 1.35 derating factor.
In order to physically implement an ASIC chip, there are two categories of goals we need to
achieve regarding to the timing aspects of the design.
The first category is called Design Rule. They are some technology dependent rules designers must
follow in order to make the chip work as intended. We know each device are characterized in a
certain operating range. So if we want to make the chip reliable, we should design in a way that
each device falls into the characterization range from the library. Otherwise the device is working
on some unknown status and its behavior cannot be captured accurately, which may cause design
failure.
The second category is called optimization goal. They are goals defines the performance, power
and area targets the designer wants to achieve. Say if the designer wants a chip to work on a
maximum frequency of 2 GHz, then each timing path need to propagate within the clock period
of 2GHz, which is 500ps. The optimization goals in timing aspect translate to timing constraints
for a design.
Normally, the design rules have precedence over timing constraints because obviously they have
to be met in order to realize a functional ASIC design.
Max transition rule defines the longest time it requires for a pin in the design to change its logic
values. Many logic libraries contain restrictions on the maximum transition time allowed for a pin,
creating an implicit transition time limit for designs using that library. Transition times on nets are
computed by use of timing data from the logic library.
1) Make sure delay calculations fall into library characterization range so it can be accurate.
2) Reduce input transition to reduce short circuit current and power consumption
Max transition limit can be on data signal as well as clock signals. Usually the clock propagation
requires much tighter transition time than data signals. Most of the design will have separate
specification for clock transition and data transitions.
Max cap rule defines the maximum total capacitive load that an output pin can drive. That is, the
pin cannot connect to a net which has a total capacitance exceeds the maximum capacitance
requirement defined for the pin.
The total capacitance seen by a driving pin is by adding the wire capacitance of the net to the
capacitance of all the sink pins attached to the net.
Usually, a cell with larger drive strength can have a larger max capacitance threshold.
Max fanout rule defines fanout restrictions for each output driver. Fanout load is a dimensionless
number set for each input pin by the library designers in the standard library. It doesn’t stand for
capacitance.
To evaluate the fanout for a driving pin, the tool calculates the sum of all the fanout_load
attributes for inputs driven by the driving pin and compares that number with the number of
max_fanout attributes stored at the driving pin.
This is a soft limit to restrict the number of fanout a gate can drive, it is defined to avoid max. cap
and max. transition violations. But as long as you meeting max fanout and max transition, you can
ignore max. fanout violations.
In a design, the width of each signal pulse needs to satisfy certain threshold either defined in the
dot lib or set_min_pulse_width in order to function properly. This is especially true for the clock
signal to the sequential elements. The signal pulse width shrink happens mainly due to the non-
equal rise and fall time of the cells. The difference of rise and fall could be caused by OCV.
Sometimes the pulse width reduction is caused by large transition time so the mid-point of rise
transition to mid-point of fall transition has been shrined.
If the pulse width keeps decreasing along the path, at some point if the width is less than the AC
noise margin of the cell, the pulse won’t be able to propagate through the cell and is absorbed.
This phenomenon is called pulse absorption and should be avoided.
In physical implementation, we can use double inverters to replace unbalanced buffer so both the
rise and fall edge of the original waveform can experience same number of rise and fall transitions.
We will talk more on this in place and route course.
Another common reason for min pulse with violation is due to wrongly set clock uncertainty. If
the clock uncertainty value is too large, it will also eat up clock pulse width in STA analysis.
[p108] Topic 19: Minimum Pulse Width
Let’s take a closer look at the minimum pulse width requirement. The min pulse width
requirement is usually coming from the library when the cell is get characterized. In dot lib, tool
use fall_constraint for a low pulse and rise_constraint for a high pulse.
As we mentioned in previous slide, Delay difference in the rise and fall delays of the gates on the
path. If the rise delay of the cell is greater than the fall delay, then the output clock has a smaller
pulse width than the input.
Besides the non-equal rise and fall delay of the gates along the path, clock re-convergence path
pessimism also impacts the calculation of pulse width window.
Dynamic CRP means clock Reconvergence pessimism introduced by dynamic effects like signal
integrity delta delays or dynamic clock source latency. Depends on the different derating set for
early and late clock path, the dynamic CRP will reduce the calculated pulse width because the
path delay could be deviated at different timestamp.
The static CRP refers to a CRP value computed from clock arrivals that do NOT include any dynamic
effects. The static CRP can be removed from the reduction of pulse width so it’s giving credit.
Besides all above, clock uncertainly also will be reduced from the calculated pulse width to
account in the uncertain nature of the clock.
As the picture showed on the right, from the library the min pulse width for low pulse and high
pulse are defined as 0.07 and 0.06 respectively. After we time the design and report the pulse
with on the clock pin, we can see the actual calculated pulse with and the slack against the
requirement.
[p109] (cont’d)
And here is another explanation of how the min pulse width is calculated in the tool. As you can
see on the left hand side, this is a min pulse width calculation for high pulse. According to the
notes on the right, the leading edge is calculated as the max_rise clock arrival time, which is the
late latency path. The closing edge is calculated as the min_fall clock arrival, which is the early
latency path. So the initial 10ns pulse eventually became 8ns wide.
The same holds for low pulse width calculation. You can exercise the details using the notes on
the right hand side.
For area, there is only one goal: to minimize the die size required. Of course, the physical limitation
to the area is you have to make sure the design is routable with no shorts and physical DRC rule
violation such as minimum spacing between two metal traces, minimum width of the metal trace.
On the right side, there is a sample delay optimization process. It shows you the time cost of each
optimization step along with the different design PPA metrics. Usually this can be found in the run
log files. It can serve as an early indicator of the current run quality.
Pre-layout STA
2) Clock Tree is assumed to be ideal before CTS so the STA focus on datapath issues.
3) At this stage hold timing violation is usually ignored. (No real skew information)
4) For intra-clock uncertainty, clock skew estimation and clock jitter value has to be modeled
for setup analysis; clock slew estimation has to be modeled for hold analysis since the
hold check is performed on same clock edge of same clock, so the contribution from jitter
has been canceled out.
5) For inter-clock uncertainty, both clock skew estimation and clock jitter value has to be
modeled for setup and hold analysis since the timing path crosses different clocks.
Post-layout STA
1) First step is to extract parasitic from the actual layout
2) Clock tree has been implemented and clock network delay is propagated with real delay
values.
4) For intra-clock uncertainty, only clock jitter value has to be modeled for setup analysis;
Clock skew value will be propagated from clock delay calculation for both setup and hold
checks.
5) For inter-clock uncertainty, clock jitter value has to be modeled for setup and hold
analysis; clock skew value will be propagated from clock delay calculation for both setup
and hold checks.
6) May plugin additional recipe for robust verification such as OCV derating factor.
The core of a flip flop consists of two back-to-back latches. Each latch consists of two transmission
gates and two back to back inverters. Two latches are triggered by opposite clock edges.
At the same time, the second transmission gate is shut off but the second latch is holding the
value and send it to output Q pin.
Notice the Node 4 is where the sample clock edge is going to get data into the first latch. The data
has to come all the way from the node D through a transmission gate and two inverters so it has
some path delay. This means at the moment of the clock sampling edge, the data is actually
coming from some time back and this time equals to the propagation delay from the D pin to the
Node 4. We call the propagation delay for the first latch the setup time. (Tpd_latch)
In order to have a steady input value for the clock pin to sample, we must ensure the data will not
change value during the clock active edge. This means the data pin cannot change value even
before the clock active edge for a certain amount of time. This is where the setup time comes
from.
The total propagation delay from outside the flop to the node 4 will be propagation delay through
interconnections or any elements before the node D plus the original propagation delay of first
latch. At the meantime, the clock signal also need some time to activate the latch. Thus, the real
setup time will be Tpd_data + Tpd_latch – Tpd_clk. Depends on the value of these components,
setup time could be positive or negative. Flops with negative setup time means the data can
arrival at the flop later than the clock edge, which can be used as a way to fix setup violation, but
it has a larger hold time requirement.
In actual circuitry, clock signal is generated from CLK bar signal through an inverter. The inverter
has propagation delay, so during the clock edge transitioning, there is a certain amount of time
both the CLK and CLK bar is in transition and the first transmission gate is still transparent.
Thus, it takes some time for the transmission gate to completely shut off. If any new data change
now passes through this gate before its completely closed, the original data is now corrupted and
hence correct data cannot come out of the RHS latch.
The hold time is determined by the inverter propagation delay to make first transmission gate
completely shut off.
The time takes from the first latch to Q pin is the cause of clock to Q delay.
[p119] Metastability
From the analysis above, we know that due to the propagation delay inside the flip flop, there will
be a certain time range that the data pin cannot change value so that clock active edge can sample
the data steadily.
The Metastability window must be positive, however either setup time or hold time alone can be
negative. This is because the propagation delay to the D pin and clock pin has difference as
mentioned in previous analysis, so it could possibly have negative setup or negative hold time
requirement.
In some cases, we can use negative setup flip flop for max timing path fixes.
Metastability is another big topic. We will not discuss in this course but all the following timing
checks are here to ensure a flip flop will not go into Metastable status.
The launch path is the path going through the clock tree to the clock pin of the launch flop and
then the data path between launch flop and capture flop.
The capture path is the path going through the clock tree directly to the clock pin of the capture
flop.
As we can see, the propagation delay on the launch path is the sum of insertion delay of launch
clock, the clk-to-q delay of launch flop and the propagation delay of the data path;
The propagation delay on the capture path is the sum of the insertion delay of capture clock.
For a setup timing check, we expect the data to be launched at cycle N and to be captured at the
next cycle (N+1). At the same time, the value on the data pin of capture flop must be stable for
the amount of setup time when the capture clock is sampling the data.
Thus, we can have following math equation to establish the relation between all these delay
values. Basically this simply mean the data launched must be stable before it gets sampled.
To make the requirement more restrictive, the clock uncertainty is also accounted in and being
subtracted from the RHS of the equation.
T launch clock path delay plus T clock to q delay plus T data path delay should less than or equal
to T cycle plus T capture clock path delay minus T setup time requirement minus T clock
uncertainty
The slack is difference between the LHS and RHS. Once we have the slack number, it is straight-
forward to derive the maximum effective clock period and frequency out of this equation:
The startpoint stands for the launch flop, notice that the flop is driven by CLK_A.
By default, this path belongs to the path group of capture clock, in this case it is CLK_B.
“Point” means the following are the timing point along the path.
“cap” is the total cap see by the driving pin, which is the sum of wire capacitance and load pin
capacitance.
“derate” is a scaling factor for modeling any systematic margin such as on chip variation effect.
“incr” is the propagation delay of the wire or the cell. The value on the line of input pin stands for
the wire delay to the input pin. The value on the line of output pin stands for the cell propagation
delay of the driving cell.
“voltage” is the voltage condition for this path, it should be corresponding to the PVT corner the
analysis is done.
“H” letter: means the delay value is Hybrid annotation from multiple sources such as wire load
model and SDF
“*” (asterisk sign): means the wire/cell delay value are directly annotated by Standard Delay
Format (SDF) file
“&” (ampersand sign): means RC parasitic is back-annotation from a Standard Parasitic Exchange
Format (SPEF)
If there are no symbols: it means the delay is estimated by wire load model
The path launched from time stamp 0 and being captured at 0.80ns. Since we haven’t apply any
multicycle constraints on this path, this tells us the clock cycle is 800ps.
The actual arrival time (path delay) of launch path reaches 0.97ns in the data pin of capture flop
while the capture path data required time is only 0.77ns.
This means the launch path is too slow so the data pin could be still changing values when the
capture clock is sampling data. Thus this path is now violating the setup check requirement.
This slide has listed several categories of common timing fix techniques. I am going to briefly
introduce the most commonly used techniques here. More detailed explanation and examples
will be provided through another advanced STA course.
The first category is to fix the setup violation by speed up the cell delay. This can be done by cell
adjustments.
Most easy way is to swap high VT threshold cell by a low VT one since the drain current is much
higher in a low VT device thus the discharge/charge can happen faster. But the low VT cell also
consume a lot more power.
This is a preferred way if you don’t want to disturb routing around the cell, especially in ECO phase.
We also swap low drive strength cell in the critical position into high drive strength. The cell size
could be bigger, so it is a disturbance to the placement and routing. Bigger cell also has larger pin
cap, so it will increase the total capacitance seen by previous driver. We can use slew degradation
to determine if a cell is good for this type of swap. That is, if the cell output transition is worse
than the input transition, then this cell has not reshaped the waveform as we expected, so we can
size this up.
Physical implementation tools such as logic synthesis tool or place and route tool tend to insert a
chain of buffer into the design for various reasons. In many situations, the number of buffers are
more than what we actually need. This can contribute a lot unnecessary cell delay to the total
path delay. If some cell has a large fanout along the path that usually results in a large load for the
driver, so we can add a buffer to share part of the loads and reduce the total cap seen by the
original driver. If only a few endpoint of a large fanout cone has violation while other endpoint
has tons of positive slack, we can create a dedicated buffer for the failing endpoint so the critical
path can be isolated and see much less load on the net.
Place and route tool sometimes did bad routing on critical nets. If we know certain net needs to
travel long distance and the timing is critical, we can set net routing layer constraints to make the
tool route that net on higher metal layer. Many of these wire routing improvement need control
in place and route stage, which will be covered by a later PnR course.
Logic manipulation can be interesting but dangerous as well. For example we know that for a logic
gate, the propagation delay from the input closer to the output is usually faster than the input far
way from the output. If we have a signal on critical path we can put it on the input closer to the
output. Logic replication is also an often used technique, the idea is very similar to load splitting
the buffer tree. The driver logic gate can be cloned and take some of the load away. For each logic
change we make, we need to make sure it won’t create logic equivalent issue.
Lastly clock tree manipulation is also widely used fixing techniques. If there are a lot paths from a
same clock source violating timing, we can try tie off the downstream clock tree element closer
to upstream driver to speed up the clock tree or add more clock buffer to slow down the clock
insertion delay. Understand the reduction in clock insertion delay could speed up the launch path
delay of current stage, but also reduce the capture path delay of the previous stage so it may
create hold violation when fixing setup. Designer needs to check both stage to see if there is
enough positive marge on the other side when the change is done.
To meet the hold timing check, following delay relation must be meet.
To make the requirement more restrictive, the clock uncertainty is being added to the RHS of the
equation.
As we can see from the above equation, the hold time check is independent with the clock period.
This is because the hold check is performed on the same edge of the clock waveform.
Thus, we can say the hold time is more critical than setup time, because if we violate setup time,
the chip will still work on a slower frequency.
But if we violate hold, the chip won’t work on any condition and that ensured a functional failure.
The hold relations are determined according to that setup relation. There are two hold check
scenarios:
1) Data from the source clock edge that follows the setup launch edge must not be captured
by the setup latch edge. This one is depicted in the left hand side picture.
2) Data from the setup launch edge must not be captured by the destination clock edge that
precedes the setup latch edge. This one is described in the right hand side picture.
Every time when there is a hold check, STA tool will choose the worst hold time scenario from the
two. In this example, these two scenarios are essentially equal. But we will see cases where they
are not equal in later slides.
This is because the setup check has to use the longest timing path while the hold check has to
choose the shortest timing path.
The path is launched at 0ns and captured at 0ns as well, indicating this is a same cycle check. In
this example, sine the library hold time is larger than the data arrival time, the hold requirement
has been met.
So where is the optimal location to add the hold buffer? Generally speaking, we can follow these
orders:
1. Find timing path with worst hold slack across all PVT
2. Choose the pins with maximum number of violating paths going through (bottleneck) as fixing
candidate minimal number of buffers inserted.
3. Exclude pins with bad setup margin / negative slack, choose the one have good setup slack
avoid setup/hold conflict
4. If other conditions being equal, it is preferred to choose fixing at Load pins rather than driver
pins. This is because adding delay cell at the load pin will not disturb driving cell and wire delay
too much and is more predictable.
If the path originating from the lower launch flop is timing critical, then that means there is a
setup and hold conflict situation beyond location #2. That’s when we choose location #3 as the
fixing point; Otherwise, location #2 can also be used as an option.
[p130] Delay Calculation for Timing Path
In order to cover the worst corner case and guard band the design, when STA tool calculates the
launch path delay, it will use slowest possible delay number a cell could have in current operating
condition; on the contrary, it will use fastest cell delay value for capture clock path. There are
multiple sources of delay variation for a cell.
First it comes with the process variation mentioned in earlier chapter. Same type of cell could
have different delay values due to oxide thickness variation, voltage threshold variation, channel
length variation.
Secondly, cells in different location can get different supply rail voltage and temperature. The IR
drop effect can depend upon a couple of factors such as how fast the cell is switching, how dense
is the power consumption in that region, how resistive is the supply net or how much decap exists
in nearby location.
Thirdly the variation can come from coupling effect with nearby nets. The cell transition can be
slower or faster depends on the switching direction of the nearby net. If the nearby net is
switching in the same direction as the net connected to the cell does, then the transition of the
net is also speed up; if the nearby nets is switching oppositely, then the nets transition will slow
down.
Static variation means the variation will not be different during the time period of the timing check.
For example, process variation can be a source of static variation for both setup check and hold
check but crosstalk noise can only be source of static variation for only hold check.
These early/late arrival of clock edge can be introduced mainly by two reasons:
Most of the time, the design is worked on one of the clock paths so it shouldn’t have different
arrival time. The re-converge point usually have selection logic to control which clock path is active.
However, without case analysis, STA will propagate both clock path and use them separately into
timing check.
For max timing path checks, it will choose longer path for launch and shorter path for capture.
Vice versa it will use short path for launch path and longer path for capture.
One thing to notice is that the launch clock path could have shared portion with the capture clock
path. The way the clock tree is being built today creates scenario where one clock signal travels
on a few clock cells upstream but fan out to a lot endpoint downstream. So between flops starting
from same clock tree branch, they do have some common portion along their clock path.
For example, during max timing checks, the common path is assigned to large delay value during
launch path calculation and smaller value during capture path calculation.
But we know that the same cell cannot have two different delay values at the same time, so the
delay difference on the common path doesn’t exist in real world. They are just artificial pessimism
need to be removed from the timing path check equations.
However, not all source of delay variation on the clock common path can be removed. The CRPR
mainly refers to removal of process variation difference. Other two main sources, namely IR drop
and crosstalk, need to be handled carefully.
For max timing check like setup check, since it is between two different clock cycle, which means
the check happens at two time stamps, the switching activity could be very different even for the
same cell. Thus, we cannot simply remove the supply voltage variation and crosstalk effect blindly.
For min timing check like hold check, it is performed on the same clock edge, which means the
check happens only at a single time stamp, so the switching activity is exactly the same. Then the
supply voltage variation and crosstalk effect can be taken out as well.
There are some choices to place a clock gate: generally, it can be placed closer to the clock source,
or be placed near the leaf cell, or somewhere in between. In the first case, if the clock is placed
near the source, what will happen? Well, the clock is being gated from the trunk, so fewer clock
gates will be enough to shut off large area of flops in the design, and large portion of the clock
tree is gated off. This gives great power saving when the functional feature is not in use. But from
timing perspective, the clock gates are upstream which means the clock tree will be diverging
early up the tree. The common path to the leaf cells is shorter so cells in different clock subtree
could see larger source latency variation since the common path pessimism is minimal.
[p135] (cont’d)
On the contrary, if the clock gates are placed near the leaf cells, more clock gates are needed to
gate off the same amount of flops as the first case. Also, since the clock gates are near leaf end,
only smaller portion of the clock tree will be shut off. The power saving will not be as good as the
first scenario. However, from timing perspective, since the clock tree diverges only when it
reaches the leaf, large portion of the clock tree is common to the downstream flop. Source latency
variation on the clock paths for these flops will be less than the first case since the common path
pessimism is at its maximum and can giving timing credit back.
The time it requires before the active clock edge is called recovery time and after the active clock
edge is called removal time. Similar to setup and hold checks, STA will use max cell delay in launch
path, min delay in capture path for recovery check and min cell delay in launch path, max delay in
capture path for removal checks.
The SDC command set_multicycle_path -setup can be used to either move forward the capture
clock edge or move backward the launch clock by a specific number of cycles from the default
check edge.
By default, for setup check, STA tool will move the check using capture clock period except you
explicitly forcing the tool to use launch clock period by using –start options. In this case since the
launch clock is running at the same frequency as capture clock, it makes no difference; but later
in the course we will see different behavior when the transaction happens before slow to fast
clock domain or fast to slow clock domain.
[p139] (cont’d)
After the setup capture edge moved N cycles, there are two scenarios for the corresponding hold
checks. Case number one, the data launched from the next following cycle must not override the
data on current setup capture edge. This case is shown on the left hand side.
Case number two, the data launched from the current setup launch edge must not override the
data on the next following capture edge. This case is shown on the right hand side.
The STA tool will pick the worst case from the two. In case where the launch clock and capture
clock running at same clock frequency, they are same. But in case for slow/fast or fast/slow clock
crossing path, situation will be different.
[p140] (cont’d)
Keep in mind that since this is a multicycle setup path of N, which means the data shouldn’t be
change for at least N cycle. Checking hold at the next cycle will be no difference as checking hold
at cycle N. So we don’t need to check the hold for the most restrictive case. Actually, to maintain
valid data crossing, all hold check can be relaxed to zero cycle check. So the new hold edge will be
pushed by the multicycle hold path by N-1 cycle to align with the zero cycle launch edge.
In this picture, the multicycle setup constraint pushes the capture edge to be 3 cycles away. The
corresponding hold is pull back the hold checking edge by 2 cycles, after which the hold checking
edge is aligned with the launch edge in the same cycle.
From the waveform, we can tell the new capture edge has been moved to the 5th cycle, which is
3 cycles away from the launch clock edge.
By default, STA then is going to find most restrictive edge combination for hold check. In this case
it would be launching from 2th cycle and capture at the 4th cycle.
But let’s recall that the output value of this flop will only be used every 3 clock cycle. Since it is
being sampled on cycle 2 and cycle 5, there is no point to check the value on cycle 3 or cycle 4. In
another word, we don’t care about the value in cycle 3 and cycle 4 even if they get override. So
meeting the hold check requirement at same cycle of the launch data is always good enough to
ensure the correct functionality. That’s why we have set_multicycle_path –hold 2.
By default, PrimeTime assumes that data launched at a path startpoint is captured at the path
endpoint by the very next occurrence of a clock edge at the endpoint. For paths that are not
intended to operate in this manner, you need to specify a timing exception. Otherwise, the timing
analysis does not match the behavior of the real circuit.
There are three types of timing exception: 1) False path, 2) Min/Max delay and 3) Multi-cycle
paths.
A false path is a logic path that exists but should not be analyzed for timing. Declaring a
path to be false removes all timing constraints from the path. STA does not report it to be
a violation no matter how long or short the path delay
Min/Max delay constraints is to override the default maximum or minimum time with
your own specific time value. By default, PrimeTime calculates the maximum and
minimum path delays by considering the clock edge times. By using set_min_delay or
set_max_delay, it forces the tool to ignore the clock relationship.
Multicycle path needs to be specified when there are more than one clock cycles required
to propagate data from the start of a path to the end of the path. It relaxes the default
timing check behavior of the STA tool.
If a timing exception is specified on a particular set of pins, it needs to keep track of exceptions
on registers, pins, nets. This makes the command less efficient. To efficiently specify the timing
exception and reduce analyze time/run time, following below method:
Before using false paths, consider using case analysis (set_case_analysis), declaring an exclusive
relationship between clocks (set_clock_groups), or disabling analysis of part of the design
(set_disable_timing). These alternatives can be more efficient than using the set_false_path
command
If set false paths must be used, avoid specifying a large number of paths using the -through
argument, by using wildcards, or by listing the paths one at a time.
[p143] (cont’d)
There are some rules for applying the timing exceptions.
In case of conflicting exceptions for a particular path, the timing exception types have the
following order of priority, from highest to lowest:
set_false_path
That means if there is set_false_path and set_multicycle_path constraints set for the same pair of
start point and end point, the set false path will take the precedence. The path will become totally
excluded from timing verification.
For same type of constraints, more restrictive one wins and override the less restrictive one.
For example, if we have set multicycle 5 and set multicycle 3 working on the same path, then since
the multicycle 3 is more restrictive than 5, the path is constrained to be a three cycle multicycle
path.
If two constraints worked on the same path, the constraints with more specific condition wins.
For example, if we have applied a global multi-cycle constraint from clock A domain to clock B
domain, but we also have a second constraint specify a different multi-cycle number through a
particular pin between the two clock domains, then the one with the through pin take priority.
It reports a list of top cells with the highest bottleneck cost where bottleneck cost is defined as
the number of violating paths through cell. This means if these cells have better delay, the path
delay on a lot of timing paths could be improved. To first address these cells, it provides us the
quickest way to cut down the TNS number of a design.
For example, in this picture, the NAND gate in the middle is on all the four paths between two
sets of startpoint and endpoint. If these four paths have timing delay issues, by fixing this common
gate it can benefit all of them. The report bottleneck command can be used to create a sorted list
of the common gates which should analyzed first.
Secondly, the STA engine will find the most restrictive launch and capture edge combination to
guard band the design.
In case of one clock has integer multiple clock period of another, STA only need to extend the fast
clock to align with the slow clock.
Then it is looking for the most restrictive setup and hold check edges.
By default, STA will think every two clock are synchronous clock even though they are not. So one
thing to be noted is if you missed some timing exceptions between two asynchronous clock, STA
will still expand those clocks and come up with very large and weird number of minimum base
period. If you see this, most of the time it means you are missing a set_false_path or
set_clock_group command to disable those false timing paths.
Let’s assume a timing path is traveling from CLK_B to CLK_A. CLK_B is a slow clock whose clock
period is 4 times the period of CLK_A. From what we have learnt in previous slides, STA tool first
expanded the CLK_A 4 times to align CLK_A with CLK_B. By default, the most restrictive launch
and capture pair for setup check is from first CLK_B launch edge to second CLK_A capture edge.
Most restrictive launch and capture pair for hold check is from first CLK_B edge and CLK_A edge.
[p151] (cont’d)
In this case, since the launch clock and capture clock have different data rate, the value from
CLK_B may not be needed immediately by CLK_A in the very next cycle. Ideally the launch flop
need to drive the signal steady for at least N cycle before it got sampled. Thus, it is highly possible
to relax the setup requirement and give more time to the data transmission.
Say we want to sample the data every 3 cycles; we can use set_multicycle_path –setup 3 –from
CLK_B –to CLK_A. Then the setup capture edge is at where we want, but hold capture edge also
moved along with it.
[p152] (cont’d)
The first thing to do with hold check is to determine the right launching and capturing edge.
Providing the setup multicycle is 3 cycles, we have two possibilities for the hold.
Case #1, data launched from the source clock edge that follows the current setup launch edge
must not be captured by the current setup capture edge.
Case #2, data launched from the current setup launch edge must not be captured by the
destination clock edge that precedes the setup capture edge. It is more restrictive than case #1
for a slow to fast clock crossing.
[p153] (cont’d)
But since the driver is stable for at least 3 cycles, we don’t need to ensure such stringent hold
requirement. In another word, we don’t care if the value captured by CLK_A in cycle #2 or cycle
#3 is overwritten by data launched from CLK_B in cycle #1.
We have to use setup_multicycle_path –hold 2 –end move the hold capture edge back to be the
first clock edge so the hold check remains a zero cycle check.
Note there is a -end option in the hold MCP specification. –end means to use the endpoint clock
cycle as a reference when moving the edge. In this case, since the endpoint clock is CLK_A, this
tells the STA tool to move the hold capture edge 2 CLK_A edge backward.
Note that by default, the -end is the default for a multicycle setup constraint and the -start is the
default for multicycle hold constraint.
[p154] (cont’d)
Here is an example for slow to fast clock crossing timing report. The clocks in discussion is clock
M and clock P. the clock M has a period of 20ns and clock P has a period of 5ns. On the left hand
side, the setup path is launching from the rising edge of clock M at time 0 to the rising edge of
clock P at time 5ns. Which indicate this is a path without any multicycle path relaxation. The right
hand side is the correspond hold check timing report. The data launching from clock M at time 0
is captured at clock P at time 0. You can try apply some multicycle path to both setup and hold
side and think about how the timing report will look like.
Since path is traveling from CLK_A to CLK_B, we use –start to indicate the reference clock for
moving the edge is CLK_A. From above two example, we can tell that the edge movement number
usually specified by using the clock period of faster clock. This is because this way gives greater
timing gratuity to model design function. If we set up MCP relative to slow clock, the path will be
too relaxed to reflect what the timing requirement really need to be.
[p157] (cont’d)
For the corresponding hold check, we can find the worst one from the two possibilities listed as
the two pictured below. We can use the same method discussed previously. The conclusion is
during a fast to slow clock transition, the worse hold scenario is when data launched from the
source clock edge that follows the setup launch edge must not be captured by the current setup
capture edge.
[p158] (cont’d)
Now since the data will be held stable for 2 cycles, we can relax the hold requirement by moving
the hold check edge one cycle later. In another word, we don’t care if the value launched from
CLK_A in cycle #4 is overwriting data launched from cycle #3.
We can use set_multicycle_path –hold 1 -from CLK_A to CLK_B. Note that the hold check is moving
the launch clock by default, so we don’t need to specify the –start switch explicitly.
New hold launch edge is now aligned with capture edge to ensure data launched from 0ns is not
overriding the previous capture edge.
[p159] (cont’d)
Here is an example of setup and hold check timing report when there is no multicycle path applied
between clock M and clock P. on the left side, the setup check happens between time 15ns and
20ns, which indicates it is launching from 4th launch clock and captured by the 2nd capture clock.
The corresponding hold check is still a zero cycle check. You can exercise out some multicycle
constraints and see how the timing report should become.
As I mentioned in the introduction of this course, this cannot be verified by STA engine since the
datapath behavior is not deterministic anymore.
This is usually called a clock domain crossing problem. If the clocks don’t share a phase
relationship, the arrival of the launch clock and capture clock will not be deterministic relative to
each other.
This means setup and hold timing requirement could potentially vary in every cycle. This will easily
cause Metastability which needs to be resolved using synchronizers.
There are two typical scenarios, one is we need to transfer single digit or only a few data across
two clock domain. In this case the simplest way to avoid Metastability is to have two flops clocked
by destination domain in series. The data captured by the first flop will go into Metastability status,
but it has a very large change to resolve the Metastability and settle down into a stable value after
a while, usually within one clock cycle. Then in the next cycle, the second synchronous flop will
see a steady input. If the Metastability still not resolved by the first flop, we can add more flops
in series to reduce chance of Metastability at the cost of additional latency.
The second scenario is when we need to transfer high volume of data between two clock domains.
Handshaking protocol or asynchronous FIFO structure is usually used under this circumstance.
For example, a simple handshaking mechanism is shown on the RHS. Domain 1 first put a high
volume of data onto the data bus, then it sends a request signal to domain 2. The request signal
can be synchronized by using the 2-flop synchronizer. Once domain 2 received this request, it will
store the value shown on the into local flops. Then domain 2 will issue an acknowledge signal back
to domain 1. Acknowledge signal can also be synchronized by 2 flop synchronizer. After domain 1
received acknowledge signal, it then can change value on the database to send the next word.
The disadvantage of handshaking is the time to synchronous req and ack signals for each word is
added up into the total latency of the data transfer. FIFO technique can transfer high volume of
data while maintain a low latency. However, the FIFO structure is a little bit complex and out of
the scope of this course so we will not talk about it here.
[p162] (cont’d)
The main issue for system synchronous clock scheme is the clock uncertainty needs to be
controlled reasonably well so the max timing requirement can be met. This is usually done by the
clock tree synthesize stage during place and route. However, if we send the clock along with the
data signal, it could potentially reduce the max timing requirement since the clock propagation is
in the same direction as of data. So it could potentially be used in faster clock design. Because the
clock nets and data signal are send together and routed together, the clock path will be exposed
to the same on-chip variation source as the data path.
Besides, it can be used to transfer data between asynchronous clock domains and need
synchronizers to further transfer data from the source clock domain to the destination clock
domain.
Note that If the data arrives after the clock open edge, the cell propagation delay is from the D
pin to Q pin. The latch just behaves like a buffer. There is no clock uncertainty penalty across the
latch.
If the data arrives before the clock open edge, then the data need to wait for the clock open edge.
So the cell propagation delay is from clk pin to Q pin.
1) When data arrives before clock open edge, the latch behaves just like normal flip-flop. The path
slack is positive.
2) If the data arrives during the transparency edge, the data can still be captured correctly at the
cost of eating up into the transparency window. We call it borrows into the capture clock. The
more it borrows, then less time it left for the next cycle to complete its operation.
3) If the data arrives after the closing edge of the latch, then it means the datapath is too slow,
there is no way for the capture flop to get the correct data value from the launch clock. The entire
high phase is borrowed by the current cycle.
From this explanation, the ultimate goal to use a latch is to expect it working in the transparency
mode but not borrowing too much to cause failure in next cycle.
We can see that this is in total a 1.5 cycle path where the first segment is a full cycle path between
the first flop and the latch in the middle.
The second segment is between the latch and the last flop, and it is a half cycle path since the
capture flop is clocked by the inverted version of CLK_A.
Since the first segment consists more logic gates, let’s assume it has larger path delay while the
second segment is a very short path with minimal delay.
If the latch in the middle is a hard-edge flop, then the first segment will probably fail setup check
and the second path will pass setup with plenty of positive margin.
Because it a latch, the first segment can “borrow” into this latch and use the positive margin on
the other side while not create any timing violation.
We can see that this is in total a 1.5 cycle path where the first segment is a full cycle path between
the first flop and the latch in the middle.
The second segment is between the latch and the last flop, and it is a half cycle path since the
capture flop is clocked by the inverted version of CLK_A.
Since the first segment consists more logic gates, let’s assume it has larger path delay while the
second segment is a very short path with minimal delay.
If the latch in the middle is a hard-edge flop, then the first segment will probably fail setup check
and the second path will pass setup with plenty of positive margin.
Because it a latch, the first segment can “borrow” into this latch and use the positive margin on
the other side while not create any timing violation.
In the time borrowing information section, we can see the maximum allowed time one latch can
borrow is determined by the half phase of the clock period minus the library setup time of the
latch, which is 0.33ns.
In this case, we only need 0.11ns to make the path meet setup requirement, so the actual time
borrow is 0.11ns. In the end, since the endpoint latch is in transparent phase now, it should
behave just like a buffer, the STA tool will give back the clock uncertainty deductible back as credit,
so the real time borrowed from the endpoint latch is only 60ps.
[p166] Topic 27: De-skew / Lockup Latch
De-skew latch or lock-up latch is an often used technique for hold protection. It can be extremely
useful for situations like slow clock domain transmission or communication between Intellectual
Property blocks. Let’s say if there is a path as shown in the left hand side. Notice there is a clock
skew between the capture clock and the launch clock and in this case, the capture clock arrives
much later than the launch clock. This puts on a harder requirement for the hold time check to
meet. The data transition must be slow enough to arrive later than the clock capture edge which
has a large clock skew. To fix this hold violation, we can throw in buffers to delay the data path,
but if the clock skew is too large, it will be too many buffers needed, which will be bad for power
and area.
On the left hand side, a de-skew latch is added onto the data path. Note it is triggered on the
negative edge of the clock. This de-skew latch will be transparent during the normal clock low
phase, allowing data to passing through. But it will stay closed in the high phase. So any data
transaction from the launching flop will have to wait for the latch opening. This essentially adds
half cycle delay onto the datapath and greatly help to satisfy the hold requirement even the clock
skew if large.
If we analyze the logic careful, it’s not hard to find that we can move the enable from the each
individual datapath onto the clock path. This way all those mux structures can be eliminated and
area overhead is gone. Clock switching also stopped after the enable is turn on, which saves
dynamic power.
No matter it is architecture clock gates or inferred clock gates, it must satisfy certain timing
requirement in order to generate good quality clock for downstream logic. This timing
requirement is the arrival time constraints between enable signal and the free-running clock. For
example, on the RHS picture, if the enable signal changes value during low phase of clock, the
gated clock will be a clean cut new clock afterwards. But if the enable signal changes value during
the high phase of the clock, there will be an extra pulse glitch. This extra pulse may cause
unwanted reaction in the design and even cause function failure.
If the output of the gating cell is enabled by a high level control signal, such as AND or NAND gate,
we call it active high clock gating check.
The active-high clock gating setup check requires that the gating signal changes before the clock
goes high.
The active-high clock gating hold check requires that the gating signal changes only after the
falling edge of the clock.
One can see that the hold time requirement is quite large, this can be resolved by using a different
type of launch flip-flop, say, a negative edge-triggered flip-flop to generate the gating signal.
The active-low clock gating setup check requires that the gating signal changes before the clock
goes low.
The active-low clock gating hold check requires that the gating signal changes only after the rising
edge of the clock.
Both active low and active high clock gating cells are basically saying the enable signal should only
change value when the clock signal makes the gating cell inactive so it won’t cause unwanted
extra pulses downstream.
However, if the enable signal has some glitch due to other reason, then the traditional clock gating
with single gating cell could propagate this unwanted glitch downstream.
To prevent this, normally we use a glitch-free clock gating structure. Assuming the flop is rising-
edge triggered, now the enable signal will first go into a active-low latch, the output of this latch
is then used to control the gating cell.
If the glitch happens during high phase of the CLK_B, it will not be stored into the latch since the
latch only be transparent when CLK_B is low.
If the glitch happens during low phase of the CLK_B, it will pass the latch, but will only changing
value in the inactive phase of the clock signal.
So, by adding this latch, the final gated clock signal will be glitch free and only the useful clock
pulse is left.
[p173] Data-to-Data Check
In case where we need to monitor the arrival time between one signal with respect to another
signal, data-to-data check can be applied. set_data_check command can be used between any
two arbitrary data pins. Conceptually it is similar to regular setup and hold check: One pin is the
constrained pin, which acts like a data pin of a flip-flop, the other pin is the related pin, which act
like a clock pin of a flop. The only difference is now the data-to-data setup check is performed
between the same edge of launch and capture clock.
For example, the data pin A of the AND gate is generated through a flop clocked by CLK_A and
going through some random combinatorial logic; pin B is generated through a flop clocked by
CLK_B and also going through some another cone of combinatorial cloud. Now if we can the data
value on A pin to be hold steady for an amount of time before or after the switching on data pin
B, a data check constraint can be applied as shown.
First, the B pin must be constant at least 0.2ns before A pin switches. According to the definition,
B pin is like the clock pin and A pin is like data pin, so we can say set_data_check –setup 0.20 –
from UAND/b –to UAND/a.
Secondly, the B pin must be constant at least 0.1ns after A pin changes value. So we can derive
set_data_check –hold 0.10 –from UAND/b –to UAND/a.
By default, since the setup check is performed on same edge of both launch and capture clock,
the default hold check launch edge will be one cycle ahead of the default hold capture edge.
According to the design intention, we can either move the hold launch back by specify a -1 hold
multicycle constraints or use another data setup check to reversely constrain pin A and pin B.
[p174] (cont’d)
Here is a timing report for the data to data check. Note that it is from A2 pin to A1 pin, so the A2
pin is treated as the related pin and A1 pin as constrained pin. The data check setup time is set to
0.1ns from the command and is being deducted from the data arrival time on A2. On the other
hand, data arrival time on A1 is calculated just as normal and then compared with the adjusted
A2 arrival time. The slack is the delta between the two arrival time.
Specification for this check is very simple, if we want to constraint all path coming from first flop
to be within 2ns, we can say set_max_delay 2.0 –from UDFF1/Q.
If we want to constrain path delay from point A to point B, we can say set_max_delay 1.0 –from
A –to B.
Also, we can specify the minimum time required between two points. For example, if we want to
make all timing path ending up in the second flop to be at least 1ns long, we can say set_min_delay
1.0 –to UDFF2/d.
Normally, in a synchronize design, the assumption is design is completely constrained with respect
to clocks. So set_max_delay and set_min_delay is not recommended to use in most situations.
These two commands are mostly used in async signals.
[p176] (cont’d)
This is the timing report for point to point delay check. For comparison, the normal timing report
is listed on the left. The original timing path is between the arc CLK to D. with the new set max
delay constraints we added, now it is checking the arrival time only to the A pin of cell u6561.
[p180] Category
SI issues result in two primary failure modes: functional failure due to glitch on a steady signal or
timing failure due to delta delay on a switching signal.
Usually when we talk about crosstalk, the one causing the unwanted switching is called aggressor
while the other is called the victim.
Note that they could switch place if we upsize the victim net driver too much so the victim can be
a new aggressor to the original aggressor net.
[p181] Glitch
First, let’s look at the glitch issue on the steady victim net. Glitch issue could lead to functional
failure by sending wrong value to the downstream logic.
In this case, assuming there is a rising switching activity appears on the aggressor net, the node
voltage on the ground capacitance of victim net will be charging up through the coupling
capacitance.
But since the victim net is driven by a steady value, eventually the charge will be released and
voltage level will be restored.
The magnitude of the glitch is determined by the coupling cap, the relative drive strength of
aggressor and victim and the ground capacitance of the victim.
The larger the coupling cap, the more charges transferred to the ground cap, and thus the glitch
height become taller.
The larger the ground cap, the more charges needed to build up the voltage, so the glitch height
become shorter.
Strong driver on aggressor can use higher glitch while strong driver on victim net can improve the
immunity to noise.
If the glitch magnitude is large enough to be captured as a different logic value by the downstream
cell inputs, such as the clock pin or asynchronous set/reset pin, it could result in real functional
failure.
If the glitch width is width enough to propagate through downstream cells and reach a sequential
cell input, it could also cause functional failure.
So how to determine whether or not a glitch can be tolerated by a design? In today’s STA, we use
two types of noise margin to check against the glitch.
A glitch below the DC margin limit will not cause a logic value change in downstream logic and
also cannot be propagate through the fanout no matter how large the pulse width is.
For example, as shown in LHS, the fanout of the victim net is a inverter. Output of the inverter will
remain low as long as input voltage is higher than VIH
Output of the inverter will remain high as long as input voltage is lower than VIL.
So instead of a clean-cut noise threshold, the AC noise margin takes glitch width into consideration
and defines a safe-zone of glitch.
The base delay calculation assumes that the driving cell provides all the necessary charge for rail-
to-rail transition of the total capacitance of the net, where Ctotal = Cground + Ccoupling.
[p187] (cont’d)
The aggressor switching in the opposite direction increases the amount of charges required from
the driving cell of the victim net and increase the delays for the driving cell and the interconnect
for the victim net.
The charge required to change the voltage difference on the coupling capacitance from +V to –V
essentially doubles the coupling capacitance in baseline delay calculation.
The aggressor switching in the same direction reduces the amount of charges required from the
driving cell. The delay for the driving cell and interconnection are also reduced.
Since there is no voltage difference on both side of the coupling cap, no charge needed for the
coupling cap so essentially it cancelled out the coupling cap effect from the base delay calculation.
Recall what we have talked about, in GBA mode, each signal has an arrival time window that
propagated along the timing path. The arrival time window or simply called timing window
bounds the earliest and latest possible switching time one signal could have at that node. So the
calculation of worst crosstalk is simply by adding up each crosstalk effect from each individual
aggressor.
The aggressor nets that can switch within the arrival time window of a victim net are all assumed
to switch in a direction that maximizes pessimism, as follows:
For minimum-timing analysis, the aggressors are all assumed to switch in the same direction as
the victim, making the delay of the victim net as small as possible.
For maximum-timing analysis, the aggressors are all assumed to switch in the opposite direction
as the victim net, making the delay of the victim net as large as possible.
1. Routing improvement
Keep aggressor away from the victim net to reduce coupling cap. (NDR rules)
Use non-default routing rule such as double spacing or double width, ground shielding to reduce
the coupling capacitance
2. Gate resizing
Upsizing the victim gate or downsizing the aggressor gate. Since downsizing usually could hurt
max timing paths, usually only upsizing the victim net is used. But if the victim is overly upsized, it
could become a new aggressor.
Most Effective way to fix noise violations. If the buffer is correctly selected, the noise problem can
be resolved without creating a new aggressor. Net splitting works on nets that have more than
one fanouts
HVT cells has a higher VT threshold and thus higher noise margin. Small glitch will just be filtered
out automatically. Replace the original receiver cells on the victim net into a HVT device is you
have positive timing margin.
5, Guard Ring
Usually applies on the partition boundary or the surrounding area of sensitive circuitry. Guard
Ring essentially serve as a shield to the portion of the circuit being protected.
Global variation is also called die-to-die variation. Die-to-die variation have a variation radius
larger than the die size including within wafer, wafer to wafer, lot to lot and fab to fab variation.
These variations affect all the circuits within the die equally. Die-to-die variation of a parameter
can be considered as the die-averaged parameter mean deviation from its process mean target.
On-chip variation, on the other hand, refer to the variations that occur between various circuits
elements of the same die. They can be grouped into systematic and random variations.
So far, all the delay calculation mentioned earlier in our timing verification are deterministic. In
traditional corner-based STA analysis, library characterization for one particular corner is to use a
single data set to represent the delay value under certain circumstance. We know the PVT corner
is usually used to model the global process variation, but how can we use a single delay value to
model local variation all across the chip?
If the variation is coming from same source of process variation, these variations are addictive
and the standard deviation of the delay can be directly added up.
These variation is called systematic variation. They are design dependent such as layout proximity
effects, CMP related variations, IR drop, temperature map, etc.
Paths comprised of cells in close proximity exhibit less variation relative to one another. This
phenomenon is called Spatial Effect.
The random one or statistical components are related to the variations associated with the
processing equipment.
As the path stages increase, the probability of all gates along that path being simultaneously fast
or slow decreases.
Random variation will die out over logic stages. This phenomenon is called Statistical Cancellation.
[p195] Statistical STA (SSTA)
As a matter of fact, all these timing parameters such as cell delay, wire delay, timing window, path
slacks are statistical in nature.
For cell delay, Manufacture parameters such as channel length and threshold voltage have
variation due to Global and Local process variation. Delay through each timing arc can be
represented by mean and standard deviations.
For wire delay, Electrical properties such as metal thickness and dielectrics can have variations for
each metal layer. Delay through interconnections can be represented by mean and standard
deviations.
What’s more, since the metallization for each layer is done separately, the wire variation for each
layer can be independent from each other. Even adjacent layer can have totally different direction
of delay variation. This add another layer of complexity.
Since both cell delay and wire delay are statistical in nature, signal arrival time is also not
deterministic. Timing window can also be modelled statistically when calculating crosstalk
So now the path slack is also represented in form of mean and standard deviation. The passing or
failing criterion can be determined based upon the required statistical confidence.
Even though the SSTA captures the statistical nature of the manufacturing, it is technically
complex and need variation extraction support. The biggest knock against SSTA is the fact that
the characterization database is large and that run-times can be long. Large full-chip SSTA really
isn’t feasible today. In reality, ASIC designers developed enhanced deterministic STA tools to take
on-chip variation effect into account rather than use SSTA directly.
The global derating is very conservative. It is unaware of spatial correlation since even cells
adjacent to each other could have very different variation. It also does not care about statistical
cancellation. The paths are derated exactly same across the board no matter how deep the logic
depth is. Global OCV method is a safe approach of applying the worst-case variation across the
entire chip but it could result in overdesign, reduced design performance, and longer timing
closure cycles, excessive design margins.
The AOCV method provides better accuracy and reduces pessimism on long paths (as well as
reducing the chance of errors in over-optimistic short paths). Of course, it comes with a cost:
these tables have to be built, and if each cell in the library has to be characterized for, say, N path
lengths and M positions in that path, that’s a N by M entry table for every cell that has to be
derived via SPICE.
We can use it for regular tasks like web-browsing, intensive jobs like over-clocked gaming, idle
mode like sleeping, or debug mode when running a system diagnose.
Similarly, for an ASIC chip, we also could have many different modes to run different tasks, each
mode requires its own clock configuration and timing constraints. The chip may run at different
frequency or part of the design is shut off while other part remains on. A user could write the
constraints for each mode individually, or write a set of constraints which are combined for
multiple modes.
At the meanwhile, even for the same mode, the chip could be running at different PVT corners
depending on the process variation, voltage supply variation and temperature change.
We define the combination of one mode and one corner as a scenario. Since STA tool can only
analyze one scenario at a time, the designers have to create scenario for each functional mode.
However, we can image, as the design become more and more complex and have all kinds of
functionality nowadays, the number of scenario explodes. This increases the difficulty in timing
closure dramatically since after we close timing in one corner, new timing violation in another
corner may pop up. So the design has to go through a long iterative process for final timing closure.
Therefore, people seek to merge some of the scenarios into one scenario and create a super mode
called merge mode. How to merge different mode into one single mode is another big topic which
will be covered by another course. While the merge mode methodology can dramatically collapse
many analysis modes all-together, a loss of information (and accuracy) must occur. That’s because:
- A timing arc can only have a single set of min/max timing behaviors
- The only safe behavior is to keep the most pessimistic min/max timing across all modes, and
then use that timing for every mode.
While create individual mode and analyzed separately, each operating mode has its own unique
timing. Every analysis is accurate.
To better help you understand what does merge mode mean, let’s see an example.
In the shown example, 2 different clock signals are present (CLK A and CLK B) which are selected
by SEL signal. It means the circuit operated in two modes – Mode 0 (when SEL equals 0 and CLK A
is given to the circuit) and Mode 1 (when SEL equals 1 and CLK B is given to the circuit). Analyzed
separately, case analysis is used to select the appropriate clock for each mode. Analyzed together,
these modes could be combined together. Combining the modes is the best way as there is need
to do fewer runs to get the same result. However, there are drawbacks to combining modes:
- Timing pessimism
- Increased memory/runtime
- Increased script complexity
When the modes are analyzed together, both clocks propagate into the network.
- The fast CLK A slew (red) is used for min-delay propagation.
- The slow CLK B slew (blue) is used for max-delay propagation.
Typically, CTS keeps tight control on clock slews. However, the non-constant mux select slew can
also propagate into the clock network.
1) Natural to think
2) Can be very runtime intensive and designers may not get a meaningful result in a
reasonable time
1) Would be pessimistic compared with MCMM since it has to be conservative not under-
constrain any mode.
2) Additional constraints (e.g. new generated clocks) may be needed to modelling exclusive
relation between merged clocks.
Chapter 8
Basically each input port should have driving cell type specified to let the STA tool know how much
drive strength is driving the logic inside the design scope;
Each output port should have load capacitance specified so the tool has a feeling about how much
load the design is going to drive.
Of course, the driving cell and load capacitance are better to be correlated from a top-level full-
chip timing run so we can see the real timing impact from a global perspective.
Besides, any primary IO/block IO which is unconstrained should have valid reason.
In some cases, there is a part of the clock tree from top level design trespassing the design under
analysis, we have to check if the clock network delay meets the requirement and try to minimize
the clock latency if needed.
First we need to make sure the extraction annotation quality is good so we can get accurate delay
calculation results into the design. Then we need to make sure the general design rules are
meeting requirements, there are max transition, max cap, clock pulse width rules. Then, the
design should also meet the performance targets which are setup, hold, recovery, removal check.
Special timing checks like clock gating check and data to data check are the next action item to
look at. Signal integrity issue like glitch and crosstalk need to be addressed as well.
In the end, if some violations above have a reason not to be fixed, we need to document them
and put into a waiver file for record.
[p209] Recommendation