Real World Fpga Design
Real World Fpga Design
Ken Coffman
President, Bytech Services
Prentice Hall books are widely used by corporations and government agencies for training, marketing, and resale.
The publisher offers discounts on this book when ordered in bulk quantities. For more information, contact: Corpo-
rate Sales Department, Prentice Hall PTR, One Lake Street, Upper Saddle River, NJ 07458 Phone: 800-382-3419;
Fax: 201-236-7141; email: corpsales@prenhall.com
Trademarks: Verilog is a trademark of Cadence Design Systems, Inc. OrCAD is a registered trademark of OrCAD
Systems Corporation. Silos III is a trademark of Simucad Inc. Altera is a trademark and service mark of the Altera
Corporation in the United States and other countries. MAX, FLEX, FLEX 10K, FLEX 8000, AHDL, MegaCore,
and Altera device part numbers are trademarks and/or service marks of Altera Corporation in the United States and
other countries. Xilinx is a registered trademark of Xilinx, Inc. Hardwire, LogiBLOX, VersaBlock, VersaRing are
trademarks of Xilinx, Inc. LeonardoSpectrum, LeonardoInsight, HDLInventor, FlowTabs, and Power Tabs are
trademarks of Exemplar Logic. All other product names mentioned herein are the trademarks of their respective
owners.
All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in
writing from the publisher.
Materials based on or adapted from materials and text owned by Xilinx, Inc., courtesy of Xilinx, Inc. © Xilinx, Inc.
1995–1999. All rights reserved.
ISBN: 0-13-099851-6
iii
Prentice Hall Modern Semiconductor Design Series
James R. Armstrong and F. Gail Gray
VHDL Design Representation and Synthesis
Jayaram Bhasker
A VHDL Primer, Third Edition
Mark D. Birnbaum
Essential Electronic Design Automation (EDA)
Eric Bogatin
Signal Integrity: Simplified
Douglas Brooks
Signal Integrity Issues and Printed Circuit Board Design
Alfred Crouch
Design-for-Test for Digital IC’s and Embedded Core Systems
Tom Granberg
Handbook of Digital Techniques for High-Speed Design
Howard Johnson and Martin Graham
High-Speed Digital Design: A Handbook of Black Magic
Howard Johnson and Martin Graham
High-Speed Signal Propagation: Advanced Black Magic
William K. Lam
Hardware Design Verification: Simulation and Formal Method-Based
Approaches
Farzad Nekoogar and Faranak Nekoogar
From ASICs to SOCs: A Practical Approach
Samir Palnitkar
Design Verification with e
David Pellerin and Scott Thibault
Practical FPGA Programming in C
Christopher T. Robertson
Printed Circuit Board Designer’s Reference: Basics
Chris Rowen
Engineering the Complex SOC
Wayne Wolf
FPGA-Based System Design
Wayne Wolf
Modern VLSI Design: System-on-Chip Design, Third Edition
Brian Young
Digital Signal Integrity: Modeling and Simulation with Interconnects and
Packages
Contents
Foreword
Notes on the Current State of the Art ix
Preface
Digital Design in the Real World xi
v
vi Contents
Resources 277
Bibliography 291
Index 292
viii Contents
The Real World of design was about to undergo a transformation for which my formal
education left me ill-prepared: the apparition of logic synthesis. Minimizing logic using
Karnaugh maps was being relegated to the electronic equivalent of the Stone Age. Selecting
JK or T flipflops to minimize decode logic was becoming just as relevant. The little green
ix
x Foreword—Notes on the Current State of the Art
plastic template I used to draw schematics in countless lab reports and final exams was
going to join the manual typewriter in the obsolescence paradise.
The skills that turned out to be the most useful I had not learned as part of my
engineering curriculum: typing (which my mother forced me to learn throughout high
school on our IBM Selectric typewriter) and computer programming (where I was self-
taught and still had more to learn). What my engineering background gave me was the
ability to learn new tricks and discern work patterns that could be rendered repeatable, then
later automated.
This book is all about what I learned through the hardware-design school of hard
knocks. Many mistakes could have been avoided, and many hours of mentoring eliminated,
if I had had such a textbook and heeded its advice. This book is not just about the Verilog
language, and that is its greatest contribution. There are already numerous books about the
details of the language. This book is about hardware design in the Real World, where
Verilog is simply the implementation tool. I hope that the next edition will feature VHDL as
well as Verilog: both are equally capable of (I would even say equally poor at)
implementing designs that will meet Real World constraints. This book is also unique in
describing in detail the entire FPGA design process: from HDL coding to verification to
synthesis to device selection to fitting and place-and-route. Too many books satisfy
themselves in showing only the HDL coding aspect.
The most important advice that this book gives is to understand what needs to be done
before you start coding. The biggest sin this book commits is in understating the verification
task: expect to spend 70% of your design time writing test fixtures and debugging the
function of your design before implementing it. Both of these points underscore the
importance of planning as well as investing as much effort as possible as early as possible in
the design process. In hardware design, progress is not measured by how far along in the
design process you are. Progress is measured by how close you are to producing working
hardware.
Today’s buzz is about IP and design reuse. This book can be considered to be about
design reuse: it is about excellent and safe design practices, not only for FPGAs but for
ASICs, too. Even though I have never worked with the author, I would feel confident in
reusing his designs in my own. They would be trustworthy. Design reuse is about creating
designs that are trustworthy in the Real World. This book should be mandatory reading for
every novice FPGA (and ASIC) designer.
Janick Bergeron
www.janick.bergeron.com
janick@bergeron.com
Preface: Digital Design in the Real World
T
he world of digital design is
changing quickly. At a breathtaking rate, devices are becoming faster, smaller, and denser.
Fifteen years ago the mainstream digital designer was manipulating a few thousand gates
using schematics with an occasional ABEL-HDL module tossed into the mix. Now we have
programmable devices with millions of real ASIC gates in tiny packages. On the horizon,
we see devices with many more millions of gates. It is not practical for the mainstream
designer to create systems on chips with schematics (how would you like to deal with a
1,000-page schematic?), so Hardware Description Languages like VHDL and Verilog have
come into their own. In spite of strong opinions on both sides of the fence (including my
own), the current designscape is bilingual—multilingual if you include the work of those
translating C code into hardware and the work of others on more advanced and hybrid
languages.
My own opinion of the fundamental reason for Verilog’s staying power is that Verilog had a very
large head start in [the] number of engineers who knew Verilog before VHDL really got out of
the blocks, and Verilog is easier to learn than VHDL. Thus, the established designers already
knew Verilog and had no reason to learn VHDL, and the new designers could pick it up easier
than they could pick up VHDL.
John Sanguinetti
C2 Design Automation
xi
xii Preface: Digital Design in the Real World
SURVIVAL SKILLS
Regardless of personal opinions, the practical designer will make sure that both VHDL and
Verilog skills are present on his or her resume. The current half-life of engineering
information is about four years and gets shorter every day. This means that half of what you
know today will be obsolete in four years. In order to survive, we weary designers have to
do two things:
1. Master the parts of our skill that are timeless. This includes physics (the analog
aspects of digital design, transmission-line theory, conservation of energy,
antenna theory, and power management) and design concepts like
synchronization, metastability, and propagation delay.
2. Keep up with the changing technology. Take advantage of free seminars, try to
read some of the tidal wave of trade magazines that pile up every month, buy as
many books as your Significant Other will tolerate, and pay close attention when
smart people are speaking.
The world of digital design is deeply divided. The elite 5%, the ASIC designers, use
hardware and software tools that cost hundreds of thousands to millions of dollars a year to
maintain. They earn their living creating specialized high-volume designs. If the FPGA
designer uses 50K gates, the ASIC designer uses 500K gates. If the FPGA designer is
accustomed to four nanoseconds of delay through a primitive, the ASIC designer is
accustomed to delays of less than a nanosecond. The ASIC designer is very careful,
methodical, and does extensive planning. Errors can cost hundreds of thousands of dollars in
silicon turns and schedule delays. The ASIC designer simulates, simulates, and then
simulates some more.
Survival Skills xiii
By contrast, we FPGA designers are sloppy and impatient. There is little or no cost to
experiment, so we program a part and try it. We use tools that are cheap or free on
Windows-based PCs or even embed the test equipment in FPGA logic. By comparison to
ASIC designers, we are a brutish and undisciplined mob, an unruly 95%. I have written this
book for those who would like to join me in this mob.
There’s also the human element—stress—to the reprogrammability equation. ASICs aren’t
reprogrammable; the foundry casts their functionality into silicon. Making the final decision to
commit a design to an ASIC can be extremely stressful for the entire design team. Once it
makes the final decision, the team can’t go back without incurring lots more NRE and lots more
time. Erring at this stage, thus, is definitely a Career-Limiting Move (CLM). FPGAs, on the other
hand, offer engineers a greater comfort zone midway through the project, giving them the ability
to go back and revise a design without paying the NRE and time penalties. Reprogrammability
alone may well be responsible for much of the success of the FPGA marketplace in the last
decade.
Rockland K. Awalt
“Making the ASIC/FPGA Decision”
Integrated System Design, July 1999
Reprinted by permission
This is an FPGA synthesis book. It will not make the reader into an ASIC designer,
though it does address issues associated with converting an FPGA design to an ASIC. This
book is for the newbie FPGA designer who wants a quick and dirty guide to creating FPGA
designs that actually stand a chance of surviving in the Real World.
I worked hard on this book, but it is not perfect. If you find an error or want to argue
about some of the points that are arguable (of which there are many), I look forward to
hearing from you.
Ken Coffman
Mount Vernon, Washington
kcoffman@sos.net
This page intentionally left blank
Acknowledgments
Many additional thanks to the folks who provided software and support: Dave Pfost,
John Bennett, Patrick Kane, Jeff Sanders, Don Matson of Xilinx, Tom Feist of Exemplar
Logic, Richard Jones of Simucad, and Dennis Reynolds and Dave Kresta of Model
Technology.
A special nod in the direction of William M. McDonald and Robert Craig (Coolbob)
Slater, RIP brothers.
xvii Author’s Notes
flogging now, I thought. However, this email was from a bright and gentlemanly fellow
named Stephen Wasson formerly of MorphICs and HighGate Design. Stephen gave me one
of the nicest back-handed compliments I’ve ever received when he said: “Correcting the
errors in your book was, for me, a great introduction to Verilog”. For those who have found
the 1st printing errata at my website (www.bytechservices.com), you will see the long list of
errors that Stephen uncovered. I will be eternally grateful that Stephen was polite and
gracious in helping me debug my book. For a flavor of his commentary, here is the first
paragraph of his email: “Firstly, I’d like to thank you for the marvelously entertaining and
highly readable book. As a Verilog newbie, I found it an excellent introduction; as a 28-year
design veteran, I found it highly pragmatic; and as the author of a dozen articles, I’m
envious that your editors were such good sports to let you get away with such colorful
language.” I hereby elect Stephen to the Real World FPGA Design with Verilog Hall of
Fame.
Some folks have asked me if I can recommend a VHDL book similar to mine, the best
practical VHDL book I know is Essential VHDL RTL Synthesis Done Right by Sundar
Rajan. This book is not that easy to find, try http://www.vahana.com/vhdl.htm or email me
and I’ll see if I can find you a copy. I also highly recommend Writing Testbenches by Janick
Bergeron to add validation expertise to your skill-set. For this book, surf over to
www.janick.bergeron.com
I want to mention my long-suffering Office Manager (and wife) Judy who maintains
the website (corrections to errors in this printing will be found in an errata section at
www.bytechservices.com) and my infinitely-patient editor Bernard Goodwin who is still
waiting for me to finish my 2nd book. To all you folks: muchos gracious, now lets all get
back to work.
Ken Coffman
kcoffman@sos.net
C H A P T E R 1
• are understandable to others who will work on the design later. We are moving
toward global 24/7 design activities.
• are logically correct. The design must actually implement the specified logic
correctly. The designer collects user specifications, device parameters, and
design entry rules, then creates a design that meets the needs of the end user.
• perform under worst-case conditions of temperature and process variation. As
devices age and are exposed to changes in temperature, the performance of the
circuit elements changes. Temperature changes can be self-generated or caused
1
2 Verilog Design in the Real World Chapter 1
SYNTHESIS
he translation of a high-level design description to target hardware. For thepurposes of this
book, synthesis represents all the processes that convert Verilog code into anetlist that can be
implemented in hardware.
The job of the digital designer includes writing HDL code intended for synthesis. This
HDL code will be implemented in the target hardware and defines the operation of the
shippable product. The designer also writes code intended to stimulate and test the output of
the design. The designer writes code in a language that is easy for humans to understand.
This code must be translated by a compiler into a form appropriate for the final hardware
implementation.
Trivial Overheat Detector Example 3
WHY HDL?
There are other methods of creating a digital design, for example: using a schematic. A
schematic has some advantages: it’s easy to create a design more tailored to the FPGA, and a
more compact and faster design can be created. However, a schematic is not portable and
schematics become unmanageable when a design contains more than 10 or 20 sheets. For
large and portable designs, HDL is the best method.
The real limitation of schematic-based design is the lack of an industry standard. This is
tragic because there are different types of people, some (like me) think in terms of text and
some are more graphically oriented. The EDA industry has done a poor job of serving logic
designers who prefer to work with schematics.
As a contrast between a Verilog design found in other books and a Real World design,
consider the code fragments in Listings 1-1 and 1-2.
a <= b;
To illustrate the design process, let’s follow a trivial example from concept to delivery
and examine the issues that the designer confronts when implementing the design. Don’t
worry if the Verilog language elements are unfamiliar; they will be covered in detail later in
this chapter.
4 Verilog Design in the Real World Chapter 1
Sarah, the Engineering Manager, writes the following email to Sam, the digital designer:
To: sam@engineering
From: sarah@management
Subject: Hot Design Project.
First, Sam estimates the scope of the design. From experience, she determines that this
circuit is very similar to a design she did last year. She counts the gates of the previous
design, factors in the differences between the two designs, and decides the design is
approximately 20 gates. She considers the speed that the design must run at and any other
complicating factors she can think of, including the error she made in estimating complexity
of the previous design and the fact that she’s already purchased airline tickets for a week of
vacation. She knows that, overall, including design, test, integration, and documentation,
she can design 2000 gates a month without working significant overtime. She counts the
number of pins (the specification lists a pushbutton input, an overheat input, and an
overheat output, but Sarah realizes that she’ll need to add at least a reset and clock input).
From the gate-count estimate and the pin estimate she can select a device. She picks a
device that has more pins than she needs because she knows the design will grow as
features are added. She picks an FPGA package from a family that has larger and faster
parts available so she is not stuck if she needs more logic or faster speed. Now she sends a
preliminary schedule and part selection to her boss and starts working on the design. Her
boss will thank her for her thorough work on the cost and schedule estimates, but will insist
that the job be done faster to be ready for an important trade show and cheaper to satisfy the
marketing department.
Keep in mind that rarely will your estimates be low. Even when we know better,
engineers are eternally optimistic. Unless you are very smart and very lucky, your estimate
will not allow enough contingency to cover growth of the design (feature-creep), the hassles
associated with fitting a high-speed design into a part that is too small, and the other 1001
things that can go wrong. These estimating errors result in overtime hours and increased
project cost.
Now that Sam has taken care of the up-front project-related chores, she can start
working on the design. Sam recognizes that a simple flipflop circuit will perform this
function. She also recognizes, because of the problems she had with an earlier project, that a
synchronous digital design is the right approach to solving this problem. Sam creates a
Verilog design that looks like Listing 1-3.
Trivial Overheat Detector Example 5
endmodule
This seems like a lot of typing for such a simple circuit, doesn’t it? The first always element
appears to do nothing and looks like it could be deleted. In a previous design, Sam had
6 Verilog Design in the Real World Chapter 1
problems (which will be discussed in Chapter 2) with erratic logic behavior, so she always
double-synchronizes inputs from the Real World. The second always block asserts
pushbutton_out when overheat_in_sync and pushbutton_sync are asserted.
LINES OF CODE
A useful method estimating the size of a design is to count the semicolons.
Sam has done the fun part of the design: the actual designing of the code. She quickly
runs her compiler, simulator, or Lint program to make sure there are no typographical or
syntax errors. Next, because writing test vectors is almost as much fun as designing the
code, Sam does a test fixture and checks out the behavior of her design. Her test fixture
looks something like Listing 1-4.
module oheat_tf;
always begin
#clk_period clock = ~clock; // Generate system clock.
end
initial
begin
clock = 0;
system_reset = 1; // Assert reset.
overheat_in = 0;
pushbutton_in = 0;
#75 system_reset = 0;
end
#200 overheat_in = 1;
#100 pushbutton_in = 1;
#100 pushbutton_in = 0;
#200 overheat_in = 0;
#100 $finish;
end
endmodule
Sam invokes her favorite simulation tool and examines the output waveforms to make sure
the output is logically correct. The output waveform looks like Figure 1-1 and appears okay.
Generally Sam will write and run an automated test-fixture program (as described in
Chapter 5), but the design is simple and the boss has ordered her to quit being such a
fussbudget and get on with it.
Sam assigns input/output pins and defines timing constraints for her design. She
knows that the system does not have to run fast, so she selects the lowest available crystal
oscillator to drive the clock input. This gives the lowest current consumption to maximize
the life of the battery. Sam submits the design to her FPGA compiler and gets a report back
that tells her that the design fits into the device she chose and that timing constraints are
met. From experience, she knows that a design running this slowly will not have
temperature or RFI emission problems. She checks the design into the revision control
system, sends an email to her boss to tell her the job is complete, and takes the rest of the
day off to go rollerblading.
This probably seems like a lot of work to complete a job that consists of six flipflops,
but Sam was lucky. The design fit into the device she chose, the design ran at the right
speed, the design did not have temperature/EMI/RFI problems, the specifications didn’t
change halfway through the design, the software tools and her workstation didn’t crash, and
she avoided the 1001 other hazards that exist in the Real World.
8 Verilog Design in the Real World Chapter 1
ENGINEERING SCHEDULE
Too often, a management tool for browbeating an engineer into working free overtime.
Engineers, even when they should really know better, are generally too optimistic when creating
schedules, thus, they are almost always late.
We have to be mature about this subject: without a deadline, nothing would ever get
finished. Still, most jobs should be completed with little overtime.
Some problems can be avoided by doing thorough design work up front. Sam was
careful not to start coding until she completely understood the requirements of the design.
GIGO
There is a great temptation to start coding before the product is well understood. After all,
to an engineer, coding is fun and planning is not.
I don’t care how much fun the job is, don’t start coding the design until you know what the
end result is supposed to be.
This book emphasizes design approaches that minimize problems and unpleasant
surprises.
Verilog was designed as a simulation language, and many of its elements do not translate to
hardware. Verilog is a large and complete simulation language. Only about 10% of it is
synthesizable. This chapter covers the fundamental properties of the 10% that the FPGA
designer needs.
Exactly which Verilog elements are considered synthesizable is a design problem
faced by the synthesis vendor. Generally, an “unofficial” subset of the Verilog language
elements will be supported by all vendors, but the current Verilog specification does not
contain any minimum list of synthesizable language elements. An IEEE working group
wrote a specification called IEEE Std 1364.1 RTL Synthesis Subset to define a minimum
subset of synthesizable Verilog language elements.
Verilog looks similar to the C programming language, but keep in mind that C defines
sequential processes (after all, only one line of code can be executed by a processor at a
time), whereas Verilog can define both sequential and parallel processes. Listing 1-5
presents some sample code with common synthesizable Verilog elements.
Synthesizable Verilog Elements 9
module hello (in1, in2, in3, out1, out2, clk, rst, bidir_signal,
output_enable);// See note 1.
/* See note 2.
Comments that span multiple lines can be identified like this.
*/
input in1, in2, in3, clk, rst, output_enable; // See note 3.
output out1, out2;
inout bidir_signal;
reg out2; // See note 4
wire out1;
Note 1: The first element of a module is the module name. Modules are the building
blocks of a Verilog design. In this book, the module name will be the same as the file name
(with a .v extension added) and each file will contain a single module. This is not required
but helps keep the design structure intelligible.
The port list follows the module/file name. This list contains the signals that connect
this module to other modules and to the outside world. Signals used in the module that are
not in the port list are local to the module and will not be connected to other modules. Note
the use of a semicolon as a separator to isolate Verilog elements. One confusing aspect of
Verilog is that not all lines end with a semicolon, particularly the compiler instructions
(always statements, if statements, case statements, etc.). It takes the Verilog newbie some
time to get comfortable with Verilog syntax.
Note 3: The port direction list follows the module port list. This list defines whether
the signals are inputs, outputs, or inout (bidirectional) ports of the module. All port list
signals are wires. A wire is simply a net similar to an interconnection on a printed circuit
card.
Note 4: Signals are either wires (interconnects similar to traces and pads on a circuit
board) or registers (a signal storage element like a latch or a flipflop). Wires can be driven
by a register or by combinational assignments. It is illegal to connect two registers together
10 Verilog Design in the Real World Chapter 1
inside a module. Verilog assumes that a signal, unless otherwise defined in the code, is a
one-bit-wide wire. This can be a problem; the synthesis tool will not test vector width. This
is one good reason for using a Verilog Lint tool.
Note 7: Always blocks are sequential blocks. The signal list following the @ and
inside the parenthesis is called the event sensitivity list, and the synthesis tool will extract
block control signals from this list. The requirement of a sensitivity list comes from
Verilog’s simulation heritage. The simulator keeps a list of monitored signals to reduce the
complexity of the simulation model; the logic is evaluated only when signals on the
sensitivity list change. This allows simulation time to pass quickly when signals are not
changing. This list doesn’t mean much to the synthesis tool, except that, by convention,
when certain signals are extracted for control, these input signals must appear on the
sensitivity list. The compiler will issue a warning if the sensitivity list is not complete.
These warnings should be resolved to assure that the synthesis result matches simulation.
The sensitivity list can be a list of signals (in which case, any change on any listed
signal is detected and acted upon), posedge (rising-edge triggered), or negedge (falling-edge
triggered). Posedge and negedge triggers can be mixed, but if posedge or negedge is used
for one control, posedge or negedge must be used for ALL controls for this block.
Note 8: The begin/end command isolates code fragments. If the code can be expressed
using a single semicolon, the begin/end pair is optional.
Note 9: We’re using nonblocking assignments (<=) in the always block. If blocking
assignments (=) are used, the order of the instructions may cause unwanted latches to be
synthesized so that a value can be held while earlier variables are updated. Generally, the
designer wants all elements in the sequential (always) block updated simultaneously, hence
the use of the nonblocking assignment, which emulates the clock-to-Q delay. The clock-to-
Q delay assures that cascaded flipflops (like a shift register) operate as expected. They are
called nonblocking because updating an earlier variable will not block the updating of a
later variable.
The rst input, when coded in this manner (i.e., a nonsynchronous signal used in a
synchronous module), is interpreted as asynchronous reset. This is not Verilog requirement
per IEEE Std 1364 but is an accepted convention.
Verilog language elements are case sensitive (SIGNAL1 and signal1 are not
equivalent, for example). Like the C programming language, Verilog is tolerant of white
space. The designer uses white space to assist legibility. It’s legal to combine lines as so:
but designers who write hard-to-read code like this are subject to the loss of their free sodas.
We’re not going to cover operator precedence. If you have a required precedence,
then use parenthesis to be explicit about that precedence. The reader should be able to read
the precedence in the source code, not be forced to memorize or look up the built-in
language precedence(s). Don’t create complicated structures; use the simplest and clearest
coding style possible. Listings 1-6 and 1-7 illustrate equivalent coding structures with
implicit and explicit ‘don’t-cares’.
One feature of Verilog the designer must conquer is whether a priority-encoded (deep and
slow) structure or a MUX (wide and fast) structure is desired. Nested if-then statements tend
12 Verilog Design in the Real World Chapter 1
to create priority-encoded logic. Case statements tend to create MUX logic elements. There
will be more discussion of this topic later.
Do not assume a Verilog register is a flipflop of some type. In Verilog, a register is
simply a memory storage element. This is one of the first of the features (or quirks) the
Verilog designer grapples with. A register might synthesize to a flipflop (which is a digital
construct) or a latch (which is an analog construct), a wire, or might be absorbed during
optimization. Verilog assumes that a variable not explicitly changed should hold its value.
This is a handy feature (compared to Altera’s AHDL, which assumes that a variable not
mentioned gets cleared). Verilog, with merciless glee, will instantiate latches to hold a
variable’s state. The designer must structure the code so that the intended hardware
construct is synthesized and must be constantly alert to the possibility that the latches may
be synthesized. Verilog does not include instructions that require the synthesizer to use a
certain construct. By using conventions defined by the synthesis vendor, and making sure
all input conditions are completely defined, the proper interpretation will be made by the
synthesizer.
VERILOG HIERARCHY
A Verilog design consists of a top-level module and one or many lower-level modules. The
top-level module can be instantiated by a simulation module that applies test stimulus to the
device pins. The top-level device module generally contains the list of ports that connect to
the outside world (device pins) and interconnect between lower-level modules and
multiplexing logic for control of bidirectional I/O pins or tristate device pins. The exact way
the design is structured depends on designer preference.
Module instances are defined as follows:
For example, the code in Listing 1-8 creates four instances of assorted primitive gates and
the post-synthesis schematic for this design is shown in Figure 1-2.
This example uses positional assignment. Signals are connected in the same order that they
are listed in the instantiated module port list(s). Generally, the designer will cut and paste
the port list to assure they are identical. A requirement for a primitive port listing is that the
output(s) occur first on the port list followed by the input(s).
The module port list can also use named assignments (exception: primitives require
positional assignment), in which case the order of the signals in the port list is arbitrary. For
named assignments, the format is .lower-level signal name (higher-level module signal
name). The module of Listing 1-9 includes examples of both named and positional
assignments.
module and_top;
wire test_in1, test_in2, test_in3;
wire test_out1, test_out2;
// Positional assignment.
user_and u2 (test_out2, test_in2, test_in3);
endmodule
endmodule
14 Verilog Design in the Real World Chapter 1
Tables 1-1 through 1-12 describe Verilog two-input functions. The input combinations are
read down and across. Verilog primitives are not limited to two inputs, and the logic for
primitives with more inputs can be extrapolated these tables.
and 0 1 x z
0 0 0 0 0
1 0 1 x x
x 0 x x x
z 0 x x x
nand 0 1 x z
0 1 1 1 1
1 1 0 x z
x 1 x x z
z 1 x x z
or 0 1 x z
0 0 1 x x
1 1 1 1 1
x x 1 x x
z x 1 x x
nor 0 1 x z
0 1 0 x x
1 0 0 0 0
x x 0 x x
z x 0 x x
xor 0 1 x z
0 0 1 x x
1 1 0 x x
x x x x x
z x x x x
xnor 0 1 x z
0 1 0 x x
1 0 1 x x
x x x x x
z x x x x
input output
0 0
1 1
x x
z x
input output
0 1
1 0
x x
z x
Table 1-11 notif0 (tristate inverting buffer, low enable) Gate Logic
Table 1-12 notif1 (tristate inverting buffer, high enable) Gate Logic
18 Verilog Design in the Real World Chapter 1
The code fragment in Listing 1-10 illustrates the use of these buffers and Figure 1-3 is
the schematic extracted from the synthesized logic.
0 1 x x 1 0
1 0 x x 0 1
0 0 x x 1 1 Note 1
1 1 r 1 1 0
1 1 r 0 0 1
1 1 0 x n n
1 1 1 x n n
Note 1: This condition is not stable and is illegal. The problem is, if the /Set and /Reset inputs
are removed simultaneously, the output state will be unknown.
x = don’t care (doesn’t matter).
r = rising edge of clock signal.
n = no change, previous state is held.
The typical FPGA logic element design allows the use of either an asynchronous Set
or Reset, but not both together, so we won’t have to worry about the illegal input condition
where both are asserted. This book is going to strongly emphasize synchronous design
techniques, so we discourage any connection to a flipflop asynchronous Set or Reset
input except for power-up initialization control. Even in this case, a synchronous
Set/Reset might be more appropriate.
A latch is more of an analog function. It’s helpful to bear in mind that all the
underlying circuits that make up our digital logic are analog! There is no magic flipflop
element. Flipflops are made with transistors and positive feedback: they are latches.
20 Verilog Design in the Real World Chapter 1
Even if you’re the kind of person whose eyes glaze over when you see transistors on a
schematic, you should still notice two things about Figure 1-4. The first thing is that this D
flipflop is made with linear devices, i.e., transistors. If you can always keep the idea in the
back of your head that all digital circuits are built from analog elements that have gain,
impedance, offsets, leakages, and other analog nasties, then you are on the road to being an
excellent digital designer. The second thing to notice is feedback (see highlighted signals)
from the Q and /Q outputs back into the circuit. Feedback is what causes the flipflop to hold
its state.
Latches and Flipflops 21
If you are more comfortable with gates, a different view of the same D flipflop is shown in
Figure 1-5. This is a higher level of abstraction; the transistors and resistors are hidden.
Those pesky transistors are still there! Again, note highlighted feedback path. Also note: the
PRESET and CLEAR signals have active low polarity.
Listing 1-11 shows a Verilog version of a latch, and Figure 1-6 shows the schematic
extracted from this Verilog design. The underlying circuit that implements RS Latch
(LATRS) is a circuit functionally similar to Figure 1-5. It’s not a digital circuit!
The latch uses feedback to hold a state: this feedback is implied in Listing 1-11 by not
defining q for all combinations of input conditions. For undefined inputs, q will hold its
previous state. The logic that determines a latch output state may include a clock signal but
typically does not and is therefore a level-sensitive rather than an edge-triggered construct.
endmodule
In the example of Listing 1-12, test_out1 will change only while enable_input is high,
then test_out1 will follow test_in1. This will synthesize to a combinational latch as
illustrated in Figure 1-7. We’ll discourage this type of coding style unless the latch is driven
Latches and Flipflops 23
A better design infers a clocked flipflop structure, as in Listing 1-13, with the
respective schematic shown in Figure 1-9.
Listing 1-13 demonstrates a flipflop with synchronous reset where the reset input is
evaluated only on clock edges. If the target hardware does not support a synchronous reset,
logic will be added to set the D input low when reset is asserted as shown in Figure 1-9.
Listing 1-14 illustrates a flipflop with asynchronous reset where the rst signal is “evaluated”
on a continuous basis. Notice that the dedicated global set/reset (GSR) resource of the
flipflops are not used. It would be much more efficient to synthesize a synchronous reset
signal and connect it to the GSR. This type of assignment is covered in Chapter 5.
endmodule
Blocking and Nonlocking Assignments 25
So far, we’ve used only nonblocking assignments (<=). A blocking assignment (=), when
the variable is defined outside the always statement where it is used, holds off future
assignments until the previous assignment is complete. How can synthesized hardware hold
off an assignment? By storing an old value in a latch, that’s how. This means that blocking
assignments are order sensitive; they are executed in the begin/end sequential block in the
order in which they are encountered by the compiler (top to bottom).
reg data_out;
The synthesized logic for Listing 1-15, shown in Figure 1-10, illustrates the blocking
assignment of data_temp and data_out: a flipflop is synthesized to create the intermediate
(pipelined) data_temp variable.
In Listing 1-16, the blocking statements are reversed. Notice how the resulting logic,
as illustrated in Figure 1-11, is different from the logic of Figure 1-10.
Blocking and Nonlocking Assignments 27
If we replace the blocking assignments with nonblocking assignments, the order of the
sequential instructions no longer matters. All right-hand values are evaluated at the positive
edge of the clock, and all assignments are made at the same time. The synthesized logic for
Listing 1-17, shown in Figure 1-12, illustrates the nonblocking assignments of data_temp
and data_out and the resulting synthesized design which is equivalent to the logic of Listing
1-15.
else
begin
data_out <= data_temp;
data_temp <= data_in;
end
endmodule
Numbers
Unless defined otherwise by the designer, a Verilog number is 32 bits wide. The format of a
Verilog number is size’base value. The ’ is the single quote (tick or closing quote), not to
be confused with ‘ (accent grave, opening quote, or back tick) which is used to identify text
substitution and compiler directives. Both tick and back tick are used in Verilog, which will
frustrate a newbie. Underscores are legal in a number to aid in readability. All numbers are
padded to the left with zeros, x’s, or z’s (if the leftmost defined value is x or z) as necessary.
If the number is unsized, the assumed size is just large enough to hold the defined value
when the value gets used for comparison or assignment. X or x is undefined, Z or z is high
30 Verilog Design in the Real World Chapter 1
impedance. Verilog allows the use of ? in place of z. Numbers without an explicit base are
assumed to be decimal. All nets are assumed to be Z unless driven.
Number examples:
Verilog is a loosely typed language. For example, it accepts what looks like an 8-bit
value like 4’hab without complaint (the number will be recognized as 1011 or b and the
upper nibble will be ignored). The use of a Lint program like Verilint will flag problems
like this. Otherwise, the Verilog designer must stay alert to guard against such errors.
Forms of Negation
! is logical negation; the result is a single bit value, true (1) or false (0). ~ (tilde) is
bitwise negation. We can use a ! (sometimes called a bang) to invert a single bit value, and
the result is the same as using a ~ (tilde), but this is a bad habit! As soon as someone comes
in and changes the single bit to a multibit vector, the two operators are no longer equivalent,
and this can be a difficult problem to track down (see Listing 1-18).
Forms of AND/OR
& is the symbol for the AND operation. & is a bitwise AND, && is a logical
(true/false) AND. As illustrated in Listing 1-19, these two forms are not functionally
equivalent.
| (pipe) is the symbol for the OR operation, where | is a bitwise OR and || is a logical
OR. As illustrated in Listing 1-20, these two forms are not functionally equivalent.
else if (and_test)
begin
end
endmodule
In Listing 1-21, the final else condition bears some comment. We did not cover all
input conditions in the logic above the final else condition. For example, what output do we
want if neither and_test or or_test is asserted? Without the final else defined, Verilog
interprets a change from a defined condition to an undefined condition as a hold condition
(if outputs are not commanded, the last value gets held). This causes latches to be created.
Generally, this is not what the designer intends; thus, we need to make sure that all
conditions are defined.
Miscellaneous Verilog Syntax Items 33
Equality Operators
== === are logical operators; the result is either true or false except that the == (called
logical equality) version will have an unknown (x) result if any of the compared bits are x
or z. The === (called case equality) version looks for exact match of bits including x’s and
z’s and returns only a true or false. Prepending a ! (bang) means “is not equal.” In the
equality examples of Listing 1-22, there are several if statements that will evaluate to true.
As the block is examined from top to bottom, only the first true condition will be accepted.
The later ones will not be evaluated. This is called priority encoding, and, like instantiating
latches, Verilog has a natural tendency to use this structure. It can result in many levels of
cascaded logic! Pay close attention. The alternative option is more of a MUXstyle of
structure where inputs are evaluated in parallel, which may be what you intend. We’ll talk
more about this later.
else if ((b == d) == 1)
result <= 1’bx;
// still an unknown.
result <= 1’bx;
end
endmodule
if (~resetn) ...
if (resetn == 1`b0)
Both are equivalent. Which is easy to read and easier to understand? That’s a matter
of opinion. Note the use of an ‘n’ suffix to indicate an active low (asserted when low or
low-true are other ways to describe this) signal. There are various ways of identifying active
low signals—for example, reset_not, resetl, or reset*, or resetN. It helps to identify the
assertion sense as part of the label; the main thing is to be consistent when selecting labels.
Other equalities are supported, including greater than (>), less than (<), greater than or
equal to (>=), and less than or equal to (<=).
Shift Operators
>> n and << n identify right-shift (divide by 2n) and left-shift (multiply by 2n)
operations. This operation will fill left and right values with zeros as necessary to fill the
register. Shifting a value which contains an x or a z will propagate the x or z in the direction
of the shift. Some examples of using the shift operators are presented in Listing 1-23.
a <= ’b1001;
b <= 0; // It’s bad form to async set
// values like this.
c <= 0;
d <= 0;
e <= ’bx000;
end
else if (shift_right_test)
begin
else if (shift_left_test)
begin
else
begin
d <= 0; // Assign values default to avoid
// unwanted latches.
e <= 0;
f <= 0;
end
end
endmodule
Conditional Operator
This is a common way of defining a MUX. If the expression being evaluated resolves
to x or z, the output_bus is evaluated bit-by-bit, and Verilog will try to resolve the output
values. If both input bits are 1 (which means the input condition doesn’t matter), then the
output bit is a 1. Same for both input bits being 0. Any bits that can’t be resolved are
assigned an x value. If the true_assignment or the false_assignment register width is not
wide enough to fill the output_assignment, the output_assignment bits are left-filled with
zeros. See Listing 1-24.
else
end
endmodule
Math Operators
Verilog supports a small set of math operators including addition (+), subtraction (-) ,
multiplication (*), division (/), and modulus (%); however, the synthesis tool probably
limits the usage of multiplication and division to constant powers of two (in other words, a
left shifter or right shifter will be synthesized) and may not support modulus. The + and -
math operators will instantiate preoptimized adders. Verilog assumes all reg and wire
variables are unsigned.
38 Verilog Design in the Real World Chapter 1
Parameters
Parameters are a useful way of making constants more readable. Parameters are used
only in the modules where they are defined, but they can be changed by higher-level
modules. Parameters cannot be changed at run time, but they can be changed at compile
time. This is useful in cases where a parameter changes the defined number of signals or the
number of instances some construct is used. Not all parameters have to be assigned, but if
there is a positional assignment list, parameters can’t be skipped.
A parameter can also be defined in terms of other constants or parameters. To aid in
reading the code, some people use upper-case characters for parameters.
Listings 1-25 and 1-26 demonstrate Verilog hierarchy, where a module list descends
into the hierarchy, starting at the top, and with module names separated by periods.
module top;
reg clk, resetn;
parameter byte_width = 8;
defparam
u1.reg_width = 16; // This parameter will
// replace the first
// parameter found in
// the u1 instantiation
// of reg_width.
defparam
u2.reg_width = byte_width * 2;
parm_tst u1 (clk, resetn, output_bus1);// Create a version
// of parm_tst with
// reg_width = 16.
parm_tst u2 (clk, resetn, output_bus2); // This version of
// parm_tst also has
// reg_width of 16.
parm_tst u3 (clk, resetn, output_bus3); // This version of
// parm_tst has a
// reg_width of 8.
endmodule
Concatenations
Concatenations are groupings of signals or values and are enclosed in curly brackets
{} with commas separating the concatenated expressions, as shown in Listing 1-27. All
concatenated values must be sized. Note the use of [ ] to identify the bit select or register
index. It’s legal to define a register like backwards_reg, but, regardless of the numbers used,
the leftmost definition is always the most significant bit. Usually, you’ll see the largest
number occurring on the left side of the colon (:) unless a one-dimensional array of
variables (like a RAM) is being created.
40 Verilog Design in the Real World Chapter 1
module backward;
reg [0:2] backwards_reg;
reg [2:0] test;
/* {1’b0, test, 8’h55} is the same as:
always @ (test)
begin
test = backwards_reg;
// The assignment above is equivalent to the assignments below:
test[2] = backwards_reg[0];
test[1] = backwards_reg[1];
test[0] = backwards_reg[2];
end
endmodule
C H A P T E R 2
41
42 Digital Design Strategies and Techniques Chapter 2
There will be many views of a design. The designer must be comfortable changing
between different views of the same project as it evolves into a bitstream file formatted to
configure an FPGA. It helps to keep in mind that all digital design elements are
implemented with analog components. There is no magic device that acts like a NAND
gate. We implement digital logic with analog devices like transistors, diodes, and resistors
as shown in Figures 2-1, 2-2, and 2-3. Transistors can act as digital switches (on or off) or
as analog transfer gates (pass mode). For the transistor impaired, N FETs are ON with a
“one” on the gate, P FETs are ON with a “zero” on the gate.
Most FPGAs use a multiplexer (MUX) Look-Up Table (LUT) as a basic logic element.
There are two reasons for doing this:
The MUX control inputs are used as logic inputs, and the multiplex inputs are strapped to
logic levels to implement the desired function. Figure 2-4 illustrates an inverter
implemented using this method.
A hidden advantage to using the MUX LUT as a logic element is provided by the
capacitive loading and the “break-before-make” switching character of the MUX output.
When the inputs change, the output is held and tends to change cleanly without glitching.
Synthesis Example
Changing back to the digital world, let’s refer to the Overheat Detector design of
Chapter 1, reprinted here as Listing 2-1.
endmodule
The synthesis tool converts our simple 40-line source code into an ugly EDIF netlist
almost 300 lines long. This netlist holds all the design elements and information regarding
the compiler version, target part, and all the design constraints the synthesizer knows about.
This netlist is designed to be interpreted by other computer programs and doesn’t contribute
much usable information to the designer, so we won’t look at an example.
A graphical version (a schematic) of the netlist as shown in Figure 2-5 is more useful
to us in understanding what the synthesis tool created, particularly for the HDL impaired.
Note the correct use of global resources for clock and reset. Because Verilog does not
support direct assignment of hardware resources (the biggest problem for the Verilog FGPA
designer), it is the designer’s job to assure that these inferences are made correctly.
46 Digital Design Strategies and Techniques Chapter 2
The synthesis tool has some understanding of the target architecture and can provide
estimates of the design timing and resource requirement, see Listing 2-2. This estimate will
not include black-box modules that are later imported by the FPGA place-and-route tool.
*******************************************************
*******************************************************
Number of ports : 5
Number of nets : 13
Number of instances : 10
Number of references to this view : 0
***********************************************
Device Utilization for 4010xlPQ100
Discussion of Design Processing Steps 47
***********************************************
Resource Used Avail Utilization
-----------------------------------------------
IOs 5 77 6.49%
FG Function Generators 1 800 0.13%
H Function Generators 0 400 0.00%
CLB Flip Flops 2 800 0.25%
Clock : Frequency
------------------------------------
clock : 118.8 MHz
Syntax Checking
The first step is to submit your code to a compiler, simulator, and/or Lint program that will
identify syntax, typing, and other errors. Each program evaluates the code differently. If
there is some confusion about what the syntax check says, it can be very helpful to try
another interpreter. Listings 2-3 through 2-6 illustrate four ways that an error is reported for
a simple problem inserted in the overheat.v code. A semicolon was appended on one of the
if statements as so:
ModelSim reported an error on the next line; at least in the right neighborhood of the error.
Reading “c:\verilog\verilog\overheat.v”
sim to 0
Highest level modules (that have been auto-instantiated):
(overheat overheat
The point of this exercise is to illustrate that different tools give different (and more or
less useful) error messages, and it makes good sense to have several tools available for
checking your code, particularly a Lint tool. Verilint (or a similar Verilog Lint tool) is
useful because it’s fast, easy to use, and catches many different types of errors. This type of
tool can save many hours of frustration. Regardless, this example illustrates how much
trouble a single misplaced semicolon can cause.
Discussion of Design Processing Steps 49
The end result of all of our work is a configuration of hardware. This hardware can be an
FPGA, a semicustom FPGA conversion, or some sort of ASIC (standard cell, gate array,
full custom). If the result is an FPGA, the hardware will have an underlying structure that
varies depending on the design approach taken by the FPGA vendor. The logic structure of
a Xilinx 4K family is illustrated in Figure 2-6. We’ll take a closer look at this and other
device architectures in Chapter 7.
The Xilinx 4K family Configurable Logic Block (CLB) is basically two 4-input LUTs
feeding a pair of flipflops. The Verilog code we write gets mapped into this structure by the
synthesis tool.
The synthesizer translates the design to a form suitable for the target hardware by:
x Flattening the design into large Boolean equations with one equation for each
module output, design section output, or register output. Redundant registers
may be identified and optimized out. For example, the code fragment of Listing
2-7 might be flattened into the Boolean equation of Listing 2-8.
50 Digital Design Strategies and Techniques Chapter 2
To create the gate representation of this circuit, create a truth table (see Table 2-1) which
defines all the input and output conditions.
By inspection, we see that the c[0] (SUM) output can be represented by an XOR gate
and the c[1] output (CARRY) can be represented by an AND gate. A flattened version of
the simple adder circuit is illustrated in Figure 2-7.
The logic mapped into the top F2_LUT to create c[0] (ix72) is (~I0 * I1) + (I0 * ~I1),
equivalent to the XOR function. The logic mapped into the lower F2_LUT to create c[1]
(ix71) is (I0 * I1).
The synthesizer can’t recognize redundant logic that crosses register boundaries
(though it may recognize and delete redundant registers). If there is any chance for logic
minimization, this must be part of the design input. The best opportunities for logic
reduction are created and implemented by the designer.
x Timing and resource requirements are estimated. The compiler can only
estimate the design timing and resource requirements. The manufacturer may
have made changes to the timing parameters (the device manufacturer will
always be ahead of other companies, who rely on the manufacturer for data).
Another reason the timing estimate may not be accurate is that the library and
black-box elements are not yet part of the design netlist. These elements are
inserted when the design is linked and the final netlist is created and flattened.
x The design is converted to a netlist. There are various flavors of netlists, but the
most common format at present is EDIF.
x The design elements and modules are linked together and ‘black-box’ modules
are replaced with library module netlists. The netlist created by the compiler
52 Digital Design Strategies and Techniques Chapter 2
may be flattened (all the modules merged into one netlist) or the hierarchy may
be maintained with the modules kept separate. With the hierarchy maintained,
the design is easier for the designer to understand as it appears more like it was
created.
x Floorplanning and routing attempts are made until the timing and resource
constraints are met. Floorplanning assigns elements from the device logic to the
designed circuitry. The place and route of the design is very much like the
place and route of a printed circuit board. The efficiency of routing and the
resulting speed of the routed design depend on the arrangement of the module
elements, which affects the interconnect between modules. There are limited
routing resources in an FPGA. When the routing gets dense (congested), long
routing paths may be necessary to complete a signal path. This slows the design
and causes routing problems for signals that must travel across or around the
congested area. Some FPGA vendors advertise the capability of 100% routing
of all logic, but others make densities of 65% (Altera) and 85% (Xilinx) more
reasonable. Manual floorplanning can increase the usable logic density.
x Timing and resource reports are extracted from the design. A timing-annotated
netlist may be created to support post-route simulation. A common format for a
timing-annotated netlist is the SDF format as illustrated in Listing 2-9. SDF
stands for Standard Delay Format. This file includes estimated gate delays based
on the FPGA design rules.
(DELAYFILE
(SDFVERSION “2.0”)
(DESIGN “adder”)
(DATE “08/31/99 09:21:34”)
(VENDOR “Exemplar Logic, Inc., Alameda”)
(PROGRAM “LeonardoSpectrum Level 3”)
(VERSION “v1999.1d”)
(DIVIDER /)
(VOLTAGE)
(PROCESS)
(TEMPERATURE)
(TIMESCALE 1 ns)
(CELL
(CELLTYPE “F2_LUT”)
(INSTANCE ix72)
(DELAY
(ABSOLUTE
(PORT I0 (::3.25) (::3.25))
(PORT I1 (::3.25) (::3.25)))))
Discussion of Design Processing Steps 53
(CELL
(CELLTYPE “F2_LUT”)
(INSTANCE ix71)
(DELAY
(ABSOLUTE
(PORT I0 (::3.25) (::3.25))
(PORT I1 (::3.25) (::3.25)))))
(CELL
(CELLTYPE “BUFG”)
(INSTANCE clock_ibuf)
(DELAY
(ABSOLUTE
(PORT I (::0.00) (::0.00)))))
(CELL
(CELLTYPE “OFDX”)
(INSTANCE reg_c_1)
(DELAY
(ABSOLUTE
(PORT C (::3.25) (::3.25))
(PORT D (::2.77) (::2.77)))))
(CELL
(CELLTYPE “OFDX”)
(INSTANCE reg_c_0)
(DELAY
(ABSOLUTE
(PORT C (::3.25) (::3.25))
(PORT D (::2.77) (::2.77)))))
(CELL
(CELLTYPE “IBUF”)
(INSTANCE reset_ibuf)
(DELAY
(ABSOLUTE
(PORT I (::2.77) (::2.77)))))
(CELL
(CELLTYPE “IBUF”)
(INSTANCE a_ibuf)
(DELAY
(ABSOLUTE
(PORT I (::2.77) (::2.77)))))
(CELL
(CELLTYPE “IBUF”)
(INSTANCE b_ibuf)
(DELAY
(ABSOLUTE
(PORT I (::2.77) (::2.77)))))
(CELL
(CELLTYPE “STARTUP”)
(INSTANCE ix56)
(DELAY
(ABSOLUTE
(PORT GSR (::2.77) (::2.77)))))
)
54 Digital Design Strategies and Techniques Chapter 2
x The device configuration files are created. The download file can be
programmed into a serial EPROM, downloaded through a serial or parallel
cable, or stored in memory and written to the device by a microprocessor, or by
a stand-alone EPROM with address and data control generated by the FPGA
itself. The device might be ISP (In-System Programmable) or a reprogrammable
type (plugged into a programmer, programmed, then installed in the destination
design).
Many people, when asked to draw a two-input NOR Gate, will draw a circuit that looks like
Figure 2-8. In my experience this circuit seems shifty or flaky. This is not just a sign of
mental illness. The output is very likely to be glitchy when the inputs change. We’re digital
designers and we want the analog aspects of our design to be minimized.
Figure 2-9 shows a typical circuit where the simple NOR gate might be used. The
resistance and capacitance of Figure 2-9 do not have to be discrete devices on a circuit
board, they could be parasitic values associated with signal routing and loading.
The oscilloscope trace shown in Figure 2-10 demonstrates one problem with the
combinational circuit. One input is strapped low, so the output should just be the inverse of
the other input, right? Where did those nasty glitches on the output come from? The input is
a noisy signal that crosses the input threshold (where the input is between being recognized
Discussion of Design Processing Steps 55
as one or zero by the gate input) very slowly. The RC network just exaggerates the problem
and is exactly the kind of thing you see when some bonehead tries to filter out the switch
contact bounce. The right way to filter switch bounce is to use feedback (hysteresis).
Fine, you say. You’ll make sure that the input always changes quickly to minimize
glitches. So, you invent a circuit that switches infinitely fast (you can store this circuit on
the same shelf as your perpetual motion machine). Anyway, that’s still not good enough,
because there is another cause of glitches. When the inputs are changing at nearly the same
time, again the output can be indeterminate. The circuit of Figure 2-11 demonstrates this
problem. A resistor-capacitor (RC) network is added to delay the input signal. Again, the
output has nasty transients. So, your design won’t use RC networks between inputs like this.
Well, the RC time delay might be caused by mixed routing paths between inputs (signal
skew) or by signal loading where each signal destination contributes a capacitive load. The
R part of Figure 2-11 represents the sum of the source and routing impedance (proportional
to route length) and the C part represents net loading (proportional to the number of loads
on the net). The only control you have of this problem is making sure that signals have low
fanout (a measure of the signal loading represented by destination logic elements where
56 Digital Design Strategies and Techniques Chapter 2
each gate load is counted as a fanout of 1). Most synthesis tools allow a fanout constraint to
be defined to control loading (signals are split and driven by separate buffers).
When I am asked to draw a two-input AND Gate, it looks like Figure 2-12. The
difference is the addition of a synchronizing flipflop. The output of this circuit will not be
glitchy if synchronous logic rules are followed and the setup/hold requirements for the
flipflop are met (see the next section for a discussion of setup and hold times). This is
particularly safe if the input signal is synchronous, too. If the signal at the D input of the
flipflop is stable in time to meet the setup-time requirement and maintained beyond the
hold-time requirement, then all is well.
Metastability
Figure 2-13 illustrates the metastability problem; if SIGNAL changes within the
setup/hold window of the flipflop, the output is unknown for a period. How long is this
period? It depends on the characteristics of the flipflop and its environment: how fast is the
flipflop, how much gain does it have, and how much noise is present in the system. How big
is this problem? It depends on how often the input changes and how wide the setup/hold
window is compared to the clock period.
We’ll never get to zero metastability, but hopefully the statistical probability of
metastability will be microscopic. I don’t know about you, but if I can get the mean time
between failures in my design to 100,000 years or so, that’s good enough.
The closest we will get to a solution to the metastability problem is to use
synchronous design techniques. This means a synchronizing clock is used to qualify, gate,
or trigger a circuit. The time between clock edges is used to allow signals to propagate and
settle. It’s like a game; if you can get your signal to the next flipflop before the next clock
setup time, then you win.
58 Digital Design Strategies and Techniques Chapter 2
For the output of a flipflop to be predictable (not metastable), the inputs must meet the
setup and hold time requirement of the flipflop.
x The setup time, often represented as Tsu, is the time period, BEFORE the edge
of the synchronizing clock, when the input is required to be stable. If the setup
time is violated, the output value is indeterminate.
x The hold time, often represented as Th, is the time period, AFTER the
synchronizing clock edge, when the input is required to be stable. If the hold
time is violated, again the output value is not guaranteed.
The setup and hold requirement comes from the analog nature of the flipflop design.
The flipflop uses feedback implemented with cross-coupled gates to hold a state. It takes
time for the gates to achieve their stable state. In a perfect world, an edge-triggered flipflop
would change states exactly synchronous with the clock edge. The clock edge would be
infinitely fast, and the flipflop would change states instantaneously. Real World clocks have
rise/fall times, and flipflops require stable inputs during the setup/hold time to achieve a
stable output state.
The flipflop metastability problem will never go away as long as a signal has a
random phase relation to the flipflop clock. However, IC manufacturers have made great
progress in closing the metastability window (this window is the setup plus hold time
window). By increasing the speed of the flipflop, we make the metastability window
narrower and less of a problem. The fact is, most problems that designers blame on
metastability is related to asynchronous design technique. Each FPGA input should drive
one and exactly one flipflop. The output of this single flipflop can be used to drive another
flipflop for added security or can be used to drive the rest of your synchronous system.
When an asynchronous input drives multiple flipflops, and the input changes near the clock
edge, some flipflop outputs will change and some will not. This is not a metastability
problem; this is an asynchronous input problem!
Figure 2-14 illustrates this. The RC delays represent signal delays due to routing and
load inside the FPGA. We want all three flipflop outputs to be the same, but, depending on
the phase of the input signal, sometimes the outputs will not be the same. If we synchronize
the input with a single flipflop and do not violate its setup/hold time requirement, then all
outputs are assured to be the same. That’s what we want!
Synchronous Logic Rules 59
How can we absolutely assure that the inputs are not going to change during the setup
and hold period of the flipflop? The answer is an important part of the solution for the
question: “How can I create a nearly trouble-free design?”
Always synchronize your inputs! This means an asynchronous input drives exactly one flipflop.
The output of this flipflop can be safely used to drive the rest of your synchronous circuitry.
This circuit could hardly be simpler; the inverting output is fed back to the D input,
and the output changes state on every other clock edge. It is interesting to think about a
situation where this circuit does not work. Let’s assume that the device technology has some
easy numbers to work with, so all delays are 1 nsec.
Flipflop Specification:
Flipflop Minimum Input Setup Time: 1 nsec
Flipflop Minimum Input Hold Time: 1 nsec
Clock-to-Output Delay (Maximum): 1 nsec
Maximum Propagation Delay Time (Q output to D input) 1 nsec
For repeatable results, the D input must be stable 1 nsec before the clock edge and
must remain stable after the clock edge for 1 nsec. The flipflop output is guaranteed to reach
its final value less than 1 nsec after the clock edge. It takes less than 1 nsec for the signal to
propagate from the Q output back to the D input. At what clock frequency does this circuit
begin to fail?
Rising edges of the clock must not occur before setup time + output delay + routing
delay, or 3 nsec. This means the input clock had better not have a frequency greater than
333.333 MHz. This is a high frequency, most likely achievable only with an ASIC using
today’s technology. An FPGA will have longer (possibly much longer) delays and will have
correspondingly lower maximum clock frequencies.
The delays for the device elements are provided by the device vendor. The number of
delays can be mind-boggling. An FPGA has a complicated mix of delays; clock to Q,
routing delays through switch elements, delays through signal multiplexers (look-up tables),
Synchronous Logic Rules 61
and delays proportional to signal loading, among others. For example, for a 4000XL device,
Xilinx specifies 41 timing parameters in 4 speed grades, for a total of 164 individual timing
numbers. Memorize them; there will be a test later. Fortunately, the compiler knows these
published delays and will calculate the totals for your circuit design. Let’s consider another
simple circuit, two flipflops in series as shown in Figure 2-17.
Again, this is deceptively simple. How can this circuit work reliably? What if the
minimum clock-to-output delay (a value that is rarely specified, but often estimated as 25%
of typical) for U1 is less than the hold-time requirement for U2? So, you tear up your data
book looking for the hold-time requirement, and with a sigh of relief (if you’re lucky) you
see that it is specified as zero. The suspicious engineer will say, hold on a nanosecond, how
can it be zero? All the logic circuitry we’ve ever looked at requires some hold time greater
than zero. And that’s correct. It has to be so, but the designer of the FPGA logic cell has
done some work for us and has put in delays to guarantee that the logic path (the logic in
series with the D input) has a shorter delay than the clock path. Essentially, this is done by
adding the hold time to the setup time, then delaying the clock enough to satisfy this
extended setup time. With reference to the clock edge, the input signal takes longer to arrive
at the D input, but it also stays around longer. Even if the input signal changes coincident
with the clock edge, the clock delay inside the logic cell will make sure it stays valid long
enough to satisfy the buried hold-time requirement of the flipflop. This simplifies the
analysis of the FPGA design and assures that a circuit like Figure 2-18 will function.
In summary, the FPGA chip designer has created a logic cell that assures the circuit of
Figure 2-17 will work. ASIC designers don’t have this luxury and must account for delay
and tolerance build-ups in their design. We do not have this luxury when dealing with
signals from outside the FPGA. The signal characteristics of external signals must be
examined and understood completely. If there is any sign of slow or glitchy signals, then we
will implement circuits with hysteresis (like a Schmitt trigger) and will use a two-flipflop
synchronizing circuit to minimize metastability.
Hysteresis is a circuit that adds positive feedback to the input. The idea is that when
the output switches, it adds to the input to help prevent oscillation. The amount of feedback
should be slightly greater than the noise on the input signal. Xilinx doesn’t widely advertise
this information, but all their FPGA inputs have a few hundred millivolts of hysteresis; this
makes their inputs friendly to noisy environments.
To complete our analysis, we must consider clock-skew. In a perfect world, all
flipflops in our design will receive clock edges that are exactly synchronous. The first thing
to understand is the clock-skew problem is not related to the operating frequency of your
design! Even a slow design can have clock-skew problems.
Let’s expand the circuit of Figure 2-17 to show clock-skew, see Figure 2-19. Imagine
that the flipflops are located far apart in the design and the second flipflop clock is delayed
from the clock ‘seen’ by the first flipflop.
What is the problem? Let’s call t1 the clock-to-output delay period and t2 the
propagation delay of the signal across the device to the D input of the second flipflop. We
are hoping (and perhaps assuming) the value clocked into U2 is the old value of the Q
output of U1. If the skew of the clock is too long, then we’ll get the new value at U1-Q—or
worse, we’ll violate the setup or hold time requirement of U2 (depending on how much
delay occurs) and get an unknown output from U2. We’re digital engineers; we don’t like
unknowns. What is the solution? Fortunately, the FPGA designer provided low-skew clock
networks carefully crafted to assure that the longest skew of the clock anywhere across the
device is shorter than the shortest sum of clock-to-Q and signal routing propagation times. If
you can use a global low-skew clock network, then there’s no problem. If you create an
Synchronous Logic Rules 63
asynchronous design by using a routed clock (one that travels through random logic in the
design), a gated clock, a MUX’d clock, or are designing an ASIC (where the clock networks
are all custom designed), then you are responsible for assuring that this requirement is met.
We must also carefully analyze the situation where the FPGA designer has no control of one
or more of the signals. Consider the case where an input source, represented by the flipflop
U1 in Figure 2-17, is off-chip and is connected to a flipflop clocked by the FPGA clock. If
U1 is a fast device, it is very possible that a race condition, which means signals arrive at
synchronizing flipflops at different times, will occur. The race problem occurs when signals
are changing at the input of a gate at the same time. This results in an unknown output.
We’re digital designers; we like 1’s and 0’s. Unknown output states make us neurotic and
twitchy.
This signal-race situation is much worse if there is no input-synchronizing flipflop in
the input, because the race condition propagates across the design to all the circuits sensitive
to the inputs. Very bad. At least, if there is an input-synchronizing flipflop, the only
setup/hold time requirement is on that specific flipflop; once the timing is worked out for
that device, the signal is well conditioned for operation inside the design. In a case like this,
the easy solution is to make sure that the external device runs off the same clock as the logic
synchronizing clock inside the design and is a slow device so the output can’t change fast
enough to cause a race condition. Proving this can be a problem, because chip
manufacturers almost never provide a minimum clock-to-Q output time. This is good for the
manufacturers because it allows them to improve the IC process (make the device smaller,
faster, and cheaper to build) without changing the data sheet. It’s bad for the designer using
the parts who is diligently trying to do a worst-case timing analysis.
A solution might be to clock external devices on the clock edge opposite to the one
used inside the FPGA. Xilinx allows a flipflop to be clocked by either the rising or falling
clock edge. Careful analysis must be done to assure that the timing works out. The clock-
skew and setup time must be less than 1/2 a clock period compared to the full clock period
allowed internal to the FPGA/ASIC design. A schematic of a circuit that uses the alternate
clock edge is illustrated in Figures 2-20 with the resulting timing waveforms of Figure 2-21.
Keep in mind that clocks never have perfectly equal high and low periods and these
variations in duty cycle will subtract from the available flipflop setup time margin.
64 Digital Design Strategies and Techniques Chapter 2
Figure 2-20 Two Flipflops Connected in Series Using Alternate Clock Edges
Figure 2-21 Two Flipflops Connected in Series Using Alternate Clock Edges, Timing
Diagram
CLOCKING STRATEGIES
We’ve already decided that we want to create a synchronous design. This means there is at
least one clock (preferably exactly one clock). Still, decisions remain about the clocking
strategy used in the design. For the most trouble-free design, use one master clock. But what
if the design has different clock domains it must interface with? What if using a single clock
results in too much power consumption? There is no one answer to this problem; the answer
depends on what you’re trying to accomplish. Here are some suggested clock strategies:
Clock Enable
Verilog HDL does not support dedicated clock-enable signals. The hardware (FPGA or
ASIC) may have dedicated clock-enable resources, but Verilog does not give direct control
66 Digital Design Strategies and Techniques Chapter 2
of this signal assignment. In the meantime, synthesis vendors will provide this support
through compiler directives. This means that code like Listing 2-10, depending on whether
the target hardware has dedicated clock-enable support, might synthesize in different ways.
One way a design might be interpreted by the synthesizer is illustrated in Figure 2-22 where
a clock-enable feature is available in the FPGA logic block design.
module clock_en(out,in,clock,clock_enable1,clock_enable2,reset);
output out;
input in, clock, clock_enable1, clock_enable2, reset;
reg out;
Some logic may get included in the logic that drives the clock enable, as shown in
Figure 2-23. Note that the logic is not exactly the same; the point is that the synthesizer may
insert added logic into the clock-enable path.
LOGIC MINIMIZATION
A synthesizer can recognize and remove redundant logic. For example, the code fragments
of Listings 2-11 and 2-12, are equivalent.
68 Digital Design Strategies and Techniques Chapter 2
sample = ((test1 & test2 & test3) | (test1 & !test2 & test3)
| (test1 & test2 & !test3));
The logic is minimized even if the designer intentionally put in the redundant logic to
provide hazard coverage. Hazard coverage is the addition of redundant logic to cover up
race conditions. This text will never suggest using hazard coverage; always use
synchronous design techniques to avoid hazards.
The compiler can also recognize equivalent logic equations. An alternate form of an
equation might use less area or fewer levels of logic when implemented in an FPGA. The
compiler will try alternate equation forms and use the equation that best meets the design
requirements.
DeMorgan’s Theorems
There is a corollary to the AND/OR form that can be applied to the exclusive-OR
form:
Logic Minimization 69
a ^ b = ~a ^ ~b;
~(a ^ b) = ~a ^ b = a ^ ~b;
AND/OR functions are duals of each other (like division is the dual of multiplication).
DeMorgan’s law defines the conversion between the AND/OR equation forms.
The compiler can also manipulate equations using the laws of Boolean algebra. These
laws are:
Commutative Law
a | b = b | a;
Associative Law
a | (b | c) = (a | b) | c;
Distributive Law
Because the designer uses synchronous techniques and doesn’t clog up the design
with complicated structures between registers, the ability of the synthesis tool to extract
redundant logic is limited. There may be simpler logic, but the synthesizer will not be able
to extract it if the logic is spread across register boundaries. Examine Figure 2-26, which
implements the logic of Listing 2-11 with synchronous techniques. The synthesizer will not
find the redundancy! Except for some propagation delays, the two circuits shown in Figure
2-26 are equivalent.
70 Digital Design Strategies and Techniques Chapter 2
The best logic synthesizer is the one between your ears. A poorly planned design will
always be poor regardless of how great the compilers become. When doing a design, a good
designer keeps a model of the synthesized logic in her head and doesn’t allow the logic to
grow so complex that it becomes a problem for the synthesis tool. One way of taking
advantage of the synthesis tool’s capability to minimize and pack logic effectively is to
never create purely combinational modules. None of the popular FPGA architectures have
purely combinational logic elements. There is generally a register that goes wasted if a CLB
is used only for combinational logic. Mix the combinational logic with the synchronous
logic to allow the synthesis tool to merge the logic into the resources available in the device.
The logic block architecture uses combinational logic or LUTs (Look-Up Tables) that feed
into registers. Write your logic that way!
What Does the Synthesizer Do? 71
It’s helpful to think about what the synthesizer is doing. The synthesis tool takes
Verilog HDL and maps it into hardware. First, the synthesizer will minimize logic equations
by removing redundant logic terms. Then the design will be a huge set of Boolean
equations. The remaining problem can be thought of as a simple division:
A/B
where A is the full design and B represents the hardware elements available in the target
CPLD, FPGA, or ASIC. In general, for a CPLD the hardware structure will be multi-input
Logic Elements (LE), for an FPGA the hardware structure will be a 3- or 4-input look-up
tables (LUT), and for an ASIC the hardware structure will be a more freeform collection of
library elements. Assuming the basic logic element is a 4-input LUT, the synthesis tool will
partition our complicated denominator into many equations, each a function of 4 inputs.
There will be many sets of equations that will implement our design, and the synthesis tool
will attempt to find ones that meet the design goals of size and speed.
72 Digital Design Strategies and Techniques Chapter 2
A truth table lists all input combinations and defines an output condition for each. A
truth table is a tabular equation form and works well for software manipulation of equations.
The compiler will extract a sum-of-products (SOP) equation from your HDL code. The SOP
is developed by collecting terms that give a 1 result and ORing them together.
Let’s do a SOP representation of a 7-segment decoder. This decoder, similar to a
CMOS 4513, will convert 4-bit binary-coded decimal (BCD) number to device pins that
drive a 7-segment display.
Input Segment
BCD a b c d e f g
b3 b2 b1 b0
0 0 0 0 11 1 1 110
0 0 0 1 01 1 0 000
0 0 1 0 11 0 1 101
0 0 1 1 11 1 1 001
0 1 0 0 01 1 0 011
0 1 0 1 10 1 1 011
0 1 1 0 10 1 1 111
0 1 1 1 11 1 0 000
1 0 0 0 11 1 1 111
1 0 0 1 11 1 1 011
Let’s collect the input terms that cause the ‘a’ segment to be asserted.
a = ((!b3 & !b2 & !b1 & !b0) | (!b3 & !b2 & b1 & !b0)
| (!b3 & !b2 & b1 & b0) | (!b3 & b2 & !b1 & b0)
| (!b3 & b2 & b1 & !b0) | (!b3 & b2 & b1 & b0)
| ( b3 & !b2 & !b1 & !b0) |( b3 & !b2 & !b1 & b0));
We can get a hint about how the reduction algorithm works by extracting and
examining two terms of the equation:
(!b3 & !b2 & !b1 & !b0)|(b3 & !b2 & !b1 & !b0)=(!b2 & !b1 & !b0);
The equation terms differ only in the b3 term, which is asserted low in the first term
and asserted high in the second. Clearly the b3 term doesn’t matter, is redundant, and can be
removed without affecting the logic.
Next we’ll convert the ‘a’ segment equations to standard decimal-sum form by
replacing all negated terms with 0 and all true terms with 1. A term like (!b3 & !b2 & !b1 &
!b0), which has all terms negated, becomes (0,0,0,0), and the whole term can be represented
by a decimal 0. The next term, (!b3 & !b2 & b1 & !b0), (0,0,1,0) becomes 2, and so on,
until we collect all the terms that lead to the ‘a’ segment being asserted:
a = (0,2,3,5,6,7,8,9)
What Does the Synthesizer Do? 73
The Quine-McCluskey algorithm arranges terms in order of the number of the (0)
terms. Only terms whose total numbers of negated terms differ by 1 can possibly be
combined. For example, when we combined (0,0,0,0) with (1,0,0,0), this combination was
possible because the first term has 4 zeros and the second term has 3 zeros. The Quine-
McCluskey algorithm exhaustively tests terms and combined terms against each other to
determine the minimum logic expression.
Running QM with the logic terms for segment ‘a’ gives the reduced equation:
a =((!b3 & !b2 & !b0)|(b3 & ! b2 & !b1)|(!b3 & b2 & b0)|(!b3 &
b1));
Let’s see if we can follow what the synthesizer does with this logic defined as a
Verilog design in Listing 2-13.
Listing 2-13 Verilog Design for 7-Segment Display Decoder ’a’ Term
For Xilinx 4xxx logic, which uses a 4-input LUT feeding a flipflop as a primitive, the
synthesizer arranges the logic to efficiently use the CLB resources and gives the circuit of
Figure 2-28 for the ‘a’ logic.
Figure 2-28 Synthesized Logic for 7-Segment Display Decoder ‘a’ Term Logic
AREA/DELAY OPTIMIZATION
When implementing a design, there are two fundamental properties: how big is it and how
fast will it operate? Synthesizing a logic design is much like autorouting a circuit board.
When routing a circuit-board trace, there are many options. Which path should the signal
take? What is the signal priority compared to other signals? There is no one answer. The
circuit-board trace can take a nearly unlimited set of paths to its destination. The right
answer occurs when the routing has met the requirements of the design, even if it’s possible
to get better area/delay performance. This bears special emphasis. The designer’s work
will not be judged by how perfect it is! The designer’s work will be judged by how well it
meets the system requirements for product cost, development cost, performance, reliability,
maintainability, and time to market. The quest for perfection will not be rewarded. The goal
of our quest is to achieve ‘good enough.’ This does not mean we’re going to deliver a bad
design. Our design still must meet timing requirements and use good design practices.
The concept of design cost weighs area and speed (delay) against each other. In many
cases, the fastest design is not the smallest. In many cases, the smallest design is not the
fastest. The designer has successfully accomplished the design if it fits into the technology
selected and runs fast enough to meet the needs of the system. How easy or difficult this
problem is depends on many factors: the size of the selected device, the architecture of the
device technology, the system speed requirement, and the skill and design approach of the
designer.
The experienced designer always leaves a way out of a problem by insuring that
a faster or denser device, if at all possible, is available in the same device footprint.
Area/Delay Optimization 75
This way, instead of redesigning a circuit board to accommodate a new device at great
expense and loss of time, a faster and/or denser device can be easily substituted.
This page intentionally left blank
C H A P T E R 3
Verilog uses a powerful method of isolating and maintaining identifiers. A module which is
not instantiated by other modules will be considered as a top module. The top module will
generally instantiate other modules which will appear underneath it in the design hierarchy.
This top module is called the root module. The design identifiers include module instances,
tasks, functions, or named begin/end blocks.
Each design identifier creates a new branch of the hierarchy tree. Each node of the
tree is unique and contains identifiers which will not conflict with other identifiers in other
branches or elements of the hierarchy. A signal can be accessed anywhere in the design by
77
78 A Digital Circuit Toolbox Chapter 3
top.device_bus1[5:0]
top.device_bus2[3:0]
top.device_bus3[3:0]
top.s1[3:0]
top.s2[4:0]
top.s3[3:0]
top.s4[4:0]
top.s5[5:0]
top.msb1
top.msb2
top.u3.add1[4:0]
Tristate busses are allowed by most FPGA architectures on device output pins. Listing 3-2
provides an example. In addition, some FPGAs allow internal tristate signals. Internal
tristates can save a lot of logic when selecting between different sets of control or data
signals. In other words, different logic trees feed a tristate bus with one logic tree enabled at
a time. With this method, an entire logic construct can be switched quickly. If internal
tristates are not allowed, the synthesis tool may have a control to automatically substitute
MUXes. This replacement will consume many more gates and is likely to be much slower
than a tristate bus structure. If conversion to an ASIC is intended, check with the ASIC
vendor. Internal tristates are an ASIC conversion issue; the vendor may not offer internal
tristates and may or may not offer automatic expansion of tristate nets to logically controlled
nets. Internal tristates can also cause problems during simulation.
Tristate Signals and Busses 79
In the conditional part of the assign statement, both input_bus and tri_control can be
logic equations. For internal tristates, use the tri net type as illustrated in Listing 3-3.
It is the designer’s responsibility to insure that the tristate buffer enables are mutually
exclusive so that bus conflicts are avoided. Even transient tristate bus conflicts can cause
excessive power consumption and, if allowed to occur to long or too often, can overheat and
damage the device.
80 A Digital Circuit Toolbox Chapter 3
The schematic shown in Figure 3-1 has three levels of buffers. The ones on the left
are input pin buffers, the ones in the middle are internal tristate buffers, and the ones on the
right are output pin buffers.
In Figure 3-3, the logic is the same (same Verilog source file) but the internal tristates
have been converted to MUXes. Note that one level of tristate buffering in Figure 3-1 has
been converted to two levels of MUX LUTs in Figure 3-3. This will result in a slower
operating speed. This type of conversion will occur if the design is converted to an ASIC
technology. This change was caused by checking the ‘Allow converting of internal tristates’
in Exemplar Logic’s LeonardoSpectrum Optimize, Advanced Options menu as illustrated in
Figure 3-2.
The boxes in the middle of Figure 3-3 are the MUX LUTs that replaced the internal
tristate buffers.
82 A Digital Circuit Toolbox Chapter 3
BIDIRECTIONAL BUSSES
Bidirectional busses, as shown in Listing 3-4 and Figure 3-4, are easy to define in Verilog.
The signal is divided into two parts: the driver part, which is tristated, and the input part.
The two parts are then wired together. The module port must be defined as inout in the port
definition section.
// Input part.
assign bidir_input = bidir_bus;
endmodule
84 A Digital Circuit Toolbox Chapter 3
PRIORITY ENCODERS
if-else statements can have an implied priority with precedence assigned to the first
instructions encountered in a begin/end block. Listing 3-5 illustrates a priority encoder with
the extracted schematic shown in Figure 3-5. If signal a is asserted, it has priority, and none
of the other signals matter. From a delay point of view, signal x passes through one level of
logic and is faster than signal z, which passes through three layers of logic.
Priority Encoders 85
always @ (a or b or c or x or y or z)
begin
if (a) d = x;
else if (b)
d = y;
else if (c)
d = z;
else
d = 1’b0;
end
endmodule
Like an if/else block, case blocks will create a priority encoder unless a full case compile
option is available and selected, or all input combinations have defined outputs states.
Selection of full case informs the compiler that the cases are mutually exclusive and do not
overlap. If one case is found true, then by definition no other case can be true. Since a
conflict is not allowed in a parallel case design, the cases should not prioritized, but they
86 A Digital Circuit Toolbox Chapter 3
will be if conflicting cases are defined. If it is possible for more than one case to be true
(resulting in conflicting cases), then the first one encountered by the compiler will have
priority over later cases which might be true (lower-priority cases are considered don’t-care
conditions when the higher-priority case is evaluated). This means the statement order has
meaning. This may not be the behavior the designer wants. To avoid the priority encoding,
make sure all cases are covered or use the compiler directive to define the design as a full
case design, and avoid conflicting (or contradictory) cases.
When the case input condition list is not complete, a latch is created as shown in
Listing 3-6 and Figure 3-7. For cases not defined, the previous output is held. This may not
be what the designer intends. To prevent the creation of a latch, use a default case to cover
all undefined cases as shown in Listing 3-7, or check the LeonardoSpectrum full case
option box as shown in Figure 3-6. Using the full case option creates a MUX
implementation, as shown in Figure 3-8.
The parallel case check box will still create a latch but forces undefined outputs to a
known state, as shown in Figure 3-9.
always @ (a or x or y or z)
begin
case (a)
3’b001: d = x;
3’b010: d = y;
3’b100: d = z;
endcase
end
endmodule
Priority Encoders 87
The logic of Listing 3-7 is slightly different: as in Figure 3-8, all undefined cases are
set by default to zeroes. Still, notice in Figure 3-10 that a MUX was inferred without a latch.
always @ (a or x or y or z)
begin
case (a)
3’b001: d = x;
3’b010: d = y;
3’b100: d = z;
default: d = 1’b0;
endcase
end
endmodule
Area/Speed Optimization in Synthesis 89
We’ve taken some preliminary looks at what the synthesizer does; let’s explore this a little
more. The optimization of synthesis for ASICs and FPGAs falls into two general categories,
speed/delay or area. Obviously, the design must operate at a high enough speed to meet the
design requirements. The faster the FPGA, the more expensive the device. The design must
fit into the chosen device. Larger devices are also more expensive. The FPGA designer
constantly struggles with the size (area) and speed of the synthesized logic. Size and speed
are influenced by coding style, but we’ll talk about that later.
The trade-off between speed (or delay) and area can be illustrated with an AND gate
design as shown in Listing 3-8.
always @ (e or f or g or h or i or j or k or l or m)
begin
a = e & f & g & h & i & j & k & l;
z = m & a;
end
always @ (e or f or g or h or i or j or k or l)
begin
90 A Digital Circuit Toolbox Chapter 3
always @ (e or f or g or h or i or j or k or l)
begin
c = e & f & g & h & i & j & k & l;
end
always @ (e or f or g or h or i or j or k or l)
begin
d = e & f & g & h & i & j & k & l;
end
endmodule
Figure 3-12 Illustration of Signals and Routing after Speed (Delay) Optimization
Area/Speed Optimization in Synthesis 91
The difference between the circuits of Figures 3-12 and 3-13 is subtle; the logic is exactly
the same. The difference between the two designs is the selection of area/delay optimization
on the Quick Setup tab in LeonardoSpectrum (this selection can also be made in the
Optimize tab). However, you will notice that signal z in Figure 3-12 passes through 2 levels
of logic, while it passes through 3 levels of logic in Figure 3-13. Signal z will be created
faster in Figure 3-12 at the expense of increased FPGA real estate.
Another way to look at this design is to view the critical path. The critical path is the
longest delay in the design, and LeonardoSpectrum will extract this path so it can be
analyzed. The critical paths are illustrated in Figures 3-14 and 3-15.
Figure 3-15 shows the delay improvement, two layers of logic compared to three. For this
design, the difference between critical paths is 22.78 nsec (delay optimization) and 24.57
nsec (area optimization), a 10% difference. The battle the FPGA designer faces is trying to
fit the design into a small (cheap) and slow device and still achieve the necessary
performance. This can be more an art than a science.
From this example, it should begin to be clear how many options the synthesis tool
has for implementing a logic function. Regarding coding style, this example illustrates that
grouping together logic that shares inputs and/or outputs is helpful to the synthesis tool. The
best partitioning would keep them inside the same always block, or at least in the same
module as much as possible. The best design partitioner is the one between your ears! Don’t
make the synthesis tool work too hard to group functions of common inputs and outputs,
group them yourself in block structures and modules.
The synthesizer is best at optimizing combinational logic within a module. It may find
redundant logic in separate modules, but let’s not make the synthesizer work harder than it
has to. Try to keep combinational inputs and outputs of related signals grouped together in
the same module.
Here’s another design approach that will help. When working with a design team,
there should be an agreement about where flipflops occur at module boundaries. Typically,
the structure is per Figure 3-16. As long as all modules have this form, they will work well
together. Without this agreement, you may have to use synchronizing flipflops on inputs
because you don’t know what you’re interfacing to.
It bears mentioning that area and delay are not always a trade-off, sometimes the more
compact design is also the fastest. It has fewer gates, so it’s certainly possible that there
might be fewer levels of logic. It just depends on the design architecture and how well the
synthesis tool optimizes this design.
Another trade-off between area and speed is the one between latency and circuit operating
speed. Latency is the time period between when a signal occurs at the input of a design
element and when it finally propagates through the circuit to the output. Many times a
design can tolerate latency as long as the design throughput is fast. For faster throughput,
the designer splits up the logic so that fewer operations (layers of logic) appear between
clocks. This is illustrated in Listings 3-9 and 3-10. Functionally, the designs are the same,
but the design of Listing 3-10 will have greater latency (it will take three clock periods for
the output to appear at the output pins) and will operate at a higher clock rate. The design of
Listing 3-9 will use fewer flipflops, has less latency (the output will appear in one clock
cycle), and will operate at a lower clock rate than that of Listing 3-10.
Listing 3-9 Logic with Low Latency and Low Operating Speed
Listing 3-10 Logic with High Latency and High Operating Speed
Timing Constraints
There are two strategies for improving the performance of a design. Method One: assist the
synthesis tool in identifying critical logic by applying timing constraints. Timing constraints
are discussed in Chapter 6. This gives priority to logic that must run fast at the expense of
less critical areas of the design. Method Two: write the code to give the synthesis tool an
easier problem to solve. Most problems can be resolved by reworking the source code.
Pipeline the logic or use fast structural (schematic-like) elements to implement the logic.
Test points
Experienced designers make mistakes. Knowing this, the experienced designer makes the
design easy to test and debug. A powerful technique for troubleshooting is to bring out test
points connected to easily accessible connections. For example, the layout of an HP logic
analyzer probe pod is shown in Figure 3-17. Put a double row of 0.1" header pins (either
through-hole or SMT types, SMT is better because it has less impact on PCB routing
channels and does not poke more holes in the power/ground planes) on the circuit board.
Notice that ground is connected to pin 20 and the Logic Analyzer clocks are connected to
pins 2 and 3. Pins are numbered, with odd pins 1 through 19 on one side and 2 through 20
on the other. Assign test[15:0] to device pins (which are tied to the test connector). Bring
signals to be tested up through the hierarchy to the top-level module in the manner of
Listing 3-11.
The 20-pin HP (p/n 1251-8106) logic analyzer pod has built-in termination networks
(similar to Figure 3-18) and plugs into a dual-row tenth-inch center header as shown in
Figure 3-17. To save money on pods and connect directly to the 40 pin logic analyzer pod,
put a 40-pin dual-row tenth-inch center header on the board per Figure 3-19 and termination
networks on the circuit board per Figure 3-18.
To expand the number of signals available for test, a MUX can be used to select
between sets of signals connected to the test points. Be aware: the designer can run into
Heisenberg’s Uncertainty theorem. This means that the method of taking a measurement can
affect the thing being measured. Keep in mind that the signal that passes through the MUX
is one or more logic levels removed from the internal signal, and the precise timing will be
different. For low-speed signals, this may not be important. Adding logic and routing for
test points can complicate the routing and result in slower timing. You didn’t expect to get
something for nothing, did you?
Study the test equipment in your lab. Design the circuit board to allow easy interface
to that test equipment. Provide easy access to grounds to connect to your oscilloscope or
voltmeter (we never seem to be able to find a ground or power connection where we need
it). Leave room around parts to allow the use of sockets or test clips.
STATE MACHINES
It is very common to use sequential processes to solve a design problem. One event follows
another and steers a sequential machine through its states. This is a great method of dividing
and conquering a design. A Finite State Machine (FSM) uses a set of registers, called state
registers, to identify the current machine state. The current state depends on the inputs and
the history of inputs.
98 A Digital Circuit Toolbox Chapter 3
You can call me a nut, but I consider state machines (some people insist on calling
them finite state machines or FSMs, but since I’ve never seen an infinite state machine, I
don’t feel any compulsion to keep the finite label) to be one of technology’s wonders, as
beautiful as an Escher print, a current-mirror transistor pair, or a toroidal transformer. As
designers, we are constantly trying to break hard and complicated problems into pieces that
are easier to solve. The state machine well serves this quest. Once you are in a certain state,
the only inputs that matter are the few you explicitly define yourself.
There are many forms of state machines. In textbooks they are divided into the Mealy
and Moore types. In a Moore style state machine, the output depends only on the state. A
synchronous counter is one example of a Moore state machine; the output is dependent only
on the state of the machine (actually, the output IS the state of the machine). In a Mealy
state machine, the output depends on the state and some input conditions or signals. The
style that this book recommends combines output logic and state assignments in the same
always block.
It is helpful to create a state machine the longhand way. For a counter with outputs
encoded using Gray code, each sequential output differs by exactly one bit. We’ll discuss
Gray coding again later in this chapter. A Gray Code counter is a simple state machine; let’s
design one with gates and registers to see how it is done. First, create a present-state/next-
state chart as shown in Listing 3-12. After a clock edge, the present-state values are replaced
with the next-state values.
Collect all the terms that result in the next state bits being set to a 1 and OR them all
together to create a sum-of-products (SOP) representation of the next-state decoder logic as
shown in Listing 3-13.
n2 <= (~d2 & d1 & ~d0) | (d2 & d1 & ~d0) | (d2 & d1 & d0)
| (d2 & ~d1 & d0);
n1 <= (~d2 & ~d1 & d0) | (~d2 & d1 & d0) | (~d2 & d1 & ~d0)
State Machines 99
n0 <= (~d2 & ~d1 & ~d0) | (~d2 & ~d1 & d0) | (d2 & d1 & ~d0)
| ( d2 & d1 & d0);
The logic of Figure 3-20 shows the next-state decoding logic on the left and the state
register on the right. Let’s implement the present/next-state logic with a Verilog state
machine and see how it looks. I used a case structure in Listing 3-14 to create the next-state
decoders, so the code looks different than the logic above, but it’s the same, trust me.
3’b001: begin
next_state <= 3’b011;
flag_output <= 1’b0;
end
3’b011: begin
next_state <= 3’b010;
flag_output <= 1’b1;
end
3’b010: begin
next_state <= 3’b110;
flag_output <= 1’b0;
end
100 A Digital Circuit Toolbox Chapter 3
3’b110: begin
next_state <= 3’b111;
flag_output <= 1’b0;
end
3’b111: begin
next_state <= 3’b101;
flag_output <= 1’b0;
end
3’b101: begin
next_state <= 3’b100;
flag_output <= 1’b0;
end
3’b100: begin
next_state <= 3’b000;
flag_output <= 1’b0;
end
default: begin
next_state <= 3’b0;
flag_output <= 1’b0;
end
endcase
end
endmodule
State Machines 101
An output, called flag_output, was added to the Gray code design. This shows how
we add outputs to a state machine. I want the output to be asserted during the 010 state, so it
is set in the prior state (011). The flag_output signal will be asserted on entry to state 010
and cleared on exit of state 010. See Figure 3-21 for the waveforms created by this logic.
Using a state machine is a great way to partition a problem. When the state machine is
in a given state, neglecting the clock and reset inputs, nothing else matters except the logic
and the inputs explicitly referred to inside that state. How can we create a state machine that
will synthesize effectively? At around eight state registers and inputs, I’d consider breaking
up the state machine into smaller ones.
Let’s talk about another way to create Gray Code logic.
Listing 3-15 illustrates the algorithm for converting from binary to Gray Code.
else begin
gray_output[3] <= binary_input[3];
gray_output[2] <= binary_input[3] ^ binary_input[2];
gray_output[1] <= binary_input[2] ^ binary_input[1];
gray_output[0] <= binary_input[1] ^ binary_input[0];
end
endmodule
else begin
binary_output[3] <= gray_in[3];
binary_output[2] <= gray_in[3]^gray_in[2];
binary_output[1] <=(gray_in[3]^gray_in[2])^gray_in[1];
binary_output[0] <=((gray_in[3]^gray_in[2])^gray_in[1])^gray_in[0];
end
endmodule
State Assignments
The state assignments can make a big difference in how efficiently your logic will
synthesize. We will use parameters and `ifdef statements to select between encoding
assignments as shown in Listing 3-17. A binary count is the easiest to test and debug, but
using Gray Code for state assignments that occur in sequence will synthesize the most
efficiently. State machine state coding can also be used to directly generate output signals
by using a flipflop for both a state register and an output register. Using this method, the
clever designer can save some logic and create a faster design.
State Machines 105
//`define binary
`ifdef binary
parameter state_zero = 3’b000;
parameter state_one = 3’b001;
parameter state_two = 3’b010;
parameter state_three = 3’b011;
parameter state_four = 3’b100;
parameter state_five = 3’b101;
parameter state_six = 3’b110;
parameter state_seven = 3’b111;
`else
parameter state_zero = 3’b000;
parameter state_one = 3’b001;
parameter state_two = 3’b011;
parameter state_three = 3’b010;
parameter state_four = 3’b110;
parameter state_five = 3’b111;
parameter state_six = 3’b101;
parameter state_seven = 3’b100;
`endif
endcase
end
endmodule
Some designs benefit from one-hot state assignments. One-hot means that each state is
assigned a single-state flipflop which is active only in the assigned state. This type of
coding tends to spread out the FPGA logic and can make the logic easier to synthesize.
Most FPGA architectures have a lot of registers, so it may not be a terrible penalty to
consume some with the one-hot scheme: a one-hot state machine uses more flipflops than a
Gray/binary-coded state machine, one flipflop per state. However, a one-hot design is not a
cure-all. In some cases a one-hot design uses more logic than binary or gray coding.
Definitely check to make sure you’re really getting the benefit you expect when using one-
hot coding.
One aspect of the one-hot assignment to consider is the many unused or default
condition states which should be handled. By definition, the one-hot method uses one active
flipflop at a time, but what if two registers get asserted by a noise hit or metastable input
condition. Will the design recover? How will these cases be covered? This question is tough
to answer, so I generally stick to binary- or Gray-coded state assignments.
Notice that eight registers are required in Listing 3-18 to support the same number of
state assignments as Listing 3-17. Also note that a useful reset state (all state registers =
zero) is not used.
An alternate version of one-hot state coding is one-cold, in which all state registers are
set except one as shown in Listing 3-19.
Adders 107
ADDERS
Binary adders are supported by the Verilog synthesis. The synthesis tool will examine each
instance of the + operator and will try to implement the logic with a pre-optimized module.
The optimization can be influenced by compilation settings and may be optimized for area
or speed/delay. If the design is slow and small, then using adders may be trivial and may not
result in any problems. However, the logic required to implement one line of code (a <= b +
c;) can be huge if the input vectors are wide.
The designer should be aware that the synthesis tool is searching for adders. Don’t
make the job of identifying and extracting them too difficult. Make the adder standalone as
much as possible; i.e., don’t bury it too deeply in your logic.
Half-Adder Logic
A half-adder sums two inputs. The reason it is not a “complete” or full-adder is that it
ignores carry input signals. Logic is added to sum in a carry input to create a full adder.
Figure 3-22 shows a half-adder truth table and the associated schematic is shown in Figure
3-23.
108 A Digital Circuit Toolbox Chapter 3
By inspection, we can see that the carry output is a AND b (a & b) and the sum output
is a XOR b (a ^ b).
We don’t have to use the Verilog synthesizer’s version of a half adder if we think we
can do a better job than the synthesis tool; we can make our own. After we build a byte-
wide adder, we’ll compare our results with those of LeonardoSpectrum and admire how
much smarter we are than the synthesis-tool vendor.
Unfortunately, the half adder of Listing 3-20 only does half the job when adding
multibit values. As designers, we aren’t finding much job opportunity in designing single-
bit adders. To turn the half adder into a full adder, we take the output of a half adder and
connect it into another half adder. The carry input becomes the other input for the second
stage as shown in the truth table of Figure 3-24 and the logic of Listing 3-21.
To create the equations for the full adder, cascade two half adders as shown in Figure
3-25.
You’ll notice in Listing 3-22 and Figure 3-26 that by habit, I changed the
combinational full-adder design to a synchronous version. In Figure 3-26, the modgen box
is simply an OR gate.
Adders 111
To create wider adders, in a structural manner, we can cascade as many of these full
adders as we wish. To create large adders, we should break the design into small modules
no larger than four-bits wide, and stitch them together. This aids the synthesizer and will
generally improve area and speed performance.
The carry output of one adder feeds the carry input of the higher-order bit. Note: the
size of the output register must be one larger than that of the input registers in order to
accept the carry output of the last stage. An exception to this occurs where the input set of
data inputs is limited so that a carry is not possible. The carry input for the LSB adder is
fixed at 0. An example of this type of adder is presented in Listing 3-23. To make this
design work, the full_adder.v design must be included in the project. Listing 3-23 is also an
example of a simple hierarchy.
endmodule
The synthesis summary for the byte_adder design is shown in Listing 3-24.
*******************************************************
*******************************************************
Number of ports : 27
Adders 113
Number of nets : 68
Number of instances : 51
Number of references to this view : 0
***********************************************
Device Utilization for 4010xlPQ100
***********************************************
Resource Used Avail Utilization
-----------------------------------------------
IOs 27 77 35.06%
FG Function Generators 16 800 2.00%
H Function Generators 0 400 0.00%
CLB Flip Flops 7 800 0.88%
-----------------------------------------------
Clock Frequency Report
Clock : Frequency
------------------------------------
clk : 77.6 MHz
Listing 3-25 is an example of letting the synthesizer do all the work. We just use the Verilog
addition operator (+) to create the sum of two bytes. Listing 3-26 presents the statistics for
the synthesized design.
*******************************************************
*******************************************************
Number of ports : 27
Number of nets : 103
Number of instances : 48
Number of references to this view : 0
***********************************************
Device Utilization for 4010xlPQ100
***********************************************
Resource Used Avail Utilization
-----------------------------------------------
IOs 27 77 35.06%
FG Function Generators 8 800 1.00%
H Function Generators 0 400 0.00%
CLB Flip Flops 0 800 0.00%
-----------------------------------------------
Clock Frequency Report
Clock : Frequency
------------------------------------
10101
11001
Suppose I ask you what is the result of adding only the highest-order bits (1 + 1)?
You say that you can’t answer my question without figuring out if a carry is generated by
the addition of all the lower-order bits. There you go! The carry must be calculated for all
the lower-order bits before the highest-order bits can be added. The output is not available
until the carry from each adder “ripples” through all stages to the summation and final carry
outputs. This adder would be faster if only we could “look ahead” and generate the carry
outputs in parallel instead of in series. We evaluate the inputs to create carry signals which
are added with partial sums. Oddly enough, others have thought of this idea and have
created an adder architecture called Carry Look Ahead (CLA Adder).
The CLA Adder is described in terms of Generate (carry terms) and Propagate (sum
terms) a shown in Listing 3-27.
Listing 3-27 Carry Generate and Carry Propagate Logic Code Fragment
Listing 3-28 Carry Generate and Carry Propagate Logic, Expandable Adder
The sum is still formed by cascading half-adders, adding in the propagate (sum) term
and the carry term that is calculated in parallel. To make the blocks generic, the s[0] stage,
even though a carry-in at this stage is not allowed, will still use a carry input wired to 0.
116 A Digital Circuit Toolbox Chapter 3
Another strategy for speeding up an adder adds even more hardware. Redundant hardware
is added to calculate the sum assuming a carry input and assuming no carry input. The
output is selected via multiplexer based on whether carry is required or not.
Another strategy for speeding up the adder allows the use of an inverter for cases where the
inputs are not equal. The addition uses the XOR function and creates an output when the
inputs are different, the sum is the inverse of the carry bit and the carry bit just passes
through. This type of adder is usually implemented in groups, probably groups of 4. This
version uses less real estate than a CLA and is about as fast.
There are other adder strategies which trade area for speed. The main thing: get an
appreciation for the logic that is synthesized when the Verilog addition operator is used. The
best synthesis strategy: cheat! Keep the adder input lengths short and fight the system
engineer to reduce resolution to the point of diminishing returns. Don’t implement a 16-bit
adder if a 14-bit adder will do. Also, run the adder at the lowest possible clock frequency.
SUBTRACTORS
The subtractor is similar to the adder and there are corresponding versions of subtractors for
the adders described above (Ripple Borrow Subtractor, Borrow Save Subtractor, Borrow
Select Subtractor, and so on). The logic for the Ripple Borrow Subtractor is shown to
illustrate the similarity to Adder circuits. Figure 3-27 illustrates ina - inb in a half-borrow
circuit.
Let’s expand the half-borrow logic for the full subtractor, again ina – inb as shown in
Figure 3-28.
MULTIPLIERS
The Verilog language supports unsigned multiplication and division by powers of two. This
is not a big challenge; these are simply shift-left (a multiply by two for each binary shift
left) and shift-right (a divide by two for each binary shift right) operations. This doesn’t
mean we give up. FPGAs are capable of performing sophisticated math functions at high
speed; we need to use library modules (which limits portability) or create the logic
ourselves. Again, the best strategy for dealing with advanced math is to cheat. Work with
the system engineer to reduce the numbers of bits to be multiplied. Don’t use eight bits
when seven will do. Do a model in C to manipulate test data and examine the results. Use
118 A Digital Circuit Toolbox Chapter 3
Hard-Wired Multipliers
The best way to illustrate the multiplication algorithm is by example. Let’s assume we
want to multiply an integer nibble n by a constant integer nibble with a value of D(16).
23 22 21 20 Bit weight
n3 n2 n1 n0 Variable nibble to be multiplied
1 1 0 1 Constant variable D(16)
The multiplication process shifts and adds. The leftmost digit means add n0. The n1
digit can be ignored, it does not affect the result (an example of simplification due to
multiplying by a constant). The n2 digit means: multiply n by 4 (shift n left twice) and add.
The n3 digit means: multiply n by 8 (shift n left three times) and add.
Here’s what we get:
Result = (n * 8) + (n * 4) + (n * 1)
Let’s assume the variable is B(16) or 1011(2) and plug the numbers in.
This is cool: multiplication turns into a bunch of shift and adds. We already know
how to shift and we already know how to add, so we’re in good shape. To turn this into a
generic multiplier, we have to be prepared to add a shifted value (or zero) for each bit.
Let’s see what a Verilog version of this multiplier might look like (see Listing 3-30).
This is just one way to do the logic! I could have used the shift operator. I could have used
structural adders. To speed the design up, I could have pipelined the intermediate sums to
reduce the considerable amount of combinational logic between flipflops.
Multipliers 119
endmodule
Hard-wired multipliers are fully custom and not at all generic. If you’re trying to
convince the system designer that a hard-wired multiplier is the right answer, be careful
what you promise. Changing the constant you multiply by requires changing the code. If
you are having resource problems in the FPGA, there may not be enough room to squeeze
in a new set of coefficients (count the number of 1’s in the desired constant; there will be a
shift/add for each 1 present in the constant). Adding resolution to the input or the constant
can increase the logic considerably.
Generic Multipliers
To change the hard-wired multiplier to a generic 4-by-4 multiplier, we must create logic
which allows all the shift and adds to be used (whether they are used or not depends on the
data values). Again, there are many ways to do this, the example of Listing 3-31 is just one
way.
120 A Digital Circuit Toolbox Chapter 3
One thing about Listing 3-31 should be mentioned. Verilog will let you do dumb
things; there are no checks to make sure registers are wide enough to accept the data you are
calculating. Normally, you’d expect all sum registers to be one bit larger than the largest
width of number to be added. However, because I know the nature of the numbers stored in
the store registers (i.e., that the MSB is guaranteed to be zero), I can get away with the
stored width being the same as the byte_out width. We know that 8 bits are all that are
required to store the value of two nibbles multiplied together. Use caution when making
assumptions like this!
Often, the constant coefficients are fractional values. These are easily handled by scaling the
coefficients into integer values, then scaling the result back again. You will want to reduce
Multipliers 121
the resolution and values to the minimum easiest numbers to deal with. Let’s say the system
engineer asks for a coefficient of 0.80. Won’t an easier number like 0.75 (2-1 (or ½) + 2-2 (or
¼)) work with acceptable results? If so, the job is easier. If not, then so be it.
To multiply a nibble n by 0.75, realize that 0.75 x 4 = 3 ( a number we like a lot) and
that (n x 4) / 4 = n. So, multiply the coefficient by 4, then divide the result by 4 later and
you’re even.
If the system engineer absolutely insists on a coefficient like 0.8, ask for the required
accuracy (10%?, 5%?, 1%?), then factor 0.8 into binary (0.8 = ½ + ¼ + 1/32 …) and do
more scaling shifts as required.
This page intentionally left blank
C H A P T E R 4
RIPPLE COUNTERS
The most common (generic) counter is a ripple counter, so described because the output
ripples from stage to stage. If we create a Verilog counter like Listing 4-1, using the binary-
counter option in Exemplar Logic LeonardoSpectrum’s Input File menu as shown in Figure
4-1, we’ll find the result is a synchronous binary counter.
123
124 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4
endmodule
The problem with the ripple counter is that, because more than one output is changing
at once, using combinational logic to decode output states results in glitchy signals. To
avoid this problem, use counters like Gray Code, Johnson, or synchronous binary counters
like Figure 4-1.
JOHNSON COUNTERS
The Johnson counter is a type of shift counter. A shift counter uses little combinational
logic to create the count logic and therefore can operate at high speed (the operating speed
is limited only by how fast a flipflop can switch states and by the propagation delay of the
simple count logic). The Johnson counter wraps an inverted version of the highest-order bit
back to the lowest-order bit. Like the Gray Code counter, it has one output that changes at
each clock. This results in a glitch-free output when decoded with combinational logic.
Disadvantages include the requirement of more registers to store the count variable (around
n/2 registers are required, where n is the max count value) and lack of error recovery. If a
bad count pattern gets loaded, it will recirculate until the registers are reinitialized (if this
ever happens!). The schematic for a Johnson counter is shown in Figure 4-2 with
corresponding Verilog code in Listing 4-2 and count sequence in Listing 4-3.
Johnson Counters 125
0000
1000
1100
1110
1111
0111
0011
0001
0000
Repeat…
126 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4
The alert designer notices that not all states are used in the count cycle. We have eight
count states that are not used. Wasted counter states indicate that the design does not use
registers efficiently, but this may not be important. However, if an illegal count occurs due
to noise, there is no way to recover. Illegal states without recovery will make the careful
designer nervous. Let’s add some logic, as shown in Listing 4-4, to detect and recover from
those illegal states. This logic makes the counter a lot more complex, but it may be
worthwhile to create a robust counter with glitchless output decoding.
A type of counter that is quite interesting is a Linear Feedback Shift Register or LFSR
counter. It is similar to the Johnson counter except that instead of an inverter from the last
stage back to the first stage, a small number of taps are recycled. The counter next-state
logic is very simple (a few XOR or XNOR gates). With maximal-length logic (taps selected
to give the maximal count), a small number of registers can create counts of up to (2n)-1
(compared to a binary-counter count length of 2n). The one state that is missing from a
maximal-length LSFR count sequence is the no-recovery state (all zeros for an XOR version
or all ones for an XNOR version). An LFSR counter can operate at high speed compared to
a binary counter because the feedback logic is very simple. For cases where the count value
is arbitrary (the LFSR count sequence is pseudorandom) the LFSR counter can be a good
solution.
How these can counters work can be illustrated by example. A maximal-length 4-bit
LFSR counter can use the taps [3,0] (maximal length might be achieved with other taps,
too). The taps are the register outputs that are fed back. Figure 4-3 has four flipflops and a
single XNOR gate.
This version is sometimes called ‘many-to-one’; notice how taps are derived from
many outputs, then XOR’d back to the input. There is also a variation called ‘one-to-many’
where all the feedback terms are combined before being fed back.
This counter of Listing 4-5 is an XNOR version. I generally use this version because
the illegal state consists of all ones; I prefer to reset all registers on power-up (rather than
preset some or all of the registers, which is necessary with the XOR version). Listing 4-6
presents a simple test fixture for testing the Verilog design of Listing 4-5.
always begin
#(clk_period / 2) clock = ~clock;
end
initial begin
clock = 0;
reset = 1; // Assert the system reset.
#75 reset = 0;
end
endmodule
Linear Feedback Shift Registers 129
Figure 4-4 shows the count sequence for the 4-bit LFSR counter.
// binary hex
// 0000 0
// 0001 1
// 0010 2
// 0101 5
// 1010 a
// 0100 4
// 1001 9
// 0011 3
// 0110 6
// 1101 d
// 1011 b
// 0111 7
// 1110 e
// 1100 c
// 1000 8
Looks like a big mess, doesn’t it? That’s part of the LFSR counter’s charm.
Sequential values are loosely correlated, or pseudorandom. This can be useful for reducing
clock harmonic noise. For example, in a binary counter, the lowest-order bit toggles on
every clock; this results in noise that is highly correlated to the system clock and adds
energy at subharmonics of the system clock. This harmonic energy is a large source of
system noise. An LFSR counter generates more wideband noise with lower peak energy
content, because the counter bits are changing in a more random manner.
Table 4-1 lists taps for maximal-length LFSR counters. Other tap selections are
possible for some of the counter lengths.
9 511 [8,3]
10 1,023 [9,2]
11 2,047 [10,1]
12 4,095 [11,5,3,0]
13 * 8,191 [12,3,2,0]
14 16,383 [13,4,2,0]
15 32,767 [14,0]
16 65,535 [15,4,2,1]
17 * 131,071 [16,2]
18 262,143 [17,6]
Number of Bits Length of Loop Taps
19 * 524,287 [18,4,1,0]
20 1,048,575 [19,2]
21 2,097,151 [20,1]
22 4,194,303 [21,0]
23 8,388,607 [22,4]
24 16,777,215 [23,3,2,0]
25 33,554,431 [24,2]
26 67,108,863 [25,5,1,0]
27 134,217,727 [26,4,1,0]
28 268,435,455 [27,2]
29 536,870,911 [28,1]
30 1,073,741,823 [29,5,3,0]
31 * 2,147,483,647 [30,2]
32 4,294,967,295 [31,6,5,1]
From Table 4-1, you can see we can create a 31-bit counter with 31 registers and a single
XOR gate. Imagine the ripple-carry logic required to create a 31-bit binary counter!
Linear Feedback Shift Registers 131
An example of the use of a LFSR counter is to create simple logic for a divide-by-N
circuit. In this design, a terminal count is provided as an input to be compared to. Listing 4-
7 illustrates an 8-bit divide-by-N counter, Listing 4-8 shows a test fixture, Listing 4-9 is the
output list of the pseudorandom count sequence, and Figure 4-5 shows the waveforms at the
count rollover.
Listing 4-8 Verilog Version of a 8-bit Divide-by-N LFSR Counter Test Fixture
always
begin
#(clk_period / 2) clock = ~clock;
end
initial
begin
clock = 0;
reset = 1; // Assert the system reset.
terminal_cnt = 8’d66; // Test assignment.
#75 reset = 0;
end
endmodule
The one-to-many variation as shown in Listing 4-10 splits the XOR (or XNORs) into
2-input gates and distributes them throughout the register array. Note: The same taps are
used, simply in a different form. In words, the 4-bit counter taps [3,0] means: XOR (or
XNOR) the output of register 0 and register 3 and connect that result to the input of register
1. The last register is wrapped back to register 0. This will still result in a maximal-length
sequence, but the count sequence (and terminal count value for a given count) will be
different. The schematic extracted from Listing 4-10 is shown in Figure 4-6. The output
waveform is shown in Figure 4-7.
For a more detailed explanation of LFSR counters, see Max Maxfield’s Designus
Maximus Unleashed (details on this book can be found in the Bibliography).
Logic similar to the LFSR is used to create Cyclic Redundancy Checksums, or CRCs.
Checksums are used to test a data packet to try to determine if an error has occurred. An
ordinary checksum simply adds up the data bytes or words and discards any carry beyond a
predetermined resolution. For example, an 8-bit checksum would use modulo-256 addition
and discard all carries that result in numbers greater than 255 (FF in hex).
Let’s assume a data packet consists of the following 8 bytes :
hex data
99
D0
136 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4
AA
01
09
83
AF
BE
We can use our hex calculator to find that the sum of these numbers is 40D(16). We
discard all but the lower 8 bits and get a checksum of 0D. The receiving logic can do the
same addition and see if the received data gives a checksum of 0D. This gives us some
small confidence that the data was received correctly. What if we want more confidence?
We could send a 16-bit checksum instead; this would give a 10-byte packet and a checksum
of 40D. Now, for multiple errors, the chance of detecting an error is 1 in 65,536 instead of 1
in 256. If an error causes a number greater than expected in one byte and a later error causes
a corresponding number the same amount less than expected, the checksum will match and
we’ll think a bad packet is good. What if this is not good enough? A more random sequence
of numbers would give us better error detection.
The idea behind a CRC is to do division instead of addition. The data packet is looked
at as a huge binary number. We select a polynomial to divide this binary data with, and the
remainder becomes our checksum. The sequence of remainders is more random than a
sequence of sums. I’m going to skip a whole bunch of math and just tell you that logic to
implement CRC division with a polynomial (where borrows are discarded) looks a lot like
the logic which implements an LFSR. An input data packet is created with N bits of zeroes
appended, where N is the length of the CRC, and is shifted out serially. While the data is
transmitted, the CRC is calculated and then appended in place of the zeroes. This becomes
the transmitted data packet.
At the receiver, the same CRC calculation is performed on the incoming data packet
(including the CRC bits), and the remainder will be zero if no error is detected. Let’s
illustrate this with a simple example. Xilinx uses a 16-bit CRC to validate the serial data
used for FPGA configuration. The schematic for this logic is shown in Figure 4-8. Xilinx
uses XOR logic, one-to-many configuration, and [15,14,1,0] feedback taps.
ROM 137
Notice how similar this logic is to the LFSR with the addition of a data input as a
modulation source. Listing 4-11 implements CRC-16 logic.
ROM
ROM stands for Read-Only Memory. This memory is initialized when the FPGA is
configured and cannot be changed after configuration (if it could be changed, then it would
be RAM). As an example, we can implement the four-bit LFSR counter with a ROM if we
138 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4
want (we won’t want to if we have any sense, but we’ll do it anyway for the purpose of
illustration); see Listing 4-12 and Figure 4-9.
Because Xilinx implements combinations of four inputs very effectively, this function
is efficient (not as efficient as the LFSR algorithm: the ROM version uses 2 CLBs, whereas
our earlier design used 1 CLB). However, since the logic goes up by the square of the
number of inputs, the ROM implemented in CLBs can get quite large. Another name for a
ROM design like this is a Look-Up Table (LUT).
Many of the Xilinx CLBs have a RAM mode where a 16-by-1 memory element can
be used in place of a CLB. This can be a very effective way to create RAM and ROM
modules. We’ll explore the use of LogiBLOX and memory modules in Chapter 8.
Something else to keep in mind is that many ASIC technologies do not have RAM
capability. During ASIC conversion, ROM/RAM elements will be replaced with random
logic, and this can result in a quite large ASIC design.
RAM
RAM stands for Random Access Memory, but that is not too helpful. A RAM is an array of
memory (or storage) cells, addressable in groups N elements wide (data width, like x4, x8,
x16, or x32) and M elements deep (number of N-width elements). We can synthesize a
RAM out of CLBs, so let’s do a simple 16x1 block (a very tiny RAM block) and see how it
looks. This design assumes that internal three-state drivers are available.
Note: One useful thing about the CLB RAM in an FPGA is the ability to initialize the
RAM register cells on reset.
140 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4
There are 16 memory cells, so we need 2n = 16 addresses, or four address lines as shown in
Listing 4-13.
Figure 4-10 shows the schematic of the logic synthesized from Listing 4-13. Listing
4-14 summarizes the resources used by this design.
Listing 4-14 Design Summary for Verilog 16x1 RAM Example Using CLBs
Figure 4-10 Schematic for Verilog 16x1 RAM Example Using Random Logic
RAM 143
This illustrates how inefficient it is to implement RAM with FPGA CLBs. CLBs are
designed to implement random logic functions. If we could replace this logic with a RAM
cell, it would consume one CLB!
RAM elements are easy to create with Verilog. However, Verilog does not support
two-dimensional arrays, so the RAM is modeled as a one-dimensional array of vectors.
Listing 4-15 is an example of a 256-by-8 synthesizable RAM module.
The RAM of Listing 4-15 will work, but unless the FPGA supports embedded RAM
blocks, it will consume a huge amount of logic and be many times more expensive than any
SRAM device you could buy. It might be all right for a tiny amount of RAM (on the order
of 8 bytes), otherwise another solution must be found. Using flipflops to implement RAM is
very inefficient.
From Figure 4-11, you can see that Exemplar Logic LeonardoSpectrum correctly
inferred a RAM from the Verilog code. In the Xilinx 4000XL family, embedded RAM is
supported; see Figure 4-12. This schematic looks complicated, but compare it to Figure 4-
13. The design in Figure 4-13 was compiled for an XC3000 device; this older device
architecture does not have distributed CLB RAM. The schematic for the XC3000
implementation has 39 sheets!
144 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4
Figure 4-12 256x8 RAM Implemented in the 4000XL Device Family, Sheet 1 of 1
RAM 145
Figure 4-13 256x8 RAM Implemented in the XC3000 Device Family, Sheet 1 of 39
146 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4
The designer often needs to implement RAM blocks to store an array of input or
output data, configuration information, tables, or parameters. Many modern FPGA/CPLD
architectures include RAM available as blocks (typical of Altera devices) or distributed
across the device design so that a CLB can be configured as a LUT or as a RAM element
(typical of Xilinx devices). You can see that using a single CLB as a 16-by-1 RAM cell is a
good deal for the designer; it’s fast and doesn’t consume much of the FPGA resources.
What do you do if you need more than a trivial amount of RAM? There are two
solutions. One is to pick an FPGA architecture that has enough built-in RAM to solve the
problem (remember to leave yourself some wiggle room—if you need 1K of RAM, pick a
device and architecture that has at least 2K available); the other is to put a real SRAM in the
design. Though RAM blocks or distributed RAM cells are available in modern FPGAs, it is
probably more expensive to use FPGA silicon for RAM than to use a real RAM IC. An
additional consideration is the issue RAM raises during conversion to an ASIC.
x Speed. Not only are the RAM cells fast (in general), but we avoid the speed
penalty of driving signals on and off the device.
x Timing. By staying on the chip, the clock/data relationship is known to the
place-and-route tool. This eases our timing analysis. Internal FPGA signals are
tweaked so the register hold time is zero. The external RAM may or may not
have a zero hold time. Regardless, the delays associated with device I/O must be
considered and will result in some sort of minimum hold time that must be
accounted for in the design.
x Initialization. A nice feature of the FPGA RAM is the ability to initialize the
RAM content on power-up. This initialization can be to write all zeros or to take
RAM values from a file and store them in the RAM array. This can avoid
requiring other means of initializing RAM (like having a microprocessor write
to every location on power-up, for example).
x Cost. The silicon expended on internal RAM cells is probably more expensive
than an external RAM device. The cost advantage is offset slightly by the cost
associated with stuffing an extra device on the board and consuming extra
FPGA pins.
RAM 147
Instantiating RAM
How do we instantiate RAM modules? Xilinx offers a tool called LogiBLOX for
creating RAM modules, an example of a LogiBLOX module is shown in Listing 4-16. More
detail on the procedure of creating a Xilinx LogiBLOX module is provided in Chapter 8.
//----------------------------------------------------
// LogiBLOX DP_RAM Module “r16x16dp”
// Created by LogiBLOX version M1.3.7
// on Thu Feb 12 15:27:46 1998
// Attributes
// MODTYPE = DP_RAM
// BUS_WIDTH = 16
// DEPTH = 16
//----------------------------------------------------
r16x16dp ramblk1
(.A({ram_a_addr[4],ram_a_addr[3],ram_a_addr[2],ram_a_addr[1]}),
.SPO(ram_a_data),
.DI(ram_data),
.WR_EN(wr_strobe),
.WR_CLK(clk),
.DPO(ram_b_data),
.DPRA({ram_b_addr[4],ram_b_addr[3],ram_b_addr[2],ram_b_addr[1]}));
endmodule
The r16x16dp.vei module, shown in Listing 4-17, is simply a placeholder for the
presynthesized netlist (r16x16dp.ngo) that will be inserted during the place-and-route
process. It defines the module ports, but that is all. The interface part of this automatically
generated file was cut and pasted into the module that instantiates the placeholder module.
This .vei file must be included in Exemplar Logic LeonardoSpectrum’s input file list shown
in Listing 4-16.
148 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4
//----------------------------------------------------
// LogiBLOX DP_RAM Module “r16x16dp”
// Created by LogiBLOX version M1.5.19
// on Sun May 30 14:19:03 1999
// Attributes
// MODTYPE = DP_RAM
// BUS_WIDTH = 4
// DEPTH = 16
// STYLE = MAX_SPEED
// USE_RPM = FALSE
//----------------------------------------------------
It’s easy to imagine using an external RAM in place of the LogiBLOX RAM; the
difference is that module port pins must actually connect to device pins. An interesting
expansion of the RAM interface occurs when multiple modules need RAM access. In this
case an arbitration scheme can prioritize and negotiate access to the RAM. An example of
external RAM interface with a simple arbiter (which allows multiple sources to access the
RAM) is shown in Listing 4-18. There are probably better ways to implement this design,
but this is a Real World example that was used in a commercial design.
// System inputs.
input clk, reset; // System clock and reset.
RAM 149
// Control signals.
output [2:0] rd_ack; // Acknowledge: read complete.
reg [2:0] rd_ack;
output [2:0] wr_ack; // Acknowledge: write complete.
reg [2:0] wr_ack;
// RAM interface.
input [12:1] chan0_ramaddr;// Channel 0 RAM address pointer.
wire [12:1] chan0_ramaddr;
input [12:1] chan1_ramaddr;// Channel 1 RAM address pointer.
wire [12:1] chan1_ramaddr;
output [15:0] chan0_dat_from_ram;// Channel 0 RAM read data.
reg [15:0] chan0_dat_from_ram;
output [15:0] chan1_dat_from_ram;// Channel 0 RAM read data.
reg [15:0] chan1_dat_from_ram;
input [15:0] chan0_dat_to_ram; // Channel 0 RAM write data.
wire [15:0] chan0_dat_to_ram;
output [15:0] chan1_dat_to_ram; // Channel 1 RAM write data.
wire [15:0] chan1_dat_to_ram;
input [2:0] data_rd; // RAM read request.
wire [2:0] data_rd;
input [2:0] data_wr; // RAM write request.
wire [2:0] data_wr;
input sram_addr_strobe; // Preloads address counter.
input [15:0] up_data_to_ram; // Data written into RAM.
output [15:0] up_data_from_ram; // Data read from RAM.
reg [15:0] up_data_from_ram;
input [12:0] address_preset; // Microprocessor address
// counter preset input.
// Local variables.
reg [3:0] ram_state;
reg ram_rdn;
reg [11:0] ram_addr_ctr;// Register: store auto-
// incremented addresses.
// Counts words.
parameter ram_state_idle = 0;
parameter ram_state1 = 1;
parameter ram_state2 = 2;
150 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4
parameter ram_state3 = 3;
parameter ram_state4 = 4;
parameter ram_state5 = 5;
parameter ram_state6 = 6;
parameter ram_state7 = 7;
parameter ram_state8 = 8;
parameter ram_state9 = 9;
parameter ram_state10 = 10;
parameter ram_state11 = 11;
parameter ram_state12 = 12;
parameter ram_state13 = 13;
parameter ram_state14 = 14;
parameter ram_state15 = 15;
assign ram_rwn = ~ram_rdn; // Active high local signal.
// Control of SRAM data pins.
assign ram_data_pins = ram_data_oe ? ram_data_out : 8’bz;
assign ram_data_in = ram_data_pins;
case (ram_state)
ram_state_idle: begin
begin
ram_rdn <= 0;
ram_addr <= 0;
ram_data_out <= 0;
ram_data_oe <= 0;
end
if (data_rd[0]) begin
ram_rdn <= 0;
ram_addr <= {chan0_ramaddr, 1’b0};
ram_state <= ram_state1;
end
else // Default.
ram_state <= ram_state_idle;
end
// Read channel 0.
ram_state1: begin
ram_rdn <= 0;
ram_addr <= {chan0_ramaddr, 1’b1};
chan0_dat_from_ram[7:0] <= ram_data_in;
rd_ack[0] <= 1; // Issue early.
ram_state <= ram_state2;
end
ram_state2: begin
ram_rdn <= 0;
ram_addr <= {chan0_ramaddr, 1’b1};
chan0_dat_from_ram[15:8] <= ram_data_in;
rd_ack[0] <= 1; // Hold ack until
// read is released.
if (data_rd[0])
152 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4
// Read channel 1.
ram_state3: begin
ram_rdn <= 0;
ram_addr <= {chan1_ramaddr, 1’b1};
chan1_dat_from_ram[7:0] <= ram_data_in;
rd_ack[1] <= 1; // Issue early.
ram_state <= ram_state4;
end
ram_state4: begin
ram_rdn <= 0;
ram_addr <= {chan1_ramaddr, 1’b1};
chan1_dat_from_ram[15:8] <= ram_data_in;
rd_ack[1] <= 1; // Hold ack until
// read is released.
if (data_rd[1]) // Hold until rd released.
ram_state <= ram_state4;
else begin
rd_ack[1] <= 0; // Release ack.
ram_state <= ram_state_idle;
end
end
// Write channel 0.
ram_state5: begin
ram_rdn <= 0;
ram_addr <= {chan0_ramaddr, 1’b0};
ram_data_out <= chan1_dat_to_ram[7:0];
ram_data_oe <= 1;
ram_state <= ram_state6;
end
ram_state6: begin
ram_rdn <= 1;
ram_addr <= {chan0_ramaddr, 1’b1};
ram_data_out <= chan1_dat_to_ram[15:8];
ram_data_oe <= 1;
wr_ack[0] <= 1; // Release early.
ram_state <= ram_state7;
end
ram_state7: begin
ram_rdn <= 0;
ram_addr <= {chan1_ramaddr, 1’b1};
RAM 153
// Write channel 1.
ram_state8: begin
ram_rdn <= 0;
ram_addr <= {chan1_ramaddr, 1’b0};
ram_data_out <= chan1_dat_to_ram[7:0];
ram_data_oe <= 1;
ram_state <= ram_state9;
end
ram_state9: begin
ram_rdn <= 1;
ram_addr <= {chan1_ramaddr, 1’b1};
ram_data_out <= chan1_dat_to_ram[15:8];
ram_data_oe <= 1;
wr_ack[1] <= 1; // Release early.
ram_state <= ram_state10;
end
ram_state10: begin
ram_rdn <= 0;
ram_addr <= {chan1_ramaddr, 1’b1};
ram_data_out <= chan1_dat_to_ram[15:8];
ram_data_oe <= 1;
wr_ack[1] <= 1; // Hold ack until
// write is released.
if (data_wr[1]) // Hold until wr released.
ram_state <= ram_state10;
else begin
wr_ack[1] <= 0; // Release ack.
ram_state <= ram_state_idle;
end
end
ram_state14: begin
ram_rdn <= 1;
ram_addr <= {ram_addr_ctr, 1’b1};
ram_data_out <= up_data_to_ram[15:8];
ram_data_oe <= 1;
wr_ack[2] <= 1; // Release early.
ram_state <= ram_state15;
end
Modern synthesis tools can extract RAM from logic structures as long as we don’t
bury them so deep that they are hard for the compiler to find. This means a random logic
design is parsed and the compiler will try to extract modules that are more efficiently
implemented as RAM blocks.
FIFO NOTES
FIFOs (First-In First-Out memories) are used to change data rates between systems. Data is
written at one rate and read out at a different (same or faster) rate. When you take your first
look at a FIFO, it appears like a register file that expands and contracts like an accordion.
However, it is really designed as a RAM block with an independent write address counter
and an independent read address counter. For each FIFO write, the write counter (usually a
Gray Code counter) gets incremented; for each FIFO read, the read counter gets
incremented. The minimum set of flags includes an empty flag (set when the read and write
pointers have caught up to each other and are equal) and a full flag (again set when the read
and write flags are equal, but equal this time because the write address has wrapped
around). The major goal of a FIFO system design is to prevent an overrun which results in
data loss either due to new data not being written or old data being written over. The factors
that influence overrun are the depth of the FIFO and the read and write frequencies.
One of the challenges of designing a FIFO is the flag design. The full flag, for
example, is set in the write clock domain, but must be read and cleared in the read clock
domain. This requires synchronization between the two domains, always a tricky task.
Like a RAM, a FIFO can be built out of registers. However, unless the FIFO is very
small, you’re not going to want to build a FIFO out of registers (use RAM instead), because
the design is inefficient.
This page intentionally left blank
C H A P T E R 5
reason the FPGA designer shouldn’t regularly use a simulator. There are some excellent
books that cover Verilog simulation in detail (see the bibliography at the end of this book);
we’ll just do a quick and dirty overview in this chapter.
Most simulators have a waveform viewer, and a lot of effort is put into making this
viewer attractive to the eye. The problem is that a human brain is required to analyze and
interpret the waveforms. Waveforms are great and we’ve used them throughout this text to
show input and output signals. However, Verilog supports automated testing. This is a great
way to test and validate a design and later design changes. You can make a design change
and carefully evaluate the effect on the area of interest, but how do you know you didn’t
break something in another part of the design that used to work?
This doesn’t mean the automated test fixtures are a panacea. They are often a pain in
the rear. You’ll spend a lot of time revising the test fixture to ‘fix’ tests where signals that
don’t matter were improperly or too strictly tested.
COMPILER DIRECTIVES
Many powerful compiler directives are available in Verilog. Note the use of the ` (back tick
or accent grave) as part of these compiler directives.
`define, `ifdef, `else, `endif, `undef Verilog supports conditional compilation and
execution. Code may support simulation and not be synthesizable, or may be conditionally
synthesized to support optional features. A macro variable can be defined to control
compilation and might have a form like that shown in Listing 5-1.
`ifdef test_mode
// Insert test_mode code here. Could be a test module
Compiler Directives 159
`else
// Insert non-test_mode code here.
// The `else portion is optional.
data_bus <= internal_data;
`endif
// Continue with unconditional code here.
#75 reset = 0;
The number 75 represents a delay in units of the timescale unit, in this case 75 nsec.
The delay tells the simulator to wait until simulation time has advanced by the delay value
before executing the next directive.
DELAYS IN SYNTHESIS
Delays have meaning only for simulation, never for synthesis.
There is no magic hardware construct that will create a delay for you. The default, if
no timescale directive is executed, is 1 nsec. The precision determines how delay values are
rounded off and determine the simulation resolution. The precision must be equal to or less
than the timescale unit. The timescale argument units are in s (seconds), ms (milliseconds),
us (microseconds), ns (nanoseconds), ps (picoseconds), or fs (femtoseconds). Mostly, you’ll
see 1 ns / 1 ns, for delay units of 1 nsec with rounding of delay values to the nearest nsec.
There can only be one timescale in a design.
System Tasks
$finish; When encountered in the code, $finish ends the simulation. Without some
termination point, the simulation will continue forever (or until your PC runs out of memory
and crashes). $finish returns control of the computer back to the operating system.
$stop; This system task halts simulation but does not return control to the
operating system. Simulation can be continued from the stopping point, or other system
commands can be executed at the current simulation time.
$display(list element 1, list element 2); This system task is similar to C’s printf
command. Verilog runs just fine in a nonwaveform output mode. If there are no waveforms,
how can we tell what our design is doing? Stick some $display commands in your design to
view variables and other information (such as simulation time) as illustrated in Listings 5-2
and 5-3.
// Display Test.
module disp1_tf;
`timescale 1ns / 1ns
reg clock, reset;
#1000 $finish;
end
endmodule
Listing 5-4 shows the result of a Silos III simulation run. The first few zeros are the
count_val register content during the reset period. The $display defaults to a decimal
number format and includes a carriage return (newline) after each execution. If the newline
is not desired, the $write system task can be used instead.
S I L O S I I I Version 99.100
DEMO COPY LIMITED TO 100 to 200 DEVICES
Copyright (c) 1999 by SIMUCAD Inc. All rights reserved.
No part of this program may be reproduced, transmitted,
transcribed, or stored in a retrieval system, in any
form or by any means without the prior written consent of
SIMUCAD Inc., 32970 Alvarado-Niles Road, Union City,
California, 94587, U.S.A.
(510)-487-9700 Fax: (510)-487-9721
Electronic Mail Address: “silos@simucad.com”
!file .sav=“display1”
!control .sav=3
!control .savcell=0
!control .disk=1000M
Reading “c:\verilog\sourcecode\disp1_tf.v”
Reading “c:\verilog\sourcecode\display1.v”
sim to 0
Highest level modules (that have been auto-instantiated):
(disp1_tf disp1_tf
3 total devices.
Linking ...
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
313 State changes on observable nets in 0.33 seconds.
948 Events/second.
Text and numbers can be formatted with escape string. An escape string is a string
following a backslash (\) or %. Here are a few examples:
Compiler Directives 163
To get some experience with this formatting, take a look at Listings 5-5, 5-6, and 5-7.
module disp2_tf;
`timescale 1ns / 1ns
reg clock, reset;
parameter clk_period = 20;
initial
begin
clock = 0;
reset = 1; // Assert the system reset.
#75 reset = 0;
#1000 $finish;
end
endmodule
164 Verilog Test Fixtures Chapter 5
S I L O S I I I Version 99.100
DEMO COPY LIMITED TO 100 to 200 DEVICES
Copyright (c) 1999 by SIMUCAD Inc. All rights reserved.
No part of this program may be reproduced, transmitted,
transcribed, or stored in a retrieval system, in any
form or by any means without the prior written consent of
SIMUCAD Inc., 32970 Alvarado-Niles Road, Union City,
California, 94587, U.S.A.
(510)-487-9700 Fax: (510)-487-9721
Electronic Mail Address: “silos@simucad.com”
!file .sav=“display2”
!control .sav=3
!control .savcell=0
!control .disk=1000M
Reading “c:\verilog\sourcecode\time_setup.v”
Reading “c:\verilog\sourcecode\display2.v”
sim to 0
Highest level modules (that have been auto-instantiated):
(time_setup time_setup
(disp2_tf disp2_tf
Compiler Directives 165
4 total devices.
Linking ...
3 nets total: 11 saved and 0 monitored.
74 registers total: 74 saved.
Done.
A list of variables can be displayed when they change by using the $monitor directive. The
syntax of the $monitor signal list and formatting controls is very similar to those used for
the $display directive. Some examples of the using the $monitor directive are presented in
Listing 5-8 and Listing 5-9 with the corresponding output shown in Listing 5-10.
$monitor (signal list and formatting);
$monitoron/$monitoroff;
module time_setup2;
initial
begin
`timescale 1ns / 1ns
Compiler Directives 167
module disp3_tf;
always
begin
#(clk_period / 2) clock = ~clock;
end
initial
begin
$monitor ($time, “ Counter value: %h”, u1.count_val);
clock = 0;
reset = 1; // Assert the system reset.
#75 reset = 0;
#1000 $finish;
end
endmodule
Verilog can save simulation results in an ASCII file. The format of this file is called
Value Change Dump or VCD. An entry in the file occurs only when a variable value
Compiler Directives 169
changes. The $dumpvars directive without an argument list will dump all the variables in
the design. The $dumpvars (0, module name) directive will dump variables from the listed
module and all modules instantiated by the listed module. Variables can be identified
hierarchically in the file list (module1.module2.variable_name).
The VCD file can get very large. Setting a $dumplimit will stop the dump when the
VCD file reaches the specified limit. Some examples of the $dump directive is shown in
Listing 5-11 and Listing 5-12, with a partial output listing shown in Listing 5-13.
module disp4_tf;
reg clock, reset;
wire [7:0] count_val;
parameter clk_period = 20;
$upscope $end
$enddefinitions $end
#0
$dumpvars
1!
0“
0#
1$
b00000000 %
$end
Listing 5-13 is a small part of the bigdump.dmp. Note that each signal in the scope of
$dumpvars (because $dumpvars was not limited in scope, all signals in the design are
dumped) is assigned a key character (! = reset, for example) and this shorthand is used in
the dumpfile. The VCD is not human-friendly, but is a format that can be read by other
tools.
$readmemh (“filename”, memory_name); Read hex values from a file.
Automated Testing 171
Optional starting and ending addresses can be added to place limits on the data pulled
from the file. The address is an index into the array, the nth data element. It is acceptable to
have a start address, but no end address, in which case the file is read to the end of the
memory array.
$readmemh (“filename”, memory_name, start_addr, end_address);
AUTOMATED TESTING
The only way to assure thorough testing of a design is to automate the task. A check-off list
can be created. When a new revision of code is being released, all the automated tests
should be run again. The process is maddening, because most of the effort to resolve
problems will be in test-fixture errors, not design errors. Still, there is no better way to test a
design.
As an example, let’s design and test a simple digital filter as shown in Listing 5-14.
The source code for this design is shown in Listing 5-15. The output values are shown in
Figure 5-1. This design implements a one-dimensional low-pass pyramidal filter that uses
five samples with coefficients of 0.0625, 0.125, 0.625, 0.125, and 0.0625 (note the sum of
these coefficients is 1). This filter is crude and suffers from truncation errors but will serve
as an example of automated testing.
module time_setup4;
initial
begin
`timescale 1ns / 1ns
$timeformat (-9, 2, “ ns”, 3);
end
endmodule
module pf1_tf;
always begin
#(clk_period / 2)
clock = ~clock;
tap_unfilt = test_pattern [mem_index];
// filt_test = tap_filt [mem_index];
tap_test = verify_pattern [mem_index];
mem_index = mem_index + 1;
if (mem_index == 0) $finish;
end
always begin
#(clk_period)
if (!reset & (tap_filt != tap_test))
begin
$display ($time, “ ERROR! tap_filt = %h tap_test = %h”, tap_filt,
tap_test);
flag <= 1;
end
// else $display (“All is okay.”);
end
initial
begin
clock = 0;
mem_index = 0;
tap_test = 0;
filt_test = 0;
tap_unfilt = 0;
flag = 0;
reset = 1; // Assert the system reset.
#(clk_period * 2) reset = 0;
end
endmodule
Automated Testing 173
The pyramidal filter design reads data from two external files. Data used to stimulate
the filter is extracted from pfilt1.tst (shown in Listing 5-16) and identical data (except for an
intentional error put in for test) used to test the filter output extracted from the file
verify1.tst shown in Listing 5-17. The error is displayed as so:
Three types of entries are allowed in files read by Verilog: numbers (either hex or
binary), white space, and comments.
0
0
0
0
0
0
0
0
0
0
3
3
3 // Error added. Should be 4.
3 // Error added. Should be 4.
4
4
2
176 Verilog Test Fixtures Chapter 5
2
3
3
2
2
3
3
2
2
3
3
2
2
C H A P T E R 6
177
178 Real World Design: Tools, Techniques, and Trade-offs Chapter 6
because, when bundled with Xilinx Design Manager, it’s the cheapest package available.
We are using Exemplar Logic’s LeonardoSpectrum for this book.
The design flow and tools we will use are as follows:
• Specify the design. It doesn’t make sense to start coding until the job is defined.
In the Real World we often have to start a job before marketing has fully
defined the requirements, but we’ll try to get the job scoped out as much as
possible first.
• Partition the design. Divide the job into sections. Reuse old designs as much as
possible. We want our modules to be 5,000 to 10,000 gates. My estimate is
approximately 20 gates per line (which can vary wildly), so this is 250 to 500
lines of code (semicolons) per module.
• Write the code. Use a color-coded editor to help avoid syntax errors (the color
coding acts as an on-the-fly syntax checker and is remarkably useful).
Implement area, timing, clock/reset resource-assignment, and pin-assignment
constraints.
• To help locate syntax problems, try compiling your design with every tool you
can find, including different simulators. You’ll find that each vendor provides
differing error messages with differing levels of helpfulness.
• If possible, use a lint program like Verilint. There are several errors that a
Verilog compiler will accept, like mismatched vectors and the creation of
unwanted latches, that Verilint will catch. Pay close attention to warnings that
may indicate problems with synthesis.
• Simulate the design. Write test fixtures and use automated testing and
waveforms to verify the design. In this book, this means use Simucad’s Silos III
to simulate the design at as high a level as possible.
• Compile the code. In this book, this means use Exemplar Logic’s
LeonardoSpectrum to create a netlist. Watch the gate counts and speed
estimates. Use the schematic viewer to assure that your code is being
implemented in the manner you expect. Examine how clocks and resets are
implemented. Make sure global signals are detected and handled in the manner
you expect.
• Place and route the netlist. In this book, this means use Xilinx Design Manager
to create a downloadable configuration file. Manipulate the place/route controls
and perform as many place/route passes as necessary to achieve the design
requirements.
• Download the design and test it in the target hardware. FPGA designers tend to
jump to this step too soon, owing either to not having the right tools or to
impatience. The designer should be very sure the design is good before testing
in circuit.
Compiling with LeonardoSpectrum 179
LeonardoSpectrum has a graphical user interface and a wizard that leads the designer
through the design requirements. Very quickly, however, the designer will find the use of
scripts to be a faster and more efficient method of creating a netlist. The design script
created by the Wizard can be captured and run. Listing 6-1 is an example script created by
the design wizard.
set register2register 50
set input2register 50
set input2output 50
set register2output 50
set output_file “C:/verilog/latch.edf”
set novendor_constraint_file FALSE
_gc_read_init
_gc_run_init
set input_file_list { “C:/verilog/latch.v” }
set part 4013xlPQ160
set process 3
set wire_table 4013xl-3_avg
set nowrite_eqn FALSE
set chip TRUE
set area TRUE
set report brief
set global_sr reset
set output_file “C:/verilog/latch.edf”
set target xi4xl
_gc_read
set register2register 50
set input2register 50
set input2output 50
set register2output 50
set output_file “”
(verilog), .vhd (vhdl), .vhdl (vhdl), .xdb (binary dump), .xnf (Xilinx netlist
format).
set part 4013xlPQ160 The device we will implement this design in is a Xilinx
4013XL (roughly 13,000 gates) in a PQ160 (160-pin surface-mount) package.
set process 3 We are using the LeonardoSpectrum Level 3 design flow. Levels
1 and 2 are subsets of level 3, level one is a single-vendor FPGA design flow;
level 2 is multi-vendor FPGA flow; level 3 is multivendor and includes ASIC
flows.
set wire_table 4013xl-3_avg The delays will be based on average (as
compared to worst-case) loading for a -3 speed grade device.
set nowrite_eqn FALSE Here’s another double negative that means we
will write device equations into the schematic when the schematic is extracted
from the netlist.
set chip TRUE The netlist will be compiled to a device and will include I/O
pins for pins at the top level.
set area TRUE The design will be compiled for area optimization. The option
is to compile for speed. LeonardoSpectrum Level 3 allows individual modules
to be compiled for either area or speed—a great feature.
set report brief The report will be concise.
hierarchy_preserve TRUE LeonardoSpectrum will combine modules in an
attempt to reduce logic by maintaining the hierarchy. This reduction is not
allowed. Setting this TRUE during debugging is useful because it is more likely
that your signal names will be preserved.
set target xi4xl Implement the design using primitives from the Xilinx 4000XL
library.
To refresh our memory, Listing 6-2 is the design we’re working with. This design has
a problem: an inadvertent latch is created. LeonardoSpectrum is polite enough to point this
out to us in the message log of Listing 6-3 (see bold-highlighted text).
*******************************************************
*******************************************************
Number of ports : 4
Number of nets : 10
Compiling with LeonardoSpectrum 183
Number of instances : 9
Number of references to this view : 0
***********************************************
Device Utilization for 4010xlPQ100
***********************************************
Resource Used Avail Utilization
-----------------------------------------------
IOs 4 77 5.19%
FG Function Generators 0 800 0.00%
H Function Generators 0 400 0.00%
CLB Flip Flops 0 800 0.00%
-----------------------------------------------
Clock Frequency Report
Clock : Frequency
------------------------------------
We selected a 1-pass optimization; this pass resulted in a delay of 7 nsec. This design
uses no D flipflops, uses two input ports and two output ports, and took zero seconds to
compile. All right, not 0 seconds, but it compiled fast.
x IOs 4 77 5.19%
We’ve used a very small part of the 4010XL device.
Once you’re familiar with LeonardoSpectrum and want to get things done faster and in a
more repeatable and controlled manner compared to using the GUI, you can run in in the
batch mode with the spectrum executable (this program was called elsyn in previous
versions of LeonardoSpectrum). Make sure the DOS PATH environment setting in
autoexec.bat points to the spectrum program. For example, in my environment, this path is
c:\exemplar\LeoSpec\v1999.1d\bin\win32.
For example, an elementary command mode which will compile our basic latch
design might look like:
Another way is to cut and paste from the GUI filtered command window and create a file
like basiclatch.run as shown in Listing 6-4.
restore_project_script C:/Verilog/verilog/basiclatch.scr
_gc_read_init
_gc_run_init
set input_file_list { “C:/Verilog/verilog/basiclatch.v” }
set part 4013xlPQ160
set process 3
set wire_table 4000xl-default
set pack_clbs FALSE
set timespec_generate FALSE
set nowrite_eqn FALSE
set chip TRUE
set macro FALSE
set area TRUE
set delay FALSE
set report brief
set hierarchy_preserve FALSE
set output_file “C:/Verilog/verilog/basiclatch.edf”
set novendor_constraint_file FALSE
set target xi4xl
_gc_read
_gc_run
This file was invoked with the command line: spectrum –file basiclatch.scr. Type “spectrum
-batchhelp” to list all the command-line options (similar to Listing 6-5).
-nomap_global_bufs
Don’t use global buffers for clocks and other global signals (Xilinx/Actel).
186 Real World Design: Tools, Techniques, and Trade-offs Chapter 6
-use_qclk_bufs
Use quadrant clocks for Actel 3200dx architecture.
-insert_global_bufs
Use global buffers for clocks and other global signals (Xilinx/Actel).
-max_cap_load <float>
Override default max_cap_load if specified in the library.
-max_fanout_load <float>
Override default max_fanout_load if specified in the library.
-lut_max_fanout <integer>
Specify net fanout for LUT technologies (Xilinx, Altera Flex, and Lucent ORCA).
-noenable_dff_map
Disable clock-enable detection from HDLs.
-enable_dff_map_optimize
Enable use of flipflop clock-enable extracted from random logic.
-exclude <list>
Don’t use listed gate in mapping.
-include <list>
Map to specified synchronous DFFs and DLATCHes.
-pal_device
Disable map to complex IOs for Actel.
-wire_tree <string>
Interconnect wire tree : best|balanced|worst = default.
-wire_table <string>
Wire load model to use for interconnect delays.
-nowire_table
Ignore interconnect delays during delay analysis.
-nobreak_loops_in_delay
Don’t break combinational loops statically for timing analysis.
Compiling with LeonardoSpectrum 187
-crit_path_analysis_mode <string>
maximum(report setup violations) | minimum(report hold violations) | both = default.
-num_crit_paths <integer>
Report <integer> number of critical paths.
-crit_path_slack <float>
Slack threshold in nanoseconds.
-crit_path_arrival <float>
Arrival threshold in nanoseconds.
-crit_path_longest
Show longest paths rather than critical paths.
-crit_path_detail <string>
full(detailed point-to-point)(default) | short(startpoint-endpoint)
-crit_path_no_io_terminals
Don’t report paths terminating in primary outputs.
-crit_path_no_int_terminals
Don’t report paths terminating in internal endpoints.
-crit_paths_from <list>
Report only paths starting at this <list> port, port_inst or instance.
-crit_paths_to <list>
Report only paths ending at this <list> port, port_inst or instance.
-crit_paths_thru <list>
Report only critical paths through the <list> net.
-crit_paths_not_thru <list>
Report only critical paths that do not go through <list> net.
-crit_path_report_input_pins
Report input pins of gates. Default = off.
-crit_path_report_nets
Report net names. Default = off.
-nocounter_extract
Disable automatic extraction of counters.
-noram_extract
Disable automatic extraction of rams.
-nodecoder_extract
Disable automatic extraction of decoders.
188 Real World Design: Tools, Techniques, and Trade-offs Chapter 6
-optimize_cpu_limit <integer>
Set a CPU limit for optimization.
-notimespec_generate
Don’t create TIMESPEC info from user constraints; Xilinx only.
-nopack_clbs
Don’t pack look-up tables (LUTs) into CLBs; for Xilinx 4K families only.
-write_clb_packing
Print CLB packing (HBLKNM) info, if available, in XNF/EDIF.
-crit_path_rpt <string>
Write critical path reporting in this file.
-nocrit_path_rpt
Don’t create a critical path reporting file.
-report_brief| -report_full
Generate a concise design summary or a detailed one. Default = full.
-map_area_weight <float>
A number between 0 and 1.0. The larger this number, the more mapping will try to
minimize area.
-map_delay_weight <float>
A number between 0 and 1.0. The larger this number, the more mapping will try to
minimize delay.
-simple_port_names
Create simple names for vector ports: %s%d instead of %s(%d).
-bus_name_style <string>
Naming style for vector ports and nets: default %s(%d)| simple %s%d| old_galileo %s_%d
-nobus
Write busses in expanded form. This may be required for the Xilinx EDIF reader.
-nowrite_eqn
Don’t write equations in output; use technology primitives instead.
-nopld_xor_decomp
Don’t do XOR decomposition for Altera MAX and Xilinx CPLD technologies.
-noglobal_symbol
Delete startup (GSR) block.
-notime_opt
Don’t run timing optimization.
-max_frequency <float>
Complete Design Flow, 8-Bit Equality Comparator 189
So far, we’ve done only half the design work: the design entry and synthesis. To finish the
job, we need to run the Xilinx place-and-route tool, the Design Manager. To illustrate how
this tool is used, we’ll take an example design all the way through the process. This design
is similar to an HC688, an 8-bit equality comparator. This design compares two bytes and
generates a signal called equal if they are equivalent. A cascade input is also provided to
expand the inputs that are compared; if cascade is not asserted, the equal output is
inhibited. Because of personal preference, I’ve made a couple of design changes; all signals
are active high, and I made the equal output synchronous. See Listing 6-6 for the Verilog
code for this design.
The Verilog code is simple enough; equal can go high only if cascade is high and the
a and b input bytes are equal. Let’s see what LeonardoSpectrum makes of this design by
looking at the extracted schematic of Figure 6-1
Complete Design Flow, 8-Bit Equality Comparator 191
From Figure 6-1 we can see that the equal output is created by a flipflop and that the
clock and reset were implemented as intended. LeonardoSpectrum has instantiated a library
function from their module generator (modgen) to do the equality-test logic. For greater
detail, LeonardoSpectrum has another schematic view option, the gate-level schematic,
shown in Figure 6-2.
192 Real World Design: Tools, Techniques, and Trade-offs Chapter 6
The gate-level schematic shows the logic as it is mapped into Xilinx hardware. The
Xilinx Configurable Logic Block (CLB) will be explored in more detail in Chapter 7; for
now we can note the assignment of our logic to 2-, 3-, and 4-input look-up tables (LUTs),
the use of global buffers for clock and reset, and the flipflop that drives the equal output
signal.
A couple of things should be noted about these schematic views. First of all, they are
graphical representations of the netlist that LeonardoSpectrum synthesized. There is still
some processing to be done on the design by the Xilinx Design Manager (the place-and-
route tool). The use of this schematic is as a sanity check; if the design is not being
synthesized effectively, the designer can try different compilation options or design in a
more structural way. For example, the designer can replace the high-level equality operator
(==) with structural gates to assert more control of how the design is synthesized.
LeonardoSpectrum provides one last view of the schematic, the critical path as shown
in Figure 6-3.
The critical path is the longest delay path through the design. If the design needs to be
optimized for greater speed, the designer should focus on redesigning this path to remove
layers of logic. From this schematic, we can see that the longest delay path is from the b[4]
input to the equal output, and there are four layers of logic in this path. Like the adders we
studied earlier, there is probably a way to add extra logic to “look ahead” and streamline
this logic, if necessary.
For this design, compiling to optimize for delay didn’t change anything, but for most
designs there will be a change in interpreting the design, hopefully a change for the better.
There is one more view that has some value. The output of the synthesizer is a netlist,
in this case an EDIF (.edf) file, but this type of file is not intended to be read by humans.
LeonardoSpectrum can also generate a structural version of the netlist in a Verilog format.
In fact, one great feature of LeonardoSpectrum is the ability to translate between netlists of
various types. Anyway, we’re learning Verilog, so let’s look at the Verilog version of the
netlist as shown in Listing 6-7.
//
// Verilog description for cell hc688s,
// 09/06/99 11:00:40
//
output equal ;
input clock ;
input reset ;
input cascade ;
input [7:0]a ;
input [7:0]b ;
This is a bit of an ugly mess, but there are a few things we can extract from it. Note the _int
attached to the internal signals. This is very polite; some synthesizers convert a useful signal
name like clock into a signal name like ifght_2746 instead of clock_int which makes it very
difficult to search netlists. We want the synthesizer to do whatever is necessary to isolate a
signal as it gets routed, but keep some part of the signal name we assigned in there
8-Bit Equality Comparator with Hierarchy 195
somewhere. The equality module is modgen_2, and it gets wired up to the input buffers
(ibufs). The equal register is an OFDX (output D flipflop); note the assignments for Q
output, clock/data/clock enable. The GTS is a global tristate control and the GSR is the
global set/reset control.
The place-and-route tool works on the netlist that is extracted from the input design
and influenced by the design constraints and synthesis controls. If there is a problem with
synthesized logic, it may help to look at the netlist and make sure things are being
synthesized in a reasonable manner.
Another netlist form is the .xnf (Xilinx Netlist Format) which is very readable. Sadly
though, Xilinx is moving to standardize on the much-less-readable EDIF format.
Let’s hook up a few of our equality comparators and see what effect a hierarchical design
has on the resulting netlist. The hier688 design, shown in Listing 6-8, instantiates three of
our hc688s designs to create a 24-bit address decoder.
endmodule
The schematic of Figure 6-4 is not very legible, but you can see that our structural use
of the HC688 decoders results in cascaded logic. This design is not going to be very fast,
but is easy to put together as it reuses predesigned HC688 modules. Although we’re not
going to analyze the critical path, clearly it will be from a low-order address input to the
output_enable output signal.
Let’s carry this design into a real device. We do this by placing and routing the design
and creating a configuration file for the Xilinx device where our design will live. We will
open the Design Manager, create a new project, and browse (see Figure 6-5) until we find
the hier688.edf netlist. The Design Manager has a one-button operation (the idea is: if the
designer falls over dead, his or her head will hit the keyboard, and a place-and-route will
still take place). We’ll play dumb and just run the default Design Manager flow and see
what we get.
A convenient way to execute the Design Manager is to create a shortcut icon on your
Windows desktop. For example, in my environment the command line is:
C:\Xilinx\bin\nt\dsgnmgr.exe.
8-Bit Equality Comparator with Hierarchy 197
Listing 6-9 8-Bit Equality Comparator Hierarchical Example, Xilinx Translation Report
Figure 6-6 shows the Report Browser window. If we click on the Translation Report,
we will see the report of Listing 6-9, and we can see that the input design was read without
error. The EDIF netlist is converted to a Xilinx binary netlist file: a .ngo file.
Listing 6-10 8-Bit Equality Comparator Hierarchical Example, Xilinx Place and Route
Report
Listing 6-10 is a clip from the Xilinx place-and-route report. Like a printed circuit
board autorouter, the place-and-route tool tries different placements and selects the ones
with the better results. At this point an estimate of the timing can be extracted.
8-Bit Equality Comparator with Hierarchy 199
Listing 6-11 Equality Comparator Hierarchical Example, Xilinx Average Delay Report
The Number of signals not completely routed for this design is: 0
d <= 10 < d <= 20 < d <= 30 < d <= 40 < d <= 50 d > 50
------- --------- --------- --------- --------- -------
37 0 0 0 0 0
The signal delays are binned per Listing 6-11. This is a moderately fast design (looks
like it would run at 100 MHz to me) but only because very little of the device is used! As
the device gets fuller and more logic competes with routing resources, the design will get
slower.
Listing 6-12 8-Bit Equality Comparator Hierarchical Example, Xilinx Pad Report
We did not assign pin locations in the input design. The first time through it is not a
bad idea to let the place-and-route tool assign the pins (particularly with Altera devices).
The FPGA design tries to allow pins to be assigned in a universal manner (i.e., not be
sensitive to pin usage by the designer; allow any I/O pin to be used with logic anywhere on
the chip), but there is some assumption made, for example, that data flow is horizontal (with
relation to the Pin 1 location on the device) and control is vertical. On the other hand, for
the PWB design, you may want to control the pin locations and keep addresses together and
that sort of thing. Once the circuit board has been designed, we don’t want the compiler
reassigning pins, so we are going to constrain the pin locations. The pins assigned by the
Xilinx place-and-route tool can be located in the Pad Report as shown in Listing 6-12. This
file can be cut, pasted, and edited into the LeonardoSpectrum Constraint file to lock down
pin assignments as shown in Listing 6-13. This can also be done in Xilinx Design Manager,
but I prefer to lock these pins in the design capture environment.
These are not the only required pin assignments on the circuit board. We must hook
up the dedicated signals including power, ground, and configuration signals on the board-
level schematic.
The top 20 delays can be viewed in the Asynchronous Delay Report as shown in
Listing 6-14. From this, we can guess that this design would run at 168 MHz, not bad for a
slow –3 speed grade part. Again, we’re using only a tiny percentage of the device. Still, this
is not the full story, this is just the delays between individual nodes; to get the full delay we
have to run full timing analysis with this result:
This tells us that the worst-case delay from flipflop to flipflop is 9.967 nsec, so we can
really only run our clock at 100 MHz, not nearly so impressive.
The Xilinx place-and-route tool, called the Design Manager, converts the EDIF netlist into a
configuration file that can be loaded into a target device. Some of the place-and-route tool
optimization parameters are configurable by the designer. To get into the options menu,
select options from the implementation menu as shown in Figure 6-8.
MAPPING OPTIONS
The synthesized netlist has some placeholders for precompiled library elements. The
mapper finds the library elements (.ngo files, a binary netlist format) and merges them in.
The mapper then converts the merged netlist into a physical netlist with specific hardware
elements assigned to all the netlist logic elements. The mapper output is an .ncd (physical
netlist format) file. The user can configure the mapping process with the following options
from the Implementation Options window shown in Figure 6-9.
206 Real World Design: Tools, Techniques, and Trade-offs Chapter 6
If the mapper encounters logic that is not used, this logic can be deleted from the design.
This simplifies the logic and speeds up the place-and-route process. However, the designer
might want to keep the unused logic because it will be used in a later version of the design.
Leaving the logic in may give a better estimate of the resources and timing related to the
final design.
Redundant logic can be added to the design to reduce driver loading and speed up the
design (the basic area/speed trade-off).
Generally, the basic Xilinx logic element is a 4-input look-up table. However, in some
Xilinx families the CLB logic can be configured to create 5-input LUTs.
The mapper uses a set of rules to attempt to utilize the CLBs effectively. The CLB Packing
Strategy modifies the logic partitioning to allow less signal sharing and allows the use of a
CLB flipflop without the associated LUT. Again, this is a speed/area trade-off; the CLB
Packing Strategy can use more logic but may allow the design to run at a higher operating
speed. The Fit Device option packs the CLBs with possibly unrelated logic until the design
fits into the target device or until no more packing is possible. Turning this option Off
allows only related logic (logic with shared inputs) to be packed into a CLB.
This option controls register ordering by analyzing bussed signal names. The Minimum
Area option will result in a denser design with registers mapped in a more random order.
The Structure option enables register-ordering analysis.
Normally, the synthesis tool assigns logic to I/O buffers (IOBs). However, this option
allows the mapper to assign IOBs and can result in better CLB packing. Use the Off option
to allow the synthesis tool to control IOB assignment.
Mapping Options 207
Older Xilinx devices used primary (BUFGP) and secondary (BUFGS) global buffers for
global signals, so some synthesis tools may make these assignments. Newer Xilinx devices
use a pool of generic global buffers (BUFGs). Enabling this option will allow the
replacement of BUFGSs and BUFGPs with BUFGs.
Place-and-Route Options
Another trade-off is the amount of time spent optimizing a design versus the optimization
results as shown in the Place and Route menu in Figure 6-10. If the place-and-route tool
tries longer, it will have more options to select from, and the area/speed results will
probably be better. Higher effort levels will increase the run time.
The designer can select the number of routing passes. Each routing pass is a complete
attempt at placement. Once the router has met the design requirements (the design fits into
the device with all timing constraints met), the router exits.
Once a design has been placed, the timing can probably be improved. With this option the
designer can run 1 to 5 additional cleanup passes to attempt to improve the operating speed.
The timing constraints can be used to influence the place-and-route and achieve higher
operating speeds. Using timing constraints trades off processing time for design
performance. Turn this option Off to ignore timing constraints and speed up the place-and-
route process.
Logic Level Timing Report/Post Layout Timing Report 209
For a quick view of the timing performance of the design, a logic level timing report can be
produced by selecting the check box shown in Figure 6-11. These estimated results can be
reviewed without going through the complete (and often very time-consuming) place-and-
route process.
210 Real World Design: Tools, Techniques, and Trade-offs Chapter 6
A top-level report of the device timing can be reviewed with this brief timing report. The
maximum clock speed is reported. For error and path reports the entries are sorted by
constraint and delay value. Negative slack-time values indicate a constraint that was not
met.
This setting, either Summary, No Limit, or a number from one to ten, limits the reported
number of worst-case paths per timing constraint.
This option provides a timing analysis when no user constraints are present. The analysis
includes all clocks, the required offset for each clock, and a listing of combinational paths
sorted by delay value.
This option generates a timing report based on timing constraints. The number of paths
reported per constraint is per the selection made in the Limit Report to n Paths per
Timing Constraint dialog box.
Listing 6-15 is an example of a timing report for a signal in the hier688.v design. All
the delay paths between rwn and output_enable are listed, along with the positive slack time
(good!). Note that 80% of the delay is in logic. This percentage will get smaller (possibly
much smaller) as the design gets more dense and the logic fights for routing resources.
=================================================================
Timing constraint: TS01 = MAXDELAY FROM TIMEGRP “PADS” TO TIMEGRP
“FFS” 50nS;
30 items analyzed, 0 timing errors detected.
Maximum delay is 13.354ns.
-----------------------------------------------------------------
Slack: 36.646ns path rwn to output_enable relative to
50.000ns delay constraint
reg_output_enable
-------------------------------------------------
Total (10.700ns logic, 2.654ns route) 13.354ns (to clock_int)
(80.1% logic, 19.9% route)
This option generates a report of signals and paths that fail the timing constraints, listed
from worst to best. The logic and routing delays are identified and the failing path delays
are broken out to show all the delays that build up to cause the problem. A close
examination of the delays will provide clues to areas that can be pipelined or simplified to
make the design run faster or identify areas where the constraint is over-specified.
The number of paths reported per constraint is per the selection made in the Limit
Report to n Paths per Timing Constraint dialog box.
Interface Options
When the netlist is merged and .ngo files are inserted, the compiler searches for the proper
file to insert. The user can add other search paths. Multiple search paths can be entered, a
semicolon being used as path separator.
Rules File
To be merged in the ncf netlist, the filetype must be an ngo. The rules file path can point to
a utility for converting other netlist file formats to an .ngo filetype.
212 Real World Design: Tools, Techniques, and Trade-offs Chapter 6
Some design tools convert PAD symbols into module port symbols. This checkbox option
will convert top-level module ports into PADs (device pins).
Simulation Options
Xilinx can create a timing-annotated netlist in three flavors: EDIF, VHDL, and Verilog.
We’ll want to use the Verilog option to support Verilog simulation, of course. Vendors
supported for this version of the Xilinx place-and-route tool include generic EDIF, generic
Verilog, generic VHDL, ActiveVHDL, Concept NC-Verilog, Concept Verilog-XL,
Foundation EDIF, ModelSim Verilog (for the purposes of this book, this is the option we
will use), ModelSim VHDL, NC-Verilog, Quicksim, Verilog-XL, Viewsim-XL, Viewsim-
EDIF, VSS, and Default.
To use your logic gate and signal names instead of the names assigned by the place-and-
route tool in the optimized netlist, check this checkbox.
Define the filename for the simulation output file. If you want to keep multiple versions of
the simulation file, enter the filenames here, otherwise the new file will overwrite the
previous one.
For simulation purposes, it can be handy to have the internal Set/Reset node available as a
port at the toplevel of the design. The signal name that drives the Global Set/Reset (GSR)
resource can be entered in the dialog box to match the HDL design.
VHDL/Verilog Simulation Options 213
For simulation purposes, it can be handy to have the internal tristate control node available
as a port at the toplevel of the design. The signal name that drives the Global Tristate (GTS)
resource can be entered in the dialog box to match the HDL design. This tristate controls all
device outputs and is useful for isolating a device from a circuit board being tested
(stimulated) with external equipment.
Check this checkbox to create a Verilog test fixture (.tv) template file.
Xilinx provides a set of timing-annotated SIMPRIM (SIMulation PRIMitive) files. The path
to these files can be automatically inserted in the Verilog test-fixture file by checking this
checkbox.
The Verilog test-fixture file can maintain the input design hierarchy or flatten the netlist into
one big file. Check this checkbox to maintain the input design hierarchy.
Configuration Options
Xilinx devices are SRAM based and must have their configuration loaded after each power-
on. There are many configuration modes, including serial PROM, parallel master, parallel
slave, download cable, etc.
Configuration Rate
Slow (1 MHz) or Fast (8 MHz) internal configuration clock (master modes). These are
approximate speeds.
Select Read from Design to use the TTL/CMOS input level defined in the physical
constraints (PCF) file.
Configuration Pins
Various pull-up and pull-down options are available for the TDO, Mode, and Done
configuration pins, including a tristate mode.
The internal Xilinx configuration logic can perform a four-bit partial CRC check of
configuration data frames or just do a simple check of the 0110 pattern at the end of each
frame.
The normal configuration file is a binary .bit file. An ASCII version (.rbt) of this
configuration bitstream file can also be created.
I/O pins on a low-voltage device can be configured to withstand higher drive voltages for
mixed-power-supply operation.
Start-Up Options
Start-up Clock
Configuration can be started based on an internal (CCLK) or external clock (User Clock)
source.
The status of the open-drain DONE pin can be monitored. In cases where multiple FPGA
DONE pins are wire-ORed together, enabling this feature will cause all devices to start-up
when the last device has finished configuring.
Output Events
Control signals can be asserted or released with different timing. These status signals
include Done, Enable Outputs, and Release Set/Reset.
Other Design Manager Tools 215
Readback
The device configuration can be read when readback is enabled (readback can be disabled
for design security reasons). This tab includes options for the readback clock source
(internal or external) and termination of the readback process.
Unused pins can be tied high or low to reduce noise and power consumption.
Advanced Options
In the master parallel configuration mode, where the FPGA generates address lines to
control a parallel memory device, the configuration address lines can be configured for 18
or 22 lines.
Design Manager tools include the Flow Engine (which we used to perform the place-and-
route process), Timing Analyzer, Floor Planner, PROM File Formatter, Hardware Debugger
(which includes the FPGA download utility), and the EPIC Design Editor.
Timing Analyzer
The Timing Analyzer will provide a report of selected paths in the design. For example, it is
possible to examine all clocks in the design. Specific paths can be excluded.
=================================================================
Timing constraint: Default period analysis
12 items analyzed, 0 timing errors detected.
Maximum delay is 11.647ns.
-----------------------------------------------------------------
Delay: 11.647ns device_bus2(0) to device_bus1(2)
IPAD_device_bus2(0)
ix57
CLB_R24C1.G2 net (fanout=3) 2.016R
device_bus2(0)_int
CLB_R24C1.Y Tilo 1.590R
device_bus1_dup0(3)
ix66
P44.O net (fanout=1) 2.441R
device_bus1_dup0(2)
P44.PAD Topf 4.040R device_bus1(2)
ix50
OPAD_device_bus1(2)
-------------------------------------------------
Total (7.190ns logic, 4.457ns route) 11.647ns
(61.7% logic, 38.3% route)
List 6-16 shows a generic timing report for the worst path (critical path) in the hier688
design. The maximum delay for this path is 11.647 nsec. Note the division of time between
logic and routes listed at the bottom. As the design gets denser, the routing will be a higher
percentage of the delay.
Floorplanning
Floorplanning is a procedure where the arrangement and location of logic inside the FPGA
is manipulated and optimized. Figure 6-12 illustrates a typical device floorplan. Some
aspects of the design are obvious to the designer and may or may not be recognized by the
automated place-and-route tools. Which parts of the design are critical and should be
located adjacent to other logic elements? Can things be switched around to get a more faster
and more efficient design? Humans are better at these types of tasks than computers.
Other Design Manager Tools 217
Figure 6-13 shows a zoom view of the hier688 logic, the pin assignments, the CLBs,
and a rats-nest view of the signal routing.
218 Real World Design: Tools, Techniques, and Trade-offs Chapter 6
Figure 6-13 Xilinx Design Manager Floorplanner Tool, hier688 Design Zoom View
Xilinx supports serial and parallel configuration PROM versions. A file can also be created
and linked into a microcontroller PROM. Large devices may require multiple PROMs. The
PROM File Formatter allows the design to be split into multiple configuration devices as
shown in Figure 6-14.
Other Design Manager Tools 219
Hardware Debugger
The Hardware Debugger allows communication options (shown in Figure 6-15) which
allow a device to be configured with a PC serial port, parallel port or with a Xilinx
Xchecker cable (which also connects to a PC parallel port). A header (standard 0.25 square
posts, 0.1 center pattern) wired per Figure 6-16 must on the circuit board to support this
download. Xilinx also supports 4-wire (TDI, TMS, TCK, TDO) JTAG serial-port
programming.
220 Real World Design: Tools, Techniques, and Trade-offs Chapter 6
This tool provides a graphical representation of the design as if you were looking down at
the physical device itself (see Figure 6-17). Pins, pin buffers and registers, global signals,
signal routing, and CLBs are all visible. Some routing can be done at this level. For
example, it is possible to hook up test-points without resynthesizing and recompiling.
x The package. A high proportion of the cost of an integrated circuit is in the packaging
(just like the cost of potato chips).
223
224 A Look at Competing Architectures Chapter 7
x The silicon. This includes the “square footage” and the complexity in number of
layers and lithography. This is similar to a warehouse, where the price is correlated to
how big it is and how many floors it has. Other factors can contribute, like whether
the IC fab process is “mainstream” or not. Some programmable devices use EEPROM
technology or specialized materials, such as the tungsten plugs that Actel uses for
layer interconnect vias.
x The volume, or the economy of scale. If you want a cheap IC, you have to either buy a
lot of them or leverage mass quantities consumed by industry (in other words, if you
want a cheap device, buy an IC that many other companies are buying). To turn this
around, in order for a foundry to sell an IC cheaply, it must produce a lot of them; it
doesn’t much matter if one customer or many customers buy the product. FPGAs are
flexible and are used by many designers in many industries. ASICs are specific and
targeted toward specific customers (though these foundry customers may resell the
device to a wide market; from Intel’s point of view, the microprocessor is a type of an
ASIC, for example).
These factors hold up only when a competitive situation exists. Where sources are limited
by specialized processes, proprietary technology, and small markets, the economic situation
is different and prices go up.
The FPGA market is dynamic and exciting but is also a bewildering mess of TLAs
(Three Letter Acronyms) and hyperbole. Each FPGA vendor has a design strategy to solve a
problem for the users of its products. The vendors have individual niches, strategies, and
technologies. The main idea to keep in mind is that no ultimate design approach is ideal for
each design target; each FPGA design has strengths and weaknesses. A Xilinx FPGA may
be the best solution for one design problem (like random arrays of mixed functions); an
Altera CPLD may be the best design solution for the next design (like datapath functions
including digital filters); an Actel device might be the best for designs that require ASIClike
performance.
Other factors influence the design of an FPGA device, including the patent minefield.
It can be quite galling to use a patented architecture and make your competitor rich with
royalties. Can you design around a patent? Can you devise another way of solving a
problem? Is it better to just license the patent? What features does your target market need?
Are there gaps in a competitor’s product line?
FPGA Device Design 225
The Verilog FPGA designer must be very aware of the target FPGA architecture.
Keep in mind that your design will be implemented in look-up tables with significant
routing propagation delays. Verilog can be a portable language, but the highest-performance
designs are tailored for the target FPGA. How granular are the FPGA design elements? A
Xilinx 4000XL FPGA can be thought of as an array of four input LUTs, any of which can
be a 16-by-1 RAM. An Altera Flex 8K device is an array of Logic Array Blocks (LABs)
which have four-input LUTs in groups of eight with dedicated RAM blocks spread around
the die. How rich is the routing resource? Let’s say you notice that the die inside an FPGA
that Altera rates at 20,000 gates is about two-thirds the size of a similarly rated Xilinx
device. Does this mean Altera more cleverly packs LUTs on a die, or that Xilinx believes
much more routing resource is necessary to adequately support the use of the CLBs? Are
internal tristate signals supported? How many global low-skew clock networks are
226 A Look at Competing Architectures Chapter 7
available? Leaving vested interests and emotion aside, which architecture is better? Is it
better use of silicon to provide more gates or more routing? There are no universal answers
to these questions.
Just about any FPGA can serve about any job, given that the device has enough pins and
gates. Often a device is selected for nonscientific reasons, such as what devices are in the
company stockroom or which device the designer feels a need to put on her resume. This
list will allow an objective comparison of FPGA technologies. We’re not going to ask about
gate count: each vendor counts gates in a different manner, so this number is nearly useless.
These questions are more important than the device overview, because the technology is
constantly changing.
x How many logic blocks and flipflops does the device contain?
x How much RAM does the device have (if any)? Dual port or single port?
Distributed or in blocks? How big are the blocks? In what organization can they
be used (X1, X2, X4, X8, X16, X32, etc.)?
Keep in mind that each device has power/ground and other dedicated pins that are not
available for use as signals by the Verilog designer.
x Are denser devices available in the same package and compatible pinout?
If the design grows, will the circuit board need to be redesigned to accommodate a denser
device?
x Does the device pinout have to be locked before the FPGA design is complete?
A requirement to finish a PWB design early can favor an FPGA device (which has more
capability of routing device pins to random logic elements) over a CPLD device.
FPGA Technology Selection Checklist 227
x Does the FPGA support in-circuit configuration for field upgrade or board
customization?
x Is a socket required?
For one-time programmable parts, until the design is solid, a socket will be necessary to
ease the changing of devices. This is not a trivial matter; fine-pitch packages can be a real
challenge to socket.
Two vendors have about 80% of the FPGA market: Xilinx and Altera. Which one is
larger is subject to debate at this writing and depends on how the numbers are counted. You
can’t be an FPGA designer without acknowledging these two companies (you need both on
your resume to be most marketable). Other companies have great products and software!
None yet comes close to achieving Altera and Xilinx’s market share, and I think none ever
will. Check the resources section at the end of this book for device-manufacturer website
addresses and visit those sites for the latest data.
228 A Look at Competing Architectures Chapter 7
This is an older SRAM-based architecture. The CLBs have five logic inputs, two flipflops, a
common clock, direct reset, and a clock enable. It’s interesting to look at the clock-enable
implementation (see Figure 7-3). A clock-enable MUX selects between the output of the
look-up table and the latch. In the clock disabled mode, the output of the latch is fed back to
the input.
As shown in Figure 7-2, XC3000 IOBs have input and output flipflops,
programmable tristate, and pull-up resistor output control. This architecture does not have
more modern features like SRAM and fast carry in/out.
The Xilinx 4K family, compared to the 3K family, has greater densities, improved speed,
and other added features. In particular, the addition of distributed RAM (the ability to
configure a CLB as a 16-by-1 RAM cell) in the 4000E and 4000X families is a great feature
for the designer.
The Xilinx CLB, as illustrated in Figure 7-4, contains two four-input LUTs, two D-
type flipflops with dedicated clock enable, set or reset, a clock with configurable polarity,
and fast carry-in and carry-out signal paths. Each CLB in a 4000E/XL device can be used as
two 16X1 single-port RAMs, a single 16X1 dual-port RAM, or as a single-port 32X1 RAM.
The dual-port RAM configuration is synchronous; the other modes can be non-synchronous
(level-sensitive).
Other blocks provided by Xilinx include Input/Output blocks (IOBs) which include
I/O registers and configurable terminations (pull-up or pull-down) and pin buffers (fast or
slow). The 4000 family includes wide decoder blocks which are useful for fast decoders for
up to nine inputs. The 4000 family includes an on-chip oscillator and dedicated low-skew
networks that can be used for clocks and other fast global signals. The 4000 family supports
internal tristate signals and busses.
HardWire Devices
Xilinx offers a hardwired version of its FPGAs, which can save some cost for
applications where the volume does not justify a conversion to a full custom ASIC. In this
technology, Xilinx uses the same CLB architecture but replaces the SRAM routing and
switching arrays with metal layers. The result is less silicon (smaller die) and equal-to or
better-than timing compared to the FPGA. An advantage to conversion to HardWire is the
low stress on the FPGA designer; Xilinx guarantees that the timing and function of the
custom device will match the FPGA device. Conversion to a HardWire device is a test-
vectorless process; Xilinx develops automated test coverage and guarantees the device will
work in your application. This does not let the designer off the hook for doing a
synchronous design and doing thorough testing. For example, all asynchronous logic needs
to be reviewed for race conditions, because the HardWire device will most likely be faster
than the FPGA design.
The latest devices from Xilinx are built with 0.22-micron lithography (with a roadmap to
0.18 micron) and five-layer metal technology. The million-gate device has 75,000,000
transistors. Interesting new features, as shown in Figure 7-5 and Figure 7-6, include mixed-
voltage I/O (including low-voltage differential inputs to support busses like GTL),
dedicated 4096-bit dual-port SRAM blocks, distributed RAM cells, multiple DLLs (Delay-
Locked Loops) to provide controlled-delay clock networks, and vector-based routing
(allowing flexible routing up/down/ left/right between CLBs). These are 2.5-volt devices,
but the I/Os are tolerant of higher interface voltages.
Xilinx advertises that its million-gate device has 27,648 logic cells, 131,072 block
RAM bits, and 660 user I/O pins. Consider that an 8031 microcontroller core is less than
600 gates (less than 100 CLBs), contains 256 bits of RAM, and has 32 user I/O pins.
Welcome to the new millennium.
232 A Look at Competing Architectures Chapter 7
Figure 7-6 Xilinx Virtex Family CLB Architecture (one of two slices per cell shown)
234 A Look at Competing Architectures Chapter 7
Virtex I/O blocks, as shown in Figure 7-7, include programmable pull-up and pull-
down resistors, a weak keeper circuit (this holds a signal value when a driver is removed),
tristate control, I/O latches, and inputs with programmable delay (for shifting an input signal
with respect to the clock edge).
Configuration Devices
Altera uses the phrase Complex Programmable Logic Devices (CPLD) to describe their
design approach. Altera uses less routing resource than Xilinx. Their LABs (Logic Array
Blocks) are more complex than Xilinx’s CLBs and fewer of them are available on an Altera
die. Is this approach better than Xilinx’s? It depends on what you’re trying to do. One thing
in Altera’s favor is that its place-and-route software has a much easier job than Xilinx’s
Design Manager because there is much less routing resource. This means the Altera
software is fast and very deterministic; the same design, compiled with the same
compilation settings, gives the same result every time. On one of the Usenet groups a
designer said, “I love Altera’s software and I love Xilinx’s silicon.” There is a lot of
wisdom in that simple statement.
Because of the limited routing, it can be more difficult to route a design. The designer
must not lock down the pins too early in the design cycle. As the design grows, it may not
be possible to “reach” all the pins that were previously assigned. This is much less of an
issue with Xilinx/Lucent/Actel architectures.
Configuration modes include JTAG, JAM (a serial configuration standard that Altera
is promoting), active/passive serial modes, active/passive parallel synchronous modes, and
asynchronous modes.
Altera upgraded the FLEX8K family to create the FLEX10K family. FLEX10K includes
embedded 2048-bit dual-port RAM blocks (Embedded Array Blocks or EABs) which are
spread around the die. The 2048-bit EABs can be used as 2048-by-1, 1024-by-2, 512-by-4,
or 256-by-8 arrays. They can also be combined with other EABs to create larger or deeper
memory elements. The EABs can also be used for logic functions where they act as large
LUTs. The EABs include input and output registers.
FLEX10K LEs are similar to the FLEX8K with the addition of an output connected to
their fast routing channel (called FastTrack). In addition, a clock enable is added to the LE
flipflop.
Building on the structure of the 8K/10K architectures, Altera has designed the APEX 20K
family with a logic structure as shown in Figure 7-10. This family has much denser devices
and advanced features like Clock PLL (for clock lock, multiplication and phase shift), a mix
of logic structures (including RAM and wide PLD-type logic for decoders), a variety of
input/output modes (to interface with single-ended and differential busses like Low Voltage
Differential, Stub-Series Terminated, and Gunning Transceiver Logic). The number of
Logic Elements in a LAB is expanded from 8 in the FLEX 8K/10K families to 10. The
238 A Look at Competing Architectures Chapter 7
Logic Elements are similar to the FLEX10K with added synchronous Load and Clear logic
and more clock options.
I
ncreasing electronic design density is a trend
that has continued over the last 40 years. The consumer’s hunger for increasingly
sophisticated gadgets, whether GPS, cellular telephones, games, home automation,
networking, Internet commerce, audio/video entertainment, or computing, seems endless.
We get more free time each year, and we are filling this time by playing with our electronic
toys, all of which continue to get smaller, use less power, and grow more complicated.
While the demand grows, the ability for industry to provide transistors and gates also seems
endless. For the FPGA designer, this means designs will contain more gates.
Suppose an Engineer can design at a rate of 100 or so gates a day. Let’s call this about
10 lines of Verilog code (this includes the overhead of test and documentation). Soon, the
average FPGA design will be 200,000 gates. This means, unless the design methodology
changes, that a two-person team will take 1,000 days to complete a design, almost three
years! Each year the design task gets more complex, but the schedule remains about the
same. The company expects an average project to be complete in a year and a complex
239
240 Libraries, Reusable Modules, and IP Chapter 8
project to be complete in a year and a half. Clearly, something has to give. There are several
options.
COFFMAN’S LAW
The average measure of intelligence of a room is inversely proportional to the
number of people in the room.
In the hardware design world, an electrical designer of the 40s designed with a handful of
vacuum tubes. In the 50s, the tubes were replaced with transistors. In the 60s, the transistors
were replaced with integrated circuits (100s of transistors). Today, ICs with millions of
transistors are common. So, we hardware designers became comfortable with creating
designs by mixing and matching circuit elements we didn’t design. There are two ways this
can happen.
As synthesis tools get smarter and FPGA designs get denser (so the number of gates
required to implement a design becomes less important and we can afford to waste gates in
order to produce a design more quickly), higher-level constructs become feasible. One day,
we will implement a 1024-bit adder that runs at 100 MHz by writing a line of code like:
a = b + c;
Instead of handcrafting a look-ahead carry adder, the synthesis tool will infer an
efficient adder based on your design constraints.
Modules will be included in your code from previous designs (the most common reuse
method) or will be purchased or licensed from someone else. A lot of energy in our industry
is focused on selling Intellectual Property (IP) designs to ASIC designers, and vendors
would love to supply IP to the FPGA market, too. Frankly, the heavy-breathers in the ASIC
and Design Automation areas think they will make large amounts of money selling designs
to companies trying to reduce their product’s time-to-market.
If only there were a tried-and-true market model for using IP! Well, there is, and it’s
related to the use of integrated circuits. This model has been used successfully for over 30
years, so it must work well. From a designer’s point of view, specifications, pricing, and
delivery of various IP offerings are evaluated and the right product is selected for the design
at hand. The financial model for the device manufacturer is interesting to consider. The
device is designed at great expense and placed on the market. The up-front cost to produce
this design (which can be millions of dollars) is paid back slowly over time as the devices
are purchased by electronics manufacturers. This strategy can be quite profitable if the
design becomes popular, but it takes deep pockets to play this game. Can IP vendors play
this way?
The IP provider must provide complete data which characterizes performance
including throughput, latency, signal I/O requirements, module size, and power
consumption. This assures that the design is appropriate to the application and allows
comparison to other products. The successful IP offering will be a stand-alone module that
performs specific functions that designers are comfortable with, like FIFOs and other types
of memory-based modules, microcontrollers, filters, compression/decompression functions,
and communication ports (UARTs, Ethernet, USB, etc.). For even wilder speculation about
242 Libraries, Reusable Modules, and IP Chapter 8
the type of IP that might be feasible, see the Afterword: A Look into the Future, Millions
and Millions and Millions of Gates.
Before we get too excited about off-the-shelf IP, lets take a look at the simplest
method of increasing design productivity, the use of built-in library elements.
LIBRARY ELEMENTS
Each FPGA vendor supplies a set of primitive library elements. The Verilog design is
mapped to the hardware using primitives similar to these. The primitives are implemented in
an efficient manner by the underlying hardware. They get more capable every year as the
FPGA vendor adds elements to, and increases the utility of, the libraries. The FPGA vendor
has a vested interest in providing design aids and shortcuts that increase the efficiency and
ease of use of their products.
The expert designer keeps in mind various levels of abstraction for a design, including
the types of library elements that will be used to implement the design. The following is an
example of what sort of elements we might see in a vendor’s primitive library:
There will be various flavors of these primitives. For example, the following versions of the
AND gate might be available:
The Verilog compiler also uses a set of primitives. We see much similarity with the FPGA
vendor primitive library.
1. FALSE
2. TRUE
3. INV
4. BUF
5. AND2
6. OR2
7. XOR2
8. NAND2
9. NOR2
10. MUX
11. DFFRS
12. DFFERS
13. LATRS
14. RSLAT
15. TRI
16. PULLUP
244 Libraries, Reusable Modules, and IP Chapter 8
17. PULLDN
18. TRSTMEM
19. DON’T_CARE
The following device-specific library list is from Exemplar and is for the Xilinx 4000XL
family. There is a similar library for generic primitives.
29. IFDI
30. IFDXI
31. IFDI_NG
32. IFDXI_NG
33. INLAT Input Latch
34. ILD_1
35. ILDX_1
36. ILD_1_NG
37. ILDX_1_NG
38. ILD
39. ILDX
40. ILDI
41. ILDXI
42. ILDI_1
43. ILDXI_1
44. ILDI_1_NG
45. ILDXI_1_NG
46. INREG Input Register
47. DFF D Flipflop
48. FDPE
49. FDCE
50. FD
51. FD_GP
52. FDP
53. FD_NGP
54. FD_NG
55. FDC
56. FDC_NG
57. FDCE_NG
58. FDE
59. FDE_GP
60. FDE_NGP
61. FDE_NG
62. FDP_NG
63. FDPE_NG
246 Libraries, Reusable Modules, and IP Chapter 8
If you follow the digital design newsgroups (see the resources section), you will
periodically see the schematic zealots presenting a case that efficient designs can generally
be implemented only with schematics. This may be true, so, to the Verilog purist, it may
make sense to do a schematic with text by wiring primitives together. Done properly, this
will result in very compact and fast logic. However, it can get unwieldy very fast, so we’ll
want to use this approach only where necessary.
One drawback of a schematic design can’t be argued: it’s not very portable. Using IP
in a design requires some portability if the IP is to be offered for sale to the design world. IP
and HDL need each other.
Listing 8-1 is an example of the structural use of a library primitive, Figure 8-1 shows
the corresponding synthesized circuit, and Figure 8-2 shows the Xilinx structural resource
assignment.
endmodule
Figure 8-2, a list of resources used in the global_buffer design, shows that BUF1 was
implemented as a BUFGS (the only type of global buffer available in the Xilinx 4000XL
family).
How was the RAM module an example of increasing design speed? We didn’t invent
a RAM module from scratch; we used a design tool to help create it. In this case, we used
the Xilinx LogiBLOX tool to create it. Other modules can be created and parameterized,
these include:
Accumulators
Adders/Subtractors
Clock Dividers
Comparators
Constants
252 Libraries, Reusable Modules, and IP Chapter 8
Counters
Data Registers
Decoders
Inputs/Outputs
Memories
Multiplexers
Pads
Shift Registers
Simple Gates
Tristate Buffers
Employing these types of schematiclike elements in a structural design can allow the
HDL designer to use hardware-specific hardware configurations options. For example,
under the Tristate Buffer block definition, there are three options for pull-up resistor: none,
pull-up, and double pull-up. Options like double pull-up are not directly supported by
Verilog but may be required for some design implementations. The use of a structural
design, where HDL is employed in a structural (schematic) fashion, or where Verilog
modules are stitched together with schematics, may be required in some cases. Use
whatever works!
So, one way to increase the number of gates we design in a day is to use a tool that
automates the creation of certain types of modules.
Xilinx (with the MEMEC company) provides a core generator with a wider variety of more
complex modules compared to LogiBLOX. Examples of the functions include:
x FPGA Development Tools (DSP and FPGA development platforms for evaluation
and benchmarking of DSP and FPGA designs).
x Processor peripherals including C2910A Bit Slice Processor core, DRAM controller,
M8237 DMA Controller, M8254 Programmable Timer, M8255 Programmable
Peripheral Interface, M8259 Programmable Interrupt Controller, XF8256
Multifunction Microprocessor Support Controller, and XF8279 Programmable
Keyboard Display Interface.
x Processor products including Intellicore™ Prototyping System, RISC CPU Core
Demo, Scalable Development Platform, TX400 Series RISC CPU cores, and V8
uRISC Microprocessor.
x UARTS, including M16450, M16550A, and XF8250.
x Communication and Networking Cores, including ATM Cell Assembler, ATM Cell
Delineation, ATM CRC10 Generator and Verifier, ATM CRC32 Generator and
Verifier, ATM Utopia Slave (CC-141), Forward Error Correction Reed-Solomon
Using LogiBLOX Module Generator 253
module sincos8 (
ctrl,
theta,
c,
dout);
input ctrl;
input [7:0] theta;
input c;
output [7:0] dout;
endmodule
sincos8 YourInstanceName (
.ctrl(ctrl),
.theta(theta),
.c(c),
.dout(dout));
The CORE Generator file also created a Verilog simulation file called sincos8.v. This
file is 5,287 lines of code. Assuming we could write 100 lines of debugged code a day, we
could write this module in a couple of months. It took the CORE Generator about five
seconds. Assuming the module meets the needs of the design, that’s not a bad leverage of
productivity. If it doesn’t work, we don’t have any source code, so the core doesn’t help.
The file we link into our design is the compiled EDIF netlist called sincos8.edf. Out of
curiosity, let’s implement this design in a 4000XL device and see what it looks like. There
are always some tricks to making the tools work; in this case, we need the bus delimiter to
be parentheses B() (use B<> as delimiters for the XNF netlist format) in order for the Xilinx
254 Libraries, Reusable Modules, and IP Chapter 8
Design Manager to suck in the EDIF file properly. This is done by unchecking the Verilog
Instantiation Template and Verilog Behavioral Simulation Model boxes (even though we
want these files to be generated) in order to get the Netlist Bus Format we want as shown in
Figure 8-6. In addition, in the Exemplar Leonardo EDIF output tab, we want to deselect the
Allow Writing Busses checkbox, because Xilinx does not process EDIF busses properly.
Note that the Verilog compiler doesn’t “know” anything about the black box (which
is inserted during the downstream mapping process, as illustrated in Figure 8-6). Any
estimate by the synthesis tool regarding speed and design size will not include the black-box
modules.
Without really trying, this design runs at 61 MHz in the slowest 4005XL (-3) device.
Listing 8-4 is a report on the resources used by this design.
Using LogiBLOX Module Generator 255
Another example is an eight-wide and 16-deep FIFO called fifo8x16. The Verilog
simulation file is 293 lines of code. Listing 8-5 shows the interface file.
module fifo8x16 (
d,
we,
re,
reset,
c,
full,
empty,
bufctr_ce,
bufctr_updn,
q);
input [7:0] d;
input we, re, reset, c;
output full;
output empty;
output bufctr_ce;
output bufctr_updn;
output [7:0] q;
endmodule
256 Libraries, Reusable Modules, and IP Chapter 8
fifo8x16 YourInstanceName (
.d(d),
.we(we),
.re(re),
.reset(reset),
.c(c),
.full(full),
.empty(empty),
.bufctr_ce(bufctr_ce),
.bufctr_updn(bufctr_updn),
.q(q));
We can see that using a CORE generator is an effective way to create complex
modules and increase design efficiency. This is a hit-or-miss process, because, if a module
does not meet the needs of the design after trying all available compiler options, without
source code there is no way to make modifications, so another approach will be required
(like taking the time to design an optimized module from scratch).
As you get some designs under your belt, you will find that you’ll reuse some of your
design approaches and even certain modules. If you designed it, then you certainly
understand its features and limitations and can make an almost instinctive judgment whether
to reuse something or write it from scratch. What can you do to help make a design suitable
for later use in other designs?
alone can help you maintain your sanity when crunch time comes. These tools are even
more critical when working with other designers.
and they should generally be parameters so they can be changed at a top level or in an
include file.
x Minimize Ports.
Module partitions should be selected to minimize interconnects between modules,
particularly where clock domains are crossed. Like an orange, designs have natural
boundaries for isolation and cohesion. Use these natural boundaries to partition the design.
Split up complex modules into smaller and simpler parts.
If you have a problem and you’re not sure what to do, or you’re trying to select
between competing options, talk to your peers about it. Get input from the newsgroups,
Field Applications Engineers, or your neighbors—anywhere you can find it. Even if your
cohorts have bad ideas, they may lead you to think about the problem in a different manner
and may inspire a better approach. If you really can’t find help among your coworkers, then
find a better bunch of people to work with.
x Archive Everything.
Keep scripts, make files, design notes, libraries, old versions, and all software used to
compile and implement a design.
BUYING IP DESIGNS
What does an IP block look like? From a user’s perspective, the interface must be defined,
including clock signals [polarity (uses of rising or falling edges or both), maximum and
minimum frequencies, duty cycle, and loading], reset/preset signals (polarity, synchronous
or asynchronous, required duration, and loading), and requirements for other ports.
The biggest issues with purchasing IP for an FPGA (I call this Revenue IP, some call
it Silicon IP) are not technical but are related to negotiating a license. How much should the
up-front payment be? How much for recurring payments (royalties)? What is the cost model
for unexpected usage when product volumes are higher or lower than expected? How can
usage be audited? How can the IP provider protect its investment and still provide enough
data assure a successful implementation? These questions all must be addressed to make IP
viable for a design.
Will Revenue IP (RIP) ever be a significant part of the FPGA designer’s life? As
hardware designers, we are comfortable using hardware IP in the form of integrated circuits
out of necessity. We are not ASIC designers, so we do not have to use expensive design
tools and we don’t have direct access to the foundries. We could create a functionally
equivalent design, but it would take more board space, take longer to design, and cost more
to implement. Two of these three drawbacks are not present when our design ends up in an
FPGA. We can argue with management that the last drawback, the time to do the design,
will be balanced by the avoiding up-front costs or royalties. So, RIP is the right acronym for
Revenue IP. We will use all the free (vendor-provided) and cheap (vendor-subsidized) IP
we can get our hands on, but we will resist paying for other types of IP. Still, we need to
take a look at the current IP strategies. For FPGAs, they come in two flavors: firm/hard and
soft.
Hard or Firm IP
In the ASIC business, Hard IP is like a Standard Cell, a core that is predesigned and
characterized for a specific foundry process. This option does not really exist for FPGAs;
Summing Up 259
the closest we get is with Hard/Firm IP—prerouted and preplaced modules that can be
linked with our other modules. Hard/Firm IP is most like using an integrated circuit in our
design. It is a black box from a user’s point of view. The user can’t change it; it can only be
plunked into a design and used as-is, with all other circuitry and routing forced around and
through it. These modules are provided with a behavioral model that allows the design to be
evaluated and tested. From a vendor’s point of view, this is the safest IP, as it is very
difficult to reverse-engineer or to modify and present as a new design. From a user’s point
of view, hard IP is not very friendly; it is a one-size-fits-all solution and allows no flexibility
other than built-in configuration options. It’s not portable to new processes or technology
without recompiling by the IP vendor.
Soft IP
From a user’s point of view, having source code that can be tweaked, hacked, and
synthesized is much more desirable. But how then, can the IP vendor be assured of being
paid? Get all the money up front? That’s not very likely. If a design ends up being 47% IP
and 53% hacked by the user, what’s the right compensation? What keeps the designer from
creating timing problems when the design (which can be very complex) is modified? Who
is responsible in this situation?
To protect the IP vendor, Soft IP might be encrypted or obfuscated (comments
removed, informative labels replaced with truncated and useless ones, and the code
compressed to be unreadable) so that it can be synthesized and integrated into other parts of
the design but not easily reverse-engineered.
SUMMING UP
The most common form of design reuse for FPGA designers will be reusing your own
modules. We have covered some ways to design your code that will improve reusability.
The next most common reuse method will be using modules designed by other engineers at
your company; this avoids extra cost and legal issues. The most common tools for
increasing productivity will be vendor-supplied libraries and core generation tools. Third-
party IP will be a small part of the FPGA designer’s reuse strategy.
This page intentionally left blank
C H A P T E R 9
261
262 Designing for ASIC Conversion Chapter 9
For ease of conversion and lower up-front costs, there are three options for converting
an FPGA to a custom device: a hard-wired FPGA, an FPGA conversion using laser-
programmed or custom-routed devices, and a full ASIC design. Using Verilog as a design
and simulation tool greatly enhances ease of the converting to an ASIC, because all ASIC
companies use and are comfortable with Verilog.
An FPGA is not a very good ASIC prototyping device, but they get more ‘ASIClike’
every year. Increasingly, designs will remain implemented in FPGAs because of their
increasing densities and future cost reductions. Still, many of our designs will convert to
ASICs. While FPGAs are getting cheaper and denser, so ASIC technology improves, too.
x The FPGA vendor has designed-in “training wheels” which improve the
chances of success for a designer using poor design methodology. Particularly,
the clock network has delay designed in to create a zero-hold-time requirement
for flipflops. The FPGA designer concentrates on meeting the setup-time
requirement; the ASIC designer must meet both setup- and hold-time
requirement window.
x The FPGA provides low-skew global networks for clock and reset/preset; these
networks must be created in the ASIC design.
x The experimental FPGA design mindset (Unsure about something? Try it and
see what happens) is dead wrong for designing ASICs. There is a huge cost to
making an error in an ASIC in terms of foundry charges and leadtime. This
requires a careful (some might say anal-retentive), cautious, and conservative
design approach with extensive testing.
x It can be difficult to cram logic into an FPGA, then make it run at high-speed.
The ASIC will have only the resources demanded by the design (routing and
logic resources in the FPGA are present whether they are used or not); thus will
it be smaller, use less power, and operate at higher speed. Therefore, a lot of
wasted time may be spent optimizing a design to run in an FPGA.
In spite of these caveats, successful FPGA-to-ASIC conversions are done every day.
Using some common-sense design strategies will make the process go smoothly. First, let’s
look at the technologies into which the FPGA might be converted.
Altera HardCopy offers custom hard-wired versions where the Logic Elements are the same
as a regular FPGA (though packed closer together), but the routing is replaced with custom
Semicustom Devices 263
metal runs. The design change is minimal (the device uses the same placement and signal
routing as the FPGA), and the time span for conversion can be as low as a month or so.
Minimum order quantities can be as low as a few hundred pieces. Packages and pinouts,
including power and ground, can be identical to the original FPGA. Configuration signal
emulation can be used. For example, the configuration CONF_DONE pin might be used to
control a processor reset signal on the circuit board. Though the HardCopy device does not
require configuration, having the configuration pins act the same might be required. The
HardCopy devices are built on the same fab lines as the FPGA, so the process technology
(lithography), the pin drive capability, the pin voltage tolerance, and the CLB layout are the
same.
Because the HardCopy silicon is so similar to the FPGA, the HardCopy design can be
captured just from the configuration file. Still, the conversion engineers will request source
design information, which can be informative during the conversion process.
One drawback to the HardCopy device is encountered during production testing. The
configurable devices can be programmed with a test pattern and checked out; the HardCopy
devices must have special test support designed-in (added).
Conversion Issues
Conversion to HardWire technology is the least demanding conversion for the FPGA
designer. Xilinx guarantees that the HardWire design will act the same as the FPGA device.
Still, an FPGA can mask race conditions that can create glitches, because signal routing
transistors with capacitive loading act as a low-pass (RC) filter; this effect will be much
reduced in the HardWire device. Race-condition glitches caused by asynchronous signals,
which are “filtered out” in the FPGA design, can be uncovered. Asynchronous signals will
be flagged by Xilinx during the conversion process, but it’s up to the designer to take
responsibility to insure no hazards exist.
SEMICUSTOM DEVICES
Various technologies exist for arrays where logic is placed on a die, then custom routing is
created with laser programming, where routing segments are removed. Chip Express is a
company that offers fast prototypes with laser-programmed routing (LPGA, or Laser
Personalized Gate Arrays), which can be converted later to devices with one or two metal
routing layers. Clear Logic also offers these types of devices (LPLD, or Laser-Processed
Logic Device). The trade-offs and design considerations are very similar to HardWire
conversion issues.
264 Designing for ASIC Conversion Chapter 9
Vendors like AMI (American Microsystems) and Orbit offer FPGA conversions to their
Gate Array designs. These processes offer short lead times (4 to 6 weeks) and low NRE
charges ($5K to $50K). These companies have a lot of experience with doing FPGA
conversions and can smooth the conversion process considerably.
In an FPGA design, what the designer can do is limited because the FPGA has a predefined
structure in which the design is implemented. An ASIC is more freeform. There are no
training wheels to keep the designer out of trouble. All the features we take for granted, like
programmable buffers, termination resistors, built-in oscillator buffers, and power-on
reset/preset, are not present unless we specify them in the ASIC. The ASIC has some
advantages because the routing is fully customized and only gates that are actually used get
placed. Also, much greater densities are offered, so designs that live in multiple FPGAs can
be combined into one.
The designer must provide information to feed the conversion process. This information
will be present on a checklist provided by the ASIC vendor and will include items like:
Conversion to an ASIC process can be stressful; there are hungry gators swimming in those
waters! Some hazards to watch for include delay networks, race conditions, combinational
feedback, pulse generators, floating internal busses, clock skew, and gated or divided
clocks.
Most vendors offer a “turn-key” conversion process. In this design flow, the ASIC
vendor takes complete responsibility for the conversion and provides all test vectors. This
takes longer and is more expensive than a “joint-design” conversion, where the FPGA
designer provides all or part of the test vectors and takes responsibility for the conversion.
AMI (American Microsystems, Inc.) offers a no-vector conversion; this is the most
painless conversion for the FPGA designer who hates simulation. However, the designer
must obey the following rules, which is nearly impossible:
The first rule is to do a synchronous design. This is not always an easy rule to follow, but
each clock added to a design should be carefully considered. Every clock domain, every
signal that crosses a clock-domain boundary, and every asynchronous signal is a hazard
unless dealt with exhaustively. If the purpose of a design is to convert from one clock
domain to another (like a FIFO does), then, obviously, you have no choice. If you need to
save power, but some of the design needs to run at high speed, then again you have no
choice. My personal preference is to run a design at the lowest possible speed, because this
reduces RFI emissions and makes the design more tolerant of the inefficiencies of a generic
HDL implementation. If a section of the design must be asynchronous, put it in quarantine.
Keep it segregated from the synchronous parts of the design and document it well so that
the design intent and hazards are clear.
It’s not hard to handle asynchronous signals, but it is easy to forget to do this
handling. The result is a design that works, but does not work reliably.
SYNCHRONOUS DESIGNS
Synchronous designs do not contain gated clocks or multiplexed clocks. The number of clocks
can be counted on one hand, preferably with four fingers left over.
John McGibbon
Memec Design Services
ASIC vendors who do FPGA conversions routinely replace RAM and other modules (like
adders and counters) with parameterized modules selected from their library. However, each
substitution contributes a new block to the design, and each change adds to the risk that
something will go wrong. LogiBLOX or cores should be evaluated for ease of conversion or
Synchronous Design Rules 267
substitution before being used in the FPGA design. The netlist should be “untouched” as
much as possible during the conversion process. If your design has nothing but NAND
gates in it, it will convert painlessly, because no module substitution will occur.
In a gate array, your RAM and ROM modules will be replaced with registers. This
results in an explosion of the gate count. For a 512-by-8 RAM module, 4,096 flipflops will
be instantiated. The cell decoding logic adds to that number (decoding-logic complexity
doubles for each added address line).
One feature that is particularly troublesome during ASIC conversion is RAM
initialization (or ROM contents, the same thing). The FPGA can write data into RAM cells
during the power-up configuration process. There is no corresponding magic configuration
mode in the ASIC; all RAM cells must be written to via the RAM data bus.
Power-On Conditions
Part of the training wheels for an FPGA design is the power-up initialization of all I/O pins
via the use of GSR (Global Set/Reset) resources and the device configuration process. An
ASIC will not have these features unless the designer specifically puts them in. The ASIC
vendor tends not to want to use many global networks (like reset and/or preset networks)
because they must be custom routed and they consume routing channels.
Internal Busses
Xilinx and other FPGA vendors allow the use of internal tristate busses and buskeepers to
prevent problems due to floating buffer inputs. Some ASIC vendors do not have this
capability or prefer not to use the technology because it complicates testing. Exemplar
Leonardo has a feature where internal tristate busses can be automatically converted to
MUXes for technologies that don’t support internal tristate busses.
Configuration Pins
Often the FPGA configuration pins are used in external logic (for example, using the
configuration CONF_DONE pin in a Power-On Reset logic). Special logic will have to be
designed into the ASIC to provide configuration-pin emulation. The FPGA designer needs
to define which signals are used and how the pins are expected to act. For common FPGA
signals and architectures, the ASIC vendor will have some experience with these signals
and will be able to assist.
The input signal thresholds must be defined by the FPGA designer. Input threshold options
include TTL (where the threshold voltage is about 30% of the supply rail), CMOS (where
the voltage threshold is about 50% of the supply rail), or custom.
268 Designing for ASIC Conversion Chapter 9
The output pin drive requirement must also be defined by the FPGA designer. The use
of low-impedance (high-current) buffers should be minimized to reduce power consumption
and RFI noise generated by the design. The ASIC process probably has more options for
drive capacity than the FPGA. Always use the slowest and lowest-power pin buffer that will
do the job.
OSCILLATORS
Oscillators are analog circuits, but sometimes oscillator buffers are available in FPGA
technology. These are inverting buffers with low gain to help assure that the oscillator stays
in the linear mode, the inverter provides 180 degrees of phase lag, and the RC (cheapest,
sloppiest), ceramic resonator (cheap, but not too sloppy), or crystal (best performance, but
more expensive) provides the remaining 180 degrees of phase lag to meet the requirement
for oscillation (a closed loop with 360 degrees of phase shift and an overall gain of one). A
typical gate oscillator is shown in Figure 9-2.
These circuits will need to be identified to the ASIC vendor to assure a compatible
conversion. It’s likely that the oscillator will end up being gated or multiplexed (this is
much different than having a gated clock as part of the normal operating mode) so that the
test equipment can drive the clock output with a clock of known frequency and phase. This
circuitry will be added as part of the ASIC design process and probably will not be part of
the FPGA design.
Never strap an oscillator enable pin high or low; put a resistor in so that an external
source can enable or disable the oscillator as shown in Figure 9-3.
Delay Lines 269
For best performance, clock circuits should be isolated from other noise sources by
physical distance or by guard rings, and the wiring should be kept tight to reduce loop areas.
Note: the oscillator inverter is run in the linear mode, and the output should approximate a
sinewave as much as possible to reduce EMI.
DELAY LINES
The FPGA designer sometimes uses a delay line to create time-delayed signals, particularly
when interfacing with external SRAM or DRAM components. This delay is another analog
circuit, so use caution! The delay line might be a string of buffers. This method of creating a
delay is not recommended, because it depends on typical buffer delays which are not
controlled and which change with temperature and process/technology changes.
During ASIC conversion, delay-line buffers will be replaced with buffers with
different propagation delays (usually shorter, because ASIC buffers are typically faster than
FPGA buffers) or will be completely removed because they represent redundant logic from
a digital point of view. These delays must be documented and verified to insure they get
implemented properly.
An option might be to use an external circuit to create the delay as shown in Figure 9-
4. This circuit might be an RC delay with buffers to minimize the effect of changing the pin
driver and pin loading during ASIC conversion. This delay is not precise and depends on
the propagation delay of the pin drivers, the buffer propagation delays, the buffer threshold
voltages, the tolerances of the RC components, operating temperature, and the ether-flux of
the moon’s gravitational field.
270 Designing for ASIC Conversion Chapter 9
Even better, a delay can be created from a serpentine circuit board trace with about
175 picoseconds of delay per inch as shown in Figure 9-5. Remember to include the buffer
delays, the pad delays, and the circuit-board trace delays. There are many assumption in this
delay, and your mileage will vary. The reader is urged to read Johnson and Graham’s High
Speed Digital Design, a Handbook of Black Magic (see bibliography) before implementing
a circuit like this.
Assumptions include the use of FR-4 circuit-board material, 20-mil traces, 0.6 inches
per segment, 50-mil segment pitch, and that you have a valid exemption from Murphy’s
Law.
Even better yet, think about spending some money and using a digital delay circuit
like those available from Dallas Semiconductor and others.
The Language of Test 271
We’re not going to cover test topics in depth, but we can learn a few buzzwords.
x At-speed testing. Testing performed at the actual operating speed of the design.
Most testing is performed at slower clock speeds that are comfortable for the
test equipment. These frequencies might be on the order of 1 to 5 MHz.
x Boundary scan. A test scheme where MUXes and latches are added to the
design to support shifting serial data in and out. This allows test patterns to be
applied and internal logic states to be read out.
x BIST. Built-In Self-Test, where hardware is added to the design to allow it to
test itself.
x Fault grading. A measure of the how well the design hardware is tested. It is
the ratio of the number of test vectors and the fault coverage.
x Functional test. Testing a device by applying user-provided test inputs and
checking outputs. These tests are generally not very thorough. These are not
parametric tests for AC performance.
x IDDQ Tests of power-supply current when all internal nodes are quiet. The only
inputs are terminations to prevent oscillation and to keep gates from going
linear. This is a quick test to reject devices that were manufactured improperly.
x JTAG. Joint Test Action Group that created the IEEE 1149.1 boundary scan
register and test access port (TAP) standard.
x Observability. The ability for test equipment to access an internal node. All
output pins are observable.
x Parametric testing. Testing for gate input thresholds and output drive
capability. These are analog tests which verify the ASIC processes.
x Partial scan. A scan test that covers only selected parts of the design.
x Test coverage. A figure of merit for a test suite; it’s the ratio of all detected
faults to the total number of possible faults.
x Stuck-at faults. A failure caused by a node staying in a zero or one state when
it should be driven to a different state.
Boundary Scan
Because SRAM-based FPGA devices can be reprogrammed, the FPGA manufacturer can
load a test configuration and do a thorough production test. Custom devices, like your
ASIC, must have test support designed in. A common method of providing test support is to
insert boundary scan logic (BST), which creates a serial chain that runs near the outside of
272 Designing for ASIC Conversion Chapter 9
the device under test. This chain can include other devices. The serial chain can be four or
five signals (TDI, Test Data Input, TDO, Test Data Output, TCLK, Test Clock, TMS, Test
Mode Select, and an optional Test Reset, TRSTn). Inside the ASIC, MUXes are inserted
which allow selected signals to be connected to a long shift register; this allows signals to
be shifted in and out of the device being tested.
BST adds hardware to the ASIC as shown in Figure 9-6, the added hardware increases
the ASIC design by 15 to 25%. It also adds delays to signal paths on the order of 1-2 nsec
for each BST MUX. Insertion of the BST hardware and generation of scan vectors are
automated processes. Note that the device signal always flows through a MUX. This
architecture allows the serial chain to read device signals, or to shift (pass-through) other
test signals in the chain, or to pass test signals into the signal chain.
There are other test methods. A complete discussion of them is beyond the scope of
this book, but we can at least list them and say a few words about them. Tests can be
divided into production tests (where the design is validated and process problems are tested
for), design conversion tests (insuring that the design was converted properly; this is usually
done mostly with designer-supplied functional test vectors), and static timing tests (to assure
that the ASIC’s different gate delays and clock skews don’t cause problems).
IDDQ Test This is a quick test for production problems; if the current drain of the
device is much higher than expected, then a manufacturing defect has probably occurred
and the device can be quickly rejected.
Functional Test This type of test uses test vectors provided by the designer which
emulate typical operating modes and look for predicted outputs. This type of test is
generally not very thorough, because the designer doesn’t think of all the various
combinations of input modes and logic sequences.
ATPG (Automatic Test Vector Generation) These test vectors can include serial
vectors (the ones that are clocked into the BST scan chain, if present) and parallel vectors
(the ones presented in parallel to the device inputs).
Print-on-Change Test Vectors 273
The ASIC vendor will request print-on-change (POC) test vectors; this is an ASCII-
formatted list of input sequences and expected output test patterns. Fortunately, it’s not
difficult to extract these vectors from the Verilog test fixture using $display and $monitor
directives.
II O
NN U
12 T
TIME
0 00 0
50 01 0
53 01 1
100 00 1
103 00 0
150 10 0
153 10 1
From Listing 9-1, you can see that the delay through the gate is 3 nsec (the output
changes in the period between 50 and 53 nsec).
This page intentionally left blank
Afterword: A Look into the Future, Millions
and Millions and Millions of Gates
275
276 Afterword: A Look into the Future, Millions and Millions and Millions of Gates
background debugger is included. A free Verilog design system (with Xilinx object-oriented
extensions) is available.
A wide variety of IP cores are available for inexpensive licensing, including:
Xilinx is an equal-opportunity employer, and all Xilinx devices are Y3K compliant.
Resources
For updates and errata for Real World FPGA Design with Verilog, surf over to
www.bytechservices.com
The World Wide Web is an excellent tool for research. The Usenet newsgroups are an
excellent source of unfiltered information, opinions, and gossip.
Usenet Newsgroups
comp.lang.verilog
comp.lang.vhdl
comp.arch.fpga
comp.cad.synthesis
Verilog FAQ
http://www.faqs.org/faqs/verilog-faq/
Software Suppliers
www.bluepc.com
www.cadence.com
www.exemplar.com
www.model.com
www.simucad.com
www.synopsys.com
www.synplicity.com
www.veritools-web.com
277
This page intentionally left blank
Glossary
bidirectional A port that acts as both input and output (inout). This port
will have output drivers connected to an input port. It is up to the
designer to assure that only one output driver is active at a given time.
279
280 Glossary
configuration The process of loading the FPGA with the user’s design
file(s).
footprint The arrangement and style of the physical pins and package
of a device.
hold time The period of time after a clock edge that an input signal must
be stable to assure the flipflop or latch output follows the input
correctly.
IP Intellectual Property.
LAB Logic Array Block. This is Altera’s basic logic block in their
CPLDs.
management The person who provides guidance and direction for a team.
The manager of a team sets the limit for team achievement.
pad A net that connects the FPGA logic to the outside world.
blocks like flipflops, latches, MUXs, etc., all connected together with
the FPGA routing resources.
sensitivity list Also called an event list or event sensitivity list. This is an
index of signals used in a block. This list drives the simulator: the
simulator can evaluate signals that change and determine if the signal is
used in a block. If the signal is not used, the block does not have to be
processed.
setup time The period of time before a clock edge that an input signal
must be stable to assure the flipflop or latch output follows the input
correctly.
slack time The extra time available to allow logic to resolve before a
timing violation occurs. Positive slack time is good, negative slack time
is bad.
threshold The voltage level where a signal is resolved into a zero or one
value. For TTL, this voltage is approximately 1.4V, for CMOS the
voltage is approximately ½ the supply voltage.
timescale The basic unit of time used during simulation. The default
time unit in Verilog is nsec.
X An unknown value.
XOR Exclusive OR, the output is true only when the inputs are
different.
291
Index
Index
&, 31 assign, 9
&&, 31 asynchronous, 58
|, 31 ATPG, 277
||, 31 autorouting, 75
=, 10, 25
==, 33 B
<=, 10, 25
$display, 156 bidirectional bus, 83
$dump, 165 bitstream, 275
$dumpall, 165 bitwise, 31, 276
$dumpfile, 165 blocking, 10, 25
$dumplimit, 164 buf, 16
$dumpoff, 164 bufg, 180
$dumpon, 164 bufif0, 16
$dumpvars, 164 bufif1, 17
$finish, 156 buskeeper, 276
$monitor, 162
$monitoroff 162 C
$monitoron, 162
$readmemb, 166 capacitance, 276
$readmemh, 166 carry, 106
$stop, 156 case, 276
$write, 157 casex, 276
'define, 154 checklist, 222
'include, 155 checksum, 276
'timescale, 155 CLB, 225, 276
undef, 154 clock, 61
CMOS, 276
A Coffman's Law, 236
combinational, 276
adder, 50, 105 comment, 9
alligators, 261 commutative, 70
Altera, 231 concatenation, 39
always, 10, 25 conditional, 36
array, 143 configuration, 277
ASIC, 275 constraints, 277
292
Index
core, 277 H
counter, 121
CPLD, 277 hazard, 279
CRC, 133 HDL, 279
cyclic redundancy checksum, 133 hierarchy, 12
hold time, 58, 279
D hysteresis, 63, 279
default, 85 I
delay, 56
DeMorgan's Theorems, 69 inout, 9, 83
DFFs, 277 instances, 279
dissipation, 277 instantiate, 279
division, 35, 133 IOB, 202, 224
DLL, 227
J
E
Johnson counter, 122
EAB, 233, 277
edgetrig, 24 L
event, 10
event sensitivity list, 282 latch, 18, 280
latency, 92
F LFSR, 124
Linear Feedback Shift Register 124
false, 30 Lint, 280
fanout, 56, 278 LogiBLOX, 247
feedback, 19-20 LUT, 280
FGs, 278
FLEX, 231 M
flipflop, 12, 18
floorplan, 278 metastability, 57, 280
footprint, 278 module, 9, 12
multiplexer, 44
G multiplication, 116
multivibrator, 18
GIGO, 8 MUX, 44,280
glitch, 56, 278
Gray Code, 97-103 N
GSR, 279
GTS, 279 NAND, 14
293
Index
294
T H E A U T H O R
295
This page intentionally left blank