0% found this document useful (0 votes)

64 views316 pages

Real World Fpga Design

Great book

Uploaded by

I Don't Know

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views316 pages

Real World Fpga Design

Great book

Uploaded by

I Don't Know

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 316

Real World FPGA Design with Verilog

Ken Coffman
President, Bytech Services

Prentice Hall PTR

Upper Saddle River, NJ 07458
www.phptr.com
Library of Congress Cataloging-in-Publication Data
Coffman, Ken.
Real world FPGA design with Verilog / Ken Coffman.
p. cm.
Includes bibliographical references.
ISBN 0-13-099851-6
1. Field programmable gate arrays--Computer-aided design. 2. Verilog (Computer
hardware description language) I. Title.

TK7895.G36 C64 1999

621.39’5--dc21 99-046369

Editorial/Production Supervision: Joan L. McNamara

Acquisitions Editor: Bernard Goodwin
Marketing Manager: Lisa Konzelmann
Editorial Assistant: Diane Spina
Cover Design Director: Jerry Votta
Cover Designer: Talar Agasyan
Cover Illustration: Alamini Design
Manufacturing Manager: Alexis R. Heydt

© 2000 by Prentice Hall PTR

Prentice-Hall, Inc.
Upper Saddle River, New Jersey 07458

Prentice Hall books are widely used by corporations and government agencies for training, marketing, and resale.
The publisher offers discounts on this book when ordered in bulk quantities. For more information, contact: Corpo-
rate Sales Department, Prentice Hall PTR, One Lake Street, Upper Saddle River, NJ 07458 Phone: 800-382-3419;
Fax: 201-236-7141; email: corpsales@prenhall.com

Trademarks: Verilog is a trademark of Cadence Design Systems, Inc. OrCAD is a registered trademark of OrCAD
Systems Corporation. Silos III is a trademark of Simucad Inc. Altera is a trademark and service mark of the Altera
Corporation in the United States and other countries. MAX, FLEX, FLEX 10K, FLEX 8000, AHDL, MegaCore,
and Altera device part numbers are trademarks and/or service marks of Altera Corporation in the United States and
other countries. Xilinx is a registered trademark of Xilinx, Inc. Hardwire, LogiBLOX, VersaBlock, VersaRing are
trademarks of Xilinx, Inc. LeonardoSpectrum, LeonardoInsight, HDLInventor, FlowTabs, and Power Tabs are
trademarks of Exemplar Logic. All other product names mentioned herein are the trademarks of their respective
owners.

All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in
writing from the publisher.

Materials based on or adapted from materials and text owned by Xilinx, Inc., courtesy of Xilinx, Inc. © Xilinx, Inc.
1995–1999. All rights reserved.

Printed in the United States of America

10 9 8 7 6 5 4 3 2 1

ISBN: 0-13-099851-6

Prentice-Hall International (UK) Limited, London

Prentice-Hall of Australia Pty. Limited, Sydney
Prentice-Hall Canada Inc., Toronto
Prentice-Hall Hispanoamericana, S.A., Mexico
Prentice-Hall of India Private Limited, New Delhi
Prentice-Hall of Japan, Inc., Tokyo
Pearson Education Asia Pte. Ltd.
Editora Prentice-Hall do Brasil, Ltda., Rio de Janeiro
D E D I C A T I O N

T his is my first book to reach publication and

I dedicate it to my wife, Judy Coffman. In May of 1972 we were married. Since then, she
has (mostly) patiently watched me play in rock bands (see www.owlband.com), get my
Bachelor’s degree at night school, go to a hundred concerts with my friends, do countless
moonlighting projects, write some novels (some with my partner Mark Bothum), promote
rock shows (some with my partner Craig Ranta), and now write this technical book. During
these years, of course, I was working a day job to pay the bills. All this while the weeds
were growing in the yard and the honeydew list was gathering dust.
Thanks for hanging in there, babe. I’ll finish the landscaping as soon as I finish my
next book, I promise!

iii
Prentice Hall Modern Semiconductor Design Series
James R. Armstrong and F. Gail Gray
VHDL Design Representation and Synthesis
Jayaram Bhasker
A VHDL Primer, Third Edition
Mark D. Birnbaum
Essential Electronic Design Automation (EDA)
Eric Bogatin
Signal Integrity: Simplified
Douglas Brooks
Signal Integrity Issues and Printed Circuit Board Design
Alfred Crouch
Design-for-Test for Digital IC’s and Embedded Core Systems
Tom Granberg
Handbook of Digital Techniques for High-Speed Design
Howard Johnson and Martin Graham
High-Speed Digital Design: A Handbook of Black Magic
Howard Johnson and Martin Graham
High-Speed Signal Propagation: Advanced Black Magic
William K. Lam
Hardware Design Verification: Simulation and Formal Method-Based
Approaches
Farzad Nekoogar and Faranak Nekoogar
From ASICs to SOCs: A Practical Approach
Samir Palnitkar
Design Verification with e
David Pellerin and Scott Thibault
Practical FPGA Programming in C
Christopher T. Robertson
Printed Circuit Board Designer’s Reference: Basics
Chris Rowen
Engineering the Complex SOC
Wayne Wolf
FPGA-Based System Design
Wayne Wolf
Modern VLSI Design: System-on-Chip Design, Third Edition
Brian Young
Digital Signal Integrity: Modeling and Simulation with Interconnects and
Packages
Contents

Foreword
Notes on the Current State of the Art ix

Preface
Digital Design in the Real World xi

Acknowledgments and Notes on the Second Printing xv

Chapter 1 Verilog Design in the Real World 1

Trivial Overheat Detector Example 4
Synthesizable Verilog Elements 8
Verilog Hierarchy 12
Built-In Logic Primitives 14
Latches and Flipflops 19
Blocking and Nonblocking Assignments 25
Miscellaneous Verilog Syntax Items 29

Chapter 2 Digital Design Strategies and Techniques 41

Design Processing Steps 41
Analog Building Blocks for Digital Primitives 42
Using a LUT to Implement Logic Functions 44
Discussion of Design Processing Steps 47
Synchronous Logic Rules 57
Clocking Strategies 64
Logic Minimization 67
What Does the Synthesizer Do? 71
Area/Delay Optimization 74

v
vi Contents

Chapter 3 A Digital Circuit Toolbox 77

Verilog Hierarchy Revisited 77
Tristate Signals and Busses 78
Bidirectional Busses 83
Priority Encoders 84
Area/Speed Optimization in Synthesis 89
Trade-off Between Operating Speed and Latency 93
Delays in FPGA Logic Elements 95
State Machines 97
Adders 107
Subtractors 116
Multipliers 117

Chapter 4 More Digital Circuits: Counters, RAMs, and FIFOs 123

Ripple Counters 123
Johnson Counters 124
Linear Feedback Shift Registers 127
Cyclic Redundancy Checksums 135
ROM 137
RAM 139
FIFO Notes 155

Chapter 5 Verilog Test Fixtures 157

Compiler Directives 158
Automated Testing 171

Chapter 6 Real World Design: Tools, Techniques, and Trade-offs 177

Compiling with LeonardoSpectrum 179
Complete Design Flow, 8-Bit Equality Comparator 189
8-Bit Equality Comparator with Hierarchy 195
Optimization Options In the Xilinx Environment 204
Mapping Options 205
Logic Level Timing Report/Post Layout Timing Report 209
VHDL/Verilog Simulation Options 212
Other Design Manager Tools 215

Chapter 7 A Look at Competing Architectures 223

Contents vii

Factors that Determine Integrated Circuit Pricing 223

FPGA Device Design 224
FPGA Technology Selection Checklist 226
Xilinx FPGA Architectures 228
Altera CPLD Architectures 235

Chapter 8 Libraries, Reusable Modules, and IP 239

Keys to Increased Productivity 240
Library Elements 242
Structural Coding Style 247
A Small Diversion to Compare a Schematic to a Verilog Design 248
Using LogiBLOX Module Generator 251
Design Reuse, Reusing Your Own Code 256
Buying IP Designs 258
Summing Up 259

Chapter 9 Designing for ASIC Conversion 261

HardCopy Devices 262
Semicustom Devices 263
Design Rules for ASIC Conversion 265
Synchronous Design Rules 266
Oscillators 268
Delay Lines 269
The Language of Test 271
Print-on-Change Test Vectors 273

Afterword—A Look into the Future 275

Resources 277

Glossary and Acronyms 279

Bibliography 291

Index 292
viii Contents

The Author 295

Foreword—Notes on the Current State of the
Art

W hen I graduated from the

Université du Québec à Chicoutimi with my Engineering degree, then later from the
University of Waterloo with my Master’s degree, I thought I was ready to take on the
world’s toughest design challenges. Little did I know that the Real World of design had little
to do with the ideal laboratory conditions where we bread-boarded our academic designs.
For example, once in the Real World, I found it had little use for the ripple-through FIFO
with asynchronous control logic I’d spent hours trying to understand. The ripple binary
counters, implemented by using the Qbar from the previous bit as the clock input for the
next bit, were nowhere to be seen. I think I had heard of metastability, but I was not taught
where to pay attention to it, nor how to minimize the problems it causes. I should have
learned how to properly implement an edge detector. I thought I knew what a glitch was, but
I did not understand when glitches are a problem nor how to eliminate them. I naively
believed that designs were implemented using perfect manufacturing processes. As a result,
my designs were never functionally correct the first time!

The Real World of design was about to undergo a transformation for which my formal
education left me ill-prepared: the apparition of logic synthesis. Minimizing logic using
Karnaugh maps was being relegated to the electronic equivalent of the Stone Age. Selecting
JK or T flipflops to minimize decode logic was becoming just as relevant. The little green
ix
x Foreword—Notes on the Current State of the Art

plastic template I used to draw schematics in countless lab reports and final exams was
going to join the manual typewriter in the obsolescence paradise.

The skills that turned out to be the most useful I had not learned as part of my
engineering curriculum: typing (which my mother forced me to learn throughout high
school on our IBM Selectric typewriter) and computer programming (where I was self-
taught and still had more to learn). What my engineering background gave me was the
ability to learn new tricks and discern work patterns that could be rendered repeatable, then
later automated.

This book is all about what I learned through the hardware-design school of hard
knocks. Many mistakes could have been avoided, and many hours of mentoring eliminated,
if I had had such a textbook and heeded its advice. This book is not just about the Verilog
language, and that is its greatest contribution. There are already numerous books about the
details of the language. This book is about hardware design in the Real World, where
Verilog is simply the implementation tool. I hope that the next edition will feature VHDL as
well as Verilog: both are equally capable of (I would even say equally poor at)
implementing designs that will meet Real World constraints. This book is also unique in
describing in detail the entire FPGA design process: from HDL coding to verification to
synthesis to device selection to fitting and place-and-route. Too many books satisfy
themselves in showing only the HDL coding aspect.

The most important advice that this book gives is to understand what needs to be done
before you start coding. The biggest sin this book commits is in understating the verification
task: expect to spend 70% of your design time writing test fixtures and debugging the
function of your design before implementing it. Both of these points underscore the
importance of planning as well as investing as much effort as possible as early as possible in
the design process. In hardware design, progress is not measured by how far along in the
design process you are. Progress is measured by how close you are to producing working
hardware.

Today’s buzz is about IP and design reuse. This book can be considered to be about
design reuse: it is about excellent and safe design practices, not only for FPGAs but for
ASICs, too. Even though I have never worked with the author, I would feel confident in
reusing his designs in my own. They would be trustworthy. Design reuse is about creating
designs that are trustworthy in the Real World. This book should be mandatory reading for
every novice FPGA (and ASIC) designer.

Janick Bergeron
www.janick.bergeron.com
janick@bergeron.com
Preface: Digital Design in the Real World

T
he world of digital design is
changing quickly. At a breathtaking rate, devices are becoming faster, smaller, and denser.
Fifteen years ago the mainstream digital designer was manipulating a few thousand gates
using schematics with an occasional ABEL-HDL module tossed into the mix. Now we have
programmable devices with millions of real ASIC gates in tiny packages. On the horizon,
we see devices with many more millions of gates. It is not practical for the mainstream
designer to create systems on chips with schematics (how would you like to deal with a
1,000-page schematic?), so Hardware Description Languages like VHDL and Verilog have
come into their own. In spite of strong opinions on both sides of the fence (including my
own), the current designscape is bilingual—multilingual if you include the work of those
translating C code into hardware and the work of others on more advanced and hybrid
languages.

My own opinion of the fundamental reason for Verilog’s staying power is that Verilog had a very
large head start in [the] number of engineers who knew Verilog before VHDL really got out of
the blocks, and Verilog is easier to learn than VHDL. Thus, the established designers already
knew Verilog and had no reason to learn VHDL, and the new designers could pick it up easier
than they could pick up VHDL.
John Sanguinetti
C2 Design Automation

xi
xii Preface: Digital Design in the Real World

I always thought that VHDL was the bloated/bureaucratic/design-by-committee deal, and

Verilog was the KISS/lean-and-mean/hippy/West Coast approach, and that the usual rules-of-
engagement required us to perpetuate and widen the rift between them -
Jonathan Bromley
School of Engineering
Oxford Brookes University

SURVIVAL SKILLS

Regardless of personal opinions, the practical designer will make sure that both VHDL and
Verilog skills are present on his or her resume. The current half-life of engineering
information is about four years and gets shorter every day. This means that half of what you
know today will be obsolete in four years. In order to survive, we weary designers have to
do two things:

1. Master the parts of our skill that are timeless. This includes physics (the analog
aspects of digital design, transmission-line theory, conservation of energy,
antenna theory, and power management) and design concepts like
synchronization, metastability, and propagation delay.

2. Keep up with the changing technology. Take advantage of free seminars, try to
read some of the tidal wave of trade magazines that pile up every month, buy as
many books as your Significant Other will tolerate, and pay close attention when
smart people are speaking.

80% of all embedded systems are delivered late.

Jack Ganssle
The Ganssle Group

The world of digital design is deeply divided. The elite 5%, the ASIC designers, use
hardware and software tools that cost hundreds of thousands to millions of dollars a year to
maintain. They earn their living creating specialized high-volume designs. If the FPGA
designer uses 50K gates, the ASIC designer uses 500K gates. If the FPGA designer is
accustomed to four nanoseconds of delay through a primitive, the ASIC designer is
accustomed to delays of less than a nanosecond. The ASIC designer is very careful,
methodical, and does extensive planning. Errors can cost hundreds of thousands of dollars in
silicon turns and schedule delays. The ASIC designer simulates, simulates, and then
simulates some more.
Survival Skills xiii

By contrast, we FPGA designers are sloppy and impatient. There is little or no cost to
experiment, so we program a part and try it. We use tools that are cheap or free on
Windows-based PCs or even embed the test equipment in FPGA logic. By comparison to
ASIC designers, we are a brutish and undisciplined mob, an unruly 95%. I have written this
book for those who would like to join me in this mob.

There’s also the human element—stress—to the reprogrammability equation. ASICs aren’t
reprogrammable; the foundry casts their functionality into silicon. Making the final decision to
commit a design to an ASIC can be extremely stressful for the entire design team. Once it
makes the final decision, the team can’t go back without incurring lots more NRE and lots more
time. Erring at this stage, thus, is definitely a Career-Limiting Move (CLM). FPGAs, on the other
hand, offer engineers a greater comfort zone midway through the project, giving them the ability
to go back and revise a design without paying the NRE and time penalties. Reprogrammability
alone may well be responsible for much of the success of the FPGA marketplace in the last
decade.
Rockland K. Awalt
“Making the ASIC/FPGA Decision”
Integrated System Design, July 1999
Reprinted by permission

This is an FPGA synthesis book. It will not make the reader into an ASIC designer,
though it does address issues associated with converting an FPGA design to an ASIC. This
book is for the newbie FPGA designer who wants a quick and dirty guide to creating FPGA
designs that actually stand a chance of surviving in the Real World.
I worked hard on this book, but it is not perfect. If you find an error or want to argue
about some of the points that are arguable (of which there are many), I look forward to
hearing from you.

Ken Coffman
Mount Vernon, Washington
kcoffman@sos.net
This page intentionally left blank
Acknowledgments

I have had the honor of working with

many people who were kind enough to shine some of their brilliance in my direction. People
who reviewed the manuscript and made many outstanding suggestions (including some I
actually took) include Janick Bergeron, Dr. Sajjan G. Shiva, and David Graf. David Pellerin
introduced me to our patient editor Bernard Goodwin; without Dave’s timely guidance and
inspiration, this book would not exist. Joan McNamara patiently guided me through the
production editing process and Bob Lentz did a great job copy-editing my fractured prose.
For the Second Printing, Guy Corp of GrafixCORP (www.grafixcorp.com) did an
outstanding job with prepress service and PDF conversion/formatting. A special tip of the
hat to Jake McFarland who provided much-needed help with Microsoft Word formatting
and the PDF conversion, a very painful process.
Influences in my colorful but not-all-that-illustrious career include Craig Ranta, Rick
Penn, G. Scott Bright, Jerrold Gray, Bruce Dippie, Paul Swanson, Dock Brown, Jim
Neumiller, Larry Liu, Gary Croft, Mike Kahn, Hal Bridges, Jeff Sanders, Paul Maltseff,
Tom Dickens, Michael Irvine, Steve Swedenburg, Tom Dillon, Donn Gabrielson, Geoff
Jones, Ken Lomax, Sassan Khajavi, Will Cummings, Lee Pratt and Ed Millett.
The usual suspects on the Usenet newsgroups (see the Resources section) contributed
to my thinking and advancing the state of FPGA synthesis. These folks include Paul
Menchini, Peter Alfke, Ray Andraka, Edward Arthur, Rajesh Bawankule, Stuart Sutherland,
Cliff Cummings, Tom Coonan, Ben Cohen, Steven Knapp, Austin Franklin, Utku Ozcan,
and John Cooley. Thanks: Lisa Vartanian, author of metric.c.
xv
xvi

Many additional thanks to the folks who provided software and support: Dave Pfost,
John Bennett, Patrick Kane, Jeff Sanders, Don Matson of Xilinx, Tom Feist of Exemplar
Logic, Richard Jones of Simucad, and Dennis Reynolds and Dave Kresta of Model
Technology.

A special nod in the direction of William M. McDonald and Robert Craig (Coolbob)
Slater, RIP brothers.
xvii Author’s Notes

Notes on the Second Printing

I t is an honor to have sold out the first printing

of my book and have the privilege of creating a second printing. My readers have been kind
enough to draw to my attention a number of errors which it is a great relief to finally be able
to fix. I wish to thank the sharp-eyed readers who contributed to this printing, who include
Rogelio Azuela Matuck, Glade Bacon, Richard Coronado, Joseph Curren, David Graf,
David Hawkins, Jan Hofland, Robert Jarnot, David Murray, Matt Noel, Scott Nolan,
Tanapong Nopavong, Ted Obuchowicz, Grady Sharpe, Jaime Villela, Mark Wang, Ed
Wysocki, Robert Xue and Clive Bolton. Most people who commented have been kind and
gracious. Comments, which I greatly appreciate, include “Your book is the best from the
practical point of view”, “I found the book a refreshing read”, Your book is good and easy
to read”, “It was a breath of fresh air to read such a lively introduction”, “I enjoyed [your
book] a great deal”, “an excellent book”, and “[I] really enjoyed [the book’s] style and
contents.” If you folks would repeat these nice comments at amazon.com, then I’d truly be
in author’s heaven.
There is one person to whom I want to draw special attention. Writing a book is a
scary proposition. In spite of my years of experience and rather unbounded arrogance, I
always expected there would be a critic or two who would call an irate (and well-deserved)
foul on my errors and ignorance. On February 21, 2000 I received an email with this
subject: Verilog (A critique of your book…). After a quick look, I spooled it to the printer. I
watched the print queue and saw that the email printed out in 24 pages. I’m in for a real
xviii

flogging now, I thought. However, this email was from a bright and gentlemanly fellow
named Stephen Wasson formerly of MorphICs and HighGate Design. Stephen gave me one
of the nicest back-handed compliments I’ve ever received when he said: “Correcting the
errors in your book was, for me, a great introduction to Verilog”. For those who have found
the 1st printing errata at my website (www.bytechservices.com), you will see the long list of
errors that Stephen uncovered. I will be eternally grateful that Stephen was polite and
gracious in helping me debug my book. For a flavor of his commentary, here is the first
paragraph of his email: “Firstly, I’d like to thank you for the marvelously entertaining and
highly readable book. As a Verilog newbie, I found it an excellent introduction; as a 28-year
design veteran, I found it highly pragmatic; and as the author of a dozen articles, I’m
envious that your editors were such good sports to let you get away with such colorful
language.” I hereby elect Stephen to the Real World FPGA Design with Verilog Hall of
Fame.
Some folks have asked me if I can recommend a VHDL book similar to mine, the best
practical VHDL book I know is Essential VHDL RTL Synthesis Done Right by Sundar
Rajan. This book is not that easy to find, try http://www.vahana.com/vhdl.htm or email me
and I’ll see if I can find you a copy. I also highly recommend Writing Testbenches by Janick
Bergeron to add validation expertise to your skill-set. For this book, surf over to
www.janick.bergeron.com
I want to mention my long-suffering Office Manager (and wife) Judy who maintains
the website (corrections to errors in this printing will be found in an errata section at
www.bytechservices.com) and my infinitely-patient editor Bernard Goodwin who is still
waiting for me to finish my 2nd book. To all you folks: muchos gracious, now lets all get
back to work.

Ken Coffman
kcoffman@sos.net
C H A P T E R 1

Verilog Design in the Real World

T he challenges facing digital design

engineers in the Real World have changed dramatically as technology has advanced.
Designs are faster, use larger numbers of gates, and are physically smaller. Packages have
many fine-pitch pins/pads. However, the underlying design concerns have not changed, nor
will they change in the future. The designer must create designs that:

• are understandable to others who will work on the design later. We are moving
toward global 24/7 design activities.
• are logically correct. The design must actually implement the specified logic
correctly. The designer collects user specifications, device parameters, and
design entry rules, then creates a design that meets the needs of the end user.
• perform under worst-case conditions of temperature and process variation. As
devices age and are exposed to changes in temperature, the performance of the
circuit elements changes. Temperature changes can be self-generated or caused
1
2 Verilog Design in the Real World Chapter 1

by external heat sources. No two devices are exactly equivalent, particularly

devices that were manufactured at different times, perhaps at different foundries
and perhaps with different design rules. Variations in the device timing
specifications, including clock skew, register setup and hold times, propagation
delay times, and output rise/fall times must be accounted for.
x are reliable. The end design cannot exceed the package power dissipation limits.
Each device has an operational temperature range. For example, a device rated
for commercial operation has a temperature rating of 0 to 70 degrees C (32 to
160 degrees F). The device temperature includes the ambient temperature (the
temperature of the air surrounding the product when it is in use), temperature
increases due to heat-generating sources inside the product, and heat generated
by the devices of the design itself. Internally generated temperature rises are
proportional to the number of gates and the speed at which they are changing
states.
x do not generate more EMI/RFI than necessary to accomplish the job and meet
EMI/RFI specifications.
x are testable and can be proven to meet the specifications.
x do not exceed the power consumption goals (for example, in a battery-operated
circuit).
These requirements exist regardless of the final form of the design and regardless of
the engineering tools used to create and test the design.

SYNTHESIS
he translation of a high-level design description to target hardware. For thepurposes of this
book, synthesis represents all the processes that convert Verilog code into anetlist that can be
implemented in hardware.

The job of the digital designer includes writing HDL code intended for synthesis. This
HDL code will be implemented in the target hardware and defines the operation of the
shippable product. The designer also writes code intended to stimulate and test the output of
the design. The designer writes code in a language that is easy for humans to understand.
This code must be translated by a compiler into a form appropriate for the final hardware
implementation.
Trivial Overheat Detector Example 3

WHY HDL?
There are other methods of creating a digital design, for example: using a schematic. A
schematic has some advantages: it’s easy to create a design more tailored to the FPGA, and a
more compact and faster design can be created. However, a schematic is not portable and
schematics become unmanageable when a design contains more than 10 or 20 sheets. For
large and portable designs, HDL is the best method.
The real limitation of schematic-based design is the lack of an industry standard. This is
tragic because there are different types of people, some (like me) think in terms of text and
some are more graphically oriented. The EDA industry has done a poor job of serving logic
designers who prefer to work with schematics.

As a contrast between a Verilog design found in other books and a Real World design,
consider the code fragments in Listings 1-1 and 1-2.

Listing 1-1 Non Real World Example

// Transfer the content of register b to register a.

a <= b;

Listing 1-2 Real World Example

/* Signal b must transfer to signal a in less than 7.3 nsec in a –

3 speed grade device as part of a much larger design that must
draw less than 80 uA while in standby and 800 uA while operating.
The whole design must cost less than $1.47, pass CE testing, and
take less than two months to be written, debugged, integrated,
documented, and shipped to the customer. Signal a must be
synchronized to the 75 MHz system clock and reset by the global
system reset. The signal b input should be located at or near pin
79 on the 208-pin package in order to help meet the setup and hold
requirement of register a.*/

a <= b;

To illustrate the design process, let’s follow a trivial example from concept to delivery
and examine the issues that the designer confronts when implementing the design. Don’t
worry if the Verilog language elements are unfamiliar; they will be covered in detail later in
this chapter.
4 Verilog Design in the Real World Chapter 1

TRIVIAL OVERHEAT DETECTOR EXAMPLE

Sarah, the Engineering Manager, writes the following email to Sam, the digital designer:

To: sam@engineering
From: sarah@management
Subject: Hot Design Project.

The customer wants a red light that turns on and stays on if a

button is pressed and if their machine is overheating. They want
it yesterday, it needs to be battery operated, and has to have a
final build cost of $0.02 so the company can make money when
they sell it for $9.95.

First, Sam estimates the scope of the design. From experience, she determines that this
circuit is very similar to a design she did last year. She counts the gates of the previous
design, factors in the differences between the two designs, and decides the design is
approximately 20 gates. She considers the speed that the design must run at and any other
complicating factors she can think of, including the error she made in estimating complexity
of the previous design and the fact that she’s already purchased airline tickets for a week of
vacation. She knows that, overall, including design, test, integration, and documentation,
she can design 2000 gates a month without working significant overtime. She counts the
number of pins (the specification lists a pushbutton input, an overheat input, and an
overheat output, but Sarah realizes that she’ll need to add at least a reset and clock input).
From the gate-count estimate and the pin estimate she can select a device. She picks a
device that has more pins than she needs because she knows the design will grow as
features are added. She picks an FPGA package from a family that has larger and faster
parts available so she is not stuck if she needs more logic or faster speed. Now she sends a
preliminary schedule and part selection to her boss and starts working on the design. Her
boss will thank her for her thorough work on the cost and schedule estimates, but will insist
that the job be done faster to be ready for an important trade show and cheaper to satisfy the
marketing department.
Keep in mind that rarely will your estimates be low. Even when we know better,
engineers are eternally optimistic. Unless you are very smart and very lucky, your estimate
will not allow enough contingency to cover growth of the design (feature-creep), the hassles
associated with fitting a high-speed design into a part that is too small, and the other 1001
things that can go wrong. These estimating errors result in overtime hours and increased
project cost.
Now that Sam has taken care of the up-front project-related chores, she can start
working on the design. Sam recognizes that a simple flipflop circuit will perform this
function. She also recognizes, because of the problems she had with an earlier project, that a
synchronous digital design is the right approach to solving this problem. Sam creates a
Verilog design that looks like Listing 1-3.
Trivial Overheat Detector Example 5

Listing 1-3 Overheat Detector Design Example

module overheat (clock, reset, overheat_in, pushbutton_in,

overheat_out);
input clock, reset, overheat_in, pushbutton_in;
output overheat_out;
reg overheat_out;
reg pushbutton_sync1, pushbutton_sync2;
reg overheat_in_sync1, overheat_in_sync2;

// Always synchronize inputs that are not phase related to

// the system clock.
// Use double-synchronizing flipflops for external signals
// to minimize metastability problems.
// Even better would be some type of filtering and latching
// for poorly behaving external signals that will bounce
// and have slow rise/fall times.

always @ (posedge clock or posedge reset)

begin
if (reset)
begin
pushbutton_sync1 <= 1’b0;
pushbutton_sync2 <= 1’b0;
overheat_in_sync1 <= 1’b0;
overheat_in_sync2 <= 1’b0;
end
else begin
pushbutton_sync1 <= pushbutton_in;
pushbutton_sync2 <= pushbutton_sync1;
overheat_in_sync1 <= overheat_in;
overheat_in_sync2 <= overheat_in_sync1;
end
end

// Latch the overheat output signal when overheat is

// asserted and the user presses the pushbutton.
always @ (posedge clock or posedge reset)
begin
if (reset)
overheat_out <= 1’b0;

// Overheat_out is held forever (or until reset).

else if (overheat_in_sync2 && pushbutton_sync2)
overheat_out <= 1’b1;
end

endmodule

This seems like a lot of typing for such a simple circuit, doesn’t it? The first always element
appears to do nothing and looks like it could be deleted. In a previous design, Sam had
6 Verilog Design in the Real World Chapter 1

problems (which will be discussed in Chapter 2) with erratic logic behavior, so she always
double-synchronizes inputs from the Real World. The second always block asserts
pushbutton_out when overheat_in_sync and pushbutton_sync are asserted.

LINES OF CODE
A useful method estimating the size of a design is to count the semicolons.
Sam has done the fun part of the design: the actual designing of the code. She quickly
runs her compiler, simulator, or Lint program to make sure there are no typographical or
syntax errors. Next, because writing test vectors is almost as much fun as designing the
code, Sam does a test fixture and checks out the behavior of her design. Her test fixture
looks something like Listing 1-4.

Listing 1-4 Overheat Detector Test Fixture

// Overheat detector test fixture.

// Created by Sam Stephens

`timescale 1ns / 1ns

module oheat_tf;

reg clock, system_reset, overheat_in, pushbutton_in;

parameter clk_period = 33.333;

overheat u1 (clock, system_reset, overheat_in, pushbutton_in,

overheat_out);

always begin
#clk_period clock = ~clock; // Generate system clock.
end

initial
begin
clock = 0;
system_reset = 1; // Assert reset.
overheat_in = 0;
pushbutton_in = 0;
#75 system_reset = 0;
end

// Toggle the input and see if overheat_out gets asserted.

always
begin
Trivial Overheat Detector Example 7

#200 overheat_in = 1;
#100 pushbutton_in = 1;
#100 pushbutton_in = 0;
#200 overheat_in = 0;
#100 $finish;
end

endmodule

Sam invokes her favorite simulation tool and examines the output waveforms to make sure
the output is logically correct. The output waveform looks like Figure 1-1 and appears okay.
Generally Sam will write and run an automated test-fixture program (as described in
Chapter 5), but the design is simple and the boss has ordered her to quit being such a
fussbudget and get on with it.

Figure 1-1 Overheat Detector Design Output Waveforms

Sam assigns input/output pins and defines timing constraints for her design. She
knows that the system does not have to run fast, so she selects the lowest available crystal
oscillator to drive the clock input. This gives the lowest current consumption to maximize
the life of the battery. Sam submits the design to her FPGA compiler and gets a report back
that tells her that the design fits into the device she chose and that timing constraints are
met. From experience, she knows that a design running this slowly will not have
temperature or RFI emission problems. She checks the design into the revision control
system, sends an email to her boss to tell her the job is complete, and takes the rest of the
day off to go rollerblading.
This probably seems like a lot of work to complete a job that consists of six flipflops,
but Sam was lucky. The design fit into the device she chose, the design ran at the right
speed, the design did not have temperature/EMI/RFI problems, the specifications didn’t
change halfway through the design, the software tools and her workstation didn’t crash, and
she avoided the 1001 other hazards that exist in the Real World.
8 Verilog Design in the Real World Chapter 1

ENGINEERING SCHEDULE
Too often, a management tool for browbeating an engineer into working free overtime.
Engineers, even when they should really know better, are generally too optimistic when creating
schedules, thus, they are almost always late.
We have to be mature about this subject: without a deadline, nothing would ever get
finished. Still, most jobs should be completed with little overtime.

Some problems can be avoided by doing thorough design work up front. Sam was
careful not to start coding until she completely understood the requirements of the design.

GIGO
There is a great temptation to start coding before the product is well understood. After all,
to an engineer, coding is fun and planning is not.
I don’t care how much fun the job is, don’t start coding the design until you know what the
end result is supposed to be.

This book emphasizes design approaches that minimize problems and unpleasant
surprises.

SYNTHESIZABLE VERILOG ELEMENTS

Verilog was designed as a simulation language, and many of its elements do not translate to
hardware. Verilog is a large and complete simulation language. Only about 10% of it is
synthesizable. This chapter covers the fundamental properties of the 10% that the FPGA
designer needs.
Exactly which Verilog elements are considered synthesizable is a design problem
faced by the synthesis vendor. Generally, an “unofficial” subset of the Verilog language
elements will be supported by all vendors, but the current Verilog specification does not
contain any minimum list of synthesizable language elements. An IEEE working group
wrote a specification called IEEE Std 1364.1 RTL Synthesis Subset to define a minimum
subset of synthesizable Verilog language elements.
Verilog looks similar to the C programming language, but keep in mind that C defines
sequential processes (after all, only one line of code can be executed by a processor at a
time), whereas Verilog can define both sequential and parallel processes. Listing 1-5
presents some sample code with common synthesizable Verilog elements.
Synthesizable Verilog Elements 9

Listing 1-5 Example Verilog Program

module hello (in1, in2, in3, out1, out2, clk, rst, bidir_signal,
output_enable);// See note 1.
/* See note 2.
Comments that span multiple lines can be identified like this.
*/
input in1, in2, in3, clk, rst, output_enable; // See note 3.
output out1, out2;
inout bidir_signal;
reg out2; // See note 4
wire out1;

assign out1 = in1 & in2; // See note 5.

assign bidir_signal = output_enable?out2:1’bz;//See note 6.

always @ (posedge clk or posedge rst) // See note 7.

begin // See note 8.
if (rst) out2 <= 1’b0; // See note 9.
else out2 <= in3;
end
endmodule

Note 1: The first element of a module is the module name. Modules are the building
blocks of a Verilog design. In this book, the module name will be the same as the file name
(with a .v extension added) and each file will contain a single module. This is not required
but helps keep the design structure intelligible.
The port list follows the module/file name. This list contains the signals that connect
this module to other modules and to the outside world. Signals used in the module that are
not in the port list are local to the module and will not be connected to other modules. Note
the use of a semicolon as a separator to isolate Verilog elements. One confusing aspect of
Verilog is that not all lines end with a semicolon, particularly the compiler instructions
(always statements, if statements, case statements, etc.). It takes the Verilog newbie some
time to get comfortable with Verilog syntax.

Note 2: Comments follow double forward slashes or can be enclosed within a /*

Comment here */ pair. The latter type of comment delimiting can’t be nested. The detection
of a /* following another /* will be flagged as an error.

Note 3: The port direction list follows the module port list. This list defines whether
the signals are inputs, outputs, or inout (bidirectional) ports of the module. All port list
signals are wires. A wire is simply a net similar to an interconnection on a printed circuit
card.

Note 4: Signals are either wires (interconnects similar to traces and pads on a circuit
board) or registers (a signal storage element like a latch or a flipflop). Wires can be driven
by a register or by combinational assignments. It is illegal to connect two registers together
10 Verilog Design in the Real World Chapter 1

inside a module. Verilog assumes that a signal, unless otherwise defined in the code, is a
one-bit-wide wire. This can be a problem; the synthesis tool will not test vector width. This
is one good reason for using a Verilog Lint tool.

Note 5: The assign statement is a continuous (combinational) logic assignment.

Note 6: The bidir_signal assignment uses a conditional assignment; if output_enable

is true, bidir_signal is assigned the value of out2, otherwise it’s assigned the tristate value z.

Note 7: Always blocks are sequential blocks. The signal list following the @ and
inside the parenthesis is called the event sensitivity list, and the synthesis tool will extract
block control signals from this list. The requirement of a sensitivity list comes from
Verilog’s simulation heritage. The simulator keeps a list of monitored signals to reduce the
complexity of the simulation model; the logic is evaluated only when signals on the
sensitivity list change. This allows simulation time to pass quickly when signals are not
changing. This list doesn’t mean much to the synthesis tool, except that, by convention,
when certain signals are extracted for control, these input signals must appear on the
sensitivity list. The compiler will issue a warning if the sensitivity list is not complete.
These warnings should be resolved to assure that the synthesis result matches simulation.
The sensitivity list can be a list of signals (in which case, any change on any listed
signal is detected and acted upon), posedge (rising-edge triggered), or negedge (falling-edge
triggered). Posedge and negedge triggers can be mixed, but if posedge or negedge is used
for one control, posedge or negedge must be used for ALL controls for this block.

Note 8: The begin/end command isolates code fragments. If the code can be expressed
using a single semicolon, the begin/end pair is optional.

Note 9: We’re using nonblocking assignments (<=) in the always block. If blocking
assignments (=) are used, the order of the instructions may cause unwanted latches to be
synthesized so that a value can be held while earlier variables are updated. Generally, the
designer wants all elements in the sequential (always) block updated simultaneously, hence
the use of the nonblocking assignment, which emulates the clock-to-Q delay. The clock-to-
Q delay assures that cascaded flipflops (like a shift register) operate as expected. They are
called nonblocking because updating an earlier variable will not block the updating of a
later variable.

The rst input, when coded in this manner (i.e., a nonsynchronous signal used in a
synchronous module), is interpreted as asynchronous reset. This is not Verilog requirement
per IEEE Std 1364 but is an accepted convention.
Verilog language elements are case sensitive (SIGNAL1 and signal1 are not
equivalent, for example). Like the C programming language, Verilog is tolerant of white
space. The designer uses white space to assist legibility. It’s legal to combine lines as so:

a = b&c; d = e&f; g = h | i; j = k^m; n = o&p;

Synthesizable Verilog Elements 11

but designers who write hard-to-read code like this are subject to the loss of their free sodas.

PORTABLE VERILOG CODE

It is desirable to write code that can be compiled by any vendor’s compiler and implemented in
any hardware technology with identical results. Unfortunately, to write high-performance (where
the design runs at high speed) and efficient (where the design uses minimum hardware
resources by targeting architecture-specific features) code, the designer often must use
architecture- and compiler-specific commands and constructs. Portability is often not a practical
or achievable design requirement. It’s a great goal even if we never reach it.

We’re not going to cover operator precedence. If you have a required precedence,
then use parenthesis to be explicit about that precedence. The reader should be able to read
the precedence in the source code, not be forced to memorize or look up the built-in
language precedence(s). Don’t create complicated structures; use the simplest and clearest
coding style possible. Listings 1-6 and 1-7 illustrate equivalent coding structures with
implicit and explicit ‘don’t-cares’.

Listing 1-6 Casex (Explicit Don’t Care) Code Fragment

// Indexing example with explicit ‘don’t cares’.

reg [7:0] test_vector;
Casex (test_vector)
8’bxxxx0001:
begin
// Insert code here.
// This coding style results in a parallel case structure (MUX).
end
endcase

Listing 1-7 Implicit Don’t Care Code Fragment

// Indexing example with implicit ‘don’t cares’.

reg [7:0] test_vector;
if (test_vector[3:0] == 4’b0001)
begin
// Insert code here.
// This coding style results in priority encoded logic.
end

One feature of Verilog the designer must conquer is whether a priority-encoded (deep and
slow) structure or a MUX (wide and fast) structure is desired. Nested if-then statements tend
12 Verilog Design in the Real World Chapter 1

to create priority-encoded logic. Case statements tend to create MUX logic elements. There
will be more discussion of this topic later.
Do not assume a Verilog register is a flipflop of some type. In Verilog, a register is
simply a memory storage element. This is one of the first of the features (or quirks) the
Verilog designer grapples with. A register might synthesize to a flipflop (which is a digital
construct) or a latch (which is an analog construct), a wire, or might be absorbed during
optimization. Verilog assumes that a variable not explicitly changed should hold its value.
This is a handy feature (compared to Altera’s AHDL, which assumes that a variable not
mentioned gets cleared). Verilog, with merciless glee, will instantiate latches to hold a
variable’s state. The designer must structure the code so that the intended hardware
construct is synthesized and must be constantly alert to the possibility that the latches may
be synthesized. Verilog does not include instructions that require the synthesizer to use a
certain construct. By using conventions defined by the synthesis vendor, and making sure
all input conditions are completely defined, the proper interpretation will be made by the
synthesizer.

VERILOG HIERARCHY

A Verilog design consists of a top-level module and one or many lower-level modules. The
top-level module can be instantiated by a simulation module that applies test stimulus to the
device pins. The top-level device module generally contains the list of ports that connect to
the outside world (device pins) and interconnect between lower-level modules and
multiplexing logic for control of bidirectional I/O pins or tristate device pins. The exact way
the design is structured depends on designer preference.
Module instances are defined as follows:

module_name instance_name (port list);

For example, the code in Listing 1-8 creates four instances of assorted primitive gates and
the post-synthesis schematic for this design is shown in Figure 1-2.

Listing 1-8 Structural Example

module gates (in1,in2,in3,out4);

input in1, in2, in3;
output out4;
wire in1, in2, in3out1, out2, out3, out4;
and u1 (out1, in1, in2); // Structural (schematic-like)
or u2 (out2, out1, in3); // constructs.
xor u3 (out3, out1, out2);
not u4 (out4, out3);
endmodule
Verilog Hierarchy 13

Figure 1-2 Gates Example Schematic

This example uses positional assignment. Signals are connected in the same order that they
are listed in the instantiated module port list(s). Generally, the designer will cut and paste
the port list to assure they are identical. A requirement for a primitive port listing is that the
output(s) occur first on the port list followed by the input(s).
The module port list can also use named assignments (exception: primitives require
positional assignment), in which case the order of the signals in the port list is arbitrary. For
named assignments, the format is .lower-level signal name (higher-level module signal
name). The module of Listing 1-9 includes examples of both named and positional
assignments.

Listing 1-9 Named and Positional Assignment Example

module and_top;
wire test_in1, test_in2, test_in3;
wire test_out1, test_out2;

// Named assignment where the port order doesn’t matter.

user_and u1 (.out1(test_out1), .in1(test_in1), .in2(test_in2));

// Positional assignment.
user_and u2 (test_out2, test_in2, test_in3);
endmodule

module user_and (out1, in1, in2);

input in1, in2;
output out1;

assign out1 = (in1 & in2);

endmodule
14 Verilog Design in the Real World Chapter 1

BUILT-IN LOGIC PRIMITIVES

Tables 1-1 through 1-12 describe Verilog two-input functions. The input combinations are
read down and across. Verilog primitives are not limited to two inputs, and the logic for
primitives with more inputs can be extrapolated these tables.

and 0 1 x z
0 0 0 0 0
1 0 1 x x
x 0 x x x
z 0 x x x

Table 1-1 AND Gate Logic

nand 0 1 x z
0 1 1 1 1
1 1 0 x z
x 1 x x z
z 1 x x z

Table 1-2 NAND Gate Logic

or 0 1 x z
0 0 1 x x
1 1 1 1 1
x x 1 x x
z x 1 x x

Table 1-3 OR Gate Logic

Built-In Logic Primitives 15

nor 0 1 x z
0 1 0 x x
1 0 0 0 0
x x 0 x x
z x 0 x x

Table 1-4 NOR Gate Logic

xor 0 1 x z
0 0 1 x x
1 1 0 x x
x x x x x
z x x x x

Table 1-5 XOR Gate Logic

xnor 0 1 x z
0 1 0 x x
1 0 1 x x
x x x x x
z x x x x

Table 1-6 XNOR (Equivalence) Gate Logic

16 Verilog Design in the Real World Chapter 1

input output
0 0
1 1
x x
z x

Table 1-7 buf (buffer) Gate Logic

input output
0 1
1 0
x x
z x

Table 1-8 not (inverting buffer) Gate Logic

bufif0 control = 0 control = 1 control = x control = z

data = 0 0 z x x
data = 1 1 z x x
data = x x z x x
data = z x z x x

Table 1-9 bufif0 (tristate buffer, low enable) Gate Logic

Built-In Logic Primitives 17

bufif1 control = 0 control = 1 control = x control = z

data = 0 z 0 x x
data = 1 z 1 x x
data = x x x x x
data = z x x x x

Table 1-10 bufif1 (tristate buffer, high enable) Gate Logic

notif0 control = 0 control = 1 control = x control = z

data = 0 1 z x x
data = 1 0 z x x
data = x x z x x
data = z x z x x

Table 1-11 notif0 (tristate inverting buffer, low enable) Gate Logic

notif1 control = 0 control = 1 control = x control = z

data = 0 z 1 x x
data = 1 z 0 x x
data = x x x x x
data = z x x x x

Table 1-12 notif1 (tristate inverting buffer, high enable) Gate Logic
18 Verilog Design in the Real World Chapter 1

The code fragment in Listing 1-10 illustrates the use of these buffers and Figure 1-3 is
the schematic extracted from the synthesized logic.

Listing 1-10 Example of Instantiating Structural Gates

module struct1 (out1,out2,out3,out4,in1,in2,in3,in4,in5,

in6,buf_control);
output out1, out2, out3, out4;
input in1, in2, in3, in4, in5, in6, buf_control;
bufif0 buf1(out1, in1, buf_control);
and and1(out2, in2, in3);
nor nor1(out3, in4, in5);
not not1(out4, in6);
endmodule

Figure 1-3 Schematic of Structural Gates

Latches and Flipflops 19

LATCHES AND FLIPFLOPS

Technically, a flipflop is defined as a bistable multivibrator. Not a very helpful definition, is

it? A multivibrator is an analog circuit with two or more outputs, where, if one output is on,
the other(s) will be off. Bistable means an output is binary or digital and has two output
states: high or low. We will extend the output states to include tristate (z).
There are various flavors of flipflops, but generally we will be discussing the clocked
D flipflop in which the output follows the D input after a clock edge. Table 1-13 shows the
function table of a common edge-triggered D flipflop. Note that Table 1-13 Set and Reset
inputs are active-low.

/Set /Reset Clock Data Q /Q

0 1 x x 1 0
1 0 x x 0 1
0 0 x x 1 1 Note 1
1 1 r 1 1 0
1 1 r 0 0 1
1 1 0 x n n
1 1 1 x n n

Note 1: This condition is not stable and is illegal. The problem is, if the /Set and /Reset inputs
are removed simultaneously, the output state will be unknown.
x = don’t care (doesn’t matter).
r = rising edge of clock signal.
n = no change, previous state is held.

Table 1-13 Logic description of a 7474-style D flipflop

The typical FPGA logic element design allows the use of either an asynchronous Set
or Reset, but not both together, so we won’t have to worry about the illegal input condition
where both are asserted. This book is going to strongly emphasize synchronous design
techniques, so we discourage any connection to a flipflop asynchronous Set or Reset
input except for power-up initialization control. Even in this case, a synchronous
Set/Reset might be more appropriate.
A latch is more of an analog function. It’s helpful to bear in mind that all the
underlying circuits that make up our digital logic are analog! There is no magic flipflop
element. Flipflops are made with transistors and positive feedback: they are latches.
20 Verilog Design in the Real World Chapter 1

Figure 1-4 Schematic of a Typical TTL D Flipflop Implementation

Even if you’re the kind of person whose eyes glaze over when you see transistors on a
schematic, you should still notice two things about Figure 1-4. The first thing is that this D
flipflop is made with linear devices, i.e., transistors. If you can always keep the idea in the
back of your head that all digital circuits are built from analog elements that have gain,
impedance, offsets, leakages, and other analog nasties, then you are on the road to being an
excellent digital designer. The second thing to notice is feedback (see highlighted signals)
from the Q and /Q outputs back into the circuit. Feedback is what causes the flipflop to hold
its state.
Latches and Flipflops 21

Figure 1-5 Schematic of a Typical D Flipflop Implementation (Gates)

If you are more comfortable with gates, a different view of the same D flipflop is shown in
Figure 1-5. This is a higher level of abstraction; the transistors and resistors are hidden.
Those pesky transistors are still there! Again, note highlighted feedback path. Also note: the
PRESET and CLEAR signals have active low polarity.
Listing 1-11 shows a Verilog version of a latch, and Figure 1-6 shows the schematic
extracted from this Verilog design. The underlying circuit that implements RS Latch
(LATRS) is a circuit functionally similar to Figure 1-5. It’s not a digital circuit!

Figure 1-6 Schematic of Latch Flipflop Implementation

22 Verilog Design in the Real World Chapter 1

Listing 1-11 Latch Verilog Code

// Your Basic Latch.

// This is a bad coding style: do not create latches this way!

module latch(q, q_not, set, reset);

output q, q_not;
input set, reset;
reg q;

wire set, reset;

assign q_not = ~q;

always @ (set or reset)

begin
if (set)
q = 1;
else if (reset)
q = 0;
end
endmodule

The latch uses feedback to hold a state: this feedback is implied in Listing 1-11 by not
defining q for all combinations of input conditions. For undefined inputs, q will hold its
previous state. The logic that determines a latch output state may include a clock signal but
typically does not and is therefore a level-sensitive rather than an edge-triggered construct.

Listing 1-12 Verilog Code That Creates a Latch

module lev_lat(test_in1, enable_input, test_out1);

input test_in1, enable_input;
output test_out1;
reg test_out1;

always @ (test_in1 or enable_input)

if (enable_input) begin
test_out1 <= test_in1;
end

endmodule

In the example of Listing 1-12, test_out1 will change only while enable_input is high,
then test_out1 will follow test_in1. This will synthesize to a combinational latch as
illustrated in Figure 1-7. We’ll discourage this type of coding style unless the latch is driven
Latches and Flipflops 23

by a synchronous circuit and drives a synchronous circuit, resulting in a pseudosynchronous

design.
Is a latch a good design construct? That depends on the designer’s intent. If the
designer intended to create a latch construct, then a synthesized latch is good. If the designer
did not intend to create a latch construct (which Verilog is very inclined to create), then a
latch is bad. In general, we will scrutinize all synthesized latches suspiciously, because they
are, at best, pseudosynchronous constructs.

Figure 1-7 Latch Circuit Schematic (Reset-Set Latch)

A better design infers a clocked flipflop structure, as in Listing 1-13, with the
respective schematic shown in Figure 1-9.

Listing 1-13 Cascaded Flipflops with Synchronous Reset

module edgetrig (clk, rst, test_in1, enable_input, test_out2);

input clk, rst, test_in1, enable_input;
reg test_out1, test_out2;
output test_out2;

always @ (posedge clk)

begin
if (rst)
test_out1 <= 0;
else if (enable_input) begin
test_out2 <= test_out1;
test_out1 <= test_in1;
end
end
24 Verilog Design in the Real World Chapter 1

Figure 1-8 Schematic for Cascaded Flipflops with Synchronous Reset

Listing 1-13 demonstrates a flipflop with synchronous reset where the reset input is
evaluated only on clock edges. If the target hardware does not support a synchronous reset,
logic will be added to set the D input low when reset is asserted as shown in Figure 1-9.
Listing 1-14 illustrates a flipflop with asynchronous reset where the rst signal is “evaluated”
on a continuous basis. Notice that the dedicated global set/reset (GSR) resource of the
flipflops are not used. It would be much more efficient to synthesize a synchronous reset
signal and connect it to the GSR. This type of assignment is covered in Chapter 5.

Listing 1-14 Verilog Flipflop with Asynchronous Reset

module edge_lat (clk, rst, test_in1, enable_input, test_out2);

input clk, rst, test_in1, enable_input;
reg test_out1, test_out2;
output test_out2;

always @ (posedge clk or posedge rst)

begin
if (rst) test_out1 <= 0;
else if (enable_input) begin
test_out2 <= test_out1;
test_out1 <= test_in1;
end
end

endmodule
Blocking and Nonlocking Assignments 25

Figure 1-9 Schematic for Cascaded Flipflops with Synchronous Reset

BLOCKING AND NONLOCKING ASSIGNMENTS

So far, we’ve used only nonblocking assignments (<=). A blocking assignment (=), when
the variable is defined outside the always statement where it is used, holds off future
assignments until the previous assignment is complete. How can synthesized hardware hold
off an assignment? By storing an old value in a latch, that’s how. This means that blocking
assignments are order sensitive; they are executed in the begin/end sequential block in the
order in which they are encountered by the compiler (top to bottom).

Listing 1-15 Blocking Statement Example 1

/* The blocking statement of the first blocking assignment must be

completed before any later assignments will be performed. In this
example, two sets of flipflops will be created (see Figure 1-10)
because an intermediate value is required to create data_out. */

module blocking(clock, reset, data_in, data_out);

input clock, reset;
input data_in;
reg data_temp;
output data_out;
26 Verilog Design in the Real World Chapter 1

reg data_out;

always @ (posedge clock or posedge reset)

if (reset)
begin
data_out = 0;
data_temp = 0;
end
else
begin
data_out = data_temp;
data_temp = data_in;
end
endmodule

The synthesized logic for Listing 1-15, shown in Figure 1-10, illustrates the blocking
assignment of data_temp and data_out: a flipflop is synthesized to create the intermediate
(pipelined) data_temp variable.

Figure 1-10 Blocking Statement Example 1

In Listing 1-16, the blocking statements are reversed. Notice how the resulting logic,
as illustrated in Figure 1-11, is different from the logic of Figure 1-10.
Blocking and Nonlocking Assignments 27

Listing 1-16 Blocking Statement Example 2

/* The blocking statements are reversed, making the data_temp

variable redundant, so data_temp gets optimized out. One flipflop
is created (see Figure 1-11) because the intermediate value is
‘blocked’ and not needed to create data_out. */

module block2(clock, reset, data_in, data_out);

input clock, reset, data_in;
reg data_temp;
output data_out;
reg data_out;

always @ (posedge clock or posedge reset)

if (reset) begin
data_out = 0;
data_temp = 0;
end
else
begin
data_temp = data_in; // Switch order.
data_out = data_temp;
end
endmodule

A NOTE ON BLOCKING AND NONBLOCKING ASSIGNMENTS

In a set of blocking assignments that appear in the same always block, the order in which the
statements are evaluated is significant. The use of nonblocking assignments avoids order
sensitivity and tends to create flipflops: this is generally what the designer intends.
28 Verilog Design in the Real World Chapter 1

Figure 1-11 Blocking Assignment Example 2

If we replace the blocking assignments with nonblocking assignments, the order of the
sequential instructions no longer matters. All right-hand values are evaluated at the positive
edge of the clock, and all assignments are made at the same time. The synthesized logic for
Listing 1-17, shown in Figure 1-12, illustrates the nonblocking assignments of data_temp
and data_out and the resulting synthesized design which is equivalent to the logic of Listing
1-15.

Listing 1-17 Nonblocking Assignment Example

// Nonblocking Logic Example

// The order of the nonblocking assignments is not significant.

module nonblock(clock, reset, data_in, data_out);

input clock, reset;
input data_in;
reg data_temp;
output data_out;
reg data_out;

always @ (posedge clock or posedge reset)

if (reset)
begin
data_out <= 0;
data_temp <= 0;
end
Miscellaneous Verilog Syntax Items 29

else
begin
data_out <= data_temp;
data_temp <= data_in;
end

endmodule

Figure 1-12 Nonblocking Assignment Example

MISCELLANEOUS VERILOG SYNTAX ITEMS

Numbers

Unless defined otherwise by the designer, a Verilog number is 32 bits wide. The format of a
Verilog number is size’base value. The ’ is the single quote (tick or closing quote), not to
be confused with ‘ (accent grave, opening quote, or back tick) which is used to identify text
substitution and compiler directives. Both tick and back tick are used in Verilog, which will
frustrate a newbie. Underscores are legal in a number to aid in readability. All numbers are
padded to the left with zeros, x’s, or z’s (if the leftmost defined value is x or z) as necessary.
If the number is unsized, the assumed size is just large enough to hold the defined value
when the value gets used for comparison or assignment. X or x is undefined, Z or z is high
30 Verilog Design in the Real World Chapter 1

impedance. Verilog allows the use of ? in place of z. Numbers without an explicit base are
assumed to be decimal. All nets are assumed to be Z unless driven.

Number examples:

1’b0 // A single bit, zero value.

’b0 // 32 bits, all zeros.
32'b0 // 32 bits, all zeros,
// 0000_0000_0000_0000_0000_0000_0000_0000).
4’ha // A 4-bit hex number (1010).
5’h5 // A 5-bit hex number (00101).
4’hz // zzzz.
4’h?ZZ? // zzzz, ? is an alternate form of z.
4’bx // xxxx.
9 // A 32-bit number (it’s padded to the left
// with 28 zeroes).
a // An illegal number.

Verilog is a loosely typed language. For example, it accepts what looks like an 8-bit
value like 4’hab without complaint (the number will be recognized as 1011 or b and the
upper nibble will be ignored). The use of a Lint program like Verilint will flag problems
like this. Otherwise, the Verilog designer must stay alert to guard against such errors.

Forms of Negation

! is logical negation; the result is a single bit value, true (1) or false (0). ~ (tilde) is
bitwise negation. We can use a ! (sometimes called a bang) to invert a single bit value, and
the result is the same as using a ~ (tilde), but this is a bad habit! As soon as someone comes
in and changes the single bit to a multibit vector, the two operators are no longer equivalent,
and this can be a difficult problem to track down (see Listing 1-18).

Listing 1-18 Negation example

module negation (clk, resetn);

input clk, resetn;
reg [3:0] c, d, e;
output c, d, e;

always @ (posedge clk or negedge resetn)

begin
if (~resetn) // Active low asynchronous reset.
begin

c <= 5; // Bad form to async set a value like

// this. This is called a magic
Miscellaneous Verilog Syntax Items 31

// number and should be a parameter.

d <= 0;
e <= 0;
end
else begin
d <= !c; // d gets assigned value of 0;
e <= ~c; // e gets assigned value of 1010.
end
end
endmodule

Forms of AND/OR

& is the symbol for the AND operation. & is a bitwise AND, && is a logical
(true/false) AND. As illustrated in Listing 1-19, these two forms are not functionally
equivalent.

Listing 1-19 Logical and Bitwise AND Examples

a = 4’b1000 & 4’b0001; // a = 4’b0000;

b = 4’b1000 && 4’b0000; // b = 1’b0.

| (pipe) is the symbol for the OR operation, where | is a bitwise OR and || is a logical
OR. As illustrated in Listing 1-20, these two forms are not functionally equivalent.

Listing 1-20 Logical and Bitwise OR Examples

a = 4’b1000 | 4’b0001; // a = 4’b1001;

b = 4’b1000 || 4’b0001; // b = 1’b1.

Listing 1-21 AND/OR Examples

Module and_or (clk, resetn, and_test, or_test, a, b, c, d, e, g);

input clk, resetn, and_test, or_test;
output a, b, c, d, e, g;
reg a;
reg [3:0] b;
reg [3:0] c;
reg [3:0] d;
reg [3:0] e;
reg [3:0] g;
32 Verilog Design in the Real World Chapter 1

always @ (posedge clk or negedge resetn)

begin
if (~resetn) // Active low asynchronous reset.
begin
a <= 0;
b <= 4’d4; // Bad form to async set values
// like this, should be a
// parameter.
c <= 4’d5;
d <= 0;
e <= 0;
g <= 0;
end

else if (and_test)
begin

d <= (c && !a); // d gets assigned value of 1.

e <= (c & ~a); // e gets assigned value of
// 1010.
g <= (b & c); // g gets assigned the value
// 0100.
end
else if (or_test == 1) // Equivalent to simply (or_test).
begin
e <= (c | ~a); // e gets assigned value of all
// 1’s (1111).
g <= (b | c); // g gets assigned the value 0101.
end
else
begin
d <= 0; // Assign default values to avoid
// unwanted latches.
e <= 0;
g <= 0;
end

end
endmodule

In Listing 1-21, the final else condition bears some comment. We did not cover all
input conditions in the logic above the final else condition. For example, what output do we
want if neither and_test or or_test is asserted? Without the final else defined, Verilog
interprets a change from a defined condition to an undefined condition as a hold condition
(if outputs are not commanded, the last value gets held). This causes latches to be created.
Generally, this is not what the designer intends; thus, we need to make sure that all
conditions are defined.
Miscellaneous Verilog Syntax Items 33

Equality Operators

== === are logical operators; the result is either true or false except that the == (called
logical equality) version will have an unknown (x) result if any of the compared bits are x
or z. The === (called case equality) version looks for exact match of bits including x’s and
z’s and returns only a true or false. Prepending a ! (bang) means “is not equal.” In the
equality examples of Listing 1-22, there are several if statements that will evaluate to true.
As the block is examined from top to bottom, only the first true condition will be accepted.
The later ones will not be evaluated. This is called priority encoding, and, like instantiating
latches, Verilog has a natural tendency to use this structure. It can result in many levels of
cascaded logic! Pay close attention. The alternative option is more of a MUXstyle of
structure where inputs are evaluated in parallel, which may be what you intend. We’ll talk
more about this later.

Listing 1-22 Equality Examples

module eq_test (clk, resetn, and_test, or_test, b, c, d, e, g,

h, i, j);
input clk, resetn, and_test, or_test;
output b, c, d, e, g, h, i ,j
reg result;
reg [3:0] b;
reg [3:0] c;
reg [3:0] d;
reg [3:0] e;
reg [4:0] g;
reg [3:0] h, i, j;

always @ (posedge clk or negedge resetn)

begin
if (~resetn) // Active low asynchronous reset.
begin

result <= 0; // We’ll use this register to

// mirror the equality result.
b <= 4’b1x00; // Bad form to async set values
// like this; should be a
// parameter.
c <= 4’b1z00;
d <= 4’b1000;
e <= 4’b1001;
g <= 5’b01001;
h <= 4’b1z00;
i <= 4’b0110;
j <= 4’b011x;
end

// The following test fails.

34 Verilog Design in the Real World Chapter 1

else if ((b == d) == 1)
result <= 1’bx;

else if (b == d) // This test is the same as previous

// line. Fails.
result <= 1’bx;

else if ((b == d) == 0) // This test fails because of the

// x value in b.
result <= 1’bx;

else if ((b != d) == 1) // This test is the same as in the

// previous line. Fails.
result <= 1’b0;

else if ((b == d) == 1’bx) // This test passes because the

// b value is x.
result <= 1’b1; // All following true conditions
// will be ignored.

else if (c === d) // This test fails.

result <= 1’b0;

else if (e == g) // This test passes because e is

// padded with 0’s to become equal
// in size to g.
result <= 1’b0; // Be careful when variables sizes
// don’t match.

else if (b == c) // This test fails (returns false).

result <= 1’b0;

else if (b != c) // This test passes (returns true).

result <= 1’b1;

else if ( d == e) // This test fails (returns false).

result <= 1’b0;

else if (b !== c) // This test passes (returns true).

result <= 1’b1;

else if (c == h) // This test fails (returns x).

result <= 1’bx;

else if (c===h) // This test passes (returns true).

result <= 1’b1;

else if (e == ~i) // This test passes (returns true).

result <= 1’b1;

else if (e != j) // This test fails (returns x).

// An inverted x (unknown) is
Miscellaneous Verilog Syntax Items 35

// still an unknown.
result <= 1’bx;

end
endmodule

The designer can choose between the following if statement forms:

if (~resetn) ...

if (resetn == 1`b0)

Both are equivalent. Which is easy to read and easier to understand? That’s a matter
of opinion. Note the use of an ‘n’ suffix to indicate an active low (asserted when low or
low-true are other ways to describe this) signal. There are various ways of identifying active
low signals—for example, reset_not, resetl, or reset*, or resetN. It helps to identify the
assertion sense as part of the label; the main thing is to be consistent when selecting labels.
Other equalities are supported, including greater than (>), less than (<), greater than or
equal to (>=), and less than or equal to (<=).

Shift Operators

>> n and << n identify right-shift (divide by 2n) and left-shift (multiply by 2n)
operations. This operation will fill left and right values with zeros as necessary to fill the
register. Shifting a value which contains an x or a z will propagate the x or z in the direction
of the shift. Some examples of using the shift operators are presented in Listing 1-23.

Listing 1-23 Shift Operator Examples

module shifter (clk, resetn, shift_right_test, shift_left_test, a,

b, c, d, e, f);
input clk, resetn;
input shift_right_test;
input shift_left_test;
output a, b, c, d, e, f;
reg [3:0] a;
reg [3:0] b;
reg [3:0] c;
reg d;
reg [3:0] e, f;

always @ (posedge clk or negedge resetn)

begin
if (~resetn) // Active low asynchronous reset.
begin
36 Verilog Design in the Real World Chapter 1

a <= ’b1001;
b <= 0; // It’s bad form to async set
// values like this.
c <= 0;
d <= 0;
e <= ’bx000;
end

else if (shift_right_test)
begin

c <= a >> 2; // c gets assigned value of

// 0010.
d <= a >> 5; // Regardless of the value
// of a, d will always get
// assigned to 0.
// Verilog will not complain
// about this, use caution.
f <= e >> 1; // Result is 0x00 because
// of x in e.
end

else if (shift_left_test)
begin

c <= a << 2; // c gets assigned value of

// 0100.
d <= a << 5; // Regardless of the value
// of a, d will always get
// assigned to 0.
// Verilog will not complain
// about this, useaution.
f <= e << 1; // Result is 0000.
end

else
begin
d <= 0; // Assign values default to avoid
// unwanted latches.
e <= 0;
f <= 0;
end

end
endmodule

Conditional Operator

A shorthand method of doing a conditional uses a ternary form (which means

arranged in order by threes).
Miscellaneous Verilog Syntax Items 37

output_assignment <= expression ? true_assignment :

false_assignment;

This is a common way of defining a MUX. If the expression being evaluated resolves
to x or z, the output_bus is evaluated bit-by-bit, and Verilog will try to resolve the output
values. If both input bits are 1 (which means the input condition doesn’t matter), then the
output bit is a 1. Same for both input bits being 0. Any bits that can’t be resolved are
assigned an x value. If the true_assignment or the false_assignment register width is not
wide enough to fill the output_assignment, the output_assignment bits are left-filled with
zeros. See Listing 1-24.

Listing 1-24 Conditional Example

module cond_tst (clk, resetn, tristate_control, input_bus,

output_bus);
input clk, resetn;
input tristate_control;
input [7:0] input_bus;
output output_bus;
reg [7:0] output_bus;

always @ (posedge clk or negedge resetn)

begin
if (~resetn) // Active low asynchronous reset.
output_bus <= 8’bz;

else

// Assign output_bus = input_bus if tristate_control is

// true and assign output_bus = high impedance if
// tristate_control is false.
output_bus <= tristate_control ? input_bus : 8’bz;

end
endmodule

Math Operators

Verilog supports a small set of math operators including addition (+), subtraction (-) ,
multiplication (*), division (/), and modulus (%); however, the synthesis tool probably
limits the usage of multiplication and division to constant powers of two (in other words, a
left shifter or right shifter will be synthesized) and may not support modulus. The + and -
math operators will instantiate preoptimized adders. Verilog assumes all reg and wire
variables are unsigned.
38 Verilog Design in the Real World Chapter 1

Parameters

Parameters are a useful way of making constants more readable. Parameters are used
only in the modules where they are defined, but they can be changed by higher-level
modules. Parameters cannot be changed at run time, but they can be changed at compile
time. This is useful in cases where a parameter changes the defined number of signals or the
number of instances some construct is used. Not all parameters have to be assigned, but if
there is a positional assignment list, parameters can’t be skipped.
A parameter can also be defined in terms of other constants or parameters. To aid in
reading the code, some people use upper-case characters for parameters.
Listings 1-25 and 1-26 demonstrate Verilog hierarchy, where a module list descends
into the hierarchy, starting at the top, and with module names separated by periods.

Listing 1-25 Parameter Example, Top Module

module top;
reg clk, resetn;
parameter byte_width = 8;
defparam
u1.reg_width = 16; // This parameter will
// replace the first
// parameter found in
// the u1 instantiation
// of reg_width.
defparam
u2.reg_width = byte_width * 2;
parm_tst u1 (clk, resetn, output_bus1);// Create a version
// of parm_tst with
// reg_width = 16.
parm_tst u2 (clk, resetn, output_bus2); // This version of
// parm_tst also has
// reg_width of 16.
parm_tst u3 (clk, resetn, output_bus3); // This version of
// parm_tst has a
// reg_width of 8.
endmodule

Listing 1-26 Parameter Example, Lower Module

module parm_tst (clk, resetn, output_bus);

input clk, resetn;
parameter reg_width = 8; // This constant can be
// overridden by a parameter
// value passed into the
// module.
parameter byte_signal = 8’d99;
parameter byte_signal_true = 8’hff;
Miscellaneous Verilog Syntax Items 39

parameter byte_signal_false = 8’h00;

output [reg_width - 1 : 0] output_bus;
reg [reg_width - 1 : 0] output_bus;
reg [7:0] byte_count;

always @ (posedge clk or negedge resetn)

begin
if (~resetn) // Active low asynchronous reset.
begin
output_bus <= 8’b0;
byte_count <= 8’b0;
end
else if (byte_count == byte_signal)
output_bus <= byte_signal_true;
else
begin
output_bus <= byte_signal_false;
byte_count <= byte_count + 1;
end
end
endmodule

Concatenations

Concatenations are groupings of signals or values and are enclosed in curly brackets
{} with commas separating the concatenated expressions, as shown in Listing 1-27. All
concatenated values must be sized. Note the use of [ ] to identify the bit select or register
index. It’s legal to define a register like backwards_reg, but, regardless of the numbers used,
the leftmost definition is always the most significant bit. Usually, you’ll see the largest
number occurring on the left side of the colon (:) unless a one-dimensional array of
variables (like a RAM) is being created.
40 Verilog Design in the Real World Chapter 1

Listing 1-27 Concatenation Example

module backward;
reg [0:2] backwards_reg;
reg [2:0] test;
/* {1’b0, test, 8’h55} is the same as:

{1'b0, test[2], test[1], test[0], 1’b0, 1’b1, 1’b0, 1’b1, 1’b0,

1’b1, 1’b0, 1’b1} */

always @ (test)
begin

test = backwards_reg;
// The assignment above is equivalent to the assignments below:
test[2] = backwards_reg[0];
test[1] = backwards_reg[1];
test[0] = backwards_reg[2];
end

endmodule
C H A P T E R 2

Digital Design Strategies and Techniques

F rom a design entry point of view, the digital

designer describes a design in as high a level of abstraction as possible. If it were possible to
write one line of code that resulted in 25,000 gates of usable hardware (the day will come
when a typical design is 2,500,000 gates), we’d agree that this is a very efficient way to do
design work. We use sophisticated software tools to translate the abstract top-level design
into a netlist that represents hardware and hardware interconnection.
The top-level design is processed in many steps before it is implemented in FPGA
hardware. Each of these steps will be discussed in more detail later in this chapter.

DESIGN PROCESSING STEPS

x The design is parsed for syntax errors.

41
42 Digital Design Strategies and Techniques Chapter 2

x The design is minimized and optimized for the target architecture.

x Recognized structure elements are replaced with library modules.
x Timing and resource requirements are estimated.
x The design is converted to a netlist.
x The design elements and modules are linked together and ‘black box’ modules
are replaced with library or core module netlists.
x Floorplanning and routing attempts are made until the timing and resource
constraints are met.
x Timing and resource reports are extracted from the design. A timing annotated
netlist is created to support post-route simulation.
x The device configuration files are created.

ANALOG BUILDING BLOCKS FOR DIGITAL PRIMITIVES

There will be many views of a design. The designer must be comfortable changing
between different views of the same project as it evolves into a bitstream file formatted to
configure an FPGA. It helps to keep in mind that all digital design elements are
implemented with analog components. There is no magic device that acts like a NAND
gate. We implement digital logic with analog devices like transistors, diodes, and resistors
as shown in Figures 2-1, 2-2, and 2-3. Transistors can act as digital switches (on or off) or
as analog transfer gates (pass mode). For the transistor impaired, N FETs are ON with a
“one” on the gate, P FETs are ON with a “zero” on the gate.

Figure 2-1 Discrete Logic: Simplified Inverter

Analog Building Blocks for Digital Primitives 43

Figure 2-2 Discrete Logic: Simplified NAND Gate

Figure 2-3 Discrete Logic: Simplified NOR Gate

44 Digital Design Strategies and Techniques Chapter 2

USING A LUT TO IMPLEMENT LOGIC FUNCTIONS

Most FPGAs use a multiplexer (MUX) Look-Up Table (LUT) as a basic logic element.
There are two reasons for doing this:

1. The LUT is versatile (any function of the inputs is possible)

2. The LUT is efficiently implemented in silicon.

The MUX control inputs are used as logic inputs, and the multiplex inputs are strapped to
logic levels to implement the desired function. Figure 2-4 illustrates an inverter
implemented using this method.

Figure 2-4 MUX Configured as an Inverter

A hidden advantage to using the MUX LUT as a logic element is provided by the
capacitive loading and the “break-before-make” switching character of the MUX output.
When the inputs change, the output is held and tends to change cleanly without glitching.

Synthesis Example

Changing back to the digital world, let’s refer to the Overheat Detector design of
Chapter 1, reprinted here as Listing 2-1.

Listing 2-1 Overheat Detection Source Code

module overheat (clock, reset, overheat_in, pushbutton_in,

overheat_out);
input clock, reset, overheat_in, pushbutton_in;
output overheat_out;
reg overheat_out;
reg pushbutton_sync1, pushbutton_sync2;
reg overheat_in_sync1, overheat_in_sync2;
Using a LUT to Implement Logic Functions 45

// Always synchronize inputs that are not phase related to

always @ (posedge clock or posedge reset)

// Latch the overheat output signal when overheat is

// asserted and the user presses the pushbutton.
always @ (posedge clock or posedge reset)
begin
if (reset)
overheat_out <= 1’b0;

// Overheat_out is held forever (or until reset).

else if (overheat_in_sync2 && pushbutton_sync2)
overheat_out <= 1’b1;
end

endmodule

The synthesis tool converts our simple 40-line source code into an ugly EDIF netlist
almost 300 lines long. This netlist holds all the design elements and information regarding
the compiler version, target part, and all the design constraints the synthesizer knows about.
This netlist is designed to be interpreted by other computer programs and doesn’t contribute
much usable information to the designer, so we won’t look at an example.
A graphical version (a schematic) of the netlist as shown in Figure 2-5 is more useful
to us in understanding what the synthesis tool created, particularly for the HDL impaired.
Note the correct use of global resources for clock and reset. Because Verilog does not
support direct assignment of hardware resources (the biggest problem for the Verilog FGPA
designer), it is the designer’s job to assure that these inferences are made correctly.
46 Digital Design Strategies and Techniques Chapter 2

Figure 2-5 Schematic Extracted from the EDIF Netlist

The synthesis tool has some understanding of the target architecture and can provide
estimates of the design timing and resource requirement, see Listing 2-2. This estimate will
not include black-box modules that are later imported by the FPGA place-and-route tool.

Listing 2-2 Synthesis Resource Estimate

*******************************************************

Cell: overheat View: INTERFACE Library: work

*******************************************************

Number of ports : 5
Number of nets : 13
Number of instances : 10
Number of references to this view : 0

Total accumulated area :

Number of BUFG : 1
Number of CLB Flip Flops : 2
Number of FG Function Generators : 1
Number of IBUF : 1
Number of IOB Input Flip Flops : 2
Number of IOB Output Flip Flops : 1
Number of Packed CLBs : 1
Number of STARTUP : 1

***********************************************
Device Utilization for 4010xlPQ100
Discussion of Design Processing Steps 47

***********************************************
Resource Used Avail Utilization
-----------------------------------------------
IOs 5 77 6.49%
FG Function Generators 1 800 0.13%
H Function Generators 0 400 0.00%
CLB Flip Flops 2 800 0.25%

Clock Frequency Report

Clock : Frequency
------------------------------------
clock : 118.8 MHz

DISCUSSION OF DESIGN PROCESSING STEPS

Syntax Checking

The first step is to submit your code to a compiler, simulator, and/or Lint program that will
identify syntax, typing, and other errors. Each program evaluates the code differently. If
there is some confusion about what the syntax check says, it can be very helpful to try
another interpreter. Listings 2-3 through 2-6 illustrate four ways that an error is reported for
a simple problem inserted in the overheat.v code. A semicolon was appended on one of the
if statements as so:

if (overheat_in_sync & pushbutton_sync);

Listing 2-3 Error Reporting by Exemplar LeonardoSpectrum Synthesis Tool

LeonardoSpectrum flagged “C:/Verilog/SourceCode/overheat.v”, line 37: Error, more

than one sequential statement (if statement) in asynchronous process not supported.
48 Digital Design Strategies and Techniques Chapter 2

.Listing 2-4 Error Reporting by Model Technology ModelSim

ModelSim reported an error on the next line; at least in the right neighborhood of the error.

Listing 2-5 Error Reporting by Silos III

Reading “c:\verilog\verilog\overheat.v”
sim to 0
Highest level modules (that have been auto-instantiated):
(overheat overheat

c:\verilog\verilog\overheat.v (38) : error 3.229 : expecting

“end”, or statement, not integer constant

error 2.188 : errors are too severe to simulate

Listing 2-6 Error Reporting by Verilint

Processing source file c:\verilog\verilog\overheat.v

(E363) c:\verilog\verilog\overheat.v, line 37: Syntax error:
1 syntax error
End of interHDL inc. Verilint (R) Version 3.14, 1 errors, 0
warnings

The point of this exercise is to illustrate that different tools give different (and more or
less useful) error messages, and it makes good sense to have several tools available for
checking your code, particularly a Lint tool. Verilint (or a similar Verilog Lint tool) is
useful because it’s fast, easy to use, and catches many different types of errors. This type of
tool can save many hours of frustration. Regardless, this example illustrates how much
trouble a single misplaced semicolon can cause.
Discussion of Design Processing Steps 49

Design Minimization and Optimization

The end result of all of our work is a configuration of hardware. This hardware can be an
FPGA, a semicustom FPGA conversion, or some sort of ASIC (standard cell, gate array,
full custom). If the result is an FPGA, the hardware will have an underlying structure that
varies depending on the design approach taken by the FPGA vendor. The logic structure of
a Xilinx 4K family is illustrated in Figure 2-6. We’ll take a closer look at this and other
device architectures in Chapter 7.
The Xilinx 4K family Configurable Logic Block (CLB) is basically two 4-input LUTs
feeding a pair of flipflops. The Verilog code we write gets mapped into this structure by the
synthesis tool.

Figure 2-6 Typical Xilinx Configurable Logic Block Structure

The synthesizer translates the design to a form suitable for the target hardware by:

x Flattening the design into large Boolean equations with one equation for each
module output, design section output, or register output. Redundant registers
may be identified and optimized out. For example, the code fragment of Listing
2-7 might be flattened into the Boolean equation of Listing 2-8.
50 Digital Design Strategies and Techniques Chapter 2

Listing 2-7 Simple Adder Code

// Simple adder (no carry input).

module adder(clock, reset, a, b, c);
input clock, reset, a, b;
reg [1:0] c;

always @ (posedge clock or posedge reset)

begin if (reset)
c = 2’b0;
else
c = a + b; // Adder.
end
endmodule

To create the gate representation of this circuit, create a truth table (see Table 2-1) which
defines all the input and output conditions.

a b c[1] - CARRY c[0] - SUM

0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0

Table 2-1 Simple Adder Truth Table

By inspection, we see that the c[0] (SUM) output can be represented by an XOR gate
and the c[1] output (CARRY) can be represented by an AND gate. A flattened version of
the simple adder circuit is illustrated in Figure 2-7.

Listing 2-8 Simple Adder Boolean Equations

c[0] <= a ^ b;// ^ is the Verilog Boolean XOR operator.

c[1] <= a & b;// & is the Verilog Boolean AND operator.
Discussion of Design Processing Steps 51

Figure 2-7 Flattened Schematic of Simple Adder Circuit

The logic mapped into the top F2_LUT to create c[0] (ix72) is (~I0 * I1) + (I0 * ~I1),
equivalent to the XOR function. The logic mapped into the lower F2_LUT to create c[1]
(ix71) is (I0 * I1).

x Minimizing the Boolean equations. This is done by recognizing and removing

redundant logic terms (even if, for controlling fanout or providing hazard
coverage, we want them to remain in the design).

The synthesizer can’t recognize redundant logic that crosses register boundaries
(though it may recognize and delete redundant registers). If there is any chance for logic
minimization, this must be part of the design input. The best opportunities for logic
reduction are created and implemented by the designer.

x Recognized structure elements are replaced with selected modules. For

example, the synthesizer might recognize a construct like a <= a – 1 as a down
counter and replace the logic with a predefined circuit optimized for the target
architecture for either area or speed.

x Timing and resource requirements are estimated. The compiler can only
estimate the design timing and resource requirements. The manufacturer may
have made changes to the timing parameters (the device manufacturer will
always be ahead of other companies, who rely on the manufacturer for data).
Another reason the timing estimate may not be accurate is that the library and
black-box elements are not yet part of the design netlist. These elements are
inserted when the design is linked and the final netlist is created and flattened.

x The design is converted to a netlist. There are various flavors of netlists, but the
most common format at present is EDIF.

x The design elements and modules are linked together and ‘black-box’ modules
are replaced with library module netlists. The netlist created by the compiler
52 Digital Design Strategies and Techniques Chapter 2

may be flattened (all the modules merged into one netlist) or the hierarchy may
be maintained with the modules kept separate. With the hierarchy maintained,
the design is easier for the designer to understand as it appears more like it was
created.

x Floorplanning and routing attempts are made until the timing and resource
constraints are met. Floorplanning assigns elements from the device logic to the
designed circuitry. The place and route of the design is very much like the
place and route of a printed circuit board. The efficiency of routing and the
resulting speed of the routed design depend on the arrangement of the module
elements, which affects the interconnect between modules. There are limited
routing resources in an FPGA. When the routing gets dense (congested), long
routing paths may be necessary to complete a signal path. This slows the design
and causes routing problems for signals that must travel across or around the
congested area. Some FPGA vendors advertise the capability of 100% routing
of all logic, but others make densities of 65% (Altera) and 85% (Xilinx) more
reasonable. Manual floorplanning can increase the usable logic density.

x Timing and resource reports are extracted from the design. A timing-annotated
netlist may be created to support post-route simulation. A common format for a
timing-annotated netlist is the SDF format as illustrated in Listing 2-9. SDF
stands for Standard Delay Format. This file includes estimated gate delays based
on the FPGA design rules.

Listing 2-9 Example of an SDF Netlist

(DELAYFILE
(SDFVERSION “2.0”)
(DESIGN “adder”)
(DATE “08/31/99 09:21:34”)
(VENDOR “Exemplar Logic, Inc., Alameda”)
(PROGRAM “LeonardoSpectrum Level 3”)
(VERSION “v1999.1d”)
(DIVIDER /)
(VOLTAGE)
(PROCESS)
(TEMPERATURE)
(TIMESCALE 1 ns)
(CELL
(CELLTYPE “F2_LUT”)
(INSTANCE ix72)
(DELAY
(ABSOLUTE
(PORT I0 (::3.25) (::3.25))
(PORT I1 (::3.25) (::3.25)))))
Discussion of Design Processing Steps 53

(CELL
(CELLTYPE “F2_LUT”)
(INSTANCE ix71)
(DELAY
(ABSOLUTE
(PORT I0 (::3.25) (::3.25))
(PORT I1 (::3.25) (::3.25)))))
(CELL
(CELLTYPE “BUFG”)
(INSTANCE clock_ibuf)
(DELAY
(ABSOLUTE
(PORT I (::0.00) (::0.00)))))
(CELL
(CELLTYPE “OFDX”)
(INSTANCE reg_c_1)
(DELAY
(ABSOLUTE
(PORT C (::3.25) (::3.25))
(PORT D (::2.77) (::2.77)))))
(CELL
(CELLTYPE “OFDX”)
(INSTANCE reg_c_0)
(DELAY
(ABSOLUTE
(PORT C (::3.25) (::3.25))
(PORT D (::2.77) (::2.77)))))
(CELL
(CELLTYPE “IBUF”)
(INSTANCE reset_ibuf)
(DELAY
(ABSOLUTE
(PORT I (::2.77) (::2.77)))))
(CELL
(CELLTYPE “IBUF”)
(INSTANCE a_ibuf)
(DELAY
(ABSOLUTE
(PORT I (::2.77) (::2.77)))))
(CELL
(CELLTYPE “IBUF”)
(INSTANCE b_ibuf)
(DELAY
(ABSOLUTE
(PORT I (::2.77) (::2.77)))))
(CELL
(CELLTYPE “STARTUP”)
(INSTANCE ix56)
(DELAY
(ABSOLUTE
(PORT GSR (::2.77) (::2.77)))))
)
54 Digital Design Strategies and Techniques Chapter 2

x The device configuration files are created. The download file can be
programmed into a serial EPROM, downloaded through a serial or parallel
cable, or stored in memory and written to the device by a microprocessor, or by
a stand-alone EPROM with address and data control generated by the FPGA
itself. The device might be ISP (In-System Programmable) or a reprogrammable
type (plugged into a programmer, programmed, then installed in the destination
design).

Shifty Logic Circuits

Many people, when asked to draw a two-input NOR Gate, will draw a circuit that looks like
Figure 2-8. In my experience this circuit seems shifty or flaky. This is not just a sign of
mental illness. The output is very likely to be glitchy when the inputs change. We’re digital
designers and we want the analog aspects of our design to be minimized.
Figure 2-9 shows a typical circuit where the simple NOR gate might be used. The
resistance and capacitance of Figure 2-9 do not have to be discrete devices on a circuit
board, they could be parasitic values associated with signal routing and loading.

Figure 2-8 Combinational Two-Input NOR Gate

Figure 2-9 Simple Combinational Circuit

The oscilloscope trace shown in Figure 2-10 demonstrates one problem with the
combinational circuit. One input is strapped low, so the output should just be the inverse of
the other input, right? Where did those nasty glitches on the output come from? The input is
a noisy signal that crosses the input threshold (where the input is between being recognized
Discussion of Design Processing Steps 55

as one or zero by the gate input) very slowly. The RC network just exaggerates the problem
and is exactly the kind of thing you see when some bonehead tries to filter out the switch
contact bounce. The right way to filter switch bounce is to use feedback (hysteresis).

Figure 2-10 Combinational Two-Input NOR Gate Output Transients

Fine, you say. You’ll make sure that the input always changes quickly to minimize
glitches. So, you invent a circuit that switches infinitely fast (you can store this circuit on
the same shelf as your perpetual motion machine). Anyway, that’s still not good enough,
because there is another cause of glitches. When the inputs are changing at nearly the same
time, again the output can be indeterminate. The circuit of Figure 2-11 demonstrates this
problem. A resistor-capacitor (RC) network is added to delay the input signal. Again, the
output has nasty transients. So, your design won’t use RC networks between inputs like this.
Well, the RC time delay might be caused by mixed routing paths between inputs (signal
skew) or by signal loading where each signal destination contributes a capacitive load. The
R part of Figure 2-11 represents the sum of the source and routing impedance (proportional
to route length) and the C part represents net loading (proportional to the number of loads
on the net). The only control you have of this problem is making sure that signals have low
fanout (a measure of the signal loading represented by destination logic elements where
56 Digital Design Strategies and Techniques Chapter 2

each gate load is counted as a fanout of 1). Most synthesis tools allow a fanout constraint to
be defined to control loading (signals are split and driven by separate buffers).

Figure 2-11 Combinational Two-Input AND Gate with RC Network

When I am asked to draw a two-input AND Gate, it looks like Figure 2-12. The
difference is the addition of a synchronizing flipflop. The output of this circuit will not be
glitchy if synchronous logic rules are followed and the setup/hold requirements for the
flipflop are met (see the next section for a discussion of setup and hold times). This is
particularly safe if the input signal is synchronous, too. If the signal at the D input of the
flipflop is stable in time to meet the setup-time requirement and maintained beyond the
hold-time requirement, then all is well.

Figure 2-12 Synchronous AND Gate

Synchronous Logic Rules 57

SYNCHRONOUS LOGIC RULES

Metastability

Literally, metastability means beyond-settled? Something other than steady? If a signal is

metastable, it is not stable, it is neither 1 nor 0, or it oscillates and will eventually resolve to
a 1 or 0, but we don’t know which. As a digital designer, I hope this idea keeps you up late
at night.
Metastability occurs when a clock edge is random with respect to a change of an
asynchronous input signal. If the relation of the clock and signal is truly random, then it is
inevitable that an input change will occur so close to a clock edge that the output is
unpredictable. This problem manifests itself as a flipflop output that takes a long time to
resolve, often much longer than the typical clock-to-Q output delay listed in the flipflop
datasheet.

Figure 2-13 Metastable Output

Figure 2-13 illustrates the metastability problem; if SIGNAL changes within the
setup/hold window of the flipflop, the output is unknown for a period. How long is this
period? It depends on the characteristics of the flipflop and its environment: how fast is the
flipflop, how much gain does it have, and how much noise is present in the system. How big
is this problem? It depends on how often the input changes and how wide the setup/hold
window is compared to the clock period.
We’ll never get to zero metastability, but hopefully the statistical probability of
metastability will be microscopic. I don’t know about you, but if I can get the mean time
between failures in my design to 100,000 years or so, that’s good enough.
The closest we will get to a solution to the metastability problem is to use
synchronous design techniques. This means a synchronizing clock is used to qualify, gate,
or trigger a circuit. The time between clock edges is used to allow signals to propagate and
settle. It’s like a game; if you can get your signal to the next flipflop before the next clock
setup time, then you win.
58 Digital Design Strategies and Techniques Chapter 2

Setup and Hold Time

For the output of a flipflop to be predictable (not metastable), the inputs must meet the
setup and hold time requirement of the flipflop.

x The setup time, often represented as Tsu, is the time period, BEFORE the edge
of the synchronizing clock, when the input is required to be stable. If the setup
time is violated, the output value is indeterminate.
x The hold time, often represented as Th, is the time period, AFTER the
synchronizing clock edge, when the input is required to be stable. If the hold
time is violated, again the output value is not guaranteed.

The setup and hold requirement comes from the analog nature of the flipflop design.
The flipflop uses feedback implemented with cross-coupled gates to hold a state. It takes
time for the gates to achieve their stable state. In a perfect world, an edge-triggered flipflop
would change states exactly synchronous with the clock edge. The clock edge would be
infinitely fast, and the flipflop would change states instantaneously. Real World clocks have
rise/fall times, and flipflops require stable inputs during the setup/hold time to achieve a
stable output state.
The flipflop metastability problem will never go away as long as a signal has a
random phase relation to the flipflop clock. However, IC manufacturers have made great
progress in closing the metastability window (this window is the setup plus hold time
window). By increasing the speed of the flipflop, we make the metastability window
narrower and less of a problem. The fact is, most problems that designers blame on
metastability is related to asynchronous design technique. Each FPGA input should drive
one and exactly one flipflop. The output of this single flipflop can be used to drive another
flipflop for added security or can be used to drive the rest of your synchronous system.
When an asynchronous input drives multiple flipflops, and the input changes near the clock
edge, some flipflop outputs will change and some will not. This is not a metastability
problem; this is an asynchronous input problem!
Figure 2-14 illustrates this. The RC delays represent signal delays due to routing and
load inside the FPGA. We want all three flipflop outputs to be the same, but, depending on
the phase of the input signal, sometimes the outputs will not be the same. If we synchronize
the input with a single flipflop and do not violate its setup/hold time requirement, then all
outputs are assured to be the same. That’s what we want!
Synchronous Logic Rules 59

Figure 2-14 Asynchronous Input Problem

How can we absolutely assure that the inputs are not going to change during the setup
and hold period of the flipflop? The answer is an important part of the solution for the
question: “How can I create a nearly trouble-free design?”

Always synchronize your inputs! This means an asynchronous input drives exactly one flipflop.
The output of this flipflop can be safely used to drive the rest of your synchronous circuitry.

Figure 2-15 Synchronous AND Gate with Synchronous Inputs

60 Digital Design Strategies and Techniques Chapter 2

Figure 2-15 shows a synchronous AND gate with synchronous inputs.

This goes a long way toward solving our problems, but the fussy among us might be
asking why it works. It looks like the output of the synchronizing flipflops changes at a
clock transition; isn’t that a problem for the output flipflop? Before we look at that, let’s
consider a common circuit. If you were asked to design a divide-by-two circuit, you might
draw something like Figure 2-16.

Figure 2-16 Divide-by-Two Circuit

This circuit could hardly be simpler; the inverting output is fed back to the D input,
and the output changes state on every other clock edge. It is interesting to think about a
situation where this circuit does not work. Let’s assume that the device technology has some
easy numbers to work with, so all delays are 1 nsec.

Flipflop Specification:
Flipflop Minimum Input Setup Time: 1 nsec
Flipflop Minimum Input Hold Time: 1 nsec
Clock-to-Output Delay (Maximum): 1 nsec
Maximum Propagation Delay Time (Q output to D input) 1 nsec

For repeatable results, the D input must be stable 1 nsec before the clock edge and
must remain stable after the clock edge for 1 nsec. The flipflop output is guaranteed to reach
its final value less than 1 nsec after the clock edge. It takes less than 1 nsec for the signal to
propagate from the Q output back to the D input. At what clock frequency does this circuit
begin to fail?
Rising edges of the clock must not occur before setup time + output delay + routing
delay, or 3 nsec. This means the input clock had better not have a frequency greater than
333.333 MHz. This is a high frequency, most likely achievable only with an ASIC using
today’s technology. An FPGA will have longer (possibly much longer) delays and will have
correspondingly lower maximum clock frequencies.
The delays for the device elements are provided by the device vendor. The number of
delays can be mind-boggling. An FPGA has a complicated mix of delays; clock to Q,
routing delays through switch elements, delays through signal multiplexers (look-up tables),
Synchronous Logic Rules 61

and delays proportional to signal loading, among others. For example, for a 4000XL device,
Xilinx specifies 41 timing parameters in 4 speed grades, for a total of 164 individual timing
numbers. Memorize them; there will be a test later. Fortunately, the compiler knows these
published delays and will calculate the totals for your circuit design. Let’s consider another
simple circuit, two flipflops in series as shown in Figure 2-17.

Figure 2-17 Two Flipflops Connected in Series

Again, this is deceptively simple. How can this circuit work reliably? What if the
minimum clock-to-output delay (a value that is rarely specified, but often estimated as 25%
of typical) for U1 is less than the hold-time requirement for U2? So, you tear up your data
book looking for the hold-time requirement, and with a sigh of relief (if you’re lucky) you
see that it is specified as zero. The suspicious engineer will say, hold on a nanosecond, how
can it be zero? All the logic circuitry we’ve ever looked at requires some hold time greater
than zero. And that’s correct. It has to be so, but the designer of the FPGA logic cell has
done some work for us and has put in delays to guarantee that the logic path (the logic in
series with the D input) has a shorter delay than the clock path. Essentially, this is done by
adding the hold time to the setup time, then delaying the clock enough to satisfy this
extended setup time. With reference to the clock edge, the input signal takes longer to arrive
at the D input, but it also stays around longer. Even if the input signal changes coincident
with the clock edge, the clock delay inside the logic cell will make sure it stays valid long
enough to satisfy the buried hold-time requirement of the flipflop. This simplifies the
analysis of the FPGA design and assures that a circuit like Figure 2-18 will function.

Figure 2-18 Two Flipflops Connected in Series with Internal Delays

62 Digital Design Strategies and Techniques Chapter 2

In summary, the FPGA chip designer has created a logic cell that assures the circuit of
Figure 2-17 will work. ASIC designers don’t have this luxury and must account for delay
and tolerance build-ups in their design. We do not have this luxury when dealing with
signals from outside the FPGA. The signal characteristics of external signals must be
examined and understood completely. If there is any sign of slow or glitchy signals, then we
will implement circuits with hysteresis (like a Schmitt trigger) and will use a two-flipflop
synchronizing circuit to minimize metastability.
Hysteresis is a circuit that adds positive feedback to the input. The idea is that when
the output switches, it adds to the input to help prevent oscillation. The amount of feedback
should be slightly greater than the noise on the input signal. Xilinx doesn’t widely advertise
this information, but all their FPGA inputs have a few hundred millivolts of hysteresis; this
makes their inputs friendly to noisy environments.
To complete our analysis, we must consider clock-skew. In a perfect world, all
flipflops in our design will receive clock edges that are exactly synchronous. The first thing
to understand is the clock-skew problem is not related to the operating frequency of your
design! Even a slow design can have clock-skew problems.
Let’s expand the circuit of Figure 2-17 to show clock-skew, see Figure 2-19. Imagine
that the flipflops are located far apart in the design and the second flipflop clock is delayed
from the clock ‘seen’ by the first flipflop.

Figure 2-19 Two Flipflops Connected in Series with Clock Skew

What is the problem? Let’s call t1 the clock-to-output delay period and t2 the
propagation delay of the signal across the device to the D input of the second flipflop. We
are hoping (and perhaps assuming) the value clocked into U2 is the old value of the Q
output of U1. If the skew of the clock is too long, then we’ll get the new value at U1-Q—or
worse, we’ll violate the setup or hold time requirement of U2 (depending on how much
delay occurs) and get an unknown output from U2. We’re digital engineers; we don’t like
unknowns. What is the solution? Fortunately, the FPGA designer provided low-skew clock
networks carefully crafted to assure that the longest skew of the clock anywhere across the
device is shorter than the shortest sum of clock-to-Q and signal routing propagation times. If
you can use a global low-skew clock network, then there’s no problem. If you create an
Synchronous Logic Rules 63

asynchronous design by using a routed clock (one that travels through random logic in the
design), a gated clock, a MUX’d clock, or are designing an ASIC (where the clock networks
are all custom designed), then you are responsible for assuring that this requirement is met.

Handling External Signals

We must also carefully analyze the situation where the FPGA designer has no control of one
or more of the signals. Consider the case where an input source, represented by the flipflop
U1 in Figure 2-17, is off-chip and is connected to a flipflop clocked by the FPGA clock. If
U1 is a fast device, it is very possible that a race condition, which means signals arrive at
synchronizing flipflops at different times, will occur. The race problem occurs when signals
are changing at the input of a gate at the same time. This results in an unknown output.
We’re digital designers; we like 1’s and 0’s. Unknown output states make us neurotic and
twitchy.
This signal-race situation is much worse if there is no input-synchronizing flipflop in
the input, because the race condition propagates across the design to all the circuits sensitive
to the inputs. Very bad. At least, if there is an input-synchronizing flipflop, the only
setup/hold time requirement is on that specific flipflop; once the timing is worked out for
that device, the signal is well conditioned for operation inside the design. In a case like this,
the easy solution is to make sure that the external device runs off the same clock as the logic
synchronizing clock inside the design and is a slow device so the output can’t change fast
enough to cause a race condition. Proving this can be a problem, because chip
manufacturers almost never provide a minimum clock-to-Q output time. This is good for the
manufacturers because it allows them to improve the IC process (make the device smaller,
faster, and cheaper to build) without changing the data sheet. It’s bad for the designer using
the parts who is diligently trying to do a worst-case timing analysis.

Using Alternate Clock Edges

A solution might be to clock external devices on the clock edge opposite to the one
used inside the FPGA. Xilinx allows a flipflop to be clocked by either the rising or falling
clock edge. Careful analysis must be done to assure that the timing works out. The clock-
skew and setup time must be less than 1/2 a clock period compared to the full clock period
allowed internal to the FPGA/ASIC design. A schematic of a circuit that uses the alternate
clock edge is illustrated in Figures 2-20 with the resulting timing waveforms of Figure 2-21.
Keep in mind that clocks never have perfectly equal high and low periods and these
variations in duty cycle will subtract from the available flipflop setup time margin.
64 Digital Design Strategies and Techniques Chapter 2

Figure 2-20 Two Flipflops Connected in Series Using Alternate Clock Edges

Figure 2-21 Two Flipflops Connected in Series Using Alternate Clock Edges, Timing
Diagram

CLOCKING STRATEGIES

THE MOST IMPORTANT DECISION THE FPGA DESIGNER MAKES!

The single most important decision the designer makes about a design is the clocking strategy.
This must be considered carefully. An error in clocking can doom a design. There is something
worse than making a bad decision on the clocking strategy: not thinking about the clocks at all.

We’ve already decided that we want to create a synchronous design. This means there is at
least one clock (preferably exactly one clock). Still, decisions remain about the clocking
strategy used in the design. For the most trouble-free design, use one master clock. But what
if the design has different clock domains it must interface with? What if using a single clock
results in too much power consumption? There is no one answer to this problem; the answer
depends on what you’re trying to accomplish. Here are some suggested clock strategies:

1. When designing an ASIC, if power consumption is NOT an issue, it’s best to

use one master clock on all flipflops and replace lower-frequency clocks with
Clocking Strategies 65

clock enables to qualify logic wherever it makes sense to run at a lower

frequency. Otherwise, just run at the master clock rate and be happy. This is
the ideal situation. The design has one clock, which results in the simplest
timing analysis. This design will be the easiest on which to use automated
analysis and test tools. ATPG (Automatic Test Pattern Generation) works
best on this type of design.
2. When designing an ASIC and power IS an issue (for battery operation or
where the package power dissipation is a problem, for example), running
flipflops with lower-frequency clocks in selected parts of the design is okay.
The power consumed by a circuit is proportional to the clock frequency and
the number of gates switching at the clock frequency. To reduce power
consumption, make the design smaller and/or reduce the clock frequency.
Minimize the amount of circuitry running at high speed. You are forced to
deal with the problem of synchronizing signals crossing clock domains in
exchange for reduced power consumption.
3. When designing an ASIC, but using FPGAs as prototypes, the desire would
be to run with one master clock. This ties you to FPGAs that are fast and
ASIC-like (Quicklogic, Actel, or some other antifuse One-Time
Programmable type).
4. When designing an ASIC, and using FPGAs as prototypes, but using slow
(that’s slow compared to ASIC processes) SRAM-based devices (like Xilinx
or Altera) it would be desirable to run all flipflops off a single master clock,
but you will probably be forced to run modules at the lowest possible speed
to get the design to work. Drive flipflops with multiple clocks (divided or
from external lower-frequency clocks), partition the design intelligently by
creating the clocks in a central clock-generator module and minimizing the
interconnect between clock domains, and make sure signals that cross clock
domains are properly synchronized. Logic qualified with lower-frequency
clocks used as clock enables will also work.
5. When doing a fast (by FPGA terms) FPGA design (which may become an
FPGA-to-ASIC conversion), the best method is to use up the global clock
resources, partition the design to minimize signals crossing clock domains,
and synchronize signals properly. The FPGA-conversion folks routinely deal
with multiple clocks. Some attention must be paid, but it’s a well-worn path.
6. When doing a slow FPGA design, thank your lucky stars and pick the
method that works for you; either strategy is fine.

Clock Enable

Verilog HDL does not support dedicated clock-enable signals. The hardware (FPGA or
ASIC) may have dedicated clock-enable resources, but Verilog does not give direct control
66 Digital Design Strategies and Techniques Chapter 2

of this signal assignment. In the meantime, synthesis vendors will provide this support
through compiler directives. This means that code like Listing 2-10, depending on whether
the target hardware has dedicated clock-enable support, might synthesize in different ways.
One way a design might be interpreted by the synthesizer is illustrated in Figure 2-22 where
a clock-enable feature is available in the FPGA logic block design.

Listing 2-10 Clock-Enable Example

module clock_en(out,in,clock,clock_enable1,clock_enable2,reset);

output out;
input in, clock, clock_enable1, clock_enable2, reset;
reg out;

always @ (posedge clock or posedge reset)

begin if (reset)
out <= 0;
else if (clock_enable1)
out <= out; // Hold output if not enabled.
else out <= (in & clock_enable2);
end
endmodule

Figure 2-22 Synthesized Clock Enable

Logic Minimization 67

Some logic may get included in the logic that drives the clock enable, as shown in
Figure 2-23. Note that the logic is not exactly the same; the point is that the synthesizer may
insert added logic into the clock-enable path.

Figure 2-23 Synthesized Clock Enable (Mixed)

The next example, Figure 2-24, shows a clock-enable implemented in a technology

that does not have a clock-enable feature in the logic block. The clock-enable is created
with combinational feedback that holds the output when clock_enable is not asserted.

Figure 2-24 Synthesized Clock Enable (Routed)

LOGIC MINIMIZATION

A synthesizer can recognize and remove redundant logic. For example, the code fragments
of Listings 2-11 and 2-12, are equivalent.
68 Digital Design Strategies and Techniques Chapter 2

Listing 2-11 Redundant Logic Example 1 Code Fragment

input test1, test2, test3;

output sample;

sample = ((test1 & test2 & test3) | (test1 & !test2 & test3)
| (test1 & test2 & !test3));

Listing 2-12 Redundant Logic Example 2 Code Fragment

input test1, test2, test3;

output sample;

sample = (test1 & (test2 | test3));

The logic is minimized even if the designer intentionally put in the redundant logic to
provide hazard coverage. Hazard coverage is the addition of redundant logic to cover up
race conditions. This text will never suggest using hazard coverage; always use
synchronous design techniques to avoid hazards.
The compiler can also recognize equivalent logic equations. An alternate form of an
equation might use less area or fewer levels of logic when implemented in an FPGA. The
compiler will try alternate equation forms and use the equation that best meets the design
requirements.

DeMorgan’s Theorems

~(a & b) = (~a | ~b);

~(a | b) = (~a & ~b);

Schematically, DeMorgan’s law looks like Figure 2-25.

Figure 2-25 Schematic Form of DeMorgan’s Law

There is a corollary to the AND/OR form that can be applied to the exclusive-OR
form:
Logic Minimization 69

a ^ b = ~a ^ ~b;
~(a ^ b) = ~a ^ b = a ^ ~b;

AND/OR functions are duals of each other (like division is the dual of multiplication).
DeMorgan’s law defines the conversion between the AND/OR equation forms.
The compiler can also manipulate equations using the laws of Boolean algebra. These
laws are:

Commutative Law

a | b = b | a;

Associative Law

a | (b | c) = (a | b) | c;

Distributive Law

a & (b | c) = (a & b) | (a & c);

Because the designer uses synchronous techniques and doesn’t clog up the design
with complicated structures between registers, the ability of the synthesis tool to extract
redundant logic is limited. There may be simpler logic, but the synthesizer will not be able
to extract it if the logic is spread across register boundaries. Examine Figure 2-26, which
implements the logic of Listing 2-11 with synchronous techniques. The synthesizer will not
find the redundancy! Except for some propagation delays, the two circuits shown in Figure
2-26 are equivalent.
70 Digital Design Strategies and Techniques Chapter 2

Figure 2-26 Redundant Logic Terms Spread across Register Boundaries

The best logic synthesizer is the one between your ears. A poorly planned design will
always be poor regardless of how great the compilers become. When doing a design, a good
designer keeps a model of the synthesized logic in her head and doesn’t allow the logic to
grow so complex that it becomes a problem for the synthesis tool. One way of taking
advantage of the synthesis tool’s capability to minimize and pack logic effectively is to
never create purely combinational modules. None of the popular FPGA architectures have
purely combinational logic elements. There is generally a register that goes wasted if a CLB
is used only for combinational logic. Mix the combinational logic with the synchronous
logic to allow the synthesis tool to merge the logic into the resources available in the device.
The logic block architecture uses combinational logic or LUTs (Look-Up Tables) that feed
into registers. Write your logic that way!
What Does the Synthesizer Do? 71

Figure 2-27 Combinational Logic Clouds Feeding Flipflops

Notice how the logic of Figure 2-27 is partitioned into modules.

WHAT DOES THE SYNTHESIZER DO?

It’s helpful to think about what the synthesizer is doing. The synthesis tool takes
Verilog HDL and maps it into hardware. First, the synthesizer will minimize logic equations
by removing redundant logic terms. Then the design will be a huge set of Boolean
equations. The remaining problem can be thought of as a simple division:

A/B

where A is the full design and B represents the hardware elements available in the target
CPLD, FPGA, or ASIC. In general, for a CPLD the hardware structure will be multi-input
Logic Elements (LE), for an FPGA the hardware structure will be a 3- or 4-input look-up
tables (LUT), and for an ASIC the hardware structure will be a more freeform collection of
library elements. Assuming the basic logic element is a 4-input LUT, the synthesis tool will
partition our complicated denominator into many equations, each a function of 4 inputs.
There will be many sets of equations that will implement our design, and the synthesis tool
will attempt to find ones that meet the design goals of size and speed.
72 Digital Design Strategies and Techniques Chapter 2

A truth table lists all input combinations and defines an output condition for each. A
truth table is a tabular equation form and works well for software manipulation of equations.
The compiler will extract a sum-of-products (SOP) equation from your HDL code. The SOP
is developed by collecting terms that give a 1 result and ORing them together.
Let’s do a SOP representation of a 7-segment decoder. This decoder, similar to a
CMOS 4513, will convert 4-bit binary-coded decimal (BCD) number to device pins that
drive a 7-segment display.

Here is the truth table:

Input Segment
BCD a b c d e f g
b3 b2 b1 b0
0 0 0 0 11 1 1 110
0 0 0 1 01 1 0 000
0 0 1 0 11 0 1 101
0 0 1 1 11 1 1 001
0 1 0 0 01 1 0 011
0 1 0 1 10 1 1 011
0 1 1 0 10 1 1 111
0 1 1 1 11 1 0 000
1 0 0 0 11 1 1 111
1 0 0 1 11 1 1 011

Let’s collect the input terms that cause the ‘a’ segment to be asserted.

We can get a hint about how the reduction algorithm works by extracting and
examining two terms of the equation:

(!b3 & !b2 & !b1 & !b0)|(b3 & !b2 & !b1 & !b0)=(!b2 & !b1 & !b0);
The equation terms differ only in the b3 term, which is asserted low in the first term
and asserted high in the second. Clearly the b3 term doesn’t matter, is redundant, and can be
removed without affecting the logic.
Next we’ll convert the ‘a’ segment equations to standard decimal-sum form by
replacing all negated terms with 0 and all true terms with 1. A term like (!b3 & !b2 & !b1 &
!b0), which has all terms negated, becomes (0,0,0,0), and the whole term can be represented
by a decimal 0. The next term, (!b3 & !b2 & b1 & !b0), (0,0,1,0) becomes 2, and so on,
until we collect all the terms that lead to the ‘a’ segment being asserted:

a = (0,2,3,5,6,7,8,9)
What Does the Synthesizer Do? 73

This form of the Boolean equation is used in the Quine-McCluskey method of

reducing logic equations.

The Quine-McCluskey algorithm arranges terms in order of the number of the (0)
terms. Only terms whose total numbers of negated terms differ by 1 can possibly be
combined. For example, when we combined (0,0,0,0) with (1,0,0,0), this combination was
possible because the first term has 4 zeros and the second term has 3 zeros. The Quine-
McCluskey algorithm exhaustively tests terms and combined terms against each other to
determine the minimum logic expression.

Running QM with the logic terms for segment ‘a’ gives the reduced equation:

a =((!b3 & !b2 & !b0)|(b3 & ! b2 & !b1)|(!b3 & b2 & b0)|(!b3 &
b1));

Let’s see if we can follow what the synthesizer does with this logic defined as a
Verilog design in Listing 2-13.

Listing 2-13 Verilog Design for 7-Segment Display Decoder ’a’ Term

module seven_seg (clk, reset, bcd_input, a_segment);

input clk, reset;

input [3:0] bcd_input;
output a_segment;
reg a_segment;

always @ (posedge clk or posedge reset)

if (reset)
a_segment <= 0;
else
begin case (bcd_input)
{1’b0, 1’b0, 1’b0, 1’b0}: a_segment <= 1’b1;
{1’b0, 1’b0, 1’b0, 1’b1}: a_segment <= 1’b0;
{1’b0, 1’b0, 1’b1, 1’b0}: a_segment <= 1’b1;
{1’b0, 1’b0, 1’b1, 1’b1}: a_segment <= 1’b1;
{1’b0, 1’b1, 1’b0, 1’b0}: a_segment <= 1’b0;
{1’b0, 1’b1, 1’b0, 1’b1}: a_segment <= 1’b1;
{1’b0, 1’b1, 1’b1, 1’b0}: a_segment <= 1’b1;
{1’b0, 1’b1, 1’b1, 1’b1}: a_segment <= 1’b1;
{1’b1, 1’b0, 1’b0, 1’b0}: a_segment <= 1’b1;
{1’b1, 1’b0, 1’b0, 1’b1}: a_segment <= 1’b1;
default: a_segment <= 0;
endcase
end
endmodule
74 Digital Design Strategies and Techniques Chapter 2

For Xilinx 4xxx logic, which uses a 4-input LUT feeding a flipflop as a primitive, the
synthesizer arranges the logic to efficiently use the CLB resources and gives the circuit of
Figure 2-28 for the ‘a’ logic.

Figure 2-28 Synthesized Logic for 7-Segment Display Decoder ‘a’ Term Logic

AREA/DELAY OPTIMIZATION

When implementing a design, there are two fundamental properties: how big is it and how
fast will it operate? Synthesizing a logic design is much like autorouting a circuit board.
When routing a circuit-board trace, there are many options. Which path should the signal
take? What is the signal priority compared to other signals? There is no one answer. The
circuit-board trace can take a nearly unlimited set of paths to its destination. The right
answer occurs when the routing has met the requirements of the design, even if it’s possible
to get better area/delay performance. This bears special emphasis. The designer’s work
will not be judged by how perfect it is! The designer’s work will be judged by how well it
meets the system requirements for product cost, development cost, performance, reliability,
maintainability, and time to market. The quest for perfection will not be rewarded. The goal
of our quest is to achieve ‘good enough.’ This does not mean we’re going to deliver a bad
design. Our design still must meet timing requirements and use good design practices.
The concept of design cost weighs area and speed (delay) against each other. In many
cases, the fastest design is not the smallest. In many cases, the smallest design is not the
fastest. The designer has successfully accomplished the design if it fits into the technology
selected and runs fast enough to meet the needs of the system. How easy or difficult this
problem is depends on many factors: the size of the selected device, the architecture of the
device technology, the system speed requirement, and the skill and design approach of the
designer.
The experienced designer always leaves a way out of a problem by insuring that
a faster or denser device, if at all possible, is available in the same device footprint.
Area/Delay Optimization 75

This way, instead of redesigning a circuit board to accommodate a new device at great
expense and loss of time, a faster and/or denser device can be easily substituted.
This page intentionally left blank
C H A P T E R 3

A Digital Circuit Toolbox

This chapter presents some fundamental

digital design concepts implemented in Verilog.

VERILOG HIERARCHY REVISITED

Verilog uses a powerful method of isolating and maintaining identifiers. A module which is
not instantiated by other modules will be considered as a top module. The top module will
generally instantiate other modules which will appear underneath it in the design hierarchy.
This top module is called the root module. The design identifiers include module instances,
tasks, functions, or named begin/end blocks.
Each design identifier creates a new branch of the hierarchy tree. Each node of the
tree is unique and contains identifiers which will not conflict with other identifiers in other
branches or elements of the hierarchy. A signal can be accessed anywhere in the design by
77
78 A Digital Circuit Toolbox Chapter 3

referencing it by its hierarchical description with periods separating the hierarchical

elements.
Listing 3-1 contains some example signal names.

Listing 3-1 Examples of Hierarchical Naming

top.device_bus1[5:0]
top.device_bus2[3:0]
top.device_bus3[3:0]
top.s1[3:0]
top.s2[4:0]
top.s3[3:0]
top.s4[4:0]
top.s5[5:0]
top.msb1
top.msb2

top.u1.in1 // u1 is instantiated in the top module.

top.u2.in1 // u2 is instantiated in the top module.

top.u3.add1[4:0]

TRISTATE SIGNALS AND BUSSES

Tristate busses are allowed by most FPGA architectures on device output pins. Listing 3-2
provides an example. In addition, some FPGAs allow internal tristate signals. Internal
tristates can save a lot of logic when selecting between different sets of control or data
signals. In other words, different logic trees feed a tristate bus with one logic tree enabled at
a time. With this method, an entire logic construct can be switched quickly. If internal
tristates are not allowed, the synthesis tool may have a control to automatically substitute
MUXes. This replacement will consume many more gates and is likely to be much slower
than a tristate bus structure. If conversion to an ASIC is intended, check with the ASIC
vendor. Internal tristates are an ASIC conversion issue; the vendor may not offer internal
tristates and may or may not offer automatic expansion of tristate nets to logically controlled
nets. Internal tristates can also cause problems during simulation.
Tristate Signals and Busses 79

Listing 3-2 Tristate Bus Example

module tristate (input_bus, output_bus, tri_control);

input [7:0] input_bus;
input tri_control; // Tristate control signal.
output [7:0] output_bus;
// The first condition is the tri_control true condition,
// the second is the false condition.

assign output_bus = tri_control ? input_bus : 8’bz;

endmodule

In the conditional part of the assign statement, both input_bus and tri_control can be
logic equations. For internal tristates, use the tri net type as illustrated in Listing 3-3.

Listing 3-3 Internal Tristate Example

module tristat2 (input_bus1, input_bus2, input_bus3,

input_bus4, tri_control, output_bus, output_control);

input [7:0] input_bus1, input_bus2, input_bus3, input_bus4;

input [1:0] tri_control;
input output_control;
tri [7:0] tri_bus;
output [7:0] output_bus;

parameter zero = 2’b00;

parameter one = 2’b01;
parameter two = 2’b10;
parameter three = 2’b11;

assign tri_bus = (tri_control == zero) ? input_bus1 : 8’bz;

assign tri_bus = (tri_control == one) ? input_bus2 : 8’bz;
assign tri_bus = (tri_control == two) ? input_bus3 : 8’bz;
assign tri_bus = (tri_control == three)? input_bus4 : 8’bz;
assign output_bus = output_control ? tri_bus : 8’b0;
endmodule

It is the designer’s responsibility to insure that the tristate buffer enables are mutually
exclusive so that bus conflicts are avoided. Even transient tristate bus conflicts can cause
excessive power consumption and, if allowed to occur to long or too often, can overheat and
damage the device.
80 A Digital Circuit Toolbox Chapter 3

The schematic shown in Figure 3-1 has three levels of buffers. The ones on the left
are input pin buffers, the ones in the middle are internal tristate buffers, and the ones on the
right are output pin buffers.

Figure 3-1 Schematic for Internal Tristate Buffer Design

Tristate Signals and Busses 81

In Figure 3-3, the logic is the same (same Verilog source file) but the internal tristates
have been converted to MUXes. Note that one level of tristate buffering in Figure 3-1 has
been converted to two levels of MUX LUTs in Figure 3-3. This will result in a slower
operating speed. This type of conversion will occur if the design is converted to an ASIC
technology. This change was caused by checking the ‘Allow converting of internal tristates’
in Exemplar Logic’s LeonardoSpectrum Optimize, Advanced Options menu as illustrated in
Figure 3-2.

Figure 3-2 Tristate Buffers Converted to MUXes

The boxes in the middle of Figure 3-3 are the MUX LUTs that replaced the internal
tristate buffers.
82 A Digital Circuit Toolbox Chapter 3

Figure 3-3 Schematic for MUX Version of Tristate Buffer Design

Bidirectional Busses 83

BIDIRECTIONAL BUSSES

Bidirectional busses, as shown in Listing 3-4 and Figure 3-4, are easy to define in Verilog.
The signal is divided into two parts: the driver part, which is tristated, and the input part.
The two parts are then wired together. The module port must be defined as inout in the port
definition section.

Listing 3-4 Bidirectional Bus Example

module bidir (bidir_bus, direction_sig, use_bidir_sig);

inout [7:0] bidir_bus;
input direction_sig;
output [7:0] use_bidir_sig;
reg [7:0] output_bus;
wire [7:0] bidir_input;

// When direction_sig is true, output_bus drives the

// bidir_bus port pins.

// The bidir_bus signals are accessible inside the design

// on the bidir_input bus.

// Output part, MUX form.

assign bidir_bus = direction_sig ? output_bus : 8’bz;

// Input part.
assign bidir_input = bidir_bus;

// Assign so the input does not get optimized in synthesis.

assign use_bidir_sig = bidir_input;

endmodule
84 A Digital Circuit Toolbox Chapter 3

Figure 3-4 Schematic for Bidirectional Bus Design

PRIORITY ENCODERS

If/else Priority Encoder

if-else statements can have an implied priority with precedence assigned to the first
instructions encountered in a begin/end block. Listing 3-5 illustrates a priority encoder with
the extracted schematic shown in Figure 3-5. If signal a is asserted, it has priority, and none
of the other signals matter. From a delay point of view, signal x passes through one level of
logic and is faster than signal z, which passes through three layers of logic.
Priority Encoders 85

Listing 3-5 Priority Encoder Example

module priority (d, a, b, c, x, y, z);

input a, b, c, x, y, z;
output d;
reg d;

always @ (a or b or c or x or y or z)
begin
if (a) d = x;
else if (b)
d = y;
else if (c)
d = z;
else
d = 1’b0;
end
endmodule

Figure 3-5 Priority Encoder Schematic

Priority in Case Statements

Like an if/else block, case blocks will create a priority encoder unless a full case compile
option is available and selected, or all input combinations have defined outputs states.
Selection of full case informs the compiler that the cases are mutually exclusive and do not
overlap. If one case is found true, then by definition no other case can be true. Since a
conflict is not allowed in a parallel case design, the cases should not prioritized, but they
86 A Digital Circuit Toolbox Chapter 3

will be if conflicting cases are defined. If it is possible for more than one case to be true
(resulting in conflicting cases), then the first one encountered by the compiler will have
priority over later cases which might be true (lower-priority cases are considered don’t-care
conditions when the higher-priority case is evaluated). This means the statement order has
meaning. This may not be the behavior the designer wants. To avoid the priority encoding,
make sure all cases are covered or use the compiler directive to define the design as a full
case design, and avoid conflicting (or contradictory) cases.
When the case input condition list is not complete, a latch is created as shown in
Listing 3-6 and Figure 3-7. For cases not defined, the previous output is held. This may not
be what the designer intends. To prevent the creation of a latch, use a default case to cover
all undefined cases as shown in Listing 3-7, or check the LeonardoSpectrum full case
option box as shown in Figure 3-6. Using the full case option creates a MUX
implementation, as shown in Figure 3-8.
The parallel case check box will still create a latch but forces undefined outputs to a
known state, as shown in Figure 3-9.

Listing 3-6 Case Example, Latch Created

module case1 (d, a, x, y, z);

input [2:0] a;
input x, y, z;
output d;
reg d;

always @ (a or x or y or z)
begin
case (a)
3’b001: d = x;
3’b010: d = y;
3’b100: d = z;
endcase
end
endmodule
Priority Encoders 87

Figure 3-6 LeonardoSpectrum Verilog Case Options

Figure 3-7 Case Schematic with Latch

Figure 3-8 Case Schematic, Full Case Selected

88 A Digital Circuit Toolbox Chapter 3

Figure 3-9 Case Schematic, Parallel Case Selected, Latch Created

The logic of Listing 3-7 is slightly different: as in Figure 3-8, all undefined cases are
set by default to zeroes. Still, notice in Figure 3-10 that a MUX was inferred without a latch.

Listing 3-7 Case Example with Default Case

module case2 (d, a, x, y, z);

input [2:0] a;
input x, y, z;
output d;
reg d;

always @ (a or x or y or z)
begin
case (a)
3’b001: d = x;
3’b010: d = y;
3’b100: d = z;
default: d = 1’b0;
endcase
end
endmodule
Area/Speed Optimization in Synthesis 89

Figure 3-10 Case Schematic with Default Case

AREA/SPEED OPTIMIZATION IN SYNTHESIS

We’ve taken some preliminary looks at what the synthesizer does; let’s explore this a little
more. The optimization of synthesis for ASICs and FPGAs falls into two general categories,
speed/delay or area. Obviously, the design must operate at a high enough speed to meet the
design requirements. The faster the FPGA, the more expensive the device. The design must
fit into the chosen device. Larger devices are also more expensive. The FPGA designer
constantly struggles with the size (area) and speed of the synthesized logic. Size and speed
are influenced by coding style, but we’ll talk about that later.
The trade-off between speed (or delay) and area can be illustrated with an AND gate
design as shown in Listing 3-8.

Listing 3-8 Duplicated Logic Example

module optimize(a, b, c, d, e, f, g ,h , i, j, k, l, m, z);

output a, b, c, d, z;
reg a, b, c, d, z;
input e, f, g, h, i, j, k, l, m;

always @ (e or f or g or h or i or j or k or l or m)
begin
a = e & f & g & h & i & j & k & l;
z = m & a;
end

always @ (e or f or g or h or i or j or k or l)
begin
90 A Digital Circuit Toolbox Chapter 3

b = e & f & g & h & i & j & k & l;

end

always @ (e or f or g or h or i or j or k or l)
begin
c = e & f & g & h & i & j & k & l;
end

always @ (e or f or g or h or i or j or k or l)
begin
d = e & f & g & h & i & j & k & l;
end

endmodule

Figure 3-11 Area/Delay Optimization Selection in LeonardoSpectrum

Figure 3-12 Illustration of Signals and Routing after Speed (Delay) Optimization
Area/Speed Optimization in Synthesis 91

Figure 3-13 Illustration of Signals and Routing After Area Optimization

The difference between the circuits of Figures 3-12 and 3-13 is subtle; the logic is exactly
the same. The difference between the two designs is the selection of area/delay optimization
on the Quick Setup tab in LeonardoSpectrum (this selection can also be made in the
Optimize tab). However, you will notice that signal z in Figure 3-12 passes through 2 levels
of logic, while it passes through 3 levels of logic in Figure 3-13. Signal z will be created
faster in Figure 3-12 at the expense of increased FPGA real estate.
Another way to look at this design is to view the critical path. The critical path is the
longest delay in the design, and LeonardoSpectrum will extract this path so it can be
analyzed. The critical paths are illustrated in Figures 3-14 and 3-15.

Figure 3-14 Design Optimized for Area, Critical Path

Figure 3-15 Design Optimized for Delay, Critical Path

92 A Digital Circuit Toolbox Chapter 3

Figure 3-15 shows the delay improvement, two layers of logic compared to three. For this
design, the difference between critical paths is 22.78 nsec (delay optimization) and 24.57
nsec (area optimization), a 10% difference. The battle the FPGA designer faces is trying to
fit the design into a small (cheap) and slow device and still achieve the necessary
performance. This can be more an art than a science.
From this example, it should begin to be clear how many options the synthesis tool
has for implementing a logic function. Regarding coding style, this example illustrates that
grouping together logic that shares inputs and/or outputs is helpful to the synthesis tool. The
best partitioning would keep them inside the same always block, or at least in the same
module as much as possible. The best design partitioner is the one between your ears! Don’t
make the synthesis tool work too hard to group functions of common inputs and outputs,
group them yourself in block structures and modules.
The synthesizer is best at optimizing combinational logic within a module. It may find
redundant logic in separate modules, but let’s not make the synthesizer work harder than it
has to. Try to keep combinational inputs and outputs of related signals grouped together in
the same module.
Here’s another design approach that will help. When working with a design team,
there should be an agreement about where flipflops occur at module boundaries. Typically,
the structure is per Figure 3-16. As long as all modules have this form, they will work well
together. Without this agreement, you may have to use synchronizing flipflops on inputs
because you don’t know what you’re interfacing to.

Figure 3-16 Suggested Module Boundary Selection and Register Assignment

Trade-off Between Operating Speed and Latency 93

It bears mentioning that area and delay are not always a trade-off, sometimes the more
compact design is also the fastest. It has fewer gates, so it’s certainly possible that there
might be fewer levels of logic. It just depends on the design architecture and how well the
synthesis tool optimizes this design.

TRADE-OFF BETWEEN OPERATING SPEED AND LATENCY

Another trade-off between area and speed is the one between latency and circuit operating
speed. Latency is the time period between when a signal occurs at the input of a design
element and when it finally propagates through the circuit to the output. Many times a
design can tolerate latency as long as the design throughput is fast. For faster throughput,
the designer splits up the logic so that fewer operations (layers of logic) appear between
clocks. This is illustrated in Listings 3-9 and 3-10. Functionally, the designs are the same,
but the design of Listing 3-10 will have greater latency (it will take three clock periods for
the output to appear at the output pins) and will operate at a higher clock rate. The design of
Listing 3-9 will use fewer flipflops, has less latency (the output will appear in one clock
cycle), and will operate at a lower clock rate than that of Listing 3-10.

Listing 3-9 Logic with Low Latency and Low Operating Speed

module latency1 (clk, reset, a, b, c, d, e, f);

input clk, reset;
input [15:0] a, b, c , d, e;
output f;
reg [15:0] f;

always @ (posedge clk or posedge reset)

begin
if (reset)
f <= 0;
else

// The next equation must resolve in one clock period.

f <= ( a & b & c & d & e);
end
endmodule
94 A Digital Circuit Toolbox Chapter 3

Listing 3-10 Logic with High Latency and High Operating Speed

module latency2 (clk, reset, a, b, c, d, e, f);

input clk, reset;
input [15:0] a, b, c, d, e;
output f;
reg [15:0] ab, cd, f;

always @ (posedge clk or posedge reset)

begin
if (reset)
begin
ab <= 0;
cd <= 0;
f <= 0;
end
else

// The remaining equations each must resolve in one clock

// period.
begin
ab <= (a & b);
cd <= (c & d);
f <= (ab & cd & e);
end
end
endmodule

As a comparison, targeting a Xilinx 4010XL-3 device, the circuit of Listing 3-9

consumes 16 partially used CLBs and will operate at 78.9 MHz. The circuit of Listing 3-10
uses 24 packed CLBs and will operate at 87.1 MHz. We’ve added hardware resources to
increase operating speed. Almost any circuit can be sped up by pipelining to reduce the
amount of logic that must resolve in a clock cycle.
Synthesis tools do a great job of implementing your logic, but the designer must take
responsibility for the overall design architecture. If a design must operate at high clock
rates, then break up the logic so there is less logic being resolved between clock edges. It’s
a poor designer who blames the tools for the poor performance of his or her design. There is
always a design approach that will improve timing. Yes, always!
Delays in FPGA Logic Elements 95

DELAYS IN FPGA LOGIC ELEMENTS

Timing Constraints

There are two strategies for improving the performance of a design. Method One: assist the
synthesis tool in identifying critical logic by applying timing constraints. Timing constraints
are discussed in Chapter 6. This gives priority to logic that must run fast at the expense of
less critical areas of the design. Method Two: write the code to give the synthesis tool an
easier problem to solve. Most problems can be resolved by reworking the source code.
Pipeline the logic or use fast structural (schematic-like) elements to implement the logic.

Test points

Experienced designers make mistakes. Knowing this, the experienced designer makes the
design easy to test and debug. A powerful technique for troubleshooting is to bring out test
points connected to easily accessible connections. For example, the layout of an HP logic
analyzer probe pod is shown in Figure 3-17. Put a double row of 0.1" header pins (either
through-hole or SMT types, SMT is better because it has less impact on PCB routing
channels and does not poke more holes in the power/ground planes) on the circuit board.
Notice that ground is connected to pin 20 and the Logic Analyzer clocks are connected to
pins 2 and 3. Pins are numbered, with odd pins 1 through 19 on one side and 2 through 20
on the other. Assign test[15:0] to device pins (which are tied to the test connector). Bring
signals to be tested up through the hierarchy to the top-level module in the manner of
Listing 3-11.

Listing 3-11 Test-Point Wiring Example

module top_lev(test, clk, reset);

output test;
wire [15:0] test;
wire [15:0] cnt;
input clk, reset;

assign test[15:0] = cnt[15:0];

lower_lev u1 (cnt, clk, reset);

endmodule

module low_lev (cnt, clk, reset);

input clk, reset;
output cnt;
reg [15:0] cnt;
96 A Digital Circuit Toolbox Chapter 3

always @ (posedge clk or posedge reset)

begin
if (reset)
cnt <= 0;
else begin
cnt <= cnt + 1;
end
end
endmodule

Figure 3-17 Logic Analyzer Header Layout, Top View of PCB

The 20-pin HP (p/n 1251-8106) logic analyzer pod has built-in termination networks
(similar to Figure 3-18) and plugs into a dual-row tenth-inch center header as shown in
Figure 3-17. To save money on pods and connect directly to the 40 pin logic analyzer pod,
put a 40-pin dual-row tenth-inch center header on the board per Figure 3-19 and termination
networks on the circuit board per Figure 3-18.

Figure 3-18 Logic Analyzer Header Termination Network Schematic

State Machines 97

Figure 3-19 Logic Analyzer Header Layout, Top View of PCB

To expand the number of signals available for test, a MUX can be used to select
between sets of signals connected to the test points. Be aware: the designer can run into
Heisenberg’s Uncertainty theorem. This means that the method of taking a measurement can
affect the thing being measured. Keep in mind that the signal that passes through the MUX
is one or more logic levels removed from the internal signal, and the precise timing will be
different. For low-speed signals, this may not be important. Adding logic and routing for
test points can complicate the routing and result in slower timing. You didn’t expect to get
something for nothing, did you?
Study the test equipment in your lab. Design the circuit board to allow easy interface
to that test equipment. Provide easy access to grounds to connect to your oscilloscope or
voltmeter (we never seem to be able to find a ground or power connection where we need
it). Leave room around parts to allow the use of sockets or test clips.

STATE MACHINES

It is very common to use sequential processes to solve a design problem. One event follows
another and steers a sequential machine through its states. This is a great method of dividing
and conquering a design. A Finite State Machine (FSM) uses a set of registers, called state
registers, to identify the current machine state. The current state depends on the inputs and
the history of inputs.
98 A Digital Circuit Toolbox Chapter 3

You can call me a nut, but I consider state machines (some people insist on calling
them finite state machines or FSMs, but since I’ve never seen an infinite state machine, I
don’t feel any compulsion to keep the finite label) to be one of technology’s wonders, as
beautiful as an Escher print, a current-mirror transistor pair, or a toroidal transformer. As
designers, we are constantly trying to break hard and complicated problems into pieces that
are easier to solve. The state machine well serves this quest. Once you are in a certain state,
the only inputs that matter are the few you explicitly define yourself.
There are many forms of state machines. In textbooks they are divided into the Mealy
and Moore types. In a Moore style state machine, the output depends only on the state. A
synchronous counter is one example of a Moore state machine; the output is dependent only
on the state of the machine (actually, the output IS the state of the machine). In a Mealy
state machine, the output depends on the state and some input conditions or signals. The
style that this book recommends combines output logic and state assignments in the same
always block.
It is helpful to create a state machine the longhand way. For a counter with outputs
encoded using Gray code, each sequential output differs by exactly one bit. We’ll discuss
Gray coding again later in this chapter. A Gray Code counter is a simple state machine; let’s
design one with gates and registers to see how it is done. First, create a present-state/next-
state chart as shown in Listing 3-12. After a clock edge, the present-state values are replaced
with the next-state values.

Listing 3-12 State Machine Example, a 3-bit Gray Code Counter

Present State Next State

d2 d1 d0 n2 n1 n0
0 0 0 0 0 1
0 0 1 0 1 1
0 1 1 0 1 0
0 1 0 1 1 0
1 1 0 1 1 1
1 1 1 1 0 1
1 0 1 1 0 0
1 0 0 0 0 0

Collect all the terms that result in the next state bits being set to a 1 and OR them all
together to create a sum-of-products (SOP) representation of the next-state decoder logic as
shown in Listing 3-13.

Listing 3-13 State Machine Example, Next-State Logic

n2 <= (~d2 & d1 & ~d0) | (d2 & d1 & ~d0) | (d2 & d1 & d0)
| (d2 & ~d1 & d0);

n1 <= (~d2 & ~d1 & d0) | (~d2 & d1 & d0) | (~d2 & d1 & ~d0)
State Machines 99

| (d2 & d1 & ~d0);

n0 <= (~d2 & ~d1 & ~d0) | (~d2 & ~d1 & d0) | (d2 & d1 & ~d0)
| ( d2 & d1 & d0);

The logic of Figure 3-20 shows the next-state decoding logic on the left and the state
register on the right. Let’s implement the present/next-state logic with a Verilog state
machine and see how it looks. I used a case structure in Listing 3-14 to create the next-state
decoders, so the code looks different than the logic above, but it’s the same, trust me.

Listing 3-14 Gray Code State Machine Verilog Example

module gray1 (clk, reset, cnt, flag_output);

input clk, reset;
output cnt;
wire [2:0] cnt; // cnt is the present-state logic.
reg [2:0] next_state;
reg flag_output;
output flag_output;

assign cnt = next_state;

always @ (posedge clk or posedge reset)

if (reset) begin
next_state <= 3’b0;
flag_output <= 1’b0;
end

else begin case (next_state)

3’b000: begin
next_state <= 3’b001;
flag_output <= 1’b0;
end

3’b001: begin
next_state <= 3’b011;
flag_output <= 1’b0;
end

3’b011: begin
next_state <= 3’b010;
flag_output <= 1’b1;
end

3’b010: begin
next_state <= 3’b110;
flag_output <= 1’b0;
end
100 A Digital Circuit Toolbox Chapter 3

3’b110: begin
next_state <= 3’b111;
flag_output <= 1’b0;
end

3’b111: begin
next_state <= 3’b101;
flag_output <= 1’b0;
end

3’b101: begin
next_state <= 3’b100;
flag_output <= 1’b0;
end

3’b100: begin
next_state <= 3’b000;
flag_output <= 1’b0;
end
default: begin
next_state <= 3’b0;
flag_output <= 1’b0;
end
endcase
end
endmodule
State Machines 101

Figure 3-20 Gray Code State Machine Logic

102 A Digital Circuit Toolbox Chapter 3

An output, called flag_output, was added to the Gray code design. This shows how
we add outputs to a state machine. I want the output to be asserted during the 010 state, so it
is set in the prior state (011). The flag_output signal will be asserted on entry to state 010
and cleared on exit of state 010. See Figure 3-21 for the waveforms created by this logic.

Figure 3-21 Gray Code State Machine Waveforms

How Many State Registers?

The number of available states = 2n, where n is the number of state registers.

Using a state machine is a great way to partition a problem. When the state machine is
in a given state, neglecting the clock and reset inputs, nothing else matters except the logic
and the inputs explicitly referred to inside that state. How can we create a state machine that
will synthesize effectively? At around eight state registers and inputs, I’d consider breaking
up the state machine into smaller ones.
Let’s talk about another way to create Gray Code logic.

FOUR-BIT GRAY CODE SEQUENCE

0000
0001
0011
0010
0110
0111
0101
0100
1100
1101
1111
1110
1010
1011
1001
1000
then back to 0000
State Machines 103

The algorithm for converting binary to Gray Code is:

gray MSB = binary MSB

next Gray bit = (corresponding binary bit XOR’d (added without
carry) with the next-highest-order binary bit)

Listing 3-15 illustrates the algorithm for converting from binary to Gray Code.

Listing 3-15 Verilog Code for Converting Binary to Gray Code

module bin2gry (clk, reset, binary_input, gray_output);

input clk, reset;
input [3:0] binary_input;
output gray_output;
reg [3:0] gray_output;

always @ (posedge clk or posedge reset)

if (reset)
begin
gray_output <= 4’b0;
end

else begin
gray_output[3] <= binary_input[3];
gray_output[2] <= binary_input[3] ^ binary_input[2];
gray_output[1] <= binary_input[2] ^ binary_input[1];
gray_output[0] <= binary_input[1] ^ binary_input[0];

end
endmodule

The algorithm for converting from Gray Code to binary:

binary MSB = gray MSB

next binary bit = (previously determined binary bit XOR’d with
corresponding Gray bit).

A synchronous version of a Gray Code to binary converter is shown in Listing 3-16.

The conversion algorithm is modified slightly, so all bits are calculated in parallel and will
converge in a single clock period.

Listing 3-16 Verilog Code for Converting Gray Code to Binary

module gry2bin (clk, reset, gray_in, binary_output);

input clk, reset;
104 A Digital Circuit Toolbox Chapter 3

input [3:0] gray_in;

output binary_output;
reg [3:0] binary_output;

always @ (posedge clk or posedge reset)

if (reset)
begin
binary_output <= 4’b0;
end

else begin
binary_output[3] <= gray_in[3];
binary_output[2] <= gray_in[3]^gray_in[2];
binary_output[1] <=(gray_in[3]^gray_in[2])^gray_in[1];
binary_output[0] <=((gray_in[3]^gray_in[2])^gray_in[1])^gray_in[0];

end
endmodule

State Assignments

The state assignments can make a big difference in how efficiently your logic will
synthesize. We will use parameters and `ifdef statements to select between encoding
assignments as shown in Listing 3-17. A binary count is the easiest to test and debug, but
using Gray Code for state assignments that occur in sequence will synthesize the most
efficiently. State machine state coding can also be used to directly generate output signals
by using a flipflop for both a state register and an output register. Using this method, the
clever designer can save some logic and create a faster design.
State Machines 105

Listing 3-17 Example of Selecting Binary/Gray Code State Assignments

module ifdef_test (clk, reset, count_output);

input clk, reset;
output count_output;
reg [2:0] count_output;

//`define binary

`ifdef binary
parameter state_zero = 3’b000;
parameter state_one = 3’b001;
parameter state_two = 3’b010;
parameter state_three = 3’b011;
parameter state_four = 3’b100;
parameter state_five = 3’b101;
parameter state_six = 3’b110;
parameter state_seven = 3’b111;

`else
parameter state_zero = 3’b000;
parameter state_one = 3’b001;
parameter state_two = 3’b011;
parameter state_three = 3’b010;
parameter state_four = 3’b110;
parameter state_five = 3’b111;
parameter state_six = 3’b101;
parameter state_seven = 3’b100;

`endif

always @ (posedge clk or posedge reset)

if (reset)
begin
count_output <= state_zero;
end

else begin case (count_output)

state_zero: count_output <= state_one;
state_one: count_output <= state_two;
state_two: count_output <= state_three;
state_three: count_output <= state_four;
state_four: count_output <= state_five;
state_five: count_output <= state_six;
state_six: count_output <= state_seven;
state_seven: count_output <= state_zero;
106 A Digital Circuit Toolbox Chapter 3

default: count_output <= state_zero;

endcase
end
endmodule

One-Hot State Assignments

Some designs benefit from one-hot state assignments. One-hot means that each state is
assigned a single-state flipflop which is active only in the assigned state. This type of
coding tends to spread out the FPGA logic and can make the logic easier to synthesize.
Most FPGA architectures have a lot of registers, so it may not be a terrible penalty to
consume some with the one-hot scheme: a one-hot state machine uses more flipflops than a
Gray/binary-coded state machine, one flipflop per state. However, a one-hot design is not a
cure-all. In some cases a one-hot design uses more logic than binary or gray coding.
Definitely check to make sure you’re really getting the benefit you expect when using one-
hot coding.
One aspect of the one-hot assignment to consider is the many unused or default
condition states which should be handled. By definition, the one-hot method uses one active
flipflop at a time, but what if two registers get asserted by a noise hit or metastable input
condition. Will the design recover? How will these cases be covered? This question is tough
to answer, so I generally stick to binary- or Gray-coded state assignments.
Notice that eight registers are required in Listing 3-18 to support the same number of
state assignments as Listing 3-17. Also note that a useful reset state (all state registers =
zero) is not used.

Listing 3-18 One-hot State Assignment Code Fragment

parameter state_zero = 8’b00000001;

parameter state_one = 8’b00000010;
parameter state_two = 8’b00000100;
parameter state_three = 8’b00001000;
parameter state_four = 8’b00010000;
parameter state_five = 8’b00100000;
parameter state_six = 8’b01000000;
parameter state_seven = 8’b10000000;

An alternate version of one-hot state coding is one-cold, in which all state registers are
set except one as shown in Listing 3-19.
Adders 107

Listing 3-19 One-cold State Assignment Code Fragment

parameter state_zero = 8’b11111110;

parameter state_one = 8’b11111101;
parameter state_two = 8’b11111011;
parameter state_three = 8’b11110111;
parameter state_four = 8’b11101111;
parameter state_five = 8’b11011111;
parameter state_six = 8’b10111111;
parameter state_seven = 8’b01111111;

ADDERS

Binary adders are supported by the Verilog synthesis. The synthesis tool will examine each
instance of the + operator and will try to implement the logic with a pre-optimized module.
The optimization can be influenced by compilation settings and may be optimized for area
or speed/delay. If the design is slow and small, then using adders may be trivial and may not
result in any problems. However, the logic required to implement one line of code (a <= b +
c;) can be huge if the input vectors are wide.
The designer should be aware that the synthesis tool is searching for adders. Don’t
make the job of identifying and extracting them too difficult. Make the adder standalone as
much as possible; i.e., don’t bury it too deeply in your logic.

Half-Adder Logic

A half-adder sums two inputs. The reason it is not a “complete” or full-adder is that it
ignores carry input signals. Logic is added to sum in a carry input to create a full adder.
Figure 3-22 shows a half-adder truth table and the associated schematic is shown in Figure
3-23.
108 A Digital Circuit Toolbox Chapter 3

in inb carry output sum output

a
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0

Figure 3-22 Half-Adder Truth Table

By inspection, we can see that the carry output is a AND b (a & b) and the sum output
is a XOR b (a ^ b).

Figure 3-23 Half-Adder Schematic

We don’t have to use the Verilog synthesizer’s version of a half adder if we think we
can do a better job than the synthesis tool; we can make our own. After we build a byte-
wide adder, we’ll compare our results with those of LeonardoSpectrum and admire how
much smarter we are than the synthesis-tool vendor.

Listing 3-20 Verilog Version of Half Adder

module half_adder (ina, inb, sum_out, carry_out, reset, clk);

input ina, inb, reset, clk;
output sum_out, carry_out;
reg sum_out, carry_out;
Adders 109

always @ (posedge clk or posedge reset)

begin if (reset)
begin sum_out <= 0;
carry_out <= 0;
end
else begin
sum_out <= ina ^ inb;
carry_out <= ina & inb;
end
end
endmodule

Unfortunately, the half adder of Listing 3-20 only does half the job when adding
multibit values. As designers, we aren’t finding much job opportunity in designing single-
bit adders. To turn the half adder into a full adder, we take the output of a half adder and
connect it into another half adder. The carry input becomes the other input for the second
stage as shown in the truth table of Figure 3-24 and the logic of Listing 3-21.

in inb carry in sum carry out

a
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1

Figure 3-24 Full-Adder Truth Table

To create the equations for the full adder, cascade two half adders as shown in Figure
3-25.

Listing 3-21 Verilog Code Fragment for Full Adder

// Cascaded half adder equations.

carry_out <= (ina & inb) | ((ina ^ inb) & carry_in));
sum_out <= (ina ^ inb) ^ carry_in;
110 A Digital Circuit Toolbox Chapter 3

Figure 3-25 Full-Adder Schematic

Listing 3-22 Verilog Version of Full Adder

module full_adder (carry_out, sum_out, ina, inb, carry_in, clk,

reset);
input ina, inb, carry_in;
input clk, reset;
output carry_out, sum_out;
reg carry_out, sum_out;

always @ (posedge clk or posedge reset)

if (reset) begin
carry_out <= 0;
sum_out <= 0;
end
else
// Cascaded half adder equations.
begin
carry_out <= (ina & inb) + ((ina ^ inb) & carry_in);
sum_out <= ((ina ^ inb) ^ carry_in);
end
endmodule

You’ll notice in Listing 3-22 and Figure 3-26 that by habit, I changed the
combinational full-adder design to a synchronous version. In Figure 3-26, the modgen box
is simply an OR gate.
Adders 111

Figure 3-26 Full-Adder Schematic after Synthesis

To create wider adders, in a structural manner, we can cascade as many of these full
adders as we wish. To create large adders, we should break the design into small modules
no larger than four-bits wide, and stitch them together. This aids the synthesizer and will
generally improve area and speed performance.
The carry output of one adder feeds the carry input of the higher-order bit. Note: the
size of the output register must be one larger than that of the input registers in order to
accept the carry output of the last stage. An exception to this occurs where the input set of
data inputs is limited so that a carry is not possible. The carry input for the LSB adder is
fixed at 0. An example of this type of adder is presented in Listing 3-23. To make this
design work, the full_adder.v design must be included in the project. Listing 3-23 is also an
example of a simple hierarchy.

Listing 3-23 Homebrew Structural Version of Byte-Wide Full Adder

module byte_adder (byte_a, byte_b, sum_output, clk, reset);

input [7:0] byte_a, byte_b;
input clk, reset;
output [8:0] sum_output;
wire [7:0] carry_output;

full_adder u1 (carry_output[0], sum_output[0], byte_a[0],

byte_b[0], 1’b0, clk, reset);
112 A Digital Circuit Toolbox Chapter 3

full_adder u2 (carry_output[1], sum_output[1], byte_a[1],

byte_b[1], carry_output[0], clk, reset);
full_adder u3 (carry_output[2], sum_output[2], byte_a[2],
byte_b[2], carry_output[1], clk, reset);
full_adder u4 (carry_output[3], sum_output[3], byte_a[3],
byte_b[3], carry_output[2], clk, reset);
full_adder u5 (carry_output[4], sum_output[4], byte_a[4],
byte_b[4], carry_output[3], clk, reset);
full_adder u6 (carry_output[5], sum_output[5], byte_a[5],
byte_b[5], carry_output[4], clk, reset);
full_adder u7 (carry_output[6], sum_output[6], byte_a[6],
byte_b[6], carry_output[5], clk, reset);
full_adder u8 (carry_output[7], sum_output[7], byte_a[7],
byte_b[7], carry_output[6], clk, reset);
assign sum_output[8] = carry_output[7];

endmodule

The synthesis summary for the byte_adder design is shown in Listing 3-24.

Listing 3-24 Homebrew Byte-Wide Full-Adder Synthesis Summary

Info, Instances dissolved by autodissolve in View

.work.byte_adder.INTERFACE
“C:/Verilog/SourceCode/byte_adder.v”, line 7: u1 (full_adder)
“C:/Verilog/SourceCode/byte_adder.v”, line 8: u2 (full_adder)
“C:/Verilog/SourceCode/byte_adder.v”, line 9: u3 (full_adder)
“C:/Verilog/SourceCode/byte_adder.v”, line 10: u4 (full_adder)
“C:/Verilog/SourceCode/byte_adder.v”, line 11: u5 (full_adder)
“C:/Verilog/SourceCode/byte_adder.v”, line 12: u6 (full_adder)
“C:/Verilog/SourceCode/byte_adder.v”, line 13: u7 (full_adder)
“C:/Verilog/SourceCode/byte_adder.v”, line 14: u8 (full_adder)
Using wire table: 4013xl-3_avg
Info, Inferred net ‘reset’ as GSR net.
-- Start optimization for design .work.byte_adder.INTERFACE
Using wire table: 4013xl-3_avg

Pass Area Delay DFFs PIs POs --CPU--

(FGs) (ns) min:sec
1 16 12 16 18 9 00:00
Info, Added global buffer BUFG for port clk
Using wire table: 4013xl-3_avg
-- Start timing optimization for design .work.byte_adder.INTERFACE
No critical paths to optimize at this level

*******************************************************

Cell: byte_adder View: INTERFACE Library: work

*******************************************************

Number of ports : 27
Adders 113

Number of nets : 68
Number of instances : 51
Number of references to this view : 0

Total accumulated area :

Number of BUFG : 1
Number of CLB Flip Flops : 7
Number of FG Function Generators : 16
Number of IBUF : 17
Number of IOB Output Flip Flops : 9
Number of Packed CLBs : 8
Number of STARTUP : 1

***********************************************
Device Utilization for 4010xlPQ100
***********************************************
Resource Used Avail Utilization
-----------------------------------------------
IOs 27 77 35.06%
FG Function Generators 16 800 2.00%
H Function Generators 0 400 0.00%
CLB Flip Flops 7 800 0.88%

-----------------------------------------------
Clock Frequency Report
Clock : Frequency
------------------------------------
clk : 77.6 MHz

Listing 3-25 is an example of letting the synthesizer do all the work. We just use the Verilog
addition operator (+) to create the sum of two bytes. Listing 3-26 presents the statistics for
the synthesized design.

Listing 3-25 Synthesis-Tool Version of Byte-Wide Full Adder

module byte_adder2 (byte_a, byte_b, sum_output, clk, reset);

input [7:0] byte_a, byte_b;
input clk, reset;
output sum_output;
reg [8:0] sum_output;

always @ (posedge clk or posedge reset)

begin
if (reset) sum_output <= 0;
else sum_output <= byte_a + byte_b;
end
endmodule
114 A Digital Circuit Toolbox Chapter 3

Listing 3-26 Synthesis-Tool Version of Byte-Wide Full Adder, Synthesis Summary

Info, Inferred net ‘reset’ as GSR net.

-- Start optimization for design .work.byte_adder2.INTERFACE
Using wire table: 4013xl-3_avg

Pass Area Delay DFFs PIs POs --CPU--

(FGs) (ns) min:sec
1 8 7 9 18 9 00:00
Info, Added global buffer BUFG for port clk
Using wire table: 4013xl-3_avg
-- Start timing optimization for design
.work.byte_adder2.INTERFACE
No critical paths to optimize at this level

*******************************************************

Cell: byte_adder2 View: INTERFACE Library: work

*******************************************************

Number of ports : 27
Number of nets : 103
Number of instances : 48
Number of references to this view : 0

Total accumulated area :

Number of BUFG : 1
Number of CY4 : 5
Number of FG Function Generators : 8
Number of IBUF : 17
Number of IOB Output Flip Flops : 9
Number of STARTUP : 1

***********************************************
Device Utilization for 4010xlPQ100
***********************************************
Resource Used Avail Utilization
-----------------------------------------------
IOs 27 77 35.06%
FG Function Generators 8 800 1.00%
H Function Generators 0 400 0.00%
CLB Flip Flops 0 800 0.00%

-----------------------------------------------
Clock Frequency Report

Clock : Frequency
------------------------------------

clk : 135.1 MHz

Adders 115

Compare our homebrew bytewide adder with the LeonardoSpectrum version. We

didn’t do very well, did we? Our version was twice as large and twice as slow.
We can improve our adder design by looking at some of the problems with our
approach. The adder we created is a Ripple Carry Adder (RCA). It is small in area, but
slow. The problem with the RCA is easy to illustrate. Imagine I give you 2 binary numbers
to add:

10101
11001

Suppose I ask you what is the result of adding only the highest-order bits (1 + 1)?
You say that you can’t answer my question without figuring out if a carry is generated by
the addition of all the lower-order bits. There you go! The carry must be calculated for all
the lower-order bits before the highest-order bits can be added. The output is not available
until the carry from each adder “ripples” through all stages to the summation and final carry
outputs. This adder would be faster if only we could “look ahead” and generate the carry
outputs in parallel instead of in series. We evaluate the inputs to create carry signals which
are added with partial sums. Oddly enough, others have thought of this idea and have
created an adder architecture called Carry Look Ahead (CLA Adder).
The CLA Adder is described in terms of Generate (carry terms) and Propagate (sum
terms) a shown in Listing 3-27.

Listing 3-27 Carry Generate and Carry Propagate Logic Code Fragment

// Definition of propagate/generate terms (not Verilog!)

generate[i] = (a[i] & b[i]);
propagate[i] = (a[i] ^ b[i]);

To describe a cascade-able (expandable) adder in terms of generate/propagate signals,

we need to add the carry from the previous stage(s); the propagate will be a single-bit adder
(OR gate) as shown in Listing 3-28.

Listing 3-28 Carry Generate and Carry Propagate Logic, Expandable Adder

// Cascade-able propagate/generate terms (not Verilog!)

carry[i] = generate[i] | (propagate[i] & carry[i-1]);
propagate[i] = (a[i] | b[i]);

The sum is still formed by cascading half-adders, adding in the propagate (sum) term
and the carry term that is calculated in parallel. To make the blocks generic, the s[0] stage,
even though a carry-in at this stage is not allowed, will still use a carry input wired to 0.
116 A Digital Circuit Toolbox Chapter 3

Carry Select Adder

Another strategy for speeding up an adder adds even more hardware. Redundant hardware
is added to calculate the sum assuming a carry input and assuming no carry input. The
output is selected via multiplexer based on whether carry is required or not.

Carry Skip Adder

Another strategy for speeding up the adder allows the use of an inverter for cases where the
inputs are not equal. The addition uses the XOR function and creates an output when the
inputs are different, the sum is the inverse of the carry bit and the carry bit just passes
through. This type of adder is usually implemented in groups, probably groups of 4. This
version uses less real estate than a CLA and is about as fast.
There are other adder strategies which trade area for speed. The main thing: get an
appreciation for the logic that is synthesized when the Verilog addition operator is used. The
best synthesis strategy: cheat! Keep the adder input lengths short and fight the system
engineer to reduce resolution to the point of diminishing returns. Don’t implement a 16-bit
adder if a 14-bit adder will do. Also, run the adder at the lowest possible clock frequency.

SUBTRACTORS

The subtractor is similar to the adder and there are corresponding versions of subtractors for
the adders described above (Ripple Borrow Subtractor, Borrow Save Subtractor, Borrow
Select Subtractor, and so on). The logic for the Ripple Borrow Subtractor is shown to
illustrate the similarity to Adder circuits. Figure 3-27 illustrates ina - inb in a half-borrow
circuit.

in inb borrow diff output

a output
0 0 0 0
0 1 1 0
1 0 0 1
1 1 0 0

Figure 3-27 Half-Subtractor Truth Table

Multipliers 117

Listing 3-29 Subtractor Logic Code Fragment

Bout = ~ina & inb; // Note similarity to adder: carry =

// ina & inb.
Diff = ina & ~inb; // Note similarity to adder
// xor: sum = (ina &
// ~inb) | (~ina & inb).

Let’s expand the half-borrow logic for the full subtractor, again ina – inb as shown in
Figure 3-28.

borrow input ina inb borrow diff

output
0 0 0 0 0
0 0 1 1 1
0 1 0 0 0
0 1 1 0 0
1 0 0 1 1
1 0 1 1 0
1 1 0 0 1
1 1 1 1 1

Figure 3-28 Full-Subtractor Truth Table

MULTIPLIERS

The Verilog language supports unsigned multiplication and division by powers of two. This
is not a big challenge; these are simply shift-left (a multiply by two for each binary shift
left) and shift-right (a divide by two for each binary shift right) operations. This doesn’t
mean we give up. FPGAs are capable of performing sophisticated math functions at high
speed; we need to use library modules (which limits portability) or create the logic
ourselves. Again, the best strategy for dealing with advanced math is to cheat. Work with
the system engineer to reduce the numbers of bits to be multiplied. Don’t use eight bits
when seven will do. Do a model in C to manipulate test data and examine the results. Use
118 A Digital Circuit Toolbox Chapter 3

the lowest resolution possible to achieve acceptable results. Implementing a multiplier is

easier if the variable inputs are multiplied by constants. If the input range is limited, this
might save some logic (suppose, for example, a variable is eight bits wide, but the range of
legal inputs is 0—160? Every bit helps simplify logic and can increase operating speed).
The result of unsigned multiplication requires n + m bits of width, where n and m are
the sizes of the input variables. For example, when multiplying two four-bit values, a
register eight bits wide is required to hold the maximum result.

Hard-Wired Multipliers

The best way to illustrate the multiplication algorithm is by example. Let’s assume we
want to multiply an integer nibble n by a constant integer nibble with a value of D(16).

23 22 21 20 Bit weight
n3 n2 n1 n0 Variable nibble to be multiplied
1 1 0 1 Constant variable D(16)

The multiplication process shifts and adds. The leftmost digit means add n0. The n1
digit can be ignored, it does not affect the result (an example of simplification due to
multiplying by a constant). The n2 digit means: multiply n by 4 (shift n left twice) and add.
The n3 digit means: multiply n by 8 (shift n left three times) and add.
Here’s what we get:

Result = (n * 8) + (n * 4) + (n * 1)

Let’s assume the variable is B(16) or 1011(2) and plug the numbers in.

Result = (1011 * 8) + (1011 * 4) + (1011 * 1)

Result = 1011000 + 101100 + 1011
Result = 10001111 = 8F(16)

This is cool: multiplication turns into a bunch of shift and adds. We already know
how to shift and we already know how to add, so we’re in good shape. To turn this into a
generic multiplier, we have to be prepared to add a shifted value (or zero) for each bit.
Let’s see what a Verilog version of this multiplier might look like (see Listing 3-30).
This is just one way to do the logic! I could have used the shift operator. I could have used
structural adders. To speed the design up, I could have pipelined the intermediate sums to
reduce the considerable amount of combinational logic between flipflops.
Multipliers 119

Listing 3-30 Hard-Wired Multiplier Example

module byte_mult (nibble_in, byte_out, clk, reset);

input [3:0] nibble_in;
input clk, reset;
output byte_out;
reg [7:0] byte_out;

always @ (posedge clk or posedge reset)

if (reset)
byte_out <= 0;
else
begin

// Shift nibble by padding with zeroes. MSB must be zero to make

// the size of the left-hand side match the set size of the
// right-hand side.
byte_out <= {1’b0, nibble_in[3:0], 3’b0}
+ {2’b0, nibble_in[3:0], 2’b0}
+ {4’b0, nibble_in[3:0]};
end

endmodule

Hard-wired multipliers are fully custom and not at all generic. If you’re trying to
convince the system designer that a hard-wired multiplier is the right answer, be careful
what you promise. Changing the constant you multiply by requires changing the code. If
you are having resource problems in the FPGA, there may not be enough room to squeeze
in a new set of coefficients (count the number of 1’s in the desired constant; there will be a
shift/add for each 1 present in the constant). Adding resolution to the input or the constant
can increase the logic considerably.

Generic Multipliers

To change the hard-wired multiplier to a generic 4-by-4 multiplier, we must create logic
which allows all the shift and adds to be used (whether they are used or not depends on the
data values). Again, there are many ways to do this, the example of Listing 3-31 is just one
way.
120 A Digital Circuit Toolbox Chapter 3

Listing 3-31 Generic 4 x 4 Multiplier Example

module byte_mult2 (nibble1, nibble2, byte_out, clk, reset);

input [3:0] nibble1, nibble2;
input clk, reset;
output byte_out;
reg [7:0] byte_out, stored3, stored2, stored1, stored0;

always @ (posedge clk or posedge reset)

if (reset)
begin
byte_out <= 0;
stored3 <= 0;
stored2 <= 0;
stored1 <= 0;
stored0 <= 0;
end
else
begin

// Shift nibble by padding with zeroes. MSB must be zero to make

// the size of the left-hand side match the set size of the
// right-hand side.
stored3 <= nibble1[3] ? {1’b0, nibble2[3:0], 3’b0} : 8’b0;
stored2 <= nibble1[2] ? {2’b0, nibble2[3:0], 2’b0} : 8’b0;
stored1 <= nibble1[1] ? {3’b0, nibble2[3:0], 3’b0} : 8’b0;
stored0 <= nibble1[0] ? {4’b0, nibble2[3:0]} : 8’b0;
byte_out <= stored3 + stored2 + stored1 + stored0;
end
endmodule

One thing about Listing 3-31 should be mentioned. Verilog will let you do dumb
things; there are no checks to make sure registers are wide enough to accept the data you are
calculating. Normally, you’d expect all sum registers to be one bit larger than the largest
width of number to be added. However, because I know the nature of the numbers stored in
the store registers (i.e., that the MSB is guaranteed to be zero), I can get away with the
stored width being the same as the byte_out width. We know that 8 bits are all that are
required to store the value of two nibbles multiplied together. Use caution when making
assumptions like this!

Multiplying by Fractional Values

Often, the constant coefficients are fractional values. These are easily handled by scaling the
coefficients into integer values, then scaling the result back again. You will want to reduce
Multipliers 121

the resolution and values to the minimum easiest numbers to deal with. Let’s say the system
engineer asks for a coefficient of 0.80. Won’t an easier number like 0.75 (2-1 (or ½) + 2-2 (or
¼)) work with acceptable results? If so, the job is easier. If not, then so be it.
To multiply a nibble n by 0.75, realize that 0.75 x 4 = 3 ( a number we like a lot) and
that (n x 4) / 4 = n. So, multiply the coefficient by 4, then divide the result by 4 later and
you’re even.
If the system engineer absolutely insists on a coefficient like 0.8, ask for the required
accuracy (10%?, 5%?, 1%?), then factor 0.8 into binary (0.8 = ½ + ¼ + 1/32 …) and do
more scaling shifts as required.
This page intentionally left blank
C H A P T E R 4

More Digital Circuits: Counters, ROMs, and

RAMs

designs implemented with Verilog.

This chapter presents an assortment of digital

RIPPLE COUNTERS

The most common (generic) counter is a ripple counter, so described because the output
ripples from stage to stage. If we create a Verilog counter like Listing 4-1, using the binary-
counter option in Exemplar Logic LeonardoSpectrum’s Input File menu as shown in Figure
4-1, we’ll find the result is a synchronous binary counter.

123
124 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

Listing 4-1 Verilog Code for Simple Counter

module ripple1 (count_out, clk, reset);

input clk, reset;
output count_out;
reg [3:0] count_out;

always @ (posedge clk or posedge reset)

if (reset)
count_out <= 0;
else
count_out <= count_out + 1;

endmodule

Figure 4-1 Counter Style Selection

The problem with the ripple counter is that, because more than one output is changing
at once, using combinational logic to decode output states results in glitchy signals. To
avoid this problem, use counters like Gray Code, Johnson, or synchronous binary counters
like Figure 4-1.

JOHNSON COUNTERS

The Johnson counter is a type of shift counter. A shift counter uses little combinational
logic to create the count logic and therefore can operate at high speed (the operating speed
is limited only by how fast a flipflop can switch states and by the propagation delay of the
simple count logic). The Johnson counter wraps an inverted version of the highest-order bit
back to the lowest-order bit. Like the Gray Code counter, it has one output that changes at
each clock. This results in a glitch-free output when decoded with combinational logic.
Disadvantages include the requirement of more registers to store the count variable (around
n/2 registers are required, where n is the max count value) and lack of error recovery. If a
bad count pattern gets loaded, it will recirculate until the registers are reinitialized (if this
ever happens!). The schematic for a Johnson counter is shown in Figure 4-2 with
corresponding Verilog code in Listing 4-2 and count sequence in Listing 4-3.
Johnson Counters 125

Figure 4-2 Johnson Counter Schematic

Listing 4-2 Johnson Counter Verilog Code

module johnson1(clock, reset, count_out);

input clock, reset;
output count_out;
reg [3:0] count_out;

always @ (posedge clock or posedge reset)

if (reset)
count_out <= 0;
else begin
count_out[3:1] <= count_out[2:0];
count_out[0] <= ~count_out[3];
end
endmodule

Listing 4-3 Johnson Counter Output Sequence

0000
1000
1100
1110
1111
0111
0011
0001
0000
Repeat…
126 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

The alert designer notices that not all states are used in the count cycle. We have eight
count states that are not used. Wasted counter states indicate that the design does not use
registers efficiently, but this may not be important. However, if an illegal count occurs due
to noise, there is no way to recover. Illegal states without recovery will make the careful
designer nervous. Let’s add some logic, as shown in Listing 4-4, to detect and recover from
those illegal states. This logic makes the counter a lot more complex, but it may be
worthwhile to create a robust counter with glitchless output decoding.

Listing 4-4 Johnson Counter with Error Recovery

module johnson2(clock, reset, count_out);

input clock, reset;
output count_out;
reg [3:0] count_out;

always @ (posedge clock or posedge reset)

if (reset)
count_out <= 0;

// Add fault recovery.

else if (count_out == 4’h2) count_out <= 0;
else if (count_out == 4’h4) count_out <= 0;
else if (count_out == 4’h5) count_out <= 0;
else if (count_out == 4’h6) count_out <= 0;
else if (count_out == 4’h9) count_out <= 0;
else if (count_out == 4’ha) count_out <= 0;
else if (count_out == 4’hb) count_out <= 0;
else if (count_out == 4’hd) count_out <= 0;
else begin
count_out[3:1] <= count_out[2:0];
count_out[0] <= ~count_out[3];
end
endmodule

Another method of error recovery to consider is allowing an external device (like a

microcontroller) to reinitialize the counter if an error occurs. Generally, software designers
complain if the hardware staff creates logic that can’t be written to (or read from after a
write occurs) by the software. Ability to read and write registers adds to the testability of the
hardware, which is generally a good thing. This adds logic, which increases the design size
and reduces the operating speed—bad things.
Linear Feedback Shift Registers 127

LINEAR FEEDBACK SHIFT REGISTERS

A type of counter that is quite interesting is a Linear Feedback Shift Register or LFSR
counter. It is similar to the Johnson counter except that instead of an inverter from the last
stage back to the first stage, a small number of taps are recycled. The counter next-state
logic is very simple (a few XOR or XNOR gates). With maximal-length logic (taps selected
to give the maximal count), a small number of registers can create counts of up to (2n)-1
(compared to a binary-counter count length of 2n). The one state that is missing from a
maximal-length LSFR count sequence is the no-recovery state (all zeros for an XOR version
or all ones for an XNOR version). An LFSR counter can operate at high speed compared to
a binary counter because the feedback logic is very simple. For cases where the count value
is arbitrary (the LFSR count sequence is pseudorandom) the LFSR counter can be a good
solution.
How these can counters work can be illustrated by example. A maximal-length 4-bit
LFSR counter can use the taps [3,0] (maximal length might be achieved with other taps,
too). The taps are the register outputs that are fed back. Figure 4-3 has four flipflops and a
single XNOR gate.

Figure 4-3 4-Bit LFSR Counter Schematic

128 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

This version is sometimes called ‘many-to-one’; notice how taps are derived from
many outputs, then XOR’d back to the input. There is also a variation called ‘one-to-many’
where all the feedback terms are combined before being fed back.
This counter of Listing 4-5 is an XNOR version. I generally use this version because
the illegal state consists of all ones; I prefer to reset all registers on power-up (rather than
preset some or all of the registers, which is necessary with the XOR version). Listing 4-6
presents a simple test fixture for testing the Verilog design of Listing 4-5.

Listing 4-5 Verilog Version of a 4-bit LFSR Counter

module lfsr4 (clock, reset, lfsr_count);

input clock, reset;
output lfsr_count;
reg [3:0] lfsr_count;

always @ (posedge clock or posedge reset)

if (reset)
lfsr_count <= 0;
else
begin
lfsr_count[3:1] <= lfsr_count[2:0];
lfsr_count[0] <= lfsr_count[3] ~^ lfsr_count[0];
end endmodule

Listing 4-6 Simple Test Fixture for 4-bit LFSR Counter

module lfsr4_tf(clock, reset, lfsr_count);

`timescale 1ns / 1ns
output clock, reset;
reg clock, reset;
input [3:0] lfsr_count;
parameter clk_period = 20;

lfsr4 u1 (clock, reset, lfsr_count);

always begin
#(clk_period / 2) clock = ~clock;
end
initial begin
clock = 0;
reset = 1; // Assert the system reset.
#75 reset = 0;
end
endmodule
Linear Feedback Shift Registers 129

Figure 4-4 shows the count sequence for the 4-bit LFSR counter.
// binary hex
// 0000 0
// 0001 1
// 0010 2
// 0101 5
// 1010 a
// 0100 4
// 1001 9
// 0011 3
// 0110 6
// 1101 d
// 1011 b
// 0111 7
// 1110 e
// 1100 c
// 1000 8

// 0000 0 Repeat sequence…

Figure 4-4 4-Bit LFSR Count Sequence

Looks like a big mess, doesn’t it? That’s part of the LFSR counter’s charm.
Sequential values are loosely correlated, or pseudorandom. This can be useful for reducing
clock harmonic noise. For example, in a binary counter, the lowest-order bit toggles on
every clock; this results in noise that is highly correlated to the system clock and adds
energy at subharmonics of the system clock. This harmonic energy is a large source of
system noise. An LFSR counter generates more wideband noise with lower peak energy
content, because the counter bits are changing in a more random manner.
Table 4-1 lists taps for maximal-length LFSR counters. Other tap selections are
possible for some of the counter lengths.

Number of Bits Length of Loop Taps

2* 3 [1,0]
3* 7 [2,0]
4 15 [3,0]
5* 31 [4,1]
6 63 [5,0]
7* 127 [6,0]
8 255 [7,3,2,1]
130 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

9 511 [8,3]
10 1,023 [9,2]
11 2,047 [10,1]
12 4,095 [11,5,3,0]
13 * 8,191 [12,3,2,0]
14 16,383 [13,4,2,0]
15 32,767 [14,0]
16 65,535 [15,4,2,1]
17 * 131,071 [16,2]
18 262,143 [17,6]
Number of Bits Length of Loop Taps
19 * 524,287 [18,4,1,0]
20 1,048,575 [19,2]
21 2,097,151 [20,1]
22 4,194,303 [21,0]
23 8,388,607 [22,4]
24 16,777,215 [23,3,2,0]
25 33,554,431 [24,2]
26 67,108,863 [25,5,1,0]
27 134,217,727 [26,4,1,0]
28 268,435,455 [27,2]
29 536,870,911 [28,1]
30 1,073,741,823 [29,5,3,0]
31 * 2,147,483,647 [30,2]
32 4,294,967,295 [31,6,5,1]

* indicates sequences whose length is a prime number

Sequences 2, 3, 5, 7, 13, 17, 19, 31 have lengths that are prime numbers.
This table is from Designus Maximus Unleashed by Clive Maxfield; it is copyrighted by Butterworth-Heinemann,
1998, and is used by permission.

Table 4-1 Maximal-Length LFSR Taps

From Table 4-1, you can see we can create a 31-bit counter with 31 registers and a single
XOR gate. Imagine the ripple-carry logic required to create a 31-bit binary counter!
Linear Feedback Shift Registers 131

An example of the use of a LFSR counter is to create simple logic for a divide-by-N
circuit. In this design, a terminal count is provided as an input to be compared to. Listing 4-
7 illustrates an 8-bit divide-by-N counter, Listing 4-8 shows a test fixture, Listing 4-9 is the
output list of the pseudorandom count sequence, and Figure 4-5 shows the waveforms at the
count rollover.

Listing 4-7 Verilog Version of an 8-bit Divide-by-N LFSR Counter

// 8-bit Divide-by-N LSFR Counter.

module lfsr8 (clock, reset, lfsr_count, terminal_cnt, rollover);
input clock, reset;
input [7:0] terminal_cnt;
output lfsr_count;
reg [7:0] lfsr_count;
output rollover;
reg rollover;

always @ (posedge clock or posedge reset)

if (reset)
begin
lfsr_count <= 0;
rollover <= 0;
end
else
if (lfsr_count == terminal_cnt)// Test for terminal count.
begin
rollover <= 1;
lfsr_count <= 0;
end
else begin
rollover <= 0;
lfsr_count[7:1] <= lfsr_count[6:0];
lfsr_count[0] <= lfsr_count[7] ~^ (lfsr_count[3]
~^ (lfsr_count[2] ~^ lfsr_count[1]));
end
endmodule

Listing 4-8 Verilog Version of a 8-bit Divide-by-N LFSR Counter Test Fixture

// 8-bit Divide-by-N LSFR Counter Test Fixture.

module lfsr8_tf(clock, reset, lfsr_count, terminal_cnt);
`timescale 1ns / 1ns
output clock, reset;
reg clock, reset;
input [7:0] lfsr_count;
132 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

output [7:0] terminal_cnt;

reg terminal_cnt;
wire rollover;

parameter clk_period = 20;

lfsr8 u1 (clock, reset, lfsr_count, terminal_cnt, rollover);

always
begin
#(clk_period / 2) clock = ~clock;
end

initial
begin
clock = 0;
reset = 1; // Assert the system reset.
terminal_cnt = 8’d66; // Test assignment.
#75 reset = 0;
end
endmodule

Listing 4-9 8-bit Divide-by-N LFSR Counter Count Sequence

lfsr_count = 00, rollover = 0 lfsr_count = 00, rollover = 0

lfsr_count = 00, rollover = 0 lfsr_count = 00, rollover = 0
lfsr_count = 01, rollover = 0 lfsr_count = 03, rollover = 0
lfsr_count = 06, rollover = 0 lfsr_count = 0d, rollover = 0
lfsr_count = 1b, rollover = 0 lfsr_count = 37, rollover = 0
lfsr_count = 6f, rollover = 0 lfsr_count = de, rollover = 0
lfsr_count = bd, rollover = 0 lfsr_count = 7a, rollover = 0
lfsr_count = f5, rollover = 0 lfsr_count = eb, rollover = 0
lfsr_count = d6, rollover = 0 lfsr_count = ac, rollover = 0
lfsr_count = 58, rollover = 0 lfsr_count = b0, rollover = 0
lfsr_count = 60, rollover = 0 lfsr_count = c1, rollover = 0
lfsr_count = 82, rollover = 0 lfsr_count = 05, rollover = 0
lfsr_count = 0a, rollover = 0 lfsr_count = 15, rollover = 0
lfsr_count = 2a, rollover = 0 lfsr_count = 55, rollover = 0
lfsr_count = aa, rollover = 0 lfsr_count = 54, rollover = 0
lfsr_count = a8, rollover = 0 lfsr_count = 51, rollover = 0
lfsr_count = a3, rollover = 0 lfsr_count = 47, rollover = 0
lfsr_count = 8f, rollover = 0 lfsr_count = 1f, rollover = 0
lfsr_count = 3e, rollover = 0 lfsr_count = 7c, rollover = 0
lfsr_count = f9, rollover = 0 lfsr_count = f3, rollover = 0
lfsr_count = e7, rollover = 0 lfsr_count = ce, rollover = 0
lfsr_count = 9d, rollover = 0 lfsr_count = 3a, rollover = 0
lfsr_count = 75, rollover = 0 lfsr_count = ea, rollover = 0
lfsr_count = d4, rollover = 0 lfsr_count = a9, rollover = 0
lfsr_count = 53, rollover = 0 lfsr_count = a6, rollover = 0
Linear Feedback Shift Registers 133

lfsr_count = 4c, rollover = 0 lfsr_count = 99, rollover = 0

lfsr_count = 33, rollover = 0 lfsr_count = 66, rollover = 0
lfsr_count = cd, rollover = 0 lfsr_count = 9a, rollover = 0
lfsr_count = 34, rollover = 0 lfsr_count = 68, rollover = 0
lfsr_count = d0, rollover = 0 lfsr_count = a0, rollover = 0
lfsr_count = 40, rollover = 0 lfsr_count = 81, rollover = 0
lfsr_count = 02, rollover = 0 lfsr_count = 04, rollover = 0
lfsr_count = 08, rollover = 0 lfsr_count = 10, rollover = 0
lfsr_count = 21, rollover = 0 lfsr_count = 43, rollover = 0
lfsr_count = 86, rollover = 0 lfsr_count = 0c, rollover = 0
lfsr_count = 19, rollover = 0 lfsr_count = 32, rollover = 0
lfsr_count = 64, rollover = 0 lfsr_count = c8, rollover = 0
lfsr_count = 91, rollover = 0 lfsr_count = 22, rollover = 0
lfsr_count = 44, rollover = 0 lfsr_count = 88, rollover = 0
lfsr_count = 11, rollover = 0 lfsr_count = 23, rollover = 0
lfsr_count = 46, rollover = 0 lfsr_count = 8d, rollover = 0
lfsr_count = 1a, rollover = 0 lfsr_count = 35, rollover = 0
lfsr_count = 6a, rollover = 0 lfsr_count = d5, rollover = 0
lfsr_count = ab, rollover = 0 lfsr_count = 56, rollover = 0
lfsr_count = ad, rollover = 0 lfsr_count = 5a, rollover = 0
lfsr_count = b5, rollover = 0 lfsr_count = 6b, rollover = 0
lfsr_count = d7, rollover = 0 lfsr_count = ae, rollover = 0
lfsr_count = 5d, rollover = 0 lfsr_count = bb, rollover = 0
lfsr_count = 76, rollover = 0 lfsr_count = ed, rollover = 0
lfsr_count = da, rollover = 0 lfsr_count = b4, rollover = 0
lfsr_count = 69, rollover = 0 lfsr_count = d2, rollover = 0
lfsr_count = a5, rollover = 0 lfsr_count = 4b, rollover = 0
lfsr_count = 97, rollover = 0 lfsr_count = 2e, rollover = 0
lfsr_count = 5c, rollover = 0 lfsr_count = b9, rollover = 0
lfsr_count = 73, rollover = 0 lfsr_count = e6, rollover = 0
lfsr_count = cc, rollover = 0 lfsr_count = 98, rollover = 0
lfsr_count = 31, rollover = 0 lfsr_count = 63, rollover = 0
lfsr_count = c6, rollover = 0 lfsr_count = 8c, rollover = 0
lfsr_count = 18, rollover = 0 lfsr_count = 30, rollover = 0
lfsr_count = 61, rollover = 0 lfsr_count = c3, rollover = 0
lfsr_count = 87, rollover = 0 lfsr_count = 0e, rollover = 0
lfsr_count = 1c, rollover = 0 lfsr_count = 39, rollover = 0
lfsr_count = 72, rollover = 0 lfsr_count = e4, rollover = 0
lfsr_count = c9, rollover = 0 lfsr_count = 93, rollover = 0
lfsr_count = 27, rollover = 0 lfsr_count = 4f, rollover = 0
lfsr_count = 9e, rollover = 0 lfsr_count = 3d, rollover = 0
lfsr_count = 7b, rollover = 0 lfsr_count = f7, rollover = 0
lfsr_count = ee, rollover = 0 lfsr_count = dd, rollover = 0
lfsr_count = ba, rollover = 0 lfsr_count = 74, rollover = 0
lfsr_count = e8, rollover = 0 lfsr_count = d1, rollover = 0
lfsr_count = a2, rollover = 0 lfsr_count = 45, rollover = 0
lfsr_count = 8a, rollover = 0 lfsr_count = 14, rollover = 0
lfsr_count = 28, rollover = 0 lfsr_count = 50, rollover = 0
lfsr_count = a1, rollover = 0 lfsr_count = 42, rollover = 0
lfsr_count = 00, rollover = 1 lfsr_count = 01, rollover = 0
134 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

Figure 4-5 8-Bit Divide-by-N LFSR Count Simulation at Rollover

The one-to-many variation as shown in Listing 4-10 splits the XOR (or XNORs) into
2-input gates and distributes them throughout the register array. Note: The same taps are
used, simply in a different form. In words, the 4-bit counter taps [3,0] means: XOR (or
XNOR) the output of register 0 and register 3 and connect that result to the input of register
1. The last register is wrapped back to register 0. This will still result in a maximal-length
sequence, but the count sequence (and terminal count value for a given count) will be
different. The schematic extracted from Listing 4-10 is shown in Figure 4-6. The output
waveform is shown in Figure 4-7.

Figure 4-6 4-Bit LFSR One-to-Many Schematic

Cyclic Redundancy Checksums 135

Listing 4-10 4-Bit LFSR One-to-Many Code

module lfsr4v2 (clock, reset, lfsr_count);

input clock, reset;
output lfsr_count;
reg [3:0] lfsr_count;

always @ (posedge clock or posedge reset)

if (reset)
lfsr_count <= 0;
else begin
lfsr_count[0] <= lfsr_count[3];
lfsr_count[1] <= lfsr_count[3] ~^ lfsr_count[0];
lfsr_count[3:2] <= lfsr_count[2:1];
end
endmodule

Figure 4-7 4-Bit LFSR One-to-Many Output Waveforms

For a more detailed explanation of LFSR counters, see Max Maxfield’s Designus
Maximus Unleashed (details on this book can be found in the Bibliography).

CYCLIC REDUNDANCY CHECKSUMS

Logic similar to the LFSR is used to create Cyclic Redundancy Checksums, or CRCs.
Checksums are used to test a data packet to try to determine if an error has occurred. An
ordinary checksum simply adds up the data bytes or words and discards any carry beyond a
predetermined resolution. For example, an 8-bit checksum would use modulo-256 addition
and discard all carries that result in numbers greater than 255 (FF in hex).
Let’s assume a data packet consists of the following 8 bytes :

hex data
99
D0
136 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

AA
01
09
83
AF
BE

We can use our hex calculator to find that the sum of these numbers is 40D(16). We
discard all but the lower 8 bits and get a checksum of 0D. The receiving logic can do the
same addition and see if the received data gives a checksum of 0D. This gives us some
small confidence that the data was received correctly. What if we want more confidence?
We could send a 16-bit checksum instead; this would give a 10-byte packet and a checksum
of 40D. Now, for multiple errors, the chance of detecting an error is 1 in 65,536 instead of 1
in 256. If an error causes a number greater than expected in one byte and a later error causes
a corresponding number the same amount less than expected, the checksum will match and
we’ll think a bad packet is good. What if this is not good enough? A more random sequence
of numbers would give us better error detection.
The idea behind a CRC is to do division instead of addition. The data packet is looked
at as a huge binary number. We select a polynomial to divide this binary data with, and the
remainder becomes our checksum. The sequence of remainders is more random than a
sequence of sums. I’m going to skip a whole bunch of math and just tell you that logic to
implement CRC division with a polynomial (where borrows are discarded) looks a lot like
the logic which implements an LFSR. An input data packet is created with N bits of zeroes
appended, where N is the length of the CRC, and is shifted out serially. While the data is
transmitted, the CRC is calculated and then appended in place of the zeroes. This becomes
the transmitted data packet.
At the receiver, the same CRC calculation is performed on the incoming data packet
(including the CRC bits), and the remainder will be zero if no error is detected. Let’s
illustrate this with a simple example. Xilinx uses a 16-bit CRC to validate the serial data
used for FPGA configuration. The schematic for this logic is shown in Figure 4-8. Xilinx
uses XOR logic, one-to-many configuration, and [15,14,1,0] feedback taps.
ROM 137

Figure 4-8 CRC-16 Schematic

Notice how similar this logic is to the LFSR with the addition of a data input as a
modulation source. Listing 4-11 implements CRC-16 logic.

Listing 4-11 Verilog Version of CRC-16 Logic

module crc16 (clock, reset, serial_data_in, serial_data_out);

input clock, reset, serial_data_in;
output serial_data_out;
reg [15:0] crc_output;
assign serial_data_out = serial_data_in ^ crc_output[15];
always @ (posedge clock or posedge reset)
if (reset) crc_output <= 0;
else begin
crc_output[14:3] <= crc_output[13:2];
crc_output[1] <= crc_output[0];
crc_output[2] <= crc_output[1] ^ serial_data_out;
crc_output[15] <= crc_output[14] ^ serial_data_out;
crc_output[0] <= serial_data_out;
end
endmodule

ROM

ROM stands for Read-Only Memory. This memory is initialized when the FPGA is
configured and cannot be changed after configuration (if it could be changed, then it would
be RAM). As an example, we can implement the four-bit LFSR counter with a ROM if we
138 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

want (we won’t want to if we have any sense, but we’ll do it anyway for the purpose of
illustration); see Listing 4-12 and Figure 4-9.

Listing 4-12 ROM Version of LFSR Counter

module lfsr_rom (binary_in, lfsr_out, clk, reset);

input [3:0] binary_in;
input clk, reset;
output [3:0] lfsr_out;
reg [3:0] lfsr_out;

always @ (posedge clk or posedge reset)

begin
if (reset)
lfsr_out <= 4’b0000;
else case (binary_in)
4’b0000: lfsr_out <= 4’b0000;
4’b0001: lfsr_out <= 4’b0001;
4’b0010: lfsr_out <= 4’b0010;
4’b0011: lfsr_out <= 4’b0101;
4’b0100: lfsr_out <= 4’b1010;
4’b0101: lfsr_out <= 4’b0100;
4’b0110: lfsr_out <= 4’b1001;
4’b0111: lfsr_out <= 4’b0011;
4’b1000: lfsr_out <= 4’b0110;
4’b1001: lfsr_out <= 4’b1101;
4’b1010: lfsr_out <= 4’b1011;
4’b1011: lfsr_out <= 4’b0111;
4’b1100: lfsr_out <= 4’b1110;
4’b1101: lfsr_out <= 4’b1100;
4’b1110: lfsr_out <= 4’b1000;
4’b1111: lfsr_out <= 4’b0000; // Unused combination.
//default:lfsr_out <= 4’b0; Not needed, all combinations covered.
endcase
end
endmodule
RAM 139

Figure 4-9 ROM Version of LFSR Counter Schematic

Because Xilinx implements combinations of four inputs very effectively, this function
is efficient (not as efficient as the LFSR algorithm: the ROM version uses 2 CLBs, whereas
our earlier design used 1 CLB). However, since the logic goes up by the square of the
number of inputs, the ROM implemented in CLBs can get quite large. Another name for a
ROM design like this is a Look-Up Table (LUT).
Many of the Xilinx CLBs have a RAM mode where a 16-by-1 memory element can
be used in place of a CLB. This can be a very effective way to create RAM and ROM
modules. We’ll explore the use of LogiBLOX and memory modules in Chapter 8.
Something else to keep in mind is that many ASIC technologies do not have RAM
capability. During ASIC conversion, ROM/RAM elements will be replaced with random
logic, and this can result in a quite large ASIC design.

RAM

RAM stands for Random Access Memory, but that is not too helpful. A RAM is an array of
memory (or storage) cells, addressable in groups N elements wide (data width, like x4, x8,
x16, or x32) and M elements deep (number of N-width elements). We can synthesize a
RAM out of CLBs, so let’s do a simple 16x1 block (a very tiny RAM block) and see how it
looks. This design assumes that internal three-state drivers are available.
Note: One useful thing about the CLB RAM in an FPGA is the ability to initialize the
RAM register cells on reset.
140 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

16x1 RAM block

There are 16 memory cells, so we need 2n = 16 addresses, or four address lines as shown in
Listing 4-13.

Listing 4-13 Verilog 16x1 RAM Example Using Random Logic

module ram16x1(ram_data, ram_addr, ram_rwn, clock, reset);

inout ram_data;
input [3:0] ram_addr;
input ram_rwn, clock, reset; // Active low write.
reg [15:0] ram_data_reg;
wire ram_data_in;

assign ram_data = ram_rwn ? ram_data_reg[ram_addr] : 1’bz;

assign ram_data_in = ram_data;

always @ (posedge clock or posedge reset)

if (reset) ram_data_reg <= 0;
else case ({ram_addr, ram_rwn})
{4’h0, 1’b0} : ram_data_reg[0] <= ram_data_in;
{4’h1, 1’b0} : ram_data_reg[1] <= ram_data_in;
{4’h2, 1’b0} : ram_data_reg[2] <= ram_data_in;
{4’h3, 1’b0} : ram_data_reg[3] <= ram_data_in;
{4’h4, 1’b0} : ram_data_reg[4] <= ram_data_in;
{4’h5, 1’b0} : ram_data_reg[5] <= ram_data_in;
{4’h6, 1’b0} : ram_data_reg[6] <= ram_data_in;
{4’h7, 1’b0} : ram_data_reg[7] <= ram_data_in;
{4’h8, 1’b0} : ram_data_reg[8] <= ram_data_in;
{4’h9, 1’b0} : ram_data_reg[9] <= ram_data_in;
{4’ha, 1’b0} : ram_data_reg[10] <= ram_data_in;
{4’hb, 1’b0} : ram_data_reg[11] <= ram_data_in;
{4’hc, 1’b0} : ram_data_reg[12] <= ram_data_in;
{4’hd, 1’b0} : ram_data_reg[13] <= ram_data_in;
{4’he, 1’b0} : ram_data_reg[14] <= ram_data_in;
{4’hf, 1’b0} : ram_data_reg[15] <= ram_data_in;
default: ram_data_reg <= ram_data_reg;
endcase
endmodule

Figure 4-10 shows the schematic of the logic synthesized from Listing 4-13. Listing
4-14 summarizes the resources used by this design.

Listing 4-14 Design Summary for Verilog 16x1 RAM Example Using CLBs

Total accumulated area :

Number of BUFG : 1
RAM 141

Number of CLB Flip Flops : 16

Number of FG Function Generators : 29
Number of H Function Generators : 3
Number of IBUF : 7
Number of OBUFT : 1
Number of Packed CLBs : 15
Number of STARTUP : 1
***********************************************
Device Utilization for 4010xlPQ100
Resource Used Avail Utilization
-----------------------------------------------
IOs 8 77 10.39%
FG Function Generators 29 800 3.62%
H Function Generators 3 400 0.75%
CLB Flip Flops 16 800 2.00%
-----------------------------------------------
Clock Frequency Report
Clock : Frequency
clock : 41.1 MHz
142 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

Figure 4-10 Schematic for Verilog 16x1 RAM Example Using Random Logic
RAM 143

This illustrates how inefficient it is to implement RAM with FPGA CLBs. CLBs are
designed to implement random logic functions. If we could replace this logic with a RAM
cell, it would consume one CLB!
RAM elements are easy to create with Verilog. However, Verilog does not support
two-dimensional arrays, so the RAM is modeled as a one-dimensional array of vectors.
Listing 4-15 is an example of a 256-by-8 synthesizable RAM module.

Listing 4-15 Verilog RAM Example

module ram_mod1(rwn, addr, data_port);

input rwn;
input [7:0] addr;
inout [7:0] data_port;
reg [7:0] ramdata [0:255];

assign data_port = (rwn) ? ramdata[addr] : 8’hz;

always @ (rwn or addr)
if (~rwn) ramdata[addr] = data_port;
endmodule

The RAM of Listing 4-15 will work, but unless the FPGA supports embedded RAM
blocks, it will consume a huge amount of logic and be many times more expensive than any
SRAM device you could buy. It might be all right for a tiny amount of RAM (on the order
of 8 bytes), otherwise another solution must be found. Using flipflops to implement RAM is
very inefficient.

Figure 4-11 Schematic for Verilog 256x8 RAM Example

From Figure 4-11, you can see that Exemplar Logic LeonardoSpectrum correctly
inferred a RAM from the Verilog code. In the Xilinx 4000XL family, embedded RAM is
supported; see Figure 4-12. This schematic looks complicated, but compare it to Figure 4-
13. The design in Figure 4-13 was compiled for an XC3000 device; this older device
architecture does not have distributed CLB RAM. The schematic for the XC3000
implementation has 39 sheets!
144 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

Figure 4-12 256x8 RAM Implemented in the 4000XL Device Family, Sheet 1 of 1
RAM 145

Figure 4-13 256x8 RAM Implemented in the XC3000 Device Family, Sheet 1 of 39
146 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

The designer often needs to implement RAM blocks to store an array of input or
output data, configuration information, tables, or parameters. Many modern FPGA/CPLD
architectures include RAM available as blocks (typical of Altera devices) or distributed
across the device design so that a CLB can be configured as a LUT or as a RAM element
(typical of Xilinx devices). You can see that using a single CLB as a 16-by-1 RAM cell is a
good deal for the designer; it’s fast and doesn’t consume much of the FPGA resources.
What do you do if you need more than a trivial amount of RAM? There are two
solutions. One is to pick an FPGA architecture that has enough built-in RAM to solve the
problem (remember to leave yourself some wiggle room—if you need 1K of RAM, pick a
device and architecture that has at least 2K available); the other is to put a real SRAM in the
design. Though RAM blocks or distributed RAM cells are available in modern FPGAs, it is
probably more expensive to use FPGA silicon for RAM than to use a real RAM IC. An
additional consideration is the issue RAM raises during conversion to an ASIC.

Trade-offs Between Internal and External RAM

Internal RAM Features

x Speed. Not only are the RAM cells fast (in general), but we avoid the speed
penalty of driving signals on and off the device.
x Timing. By staying on the chip, the clock/data relationship is known to the
place-and-route tool. This eases our timing analysis. Internal FPGA signals are
tweaked so the register hold time is zero. The external RAM may or may not
have a zero hold time. Regardless, the delays associated with device I/O must be
considered and will result in some sort of minimum hold time that must be
accounted for in the design.
x Initialization. A nice feature of the FPGA RAM is the ability to initialize the
RAM content on power-up. This initialization can be to write all zeros or to take
RAM values from a file and store them in the RAM array. This can avoid
requiring other means of initializing RAM (like having a microprocessor write
to every location on power-up, for example).

Internal RAM Problems

x Cost. The silicon expended on internal RAM cells is probably more expensive
than an external RAM device. The cost advantage is offset slightly by the cost
associated with stuffing an extra device on the board and consuming extra
FPGA pins.
RAM 147

Instantiating RAM

How do we instantiate RAM modules? Xilinx offers a tool called LogiBLOX for
creating RAM modules, an example of a LogiBLOX module is shown in Listing 4-16. More
detail on the procedure of creating a Xilinx LogiBLOX module is provided in Chapter 8.

Listing 4-16 Xilinx Synchronous Dualport RAM

module x_ram1 (clk, ram_data, ram_a_addr, ram_b_addr, ram_a_data,

ram_b_data, wr_strobe);
// Dualport RAM using Xilinx LogiBLOX.

input [15:0] ram_data;

input [4:1] ram_a_addr;
input [4:1] ram_b_addr;
input wr_strobe;
input clk;
output [15:0] ram_a_data;
output [15:0] ram_b_data;

//----------------------------------------------------
// LogiBLOX DP_RAM Module “r16x16dp”
// Created by LogiBLOX version M1.3.7
// on Thu Feb 12 15:27:46 1998
// Attributes
// MODTYPE = DP_RAM
// BUS_WIDTH = 16
// DEPTH = 16
//----------------------------------------------------
r16x16dp ramblk1
(.A({ram_a_addr[4],ram_a_addr[3],ram_a_addr[2],ram_a_addr[1]}),
.SPO(ram_a_data),
.DI(ram_data),
.WR_EN(wr_strobe),
.WR_CLK(clk),
.DPO(ram_b_data),
.DPRA({ram_b_addr[4],ram_b_addr[3],ram_b_addr[2],ram_b_addr[1]}));

endmodule

The r16x16dp.vei module, shown in Listing 4-17, is simply a placeholder for the
presynthesized netlist (r16x16dp.ngo) that will be inserted during the place-and-route
process. It defines the module ports, but that is all. The interface part of this automatically
generated file was cut and pasted into the module that instantiates the placeholder module.
This .vei file must be included in Exemplar Logic LeonardoSpectrum’s input file list shown
in Listing 4-16.
148 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

Listing 4-17 RAM Placeholder Module (r16x16dp.vei)

//----------------------------------------------------
// LogiBLOX DP_RAM Module “r16x16dp”
// Created by LogiBLOX version M1.5.19
// on Sun May 30 14:19:03 1999
// Attributes
// MODTYPE = DP_RAM
// BUS_WIDTH = 4
// DEPTH = 16
// STYLE = MAX_SPEED
// USE_RPM = FALSE
//----------------------------------------------------

module r16x16dp(A, SPO, DI, WR_EN, WR_CLK, DPO, DPRA);

input [3:0] A;
output [3:0] SPO;
input [3:0] DI;
input WR_EN;
input WR_CLK;
output [3:0] DPO;
input [3:0] DPRA;
endmodule

It’s easy to imagine using an external RAM in place of the LogiBLOX RAM; the
difference is that module port pins must actually connect to device pins. An interesting
expansion of the RAM interface occurs when multiple modules need RAM access. In this
case an arbitration scheme can prioritize and negotiate access to the RAM. An example of
external RAM interface with a simple arbiter (which allows multiple sources to access the
RAM) is shown in Listing 4-18. There are probably better ways to implement this design,
but this is a Real World example that was used in a commercial design.

Listing 4-18 RAM Access Interface and Arbitration Design

// arbit1.v © 1998 Advanced Technology Video, Inc.

// Reproduced with permission.
module arbit1 (clk, reset, chan0_ramaddr, chan0_dat_from_ram,
chan0_dat_to_ram, chan1_ramaddr, chan1_dat_from_ram,
chan1_dat_to_ram, address_preset, ram_rwn, ram_addr,
ram_data_pins, data_rd, data_wr, up_data_to_ram, up_data_from_ram,
sram_addr_strobe, rd_ack, wr_ack, ram_data_oe);

// System inputs.
input clk, reset; // System clock and reset.
RAM 149

// Control signals.
output [2:0] rd_ack; // Acknowledge: read complete.
reg [2:0] rd_ack;
output [2:0] wr_ack; // Acknowledge: write complete.
reg [2:0] wr_ack;
// RAM interface.
input [12:1] chan0_ramaddr;// Channel 0 RAM address pointer.
wire [12:1] chan0_ramaddr;
input [12:1] chan1_ramaddr;// Channel 1 RAM address pointer.
wire [12:1] chan1_ramaddr;
output [15:0] chan0_dat_from_ram;// Channel 0 RAM read data.
reg [15:0] chan0_dat_from_ram;
output [15:0] chan1_dat_from_ram;// Channel 0 RAM read data.
reg [15:0] chan1_dat_from_ram;
input [15:0] chan0_dat_to_ram; // Channel 0 RAM write data.
wire [15:0] chan0_dat_to_ram;
output [15:0] chan1_dat_to_ram; // Channel 1 RAM write data.
wire [15:0] chan1_dat_to_ram;
input [2:0] data_rd; // RAM read request.
wire [2:0] data_rd;
input [2:0] data_wr; // RAM write request.
wire [2:0] data_wr;
input sram_addr_strobe; // Preloads address counter.
input [15:0] up_data_to_ram; // Data written into RAM.
output [15:0] up_data_from_ram; // Data read from RAM.
reg [15:0] up_data_from_ram;
input [12:0] address_preset; // Microprocessor address
// counter preset input.

// RAM I/O ports.

output ram_rwn; // SRAM read/write, high = read.
output [12:0] ram_addr; // SRAM address pins.
reg [12:0] ram_addr;
inout [7:0] ram_data_pins;// RAM data to be written.
wire [7:0] ram_data_in;
reg [7:0] ram_data_out;
output ram_data_oe; // RAM output enable.
reg ram_data_oe;

// Local variables.
reg [3:0] ram_state;
reg ram_rdn;
reg [11:0] ram_addr_ctr;// Register: store auto-
// incremented addresses.
// Counts words.

parameter ram_state_idle = 0;
parameter ram_state1 = 1;
parameter ram_state2 = 2;
150 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

parameter ram_state3 = 3;
parameter ram_state4 = 4;
parameter ram_state5 = 5;
parameter ram_state6 = 6;
parameter ram_state7 = 7;
parameter ram_state8 = 8;
parameter ram_state9 = 9;
parameter ram_state10 = 10;
parameter ram_state11 = 11;
parameter ram_state12 = 12;
parameter ram_state13 = 13;
parameter ram_state14 = 14;
parameter ram_state15 = 15;
assign ram_rwn = ~ram_rdn; // Active high local signal.
// Control of SRAM data pins.
assign ram_data_pins = ram_data_oe ? ram_data_out : 8’bz;
assign ram_data_in = ram_data_pins;

always @ (posedge clk or posedge reset) begin

if (reset)
begin
ram_state <= ram_state_idle;
ram_rdn <= 0;
ram_addr <= 0;
ram_data_out <= 0;
rd_ack <= 0;
wr_ack <= 0;
ram_data_oe <= 0;
end else begin

case (ram_state)
ram_state_idle: begin
begin
ram_rdn <= 0;
ram_addr <= 0;
ram_data_out <= 0;
ram_data_oe <= 0;
end

if (data_rd[0]) begin
ram_rdn <= 0;
ram_addr <= {chan0_ramaddr, 1’b0};
ram_state <= ram_state1;
end

else if (data_rd[1]) begin

ram_rdn <= 0;
ram_addr <= {chan1_ramaddr, 1’b0};
ram_state <= ram_state3;
end
RAM 151

else if (data_wr[0]) begin

ram_rdn <= 1;
ram_addr <= {chan0_ramaddr, 1’b0};
ram_data_out <= chan0_dat_to_ram[7:0];
ram_data_oe <= 1;
ram_state <= ram_state5;
end

else if (data_wr[1]) begin

ram_rdn <= 1;
ram_addr <= {chan1_ramaddr, 1’b0};
ram_data_out <= chan1_dat_to_ram[7:0];
ram_data_oe <= 1;
ram_state <= ram_state8;
end

else if (data_rd[2])// Processor read request.

begin
ram_rdn <= 0;
ram_addr <= {ram_addr_ctr, 1’b0};
ram_state <= ram_state11;
end

else if (data_wr[2])// Processor write request.

begin
ram_rdn <= 1;
ram_addr <= {ram_addr_ctr, 1’b0};
ram_data_out <= up_data_from_ram[7:0];
ram_data_oe <= 1;
ram_state <= ram_state13;
end

else // Default.
ram_state <= ram_state_idle;
end

// Read channel 0.
ram_state1: begin
ram_rdn <= 0;
ram_addr <= {chan0_ramaddr, 1’b1};
chan0_dat_from_ram[7:0] <= ram_data_in;
rd_ack[0] <= 1; // Issue early.
ram_state <= ram_state2;
end

ram_state2: begin
ram_rdn <= 0;
ram_addr <= {chan0_ramaddr, 1’b1};
chan0_dat_from_ram[15:8] <= ram_data_in;
rd_ack[0] <= 1; // Hold ack until
// read is released.
if (data_rd[0])
152 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

ram_state <= ram_state2; // Hold until

// rd released.
else begin
rd_ack[0] <= 0; // Release ack.
ram_state <= ram_state_idle;
end
end

// Read channel 1.
ram_state3: begin
ram_rdn <= 0;
ram_addr <= {chan1_ramaddr, 1’b1};
chan1_dat_from_ram[7:0] <= ram_data_in;
rd_ack[1] <= 1; // Issue early.
ram_state <= ram_state4;
end

ram_state4: begin
ram_rdn <= 0;
ram_addr <= {chan1_ramaddr, 1’b1};
chan1_dat_from_ram[15:8] <= ram_data_in;
rd_ack[1] <= 1; // Hold ack until
// read is released.
if (data_rd[1]) // Hold until rd released.
ram_state <= ram_state4;
else begin
rd_ack[1] <= 0; // Release ack.
ram_state <= ram_state_idle;
end
end

// Write channel 0.
ram_state5: begin
ram_rdn <= 0;
ram_addr <= {chan0_ramaddr, 1’b0};
ram_data_out <= chan1_dat_to_ram[7:0];
ram_data_oe <= 1;
ram_state <= ram_state6;
end

ram_state6: begin
ram_rdn <= 1;
ram_addr <= {chan0_ramaddr, 1’b1};
ram_data_out <= chan1_dat_to_ram[15:8];
ram_data_oe <= 1;
wr_ack[0] <= 1; // Release early.
ram_state <= ram_state7;
end

ram_state7: begin
ram_rdn <= 0;
ram_addr <= {chan1_ramaddr, 1’b1};
RAM 153

ram_data_out <= chan1_dat_to_ram[15:8];

ram_data_oe <= 1;
wr_ack[0] <= 1; // Hold until write
// is released.
if (data_wr[0]) // Hold until wr released.
ram_state <= ram_state7;
else begin
wr_ack[0] <= 0; // Release ack.
ram_state <= ram_state_idle;
end
end

// Write channel 1.
ram_state8: begin
ram_rdn <= 0;
ram_addr <= {chan1_ramaddr, 1’b0};
ram_data_out <= chan1_dat_to_ram[7:0];
ram_data_oe <= 1;
ram_state <= ram_state9;
end

ram_state9: begin
ram_rdn <= 1;
ram_addr <= {chan1_ramaddr, 1’b1};
ram_data_out <= chan1_dat_to_ram[15:8];
ram_data_oe <= 1;
wr_ack[1] <= 1; // Release early.
ram_state <= ram_state10;
end
ram_state10: begin
ram_rdn <= 0;
ram_addr <= {chan1_ramaddr, 1’b1};
ram_data_out <= chan1_dat_to_ram[15:8];
ram_data_oe <= 1;
wr_ack[1] <= 1; // Hold ack until
// write is released.
if (data_wr[1]) // Hold until wr released.
ram_state <= ram_state10;
else begin
wr_ack[1] <= 0; // Release ack.
ram_state <= ram_state_idle;
end
end

// Microprocessor initiated read.

ram_state11: begin
ram_rdn <= 0;
ram_addr <= {ram_addr_ctr, 1’b1};
up_data_from_ram[7:0] <= ram_data_in;
ram_state <= ram_state12;
end
154 More Digital Circuits: Counters, ROMs, and RAMs Chapter 4

ram_state12: // Address counter incremented

begin // in this state.
ram_rdn <= 0;
ram_addr <= {ram_addr_ctr, 1’b1};
rd_ack[2] <= 1;
up_data_from_ram[15:8] <= ram_data_in;
ram_state <= ram_state_idle;
end

// Microprocessor initiated write.

ram_state13: begin
ram_rdn <= 0;
ram_addr <= {ram_addr_ctr, 1’b0};
ram_data_out <= up_data_to_ram[7:0];
ram_data_oe <= 1;
ram_state <= ram_state14;
end

ram_state14: begin
ram_rdn <= 1;
ram_addr <= {ram_addr_ctr, 1’b1};
ram_data_out <= up_data_to_ram[15:8];
ram_data_oe <= 1;
wr_ack[2] <= 1; // Release early.
ram_state <= ram_state15;
end

ram_state15: // Address counter incremented

begin // in this state.
ram_rdn <= 0;
ram_addr <= {ram_addr_ctr, 1’b1};
ram_data_out <= up_data_to_ram[15:8];
ram_data_oe <= 1;
wr_ack[2] <= 0;
ram_state <= ram_state_idle;
end
default: ram_state <= ram_state_idle;
endcase
end
end

// Increment address counter when microprocessor reads or writes.

always @ (posedge clk or posedge reset)
begin
if (reset) ram_addr_ctr <= 0;
else if (sram_addr_strobe)
ram_addr_ctr <= address_preset;
else if ((ram_state == ram_state12) |
(ram_state == ram_state15))
ram_addr_ctr <= ram_addr_ctr + 1;
end
endmodule
FIFO Notes 155

Modern synthesis tools can extract RAM from logic structures as long as we don’t
bury them so deep that they are hard for the compiler to find. This means a random logic
design is parsed and the compiler will try to extract modules that are more efficiently
implemented as RAM blocks.

FIFO NOTES

FIFOs (First-In First-Out memories) are used to change data rates between systems. Data is
written at one rate and read out at a different (same or faster) rate. When you take your first
look at a FIFO, it appears like a register file that expands and contracts like an accordion.
However, it is really designed as a RAM block with an independent write address counter
and an independent read address counter. For each FIFO write, the write counter (usually a
Gray Code counter) gets incremented; for each FIFO read, the read counter gets
incremented. The minimum set of flags includes an empty flag (set when the read and write
pointers have caught up to each other and are equal) and a full flag (again set when the read
and write flags are equal, but equal this time because the write address has wrapped
around). The major goal of a FIFO system design is to prevent an overrun which results in
data loss either due to new data not being written or old data being written over. The factors
that influence overrun are the depth of the FIFO and the read and write frequencies.
One of the challenges of designing a FIFO is the flag design. The full flag, for
example, is set in the write clock domain, but must be read and cleared in the read clock
domain. This requires synchronization between the two domains, always a tricky task.
Like a RAM, a FIFO can be built out of registers. However, unless the FIFO is very
small, you’re not going to want to build a FIFO out of registers (use RAM instead), because
the design is inefficient.
This page intentionally left blank
C H A P T E R 5

Verilog Test Fixtures

S RAM-based FPGA designers don’t simulate

their designs enough. It is easy to burn a part and try it, so that is our tendency. We can
argue some advantage to this method; after all, the end result is a part that works, right?
Still, any tool that improves the quality of our design must be used. ASIC designers and
designers using antifuse technology don’t have the luxury (or crutch, depending on your
point of view) of trying a part that is not virtually guaranteed to work. Instead of
downloading a configuration file, these designers either program an expensive antifuse
device (which is thrown away if it doesn’t work) or go through a full ASIC fabrication turn
(which can cost tens and hundreds of thousands of dollars). This is why ASICs (and
ASIClike devices) have long simulation processes that drive management nuts. The
designers are fooling around with their computers all day instead of delivering product!
Verilog was designed as a simulation and test language. It has excellent features and
has been thoroughly thought out and developed. Except for the cost of the simulator
software and the danger of falling into the endless ‘paralysis of analysis’ loop, there is no
157
158 Verilog Test Fixtures Chapter 5

reason the FPGA designer shouldn’t regularly use a simulator. There are some excellent
books that cover Verilog simulation in detail (see the bibliography at the end of this book);
we’ll just do a quick and dirty overview in this chapter.
Most simulators have a waveform viewer, and a lot of effort is put into making this
viewer attractive to the eye. The problem is that a human brain is required to analyze and
interpret the waveforms. Waveforms are great and we’ve used them throughout this text to
show input and output signals. However, Verilog supports automated testing. This is a great
way to test and validate a design and later design changes. You can make a design change
and carefully evaluate the effect on the area of interest, but how do you know you didn’t
break something in another part of the design that used to work?
This doesn’t mean the automated test fixtures are a panacea. They are often a pain in
the rear. You’ll spend a lot of time revising the test fixture to ‘fix’ tests where signals that
don’t matter were improperly or too strictly tested.

BUILT-IN SELF-TEST (BIST)

During manufacturing, an FPGA can be programmed to perform both internal and external self-
tests, then later programmed to support the embedded (shippable) application. External
hardware like DRAM, SRAM, FIFOs, etc. can be thoroughly tested, with the test results
indicated via LED or serial port (which might exist for test only).

COMPILER DIRECTIVES

Many powerful compiler directives are available in Verilog. Note the use of the ` (back tick
or accent grave) as part of these compiler directives.
`define, ìfdef, èlse, èndif, ùndef Verilog supports conditional compilation and
execution. Code may support simulation and not be synthesizable, or may be conditionally
synthesized to support optional features. A macro variable can be defined to control
compilation and might have a form like that shown in Listing 5-1.

Listing 5-1 `ifdef Example

// Conditional Compilation Example.

// Comment out the next line for synthesis.
`define test_mode; // Define test_mode macro.

`ifdef test_mode
// Insert test_mode code here. Could be a test module
Compiler Directives 159

// definition or simulation directives, not just inline

// code.

data_bus <= test_points;

`else
// Insert non-test_mode code here.
// The `else portion is optional.
data_bus <= internal_data;

`endif
// Continue with unconditional code here.

A macro definition can be ‘undefined’ by an `undef occurring later in the code.

`include filename The directive is similar to the C language #include directive. The
file pointed to by filename (which may be a file in the current path or a full path description)
will be inserted in the Verilog code during compile time. Includes can be nested; in other
words an included file may also have an included file.
`timescale unit/precision The time unit used by the simulator is programmable.
For example, in the code of Listing 5-3, there is a line that reads:

#75 reset = 0;

The number 75 represents a delay in units of the timescale unit, in this case 75 nsec.
The delay tells the simulator to wait until simulation time has advanced by the delay value
before executing the next directive.

DELAYS IN SYNTHESIS
Delays have meaning only for simulation, never for synthesis.

There is no magic hardware construct that will create a delay for you. The default, if
no timescale directive is executed, is 1 nsec. The precision determines how delay values are
rounded off and determine the simulation resolution. The precision must be equal to or less
than the timescale unit. The timescale argument units are in s (seconds), ms (milliseconds),
us (microseconds), ns (nanoseconds), ps (picoseconds), or fs (femtoseconds). Mostly, you’ll
see 1 ns / 1 ns, for delay units of 1 nsec with rounding of delay values to the nearest nsec.
There can only be one timescale in a design.

System Tasks

Verilog system tasks start with a $.

160 Verilog Test Fixtures Chapter 5

$finish; When encountered in the code, $finish ends the simulation. Without some
termination point, the simulation will continue forever (or until your PC runs out of memory
and crashes). $finish returns control of the computer back to the operating system.
$stop; This system task halts simulation but does not return control to the
operating system. Simulation can be continued from the stopping point, or other system
commands can be executed at the current simulation time.
$display(list element 1, list element 2); This system task is similar to C’s printf
command. Verilog runs just fine in a nonwaveform output mode. If there are no waveforms,
how can we tell what our design is doing? Stick some $display commands in your design to
view variables and other information (such as simulation time) as illustrated in Listings 5-2
and 5-3.

Listing 5-2 Simple $display Example

module display1 (clock, reset);

input clock, reset;
reg [7:0] count_val;
always @ (posedge clock or posedge reset)
if (reset)
count_val <= 0;
else begin
count_val <= count_val + 1;
$display (count_val);
end
endmodule

Listing 5-3 Simple $display Example Test Fixture

// Display Test.
module disp1_tf;
`timescale 1ns / 1ns
reg clock, reset;

parameter clk_period = 20;

display1 u1 (clock, reset);
always begin
#(clk_period / 2) clock = ~clock;
end
initial begin
clock = 0;
reset = 1; // Assert the system reset.
#75 reset = 0;
Compiler Directives 161

#1000 $finish;
end
endmodule

Listing 5-4 shows the result of a Silos III simulation run. The first few zeros are the
count_val register content during the reset period. The $display defaults to a decimal
number format and includes a carriage return (newline) after each execution. If the newline
is not desired, the $write system task can be used instead.

Listing 5-4 Simple $display Example Output Listing

S I L O S I I I Version 99.100
DEMO COPY LIMITED TO 100 to 200 DEVICES
Copyright (c) 1999 by SIMUCAD Inc. All rights reserved.
No part of this program may be reproduced, transmitted,
transcribed, or stored in a retrieval system, in any
form or by any means without the prior written consent of
SIMUCAD Inc., 32970 Alvarado-Niles Road, Union City,
California, 94587, U.S.A.
(510)-487-9700 Fax: (510)-487-9721
Electronic Mail Address: “silos@simucad.com”

!file .sav=“display1”
!control .sav=3
!control .savcell=0
!control .disk=1000M

Reading “c:\verilog\sourcecode\disp1_tf.v”
Reading “c:\verilog\sourcecode\display1.v”
sim to 0
Highest level modules (that have been auto-instantiated):
(disp1_tf disp1_tf
3 total devices.
Linking ...

3 nets total: 11 saved and 0 monitored.

74 registers total: 74 saved.
Done.

0 State changes on observable nets.

Simulation stopped at the end of time 0.000000000s.

Ready: sim
0
1
2
3
4
5
162 Verilog Test Fixtures Chapter 5

6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
313 State changes on observable nets in 0.33 seconds.
948 Events/second.

Simulation stopped at the end of time 0.000001075s.

Ready:

Text and numbers can be formatted with escape string. An escape string is a string
following a backslash (\) or %. Here are a few examples:
Compiler Directives 163

%h Display numbers in hex format

%d Display numbers in decimal (the default) format
%o Display numbers in octal format
%b Display numbers in binary format
%c Display numbers as ASCII characters
%t Display in current time format
“string” Display text string
\n newline
\t tab
\literal Literal could be another \ (to print a ‘\’), a “ (quote), or % (print %).

To get some experience with this formatting, take a look at Listings 5-5, 5-6, and 5-7.

Listing 5-5 Simple $display Example Output Listing with Formatting

// Display Test with formatting.

module time_setup;
initial
begin
// This timeformat is nsec (-9), 2 digits after decimal
// place (2), ns text, and a minimum of 3 spaces for
// time to be displayed in.
$timeformat (-9, 2, “ ns”, 3);
end
endmodule

module disp2_tf;
`timescale 1ns / 1ns
reg clock, reset;
parameter clk_period = 20;

display2 u1 (clock, reset);

always begin
#(clk_period / 2) clock = ~clock;
end

initial
begin
clock = 0;
reset = 1; // Assert the system reset.
#75 reset = 0;
#1000 $finish;
end
endmodule
164 Verilog Test Fixtures Chapter 5

Listing 5-6 Simple $display Example Output Listing with Formatting

module display2 (clock, reset);

input clock, reset;
reg [7:0] count_val;
// Comment out the next line for terse printout.
`define verbose
always @ (posedge clock or posedge reset)
begin
if (reset) count_val <= 0;
else begin
count_val <= count_val + 1;
ìfdef verbose
$write (“count_val = %h”, count_val);
$display (“ Current time = %t”, $time);
èlse
$write (“:”, count_val);
èndif
end
end
endmodule

Listing 5-7 Simple $display Example Output Listing with Formatting

!file .sav=“display2”
!control .sav=3
!control .savcell=0
!control .disk=1000M

Reading “c:\verilog\sourcecode\time_setup.v”
Reading “c:\verilog\sourcecode\display2.v”
sim to 0
Highest level modules (that have been auto-instantiated):
(time_setup time_setup
(disp2_tf disp2_tf
Compiler Directives 165

4 total devices.
Linking ...
3 nets total: 11 saved and 0 monitored.
74 registers total: 74 saved.
Done.

0 State changes on observable nets.

Simulation stopped at the end of time 0.000000000s.
Ready: sim
count_val = 00 Current time = 90.00 ns
count_val = 01 Current time = 110.00 ns
count_val = 02 Current time = 130.00 ns
count_val = 03 Current time = 150.00 ns
count_val = 04 Current time = 170.00 ns
count_val = 05 Current time = 190.00 ns
count_val = 06 Current time = 210.00 ns
count_val = 07 Current time = 230.00 ns
count_val = 08 Current time = 250.00 ns
count_val = 09 Current time = 270.00 ns
count_val = 0a Current time = 290.00 ns
count_val = 0b Current time = 310.00 ns
count_val = 0c Current time = 330.00 ns
count_val = 0d Current time = 350.00 ns
count_val = 0e Current time = 370.00 ns
count_val = 0f Current time = 390.00 ns
count_val = 10 Current time = 410.00 ns
count_val = 11 Current time = 430.00 ns
count_val = 12 Current time = 450.00 ns
count_val = 13 Current time = 470.00 ns
count_val = 14 Current time = 490.00 ns
count_val = 15 Current time = 510.00 ns
count_val = 16 Current time = 530.00 ns
count_val = 17 Current time = 550.00 ns
count_val = 18 Current time = 570.00 ns
count_val = 19 Current time = 590.00 ns
count_val = 1a Current time = 610.00 ns
count_val = 1b Current time = 630.00 ns
count_val = 1c Current time = 650.00 ns
count_val = 1d Current time = 670.00 ns
count_val = 1e Current time = 690.00 ns
count_val = 1f Current time = 710.00 ns
count_val = 20 Current time = 730.00 ns
count_val = 21 Current time = 750.00 ns
count_val = 22 Current time = 770.00 ns
count_val = 23 Current time = 790.00 ns
count_val = 24 Current time = 810.00 ns
count_val = 25 Current time = 830.00 ns
count_val = 26 Current time = 850.00 ns
count_val = 27 Current time = 870.00 ns
count_val = 28 Current time = 890.00 ns
count_val = 29 Current time = 910.00 ns
count_val = 2a Current time = 930.00 ns
166 Verilog Test Fixtures Chapter 5

count_val = 2b Current time = 950.00 ns

count_val = 2c Current time = 970.00 ns
count_val = 2d Current time = 990.00 ns
count_val = 2e Current time = 1010.00 ns
count_val = 2f Current time = 1030.00 ns
count_val = 30 Current time = 1050.00 ns
count_val = 31 Current time = 1070.00 ns

313 State changes on observable nets in 0.76 seconds.

411 Events/second.

Simulation stopped at the end of time 0.000001075s.

Ready:

A list of variables can be displayed when they change by using the $monitor directive. The
syntax of the $monitor signal list and formatting controls is very similar to those used for
the $display directive. Some examples of the using the $monitor directive are presented in
Listing 5-8 and Listing 5-9 with the corresponding output shown in Listing 5-10.
$monitor (signal list and formatting);
$monitoron/$monitoroff;

Listing 5-8 $monitor Example display3.v

module display3 (clock, reset);

input clock, reset;
reg [7:0] count_val;
always @ (posedge clock or posedge reset)
begin
if (reset) count_val <= 0;
else
begin count_val <= count_val + 1;
end
end
endmodule

Listing 5-9 $monitor Example Test Fixture disp3_tf.v

// $monitor used in a test fixture.

module time_setup2;
initial
begin
`timescale 1ns / 1ns
Compiler Directives 167

$timeformat (-9, 2, “ ns”, 3);

end
endmodule

module disp3_tf;

reg clock, reset;

wire [7:0] count_val;
parameter clk_period = 20;

display3 u1 (clock, reset);

always
begin
#(clk_period / 2) clock = ~clock;
end

initial
begin
$monitor ($time, “ Counter value: %h”, u1.count_val);
clock = 0;
reset = 1; // Assert the system reset.
#75 reset = 0;
#1000 $finish;
end
endmodule

Listing 5-10 $monitor Example Output Listing

90000000000.00 ns Counter value: 01

110000000000.00 ns Counter value: 02
130000000000.00 ns Counter value: 03
150000000000.00 ns Counter value: 04
170000000000.00 ns Counter value: 05
190000000000.00 ns Counter value: 06
210000000000.00 ns Counter value: 07
230000000000.00 ns Counter value: 08
250000000000.00 ns Counter value: 09
270000000000.00 ns Counter value: 0a
290000000000.00 ns Counter value: 0b
310000000000.00 ns Counter value: 0c
330000000000.00 ns Counter value: 0d
350000000000.00 ns Counter value: 0e
370000000000.00 ns Counter value: 0f
390000000000.00 ns Counter value: 10
410000000000.00 ns Counter value: 11
430000000000.00 ns Counter value: 12
450000000000.00 ns Counter value: 13
168 Verilog Test Fixtures Chapter 5

470000000000.00 ns Counter value: 14

490000000000.00 ns Counter value: 15
510000000000.00 ns Counter value: 16
530000000000.00 ns Counter value: 17
550000000000.00 ns Counter value: 18
570000000000.00 ns Counter value: 19
590000000000.00 ns Counter value: 1a
610000000000.00 ns Counter value: 1b
630000000000.00 ns Counter value: 1c
650000000000.00 ns Counter value: 1d
670000000000.00 ns Counter value: 1e
690000000000.00 ns Counter value: 1f
710000000000.00 ns Counter value: 20
730000000000.00 ns Counter value: 21
750000000000.00 ns Counter value: 22
770000000000.00 ns Counter value: 23
790000000000.00 ns Counter value: 24
810000000000.00 ns Counter value: 25
830000000000.00 ns Counter value: 26
850000000000.00 ns Counter value: 27
870000000000.00 ns Counter value: 28
890000000000.00 ns Counter value: 29
910000000000.00 ns Counter value: 2a
930000000000.00 ns Counter value: 2b
950000000000.00 ns Counter value: 2c
970000000000.00 ns Counter value: 2d
990000000000.00 ns Counter value: 2e
1010000000000.00 ns Counter value: 2f
1030000000000.00 ns Counter value: 30
1050000000000.00 ns Counter value: 31
1070000000000.00 ns Counter value: 32

There are many file commands, we’ll touch on the highlights.

$dumpfile (“filename”);
$dumpvars(levels of hierarchy, module variables extracted from);
$dumpvars(0, module the variables extracted from);
$dumpvars
$dumpon;
$dumpoff;
$dumpall;
$dumplimit(filesize in bytes);

Verilog can save simulation results in an ASCII file. The format of this file is called
Value Change Dump or VCD. An entry in the file occurs only when a variable value
Compiler Directives 169

changes. The $dumpvars directive without an argument list will dump all the variables in
the design. The $dumpvars (0, module name) directive will dump variables from the listed
module and all modules instantiated by the listed module. Variables can be identified
hierarchically in the file list (module1.module2.variable_name).
The VCD file can get very large. Setting a $dumplimit will stop the dump when the
VCD file reaches the specified limit. Some examples of the $dump directive is shown in
Listing 5-11 and Listing 5-12, with a partial output listing shown in Listing 5-13.

Listing 5-11 $dump Options

#100 $dumpon; // Dump all variables after 100 time units.

#100 $dumpoff; // Stop dump after 100 time units.
#100 $dumpall; // Dump a snapshot of all variables.

Listing 5-12 $dumpvars Example Listing

// Value Change Dump Example.

module time_setup3;
initial begin
`timescale 10ns / 1ns
$timeformat (-9, 2, “ ns”, 3);
$dumpfile (“bigdump.dmp”); // Open file.
end
endmodule

module disp4_tf;
reg clock, reset;
wire [7:0] count_val;
parameter clk_period = 20;

display4 u1 (clock, reset);

always begin
#(clk_period / 2) clock = ~clock;
end
initial begin
$timeformat (-9, 2, “ ns”, 3);
$dumpvars;
clock = 0;
reset = 1; // Assert the system reset.
#75 reset = 0;
#1000 $finish;
end
endmodule
170 Verilog Test Fixtures Chapter 5

module display4 (clock, reset);

input clock, reset;

reg [7:0] count_val;

always @ (posedge clock or posedge reset)

begin
if (reset)
count_val <= 0;
else
begin
count_val <= count_val + 1;
end
end
endmodule

Listing 5-13 $dumpvars Output Listing Extract

$scope module disp4_tf $end

$var reg 1 ! reset $end
$var reg 1 “ clock $end

$scope module u1 $end

$var wire 1 # clock $end
$var wire 1 $ reset $end
$var reg 8 % count_val [7:0] $end
$upscope $end

$upscope $end

$enddefinitions $end
#0
$dumpvars
1!
0“
0#
1$
b00000000 %
$end

Listing 5-13 is a small part of the bigdump.dmp. Note that each signal in the scope of
$dumpvars (because $dumpvars was not limited in scope, all signals in the design are
dumped) is assigned a key character (! = reset, for example) and this shorthand is used in
the dumpfile. The VCD is not human-friendly, but is a format that can be read by other
tools.
$readmemh (“filename”, memory_name); Read hex values from a file.
Automated Testing 171

$readmemb (“filename”, memory_name); Read binary values from a file.

Optional starting and ending addresses can be added to place limits on the data pulled
from the file. The address is an index into the array, the nth data element. It is acceptable to
have a start address, but no end address, in which case the file is read to the end of the
memory array.
$readmemh (“filename”, memory_name, start_addr, end_address);

AUTOMATED TESTING

The only way to assure thorough testing of a design is to automate the task. A check-off list
can be created. When a new revision of code is being released, all the automated tests
should be run again. The process is maddening, because most of the effort to resolve
problems will be in test-fixture errors, not design errors. Still, there is no better way to test a
design.
As an example, let’s design and test a simple digital filter as shown in Listing 5-14.
The source code for this design is shown in Listing 5-15. The output values are shown in
Figure 5-1. This design implements a one-dimensional low-pass pyramidal filter that uses
five samples with coefficients of 0.0625, 0.125, 0.625, 0.125, and 0.0625 (note the sum of
these coefficients is 1). This filter is crude and suffers from truncation errors but will serve
as an example of automated testing.

Listing 5-14 Pyramidal Filter Test Fixture

// Pyramidal Filter Example.

module time_setup4;
initial
begin
`timescale 1ns / 1ns
$timeformat (-9, 2, “ ns”, 3);
end
endmodule

module pf1_tf;

reg clock, reset;

reg [7:0] tap_unfilt;

// Define array where test values will come from.

reg [7:0] test_pattern[0:31];
172 Verilog Test Fixtures Chapter 5

// Define array for testing filter output.

reg [7:0] verify_pattern[0:31];
reg [4:0] mem_index;
wire [7:0] tap_filt;
reg [7:0] tap_test, filt_test;
reg flag;

parameter clk_period = 20;

pfilt1 u1 (clock, reset, tap_unfilt, tap_filt);

always begin
#(clk_period / 2)
clock = ~clock;
tap_unfilt = test_pattern [mem_index];
// filt_test = tap_filt [mem_index];
tap_test = verify_pattern [mem_index];
mem_index = mem_index + 1;
if (mem_index == 0) $finish;
end

always begin
#(clk_period)
if (!reset & (tap_filt != tap_test))
begin
$display ($time, “ ERROR! tap_filt = %h tap_test = %h”, tap_filt,
tap_test);
flag <= 1;
end
// else $display (“All is okay.”);
end

initial
begin
clock = 0;
mem_index = 0;
tap_test = 0;
filt_test = 0;
tap_unfilt = 0;
flag = 0;
reset = 1; // Assert the system reset.

// Read test pattern data from file into

// test_pattern array.
$readmemh (“pfilt1.tst”, test_pattern);
$readmemh (“verify1.tst”, verify_pattern);

#(clk_period * 2) reset = 0;
end
endmodule
Automated Testing 173

Listing 5-15 Pyramidal Filter Verilog Code

// Pyramidal Filter Example.

// This implements a low-pass filter with coefficients:
// .0625 .125 .625 .125 .0625
// Gain = 1.
module pfilt1 (clock, reset, tap4, tap_out[8:1]);
input clock, reset;
input [7:0] tap4;
output tap_out;
reg [8:0] tap_out;
reg [7:0] tap0, tap1, tap2, tap3;

// Intermediate summation (pipeline) registers.

reg [4:0] sum1;
reg [5:0] sum2;
reg [7:0] sum3;

always @ (posedge clock or posedge reset)

begin
if (reset)
begin
tap0 <= 0;
tap1 <= 0;
tap2 <= 0;
tap3 <= 0;
tap_out <= 0;
sum1 <= 0;
sum2 <= 0;
sum3 <= 0;
end
else begin
tap0 <= tap1;
tap1 <= tap2;
tap2 <= tap3;
tap3 <= tap4;

// To multiply by 0.0625 (same as division by 16):

// shift left 2 places.
// Result register must be 1 larger than input
// registers to hold carry.
sum1 <= tap0[7:4] + tap4[7:4];

// To multiply by 0.125 (same as division by 8):

sum2 <= tap1[7:3] + tap3[7:3];

// To multiply by 0.625 (5/8) is the same as:

// (division by 2) + (division by 8).
sum3 <= tap2[7:1] + {2’b0, tap2[7:3]};
174 Verilog Test Fixtures Chapter 5

// Final sum adds sum1 + sum2 + sum3.

// If the design needs to be faster, it can be
// further pipelined to spread out the summing logic.
// The LSB is truncated to give an 8-bit result. Logic
// can be added, if necessary, to round-off to 8 bits
// instead.
tap_out[8:0] <= {3’b0, sum1} + {2’b0, sum2} + sum3;
end
end
endmodule

Figure 5-1 Pyramidal Filter Waveforms

The pyramidal filter design reads data from two external files. Data used to stimulate
the filter is extracted from pfilt1.tst (shown in Listing 5-16) and identical data (except for an
intentional error put in for test) used to test the filter output extracted from the file
verify1.tst shown in Listing 5-17. The error is displayed as so:

140000000000.00 ns ERROR! tap_filt = 04 tap_test = 03

Three types of entries are allowed in files read by Verilog: numbers (either hex or
binary), white space, and comments.

Listing 5-16 Pyramidal Filter Input Data List (pfilt1.tst)

0 // Each data value is entered twice because the

0 // value is loaded on each clock edge; two values
5 // are required to sustain a value through
Automated Testing 175

5 // a complete clock period.

9
9
b
b
f
f
5
5
c
c
5
5
c
c
5
5
c
c
5
5
c
c
5
5
c
c
5
5

Listing 5-17 Pyramidal Filter Test Data List (verify1.tst)

0
0
0
0
0
0
0
0
0
0
3
3
3 // Error added. Should be 4.
3 // Error added. Should be 4.
4
4
2
176 Verilog Test Fixtures Chapter 5

2
3
3
2
2
3
3
2
2
3
3
2
2
C H A P T E R 6

Real World Design: Tools, Techniques, and

Trade-offs

T he Real World is specific, not generic. It is

theoretically possible to write portable code that will run on any vendor’s hardware
(including ASIC processes), but the required compromises in performance and efficiency
are generally not worth the trade-off. We, as over-worked FPGA designers, will find
ourselves using vendor-specific libraries and techniques to achieve tight (fast and small)
designs.
Now we have to choose an FPGA vendor. Eighty percent of the FPGA market is split
between two powerhouses that dominate: Xilinx and Altera. Both are good companies with
great technology. This book focuses on Xilinx FPGAs. We’ll look at specific architecture
differences between competing companies in Chapter 7. For compilers, at this writing, the
market leaders are Exemplar Logic and Synplicity. Both are excellent products. In addition,
Synopsis’ FPGA Express is close enough to usable that it bears watching, and not just

177
178 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

because, when bundled with Xilinx Design Manager, it’s the cheapest package available.
We are using Exemplar Logic’s LeonardoSpectrum for this book.
The design flow and tools we will use are as follows:

• Specify the design. It doesn’t make sense to start coding until the job is defined.
In the Real World we often have to start a job before marketing has fully
defined the requirements, but we’ll try to get the job scoped out as much as
possible first.
• Partition the design. Divide the job into sections. Reuse old designs as much as
possible. We want our modules to be 5,000 to 10,000 gates. My estimate is
approximately 20 gates per line (which can vary wildly), so this is 250 to 500
lines of code (semicolons) per module.
• Write the code. Use a color-coded editor to help avoid syntax errors (the color
coding acts as an on-the-fly syntax checker and is remarkably useful).
Implement area, timing, clock/reset resource-assignment, and pin-assignment
constraints.
• To help locate syntax problems, try compiling your design with every tool you
can find, including different simulators. You’ll find that each vendor provides
differing error messages with differing levels of helpfulness.
• If possible, use a lint program like Verilint. There are several errors that a
Verilog compiler will accept, like mismatched vectors and the creation of
unwanted latches, that Verilint will catch. Pay close attention to warnings that
may indicate problems with synthesis.
• Simulate the design. Write test fixtures and use automated testing and
waveforms to verify the design. In this book, this means use Simucad’s Silos III
to simulate the design at as high a level as possible.
• Compile the code. In this book, this means use Exemplar Logic’s
LeonardoSpectrum to create a netlist. Watch the gate counts and speed
estimates. Use the schematic viewer to assure that your code is being
implemented in the manner you expect. Examine how clocks and resets are
implemented. Make sure global signals are detected and handled in the manner
you expect.
• Place and route the netlist. In this book, this means use Xilinx Design Manager
to create a downloadable configuration file. Manipulate the place/route controls
and perform as many place/route passes as necessary to achieve the design
requirements.
• Download the design and test it in the target hardware. FPGA designers tend to
jump to this step too soon, owing either to not having the right tools or to
impatience. The designer should be very sure the design is good before testing
in circuit.
Compiling with LeonardoSpectrum 179

COMPILING WITH LEONARDOSPECTRUM

LeonardoSpectrum has a graphical user interface and a wizard that leads the designer
through the design requirements. Very quickly, however, the designer will find the use of
scripts to be a faster and more efficient method of creating a netlist. The design script
created by the Wizard can be captured and run. Listing 6-1 is an example script created by
the design wizard.

Listing 6-1 Example LeonardoSpectrum Script

set register2register 50
set input2register 50
set input2output 50
set register2output 50
set output_file “C:/verilog/latch.edf”
set novendor_constraint_file FALSE
_gc_read_init
_gc_run_init
set input_file_list { “C:/verilog/latch.v” }
set part 4013xlPQ160
set process 3
set wire_table 4013xl-3_avg
set nowrite_eqn FALSE
set chip TRUE
set area TRUE
set report brief
set global_sr reset
set output_file “C:/verilog/latch.edf”
set target xi4xl
_gc_read
set register2register 50
set input2register 50
set input2output 50
set register2output 50
set output_file “”

Let’s look at this script line-by-line.

180 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

set register2register 50 In the design wizard, I selected an overall constraint of

20 MHz, which gives a clock period of 50 nsec. This constraint means that all
signals between registers (from a register output to a register input) must resolve
in 50 nsec.

set input2register 50 Based on the overall design requirement of running with

a 20 MHz clock, all signals between the device input and a register must resolve
in 50 nsec. The designer must consider the problem of insuring this requirement
is met in the logic outside the device. It may be that a much tighter constraint
must be applied to these nodes, depending on the timing of the external
circuitry. Devices that have I/O registers make this problem much easier to
solve.
set input2output 50Based on the overall clock requirement of 20 MHz, all
signals between logic and a device pin this logic drives must be resolved in 50
nsec. This constraint may need to be much tighter to satisfy the circuitry outside
the device.
set register2output 50 Based on the overall clock requirement of 20 MHz, all
signals between a register and a device output pin must be resolved in 50 nsec.
This constraint may need to be much tighter to satisfy the circuitry outside the
device. Devices that have I/O registers make this problem much easier to solve.
set global_sr reset Connect the global set/reset resource to the reset signal.
Xilinx supports the connection of a user-defined global reset, which can be used
by any register in the device. The signal still has to be identified and used in
every always block where the reset is desired.
lut_max_fanout 4 To control the output loading (which affects the area and
speed of the design), LeonardoSpectrum allows the designer to control the
maximum number of loads that will be connected to a CLB. In this case, a light
load of 4 is used. This will result in many buffers being used to reduce loading.
set output_file “C:/verilog/latch.edf” The netlist created by the compiler will
be in the form of an EDIF (.edf) file and will be saved in the indicated path.
Note usage of UNIX-style forward slashes in the path! Options for file output
include: .edf (edif), .edif (edif), .eds(edif), .sdf (standard delay format), .v
(verilog), .verilog (verilog), .vhd (vhdl), .vhdl (vhdl), .xdb (binary dump), .xnf
(Xilinx netlist format).
set novendor_constraint_file FALSE This double negative means that we will
create a FPGA vendor (in this case: Xilinx) constraint file and use that to guide
the place and route of our logic.
set input_file_list { “C:/verilog/latch.v” } This is the list of input files to be
linked together. In this case just one file is used to create the design. Note usage
of UNIX-style forward slashes in the path. Options for file input include: .edf
(edif), .edif (edif), .eds(edif), .sdf (standard delay format), .v (verilog), .verilog
Compiling with LeonardoSpectrum 181

(verilog), .vhd (vhdl), .vhdl (vhdl), .xdb (binary dump), .xnf (Xilinx netlist
format).
set part 4013xlPQ160 The device we will implement this design in is a Xilinx
4013XL (roughly 13,000 gates) in a PQ160 (160-pin surface-mount) package.
set process 3 We are using the LeonardoSpectrum Level 3 design flow. Levels
1 and 2 are subsets of level 3, level one is a single-vendor FPGA design flow;
level 2 is multi-vendor FPGA flow; level 3 is multivendor and includes ASIC
flows.
set wire_table 4013xl-3_avg The delays will be based on average (as
compared to worst-case) loading for a -3 speed grade device.
set nowrite_eqn FALSE Here’s another double negative that means we
will write device equations into the schematic when the schematic is extracted
from the netlist.
set chip TRUE The netlist will be compiled to a device and will include I/O
pins for pins at the top level.
set area TRUE The design will be compiled for area optimization. The option
is to compile for speed. LeonardoSpectrum Level 3 allows individual modules
to be compiled for either area or speed—a great feature.
set report brief The report will be concise.
hierarchy_preserve TRUE LeonardoSpectrum will combine modules in an
attempt to reduce logic by maintaining the hierarchy. This reduction is not
allowed. Setting this TRUE during debugging is useful because it is more likely
that your signal names will be preserved.
set target xi4xl Implement the design using primitives from the Xilinx 4000XL
library.

To refresh our memory, Listing 6-2 is the design we’re working with. This design has
a problem: an inadvertent latch is created. LeonardoSpectrum is polite enough to point this
out to us in the message log of Listing 6-3 (see bold-highlighted text).

Listing 6-2 Verilog Latch

// Your Basic Latch.

module latch2(q, q_not, set, reset);
output q, q_not;
reg q;
input set, reset;

wire set, reset;

assign q_not = ~q;

182 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

always @ (set or reset)

begin
if (set)
q = 1;
else if (reset)
q = 0;
end
endmodule

Listing 6-3 LeonardoSpectrum Message Log for Verilog Latch

-- Reading target technology xi4xl

Reading library file
`C:\EXEMPLAR\LEOSPEC\V19991D\lib\xi4xl.syn`...
Library version = 1.8
Delays assume: Process=3
-- read -tech xi4xl { “C:/Verilog/SourceCode/latch2.v” }
-- Reading file ‘C:/Verilog/SourceCode/latch2.v’...
-- Loading module latch2
-- Compiling root module ‘latch2’
“C:/Verilog/SourceCode/latch2.v”,line 4: Warning, q is not always
assigned. latches could be needed.
-- Pre Optimizing Design .work.latch2.INTERFACE
Info: Finished reading design
->_gc_run
-- Run Started On Mon Sep 06 10:42:20 Pacific Daylight Time 1999
--
-- optimize -target xi4xl -effort quick -chip -area -
hierarchy=auto
Using wire table: 4013xl-3_avg
Info, Inferred net ‘set’ as GSR net.
-- Start optimization for design .work.latch2.INTERFACE
Using wire table: 4013xl-3_avg

Pass Area Delay DFFs PIs POs --CPU--

(FGs) (ns) min:sec
1 0 7 0 2 2 00:00
Info, Added global buffer BUFG for port reset
Using wire table: 4013xl-3_avg
-- Start timing optimization for design .work.latch2.INTERFACE
No critical paths to optimize at this level

*******************************************************

Cell: latch2 View: INTERFACE Library: work

*******************************************************

Number of ports : 4
Number of nets : 10
Compiling with LeonardoSpectrum 183

Number of instances : 9
Number of references to this view : 0

Total accumulated area :

Number of BUFG : 1
Number of CLB Latches : 1
Number of IBUF : 1
Number of OBUF : 2
Number of STARTUP : 1

***********************************************
Device Utilization for 4010xlPQ100
***********************************************
Resource Used Avail Utilization
-----------------------------------------------
IOs 4 77 5.19%
FG Function Generators 0 800 0.00%
H Function Generators 0 400 0.00%
CLB Flip Flops 0 800 0.00%

-----------------------------------------------
Clock Frequency Report

Clock : Frequency
------------------------------------

reset : 3333.3 MHz

Some items in the message log bear comment.

x Reading library file `C:\EXEMPLAR\LEOSPEC\V19991D\lib\xi4xl.syn`...

The library that LeonardoSpectrum uses to implement the latch design is the xi4xl library
for the Xilinx 4xxxXL family.

x “C:/verilog/latch.v”,line 6: Warning, q is not always assigned. latches could be

needed.
LeonardoSpectrum has very politely warned that a latch has been created. Generally,
this is an error in the code caused by not defining all output conditions completely.

x optimize -target xi4xl -effort quick -chip -area -flatten=TRUE

We have selected a Xilinx 4010XL as a target device. We have selected a quick
optimization as compared to an extended (multipass) compilation where multiple trials are
evaluated. We have selected the chip mode, so device pins will be assigned at the top level.
We have selected area optimization instead of optimization for speed. The netlist is flattened
into one merged netlist; the hierarchy (where each module has a different section of the
netlist) is dissolved.
184 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

x Info, Inferred net ‘set’ as GSR net.

LeonardoSpectrum has selected the set signal to be used as a global set (GSR stands
for Global Set-Reset) resource. Xilinx has a globally routed signal that can be used for a set
or reset without consuming the generic routing of the device; generally this network is used
for a global reset.

Pass Area Delay DFFs PIs POs --CPU--

(FGs) (ns) min:sec
1 0 7 0 2 2 00:00

We selected a 1-pass optimization; this pass resulted in a delay of 7 nsec. This design
uses no D flipflops, uses two input ports and two output ports, and took zero seconds to
compile. All right, not 0 seconds, but it compiled fast.

x Info, Added global buffer BUFG for port reset

In addition to the Global SR resource, the 4xxxXL family has eight global signals
available (BUFG). Generally they are used for clocks, but LeonardoSpectrum has
automatically extracted the reset signal and assigned it to a Global Buffer.

x Info, setting outputs in top level view ‘INTERFACE’ to fast.

The output pins assigned in this module use fast buffers. Generally, the designer
should use slow buffers where possible to reduce power consumption and noise.

x Using wire table: 4013xl-3_avg

Use average loading during analysis. The alternative is to use worst-case loading that
includes the worst-case effects of temperature and power-supply voltage. The -default mode
is for quick and dirty lab testing. The -default mode can also be used when the speed effects
are not pertinent—for example, if the FPGA is being used to emulate a design that will be
implemented in a faster technology (an ASIC).

x IOs 4 77 5.19%
We’ve used a very small part of the 4010XL device.

x Writing file C:/verilog/latch.edf

The output of the LeonardoSpectrum tool is an EDIF netlist which will be used by the
Xilinx place-and-route tool to create a device configuration file (.bit file).

To get control of LeonardoSpectrums’ configuration settings, look under the Tools

toolbar. There you’ll find a tab called Variable Editor; this pulls down a list of all the
LeonardoSpectrum settings, some of which (like xlx_fast_slew, which sets the pin default
drive to fast slew rate unless otherwise constrained) are not available in the GUI.
Compiling with LeonardoSpectrum 185

Running LeonardoSpectrum in the Batch Mode

Once you’re familiar with LeonardoSpectrum and want to get things done faster and in a
more repeatable and controlled manner compared to using the GUI, you can run in in the
batch mode with the spectrum executable (this program was called elsyn in previous
versions of LeonardoSpectrum). Make sure the DOS PATH environment setting in
autoexec.bat points to the spectrum program. For example, in my environment, this path is
c:\exemplar\LeoSpec\v1999.1d\bin\win32.
For example, an elementary command mode which will compile our basic latch
design might look like:

spectrum –source basiclatch.v -edif_file basiclatch.edf -ta xi4e

Another way is to cut and paste from the GUI filtered command window and create a file
like basiclatch.run as shown in Listing 6-4.

Listing 6-4 Sample LeonardoSpectrum Executable Script File

restore_project_script C:/Verilog/verilog/basiclatch.scr
_gc_read_init
_gc_run_init
set input_file_list { “C:/Verilog/verilog/basiclatch.v” }
set part 4013xlPQ160
set process 3
set wire_table 4000xl-default
set pack_clbs FALSE
set timespec_generate FALSE
set nowrite_eqn FALSE
set chip TRUE
set macro FALSE
set area TRUE
set delay FALSE
set report brief
set hierarchy_preserve FALSE
set output_file “C:/Verilog/verilog/basiclatch.edf”
set novendor_constraint_file FALSE
set target xi4xl
_gc_read
_gc_run

This file was invoked with the command line: spectrum –file basiclatch.scr. Type “spectrum
-batchhelp” to list all the command-line options (similar to Listing 6-5).

Listing 6-5 LeonardoSpectrum Batch Mode Commands

-nomap_global_bufs
Don’t use global buffers for clocks and other global signals (Xilinx/Actel).
186 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

-use_qclk_bufs
Use quadrant clocks for Actel 3200dx architecture.
-insert_global_bufs
Use global buffers for clocks and other global signals (Xilinx/Actel).
-max_cap_load <float>
Override default max_cap_load if specified in the library.
-max_fanout_load <float>
Override default max_fanout_load if specified in the library.
-lut_max_fanout <integer>
Specify net fanout for LUT technologies (Xilinx, Altera Flex, and Lucent ORCA).
-noenable_dff_map
Disable clock-enable detection from HDLs.
-enable_dff_map_optimize
Enable use of flipflop clock-enable extracted from random logic.
-exclude <list>
Don’t use listed gate in mapping.
-include <list>
Map to specified synchronous DFFs and DLATCHes.
-pal_device
Disable map to complex IOs for Actel.
-wire_tree <string>
Interconnect wire tree : best|balanced|worst = default.
-wire_table <string>
Wire load model to use for interconnect delays.
-nowire_table
Ignore interconnect delays during delay analysis.
-nobreak_loops_in_delay
Don’t break combinational loops statically for timing analysis.
Compiling with LeonardoSpectrum 187

-crit_path_analysis_mode <string>
maximum(report setup violations) | minimum(report hold violations) | both = default.
-num_crit_paths <integer>
Report <integer> number of critical paths.
-crit_path_slack <float>
Slack threshold in nanoseconds.
-crit_path_arrival <float>
Arrival threshold in nanoseconds.
-crit_path_longest
Show longest paths rather than critical paths.
-crit_path_detail <string>
full(detailed point-to-point)(default) | short(startpoint-endpoint)
-crit_path_no_io_terminals
Don’t report paths terminating in primary outputs.
-crit_path_no_int_terminals
Don’t report paths terminating in internal endpoints.
-crit_paths_from <list>
Report only paths starting at this <list> port, port_inst or instance.
-crit_paths_to <list>
Report only paths ending at this <list> port, port_inst or instance.
-crit_paths_thru <list>
Report only critical paths through the <list> net.
-crit_paths_not_thru <list>
Report only critical paths that do not go through <list> net.
-crit_path_report_input_pins
Report input pins of gates. Default = off.
-crit_path_report_nets
Report net names. Default = off.
-nocounter_extract
Disable automatic extraction of counters.
-noram_extract
Disable automatic extraction of rams.
-nodecoder_extract
Disable automatic extraction of decoders.
188 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

-optimize_cpu_limit <integer>
Set a CPU limit for optimization.
-notimespec_generate
Don’t create TIMESPEC info from user constraints; Xilinx only.
-nopack_clbs
Don’t pack look-up tables (LUTs) into CLBs; for Xilinx 4K families only.
-write_clb_packing
Print CLB packing (HBLKNM) info, if available, in XNF/EDIF.
-crit_path_rpt <string>
Write critical path reporting in this file.
-nocrit_path_rpt
Don’t create a critical path reporting file.
-report_brief| -report_full
Generate a concise design summary or a detailed one. Default = full.
-map_area_weight <float>
A number between 0 and 1.0. The larger this number, the more mapping will try to
minimize area.
-map_delay_weight <float>
A number between 0 and 1.0. The larger this number, the more mapping will try to
minimize delay.
-simple_port_names
Create simple names for vector ports: %s%d instead of %s(%d).
-bus_name_style <string>
Naming style for vector ports and nets: default %s(%d)| simple %s%d| old_galileo %s_%d
-nobus
Write busses in expanded form. This may be required for the Xilinx EDIF reader.
-nowrite_eqn
Don’t write equations in output; use technology primitives instead.
-nopld_xor_decomp
Don’t do XOR decomposition for Altera MAX and Xilinx CPLD technologies.
-noglobal_symbol
Delete startup (GSR) block.
-notime_opt
Don’t run timing optimization.
-max_frequency <float>
Complete Design Flow, 8-Bit Equality Comparator 189

Desired maximum operating frequency in MHz.

-edifin_ground_net_names <list>
Specify that net(s) with <list> name(s) are ground nets.
-edifin_power_net_names <list>
Specify that net(s) with <list> name(s) are power nets.
-edifin_ground_port_names <list>
Specify that port(s) with <list> name(s) are ground ports.
-edifin_power_port_names <list>
Specify that port(s) with <list> name(s) are power ports.
-edifin_ignore_port_names <list>
Specify that port(s) with <list> name(s) are ignored ports.
-edifout_power_ground_style_is_net
Write out power and ground as undriven nets with an extracted or inferred net name.
-edifout_power_net_name <string>
Use <string> name for power nets when ‘edifout_power_ground_style_is_net’ is TRUE;
default = ‘VCC’.
-edifout_ground_net_name <string>
Use <string> name for ground nets when ‘edifout_power_ground_style_is_net’ is TRUE;
default = ‘GND’.

COMPLETE DESIGN FLOW, 8-BIT EQUALITY COMPARATOR

So far, we’ve done only half the design work: the design entry and synthesis. To finish the
job, we need to run the Xilinx place-and-route tool, the Design Manager. To illustrate how
this tool is used, we’ll take an example design all the way through the process. This design
is similar to an HC688, an 8-bit equality comparator. This design compares two bytes and
generates a signal called equal if they are equivalent. A cascade input is also provided to
expand the inputs that are compared; if cascade is not asserted, the equal output is
inhibited. Because of personal preference, I’ve made a couple of design changes; all signals
are active high, and I made the equal output synchronous. See Listing 6-6 for the Verilog
code for this design.

Listing 6-6 8-Bit Equality Comparator

// Synchronous 8-bit equality comparator.

// All signals changed to be active high.
// Output made synchronous.
190 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

module hc688s (equal, clock, reset, cascade, a, b);

output equal;
input clock, reset;
input cascade;
input [7:0] a, b;
reg equal;

always @ (posedge clock or posedge reset)

begin
if (reset)
equal <= 0;
else if (~cascade)
equal <= 0;
else if (a == b)
equal <= 1;
else
equal <= 0; // Make sure all input cases are covered.
end
endmodule

The Verilog code is simple enough; equal can go high only if cascade is high and the
a and b input bytes are equal. Let’s see what LeonardoSpectrum makes of this design by
looking at the extracted schematic of Figure 6-1
Complete Design Flow, 8-Bit Equality Comparator 191

Figure 6-1 HC688s LeonardoSpectrum RTL Schematic

From Figure 6-1 we can see that the equal output is created by a flipflop and that the
clock and reset were implemented as intended. LeonardoSpectrum has instantiated a library
function from their module generator (modgen) to do the equality-test logic. For greater
detail, LeonardoSpectrum has another schematic view option, the gate-level schematic,
shown in Figure 6-2.
192 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

Figure 6-2 HC688s LeonardoSpectrum Gate-Level Schematic

The gate-level schematic shows the logic as it is mapped into Xilinx hardware. The
Xilinx Configurable Logic Block (CLB) will be explored in more detail in Chapter 7; for
now we can note the assignment of our logic to 2-, 3-, and 4-input look-up tables (LUTs),
the use of global buffers for clock and reset, and the flipflop that drives the equal output
signal.
A couple of things should be noted about these schematic views. First of all, they are
graphical representations of the netlist that LeonardoSpectrum synthesized. There is still
some processing to be done on the design by the Xilinx Design Manager (the place-and-
route tool). The use of this schematic is as a sanity check; if the design is not being
synthesized effectively, the designer can try different compilation options or design in a
more structural way. For example, the designer can replace the high-level equality operator
(==) with structural gates to assert more control of how the design is synthesized.
LeonardoSpectrum provides one last view of the schematic, the critical path as shown
in Figure 6-3.

Figure 6-3 HC688s LeonardoSpectrum Critical-Path Schematic

Complete Design Flow, 8-Bit Equality Comparator 193

The critical path is the longest delay path through the design. If the design needs to be
optimized for greater speed, the designer should focus on redesigning this path to remove
layers of logic. From this schematic, we can see that the longest delay path is from the b[4]
input to the equal output, and there are four layers of logic in this path. Like the adders we
studied earlier, there is probably a way to add extra logic to “look ahead” and streamline
this logic, if necessary.
For this design, compiling to optimize for delay didn’t change anything, but for most
designs there will be a change in interpreting the design, hopefully a change for the better.
There is one more view that has some value. The output of the synthesizer is a netlist,
in this case an EDIF (.edf) file, but this type of file is not intended to be read by humans.
LeonardoSpectrum can also generate a structural version of the netlist in a Verilog format.
In fact, one great feature of LeonardoSpectrum is the ability to translate between netlists of
various types. Anyway, we’re learning Verilog, so let’s look at the Verilog version of the
netlist as shown in Listing 6-7.

Listing 6-7 8-Bit Equality Comparator Structural Netlist

//
// Verilog description for cell hc688s,
// 09/06/99 11:00:40
//

module hc688s ( equal, clock, reset, cascade, a, b ) ;

output equal ;
input clock ;
input reset ;
input cascade ;
input [7:0]a ;
input [7:0]b ;

wire nx12, modgen_eq_2_nx21, modgen_eq_2_nx22,

modgen_eq_2_nx23, modgen_eq_2_nx28, modgen_eq_2_nx29, clock_int,
reset_int, cascade_int, a_7__int, a_6__int, a_5__int, a_4__int,
a_3__int, a_2__int, a_1__int, a_0__int, b_7__int, b_6__int,
b_5__int, b_4__int, b_3__int, b_2__int, b_1__int, b_0__int, nx15;
wire [8:0] \$dummy ;

assign modgen_eq_2_nx22 = ( ~a_7__int && ~b_7__int &&

~a_6__int && ~b_6__int) || ( ~a_7__int && ~ b_7__int && a_6__int
&& b_6__int) || (a_7__int && b_7__int && ~ a_6__int &&
~b_6__int) || (a_7__int && b_7__int && a_6__int && b_6__int) ;

assign modgen_eq_2_nx23 = ( ~a_5__int && ~b_5__int &&

~a_4__int && ~b_4__int) || ( ~a_5__int && ~ b_5__int && a_4__int
&& b_4__int) || (a_5__int && b_5__int && ~ a_4__int &&
~b_4__int) || (a_5__int && b_5__int && a_4__int && b_4__int) ;
194 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

assign modgen_eq_2_nx21 = (modgen_eq_2_nx22 &&

modgen_eq_2_nx23) ;

assign modgen_eq_2_nx28 = ( ~a_3__int && ~b_3__int &&

~a_2__int && ~b_2__int) || ( ~a_3__int && ~ b_3__int && a_2__int
&& b_2__int) || (a_3__int && b_3__int && ~ a_2__int &&
~b_2__int) || (a_3__int && b_3__int && a_2__int && b_2__int) ;

assign modgen_eq_2_nx29 = ( ~a_1__int && ~b_1__int &&

~a_0__int && ~b_0__int) || ( ~a_1__int && ~ b_1__int && a_0__int
&& b_0__int) || (a_1__int && b_1__int && ~ a_0__int &&
~b_0__int) || (a_1__int && b_1__int && a_0__int && b_0__int) ;

assign nx12 = (modgen_eq_2_nx28 && modgen_eq_2_nx29 &&

modgen_eq_2_nx21);

STARTUP ix63 (.Q2 (\$dummy [0]), .Q3 (\$dummy [1]), .Q1Q4

(\$dummy [2]), .DONEIN (\$dummy [3]), .GSR (reset_int), .GTS
(\$dummy [4]), .CLK (
\$dummy [5])) ;

IBUF b_0__ibuf (.O (b_0__int), .I (b[0])) ;

IBUF b_1__ibuf (.O (b_1__int), .I (b[1])) ;
IBUF b_2__ibuf (.O (b_2__int), .I (b[2])) ;
IBUF b_3__ibuf (.O (b_3__int), .I (b[3])) ;
IBUF b_4__ibuf (.O (b_4__int), .I (b[4])) ;
IBUF b_5__ibuf (.O (b_5__int), .I (b[5])) ;
IBUF b_6__ibuf (.O (b_6__int), .I (b[6])) ;
IBUF b_7__ibuf (.O (b_7__int), .I (b[7])) ;
IBUF a_0__ibuf (.O (a_0__int), .I (a[0])) ;
IBUF a_1__ibuf (.O (a_1__int), .I (a[1])) ;
IBUF a_2__ibuf (.O (a_2__int), .I (a[2])) ;
IBUF a_3__ibuf (.O (a_3__int), .I (a[3])) ;
IBUF a_4__ibuf (.O (a_4__int), .I (a[4])) ;
IBUF a_5__ibuf (.O (a_5__int), .I (a[5])) ;
IBUF a_6__ibuf (.O (a_6__int), .I (a[6])) ;
IBUF a_7__ibuf (.O (a_7__int), .I (a[7])) ;
IBUF cascade_ibuf (.O (cascade_int), .I (cascade)) ;
IBUF reset_ibuf (.O (reset_int), .I (reset)) ;
OFDX reg_equal (.Q (equal), .C (clock_int), .D (nx15), .CE
(\$dummy [6]), .GSR (\$dummy [7]), .GTS (\$dummy [8])) ;

BUFG clock_ibuf (.O (clock_int), .I (clock)) ;

assign nx15 = (nx12 && cascade_int) ;

endmodule

This is a bit of an ugly mess, but there are a few things we can extract from it. Note the _int
attached to the internal signals. This is very polite; some synthesizers convert a useful signal
name like clock into a signal name like ifght_2746 instead of clock_int which makes it very
difficult to search netlists. We want the synthesizer to do whatever is necessary to isolate a
signal as it gets routed, but keep some part of the signal name we assigned in there
8-Bit Equality Comparator with Hierarchy 195

somewhere. The equality module is modgen_2, and it gets wired up to the input buffers
(ibufs). The equal register is an OFDX (output D flipflop); note the assignments for Q
output, clock/data/clock enable. The GTS is a global tristate control and the GSR is the
global set/reset control.
The place-and-route tool works on the netlist that is extracted from the input design
and influenced by the design constraints and synthesis controls. If there is a problem with
synthesized logic, it may help to look at the netlist and make sure things are being
synthesized in a reasonable manner.
Another netlist form is the .xnf (Xilinx Netlist Format) which is very readable. Sadly
though, Xilinx is moving to standardize on the much-less-readable EDIF format.

8-BIT EQUALITY COMPARATOR WITH HIERARCHY

Let’s hook up a few of our equality comparators and see what effect a hierarchical design
has on the resulting netlist. The hier688 design, shown in Listing 6-8, instantiates three of
our hc688s designs to create a 24-bit address decoder.

Listing 6-8 8-Bit Equality Comparator Hierarchical Example

module hier688(chip_select, output_enable, addr, rwn, clock,

reset);
output chip_select, output_enable;
input [23:0] addr;
input rwn, clock, reset;
wire low, middle, high;
reg chip_select, output_enable;
parameter low_range = 8’h80;
parameter mid_range = 8’ha0;
parameter high_range = 8’hff;

// Tie off cascade input for low address comparator.

hc688s u1 (low, clock, reset, 1’b1, addr[7:0], low_range);
hc688s u2 (middle, clock, reset, low, addr[15:8], mid_range);
hc688s u3 (high, clock, reset, middle, addr[23:16], high_range);

// Synchronize the module outputs.

always @ (posedge clock or posedge reset)
begin
if (reset)
begin
chip_select <= 0;
output_enable <= 0;
end
else
begin
196 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

chip_select <= high;

output_enable <= (high & ~rwn);
end
end

endmodule

Figure 6-4 Hierarchical HC688s Gate-Level Schematic

The schematic of Figure 6-4 is not very legible, but you can see that our structural use
of the HC688 decoders results in cascaded logic. This design is not going to be very fast,
but is easy to put together as it reuses predesigned HC688 modules. Although we’re not
going to analyze the critical path, clearly it will be from a low-order address input to the
output_enable output signal.
Let’s carry this design into a real device. We do this by placing and routing the design
and creating a configuration file for the Xilinx device where our design will live. We will
open the Design Manager, create a new project, and browse (see Figure 6-5) until we find
the hier688.edf netlist. The Design Manager has a one-button operation (the idea is: if the
designer falls over dead, his or her head will hit the keyboard, and a place-and-route will
still take place). We’ll play dumb and just run the default Design Manager flow and see
what we get.
A convenient way to execute the Design Manager is to create a shortcut icon on your
Windows desktop. For example, in my environment the command line is:
C:\Xilinx\bin\nt\dsgnmgr.exe.
8-Bit Equality Comparator with Hierarchy 197

Figure 6-5 Opening a Design With Xilinx Design Manager

Listing 6-9 8-Bit Equality Comparator Hierarchical Example, Xilinx Translation Report

ngdbuild: version M1.5.19

Command Line: ngdbuild -p xc4010xl-3-pq100 -dd ..

C:\Verilog\SourceCode\hier688.edf hier688.ngd

Launcher: Executing edif2ngd “C:\Verilog\SourceCode\hier688.edf”

“C:\Verilog\SourceCode\xproj\ver1\hier688.ngo”
Reading NGO file “C:/Verilog/SourceCode/xproj/ver1/hier688.ngo”
...
Reading component libraries for design expansion...

Checking timing specifications ...

Checking expanded design ...

NGDBUILD Design Results Summary:

Number of errors: 0
Number of warnings: 0

Writing NGD file “hier688.ngd” ...

Writing NGDBUILD log file “hier688.bld”...

198 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

Figure 6-6 Design Manager Reports

Figure 6-6 shows the Report Browser window. If we click on the Translation Report,
we will see the report of Listing 6-9, and we can see that the input design was read without
error. The EDIF netlist is converted to a Xilinx binary netlist file: a .ngo file.

Listing 6-10 8-Bit Equality Comparator Hierarchical Example, Xilinx Place and Route
Report

Starting Constructive Placer. REAL time: 7 secs

Placer score = 13350
Placer score = 9810
Placer score = 6780
Placer score = 5730
Placer score = 5190
Placer score = 4440
Placer score = 3720
Placer score = 3570
Placer score = 3480
Placer score = 3270
Placer score = 3090
Finished Constructive Placer. REAL time: 7 secs

Listing 6-10 is a clip from the Xilinx place-and-route report. Like a printed circuit
board autorouter, the place-and-route tool tries different placements and selects the ones
with the better results. At this point an estimate of the timing can be extracted.
8-Bit Equality Comparator with Hierarchy 199

Listing 6-11 Equality Comparator Hierarchical Example, Xilinx Average Delay Report

The Number of signals not completely routed for this design is: 0

The Average Connection Delay for this design is: 1.929 ns

The Average Connection Delay on critical nets is: 0.000 ns
The Average Clock Skew for this design is: 0.098 ns
The Maximum Pin Delay is: 5.937 ns
The Average Connection Delay on the 10 Worst Nets is: 2.983 ns

Listing Pin Delays by value: (ns)

d <= 10 < d <= 20 < d <= 30 < d <= 40 < d <= 50 d > 50
------- --------- --------- --------- --------- -------
37 0 0 0 0 0

The signal delays are binned per Listing 6-11. This is a moderately fast design (looks
like it would run at 100 MHz to me) but only because very little of the device is used! As
the device gets fuller and more logic competes with routing resources, the design will get
slower.

Listing 6-12 8-Bit Equality Comparator Hierarchical Example, Xilinx Pad Report

# Pinout constraints listing

# These constraints are in PCF grammar format
# and may be cut and pasted into the PCF file
# after the “SCHEMATIC END ;” statement to
# preserve this pinout for future design iterations.
#
COMP “addr(0)” LOCATE = SITE “P90” ;
COMP “addr(1)” LOCATE = SITE “P89” ;
COMP “addr(10)” LOCATE = SITE “P36” ;
COMP “addr(11)” LOCATE = SITE “P35” ;
COMP “addr(12)” LOCATE = SITE “P37” ;
COMP “addr(13)” LOCATE = SITE “P39” ;
COMP “addr(14)” LOCATE = SITE “P44” ;
COMP “addr(15)” LOCATE = SITE “P42” ;
COMP “addr(16)” LOCATE = SITE “P32” ;
COMP “addr(17)” LOCATE = SITE “P22” ;
COMP “addr(18)” LOCATE = SITE “P30” ;
COMP “addr(19)” LOCATE = SITE “P31” ;
COMP “addr(2)” LOCATE = SITE “P93” ;
COMP “addr(20)” LOCATE = SITE “P23” ;
COMP “addr(21)” LOCATE = SITE “P21” ;
COMP “addr(22)” LOCATE = SITE “P24” ;
COMP “addr(23)” LOCATE = SITE “P33” ;
COMP “addr(3)” LOCATE = SITE “P95” ;
COMP “addr(4)” LOCATE = SITE “P97” ;
200 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

We did not assign pin locations in the input design. The first time through it is not a
bad idea to let the place-and-route tool assign the pins (particularly with Altera devices).
The FPGA design tries to allow pins to be assigned in a universal manner (i.e., not be
sensitive to pin usage by the designer; allow any I/O pin to be used with logic anywhere on
the chip), but there is some assumption made, for example, that data flow is horizontal (with
relation to the Pin 1 location on the device) and control is vertical. On the other hand, for
the PWB design, you may want to control the pin locations and keep addresses together and
that sort of thing. Once the circuit board has been designed, we don’t want the compiler
reassigning pins, so we are going to constrain the pin locations. The pins assigned by the
Xilinx place-and-route tool can be located in the Pad Report as shown in Listing 6-12. This
file can be cut, pasted, and edited into the LeonardoSpectrum Constraint file to lock down
pin assignments as shown in Listing 6-13. This can also be done in Xilinx Design Manager,
but I prefer to lock these pins in the design capture environment.

Listing 6-13 8-Bit Equality Comparator, Xilinx Pin Assignments

addr(0) INPUT P90

addr(1) INPUT P89
addr(10) INPUT P36
addr(11) INPUT P35
addr(12) INPUT P37
addr(13) INPUT P39
addr(14) INPUT P44
addr(15) INPUT P42
addr(16) INPUT P32
addr(17) INPUT P22
addr(18) INPUT P30
addr(19) INPUT P31
addr(2) INPUT P93
addr(20) INPUT P23
addr(21) INPUT P21
addr(22) INPUT P24
addr(23) INPUT P33
addr(3) INPUT P95
addr(4) INPUT P97
addr(5) INPUT P94
addr(6) INPUT P88
addr(7) INPUT P96
8-Bit Equality Comparator with Hierarchy 201

addr(8) INPUT P38

addr(9) INPUT P43
chip_select OUTPUT P20
clock INPUT P5
output_enable OUTPUT P18
reset INPUT P56
rwn INPUT P17

These pins can be assigned in the LeonardoSpectrum environment by going to the

Constraints Tab, finding the Input or Output tab, and filling in the entry box for Pin
Location. Make sure to hit the Apply button once all the pin assignments are filled in (see
Figure 6-7). They can also be assigned in the batch mode as so:

set_attribute –port {<hierarchical net name>} –name PIN_NUMBER –

value PXX

Note: replace XX with the desired pin number.

202 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

Figure 6-7 LeonardoSpectrum Pin Assignment using the GUI

8-Bit Equality Comparator with Hierarchy 203

These are not the only required pin assignments on the circuit board. We must hook
up the dedicated signals including power, ground, and configuration signals on the board-
level schematic.

Listing 6-14 8-Bit Equality Comparator Hierarchical Example, Xilinx Asynchronous

Delay Report

The 20 Worst Net Delays are:

-------------------------------
| Max Delay (ns) | Netname |
-------------------------------
5.937 low
4.154 middle
3.314 high
2.751 clock_int
2.508 reset_int
2.490 addr(23)_int
2.228 addr(17)_int
2.187 addr(21)_int
2.179 addr(4)_int
2.097 addr(1)_int
2.085 addr(7)_int
1.823 addr(14)_int
1.823 addr(10)_int
1.767 addr(9)_int
1.754 addr(22)_int
1.739 addr(0)_int
1.705 addr(15)_int
1.693 addr(11)_int
1.637 addr(6)_int
1.557 addr(16)_int

The top 20 delays can be viewed in the Asynchronous Delay Report as shown in
Listing 6-14. From this, we can guess that this design would run at 168 MHz, not bad for a
slow –3 speed grade part. Again, we’re using only a tiny percentage of the device. Still, this
is not the full story, this is just the delays between individual nodes; to get the full delay we
have to run full timing analysis with this result:

Timing constraint: Default period analysis

34 items analyzed, 0 timing errors detected.
Minimum period is 9.967ns.

Delay: 9.967ns low to middle (8.027ns delay plus 1.940ns

setup)
204 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

Path low to middle contains 2 levels of logic:

Path starting from Comp: CLB_R1C10.K (from clock_int)
To Delay type Delay(ns) Physical
Logical Resources Resource
------------------------------------------------- --------
CLB_R1C10.XQ Tcko 2.090R low
u1_reg_equal
CLB_R20C10.C2 net (fanout=1) 5.937R low
CLB_R20C10.K Thh1ck 1.940R middle
modgen_eq_3_ix18 u2_reg_equal
-------------------------------------------------
Total (4.030ns logic, 5.937ns route) 9.967ns (to clock_int)
(40.4% logic, 59.6% route)

This tells us that the worst-case delay from flipflop to flipflop is 9.967 nsec, so we can
really only run our clock at 100 MHz, not nearly so impressive.

OPTIMIZATION OPTIONS IN THE XILINX ENVIRONMENT

The Xilinx place-and-route tool, called the Design Manager, converts the EDIF netlist into a
configuration file that can be loaded into a target device. Some of the place-and-route tool
optimization parameters are configurable by the designer. To get into the options menu,
select options from the implementation menu as shown in Figure 6-8.

Figure 6-8 Xilinx Design Manager Options Selection

Mapping Options 205

Figure 6-9 Xilinx Design Manager Implementation Options

MAPPING OPTIONS

The synthesized netlist has some placeholders for precompiled library elements. The
mapper finds the library elements (.ngo files, a binary netlist format) and merges them in.
The mapper then converts the merged netlist into a physical netlist with specific hardware
elements assigned to all the netlist logic elements. The mapper output is an .ncd (physical
netlist format) file. The user can configure the mapping process with the following options
from the Implementation Options window shown in Figure 6-9.
206 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

Trim Unconnected Logic

If the mapper encounters logic that is not used, this logic can be deleted from the design.
This simplifies the logic and speeds up the place-and-route process. However, the designer
might want to keep the unused logic because it will be used in a later version of the design.
Leaving the logic in may give a better estimate of the resources and timing related to the
final design.

Replicate Logic to Allow Logic Level Reduction

Redundant logic can be added to the design to reduce driver loading and speed up the
design (the basic area/speed trade-off).

Generate 5-Input Functions

Generally, the basic Xilinx logic element is a 4-input look-up table. However, in some
Xilinx families the CLB logic can be configured to create 5-input LUTs.

CLB Packing Strategy

The mapper uses a set of rules to attempt to utilize the CLBs effectively. The CLB Packing
Strategy modifies the logic partitioning to allow less signal sharing and allows the use of a
CLB flipflop without the associated LUT. Again, this is a speed/area trade-off; the CLB
Packing Strategy can use more logic but may allow the design to run at a higher operating
speed. The Fit Device option packs the CLBs with possibly unrelated logic until the design
fits into the target device or until no more packing is possible. Turning this option Off
allows only related logic (logic with shared inputs) to be packed into a CLB.

Pack CLB Registers for Minimum Area or Structure

This option controls register ordering by analyzing bussed signal names. The Minimum
Area option will result in a denser design with registers mapped in a more random order.
The Structure option enables register-ordering analysis.

Pack I/O Registers/Latches into IOBs for Inputs Only,

Outputs Only, Inputs and Outputs, and Off

Normally, the synthesis tool assigns logic to I/O buffers (IOBs). However, this option
allows the mapper to assign IOBs and can result in better CLB packing. Use the Off option
to allow the synthesis tool to control IOB assignment.
Mapping Options 207

Use Generic Clock Buffers (BUFGs) in Place of BUFGPs

and BUFGSs

Older Xilinx devices used primary (BUFGP) and secondary (BUFGS) global buffers for
global signals, so some synthesis tools may make these assignments. Newer Xilinx devices
use a pool of generic global buffers (BUFGs). Enabling this option will allow the
replacement of BUFGSs and BUFGPs with BUFGs.

Place-and-Route Options

Figure 6-10 Xilinx Design Manager Place-and-Route Options

208 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

Place & Route Effort Level

Another trade-off is the amount of time spent optimizing a design versus the optimization
results as shown in the Place and Route menu in Figure 6-10. If the place-and-route tool
tries longer, it will have more options to select from, and the area/speed results will
probably be better. Higher effort levels will increase the run time.

Router Option, Run Routing Passes

The designer can select the number of routing passes. Each routing pass is a complete
attempt at placement. Once the router has met the design requirements (the design fits into
the device with all timing constraints met), the router exits.

Run Delay-Based Cleanup Passes

Once a design has been placed, the timing can probably be improved. With this option the
designer can run 1 to 5 additional cleanup passes to attempt to improve the operating speed.

Use Timing Constraints During Place-and-Route

The timing constraints can be used to influence the place-and-route and achieve higher
operating speeds. Using timing constraints trades off processing time for design
performance. Turn this option Off to ignore timing constraints and speed up the place-and-
route process.
Logic Level Timing Report/Post Layout Timing Report 209

LOGIC LEVEL TIMING REPORT/POST LAYOUT TIMING

REPORT

Figure 6-11 Xilinx Design Manager Implementation Options

Produce Logic Level Timing Report

For a quick view of the timing performance of the design, a logic level timing report can be
produced by selecting the check box shown in Figure 6-11. These estimated results can be
reviewed without going through the complete (and often very time-consuming) place-and-
route process.
210 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

Produce Post Layout Timing Report

A top-level report of the device timing can be reviewed with this brief timing report. The
maximum clock speed is reported. For error and path reports the entries are sorted by
constraint and delay value. Negative slack-time values indicate a constraint that was not
met.

Limit Report to n Paths per Timing Constraint

This setting, either Summary, No Limit, or a number from one to ten, limits the reported
number of worst-case paths per timing constraint.

Report Paths Using Advanced Design Analysis (No Timing

Constraints)

This option provides a timing analysis when no user constraints are present. The analysis
includes all clocks, the required offset for each clock, and a listing of combinational paths
sorted by delay value.

Report Paths in Timing Constraints

This option generates a timing report based on timing constraints. The number of paths
reported per constraint is per the selection made in the Limit Report to n Paths per
Timing Constraint dialog box.
Listing 6-15 is an example of a timing report for a signal in the hier688.v design. All
the delay paths between rwn and output_enable are listed, along with the positive slack time
(good!). Note that 80% of the delay is in logic. This percentage will get smaller (possibly
much smaller) as the design gets more dense and the logic fights for routing resources.

Listing 6-15 Example, Xilinx Timing Report

=================================================================
Timing constraint: TS01 = MAXDELAY FROM TIMEGRP “PADS” TO TIMEGRP
“FFS” 50nS;
30 items analyzed, 0 timing errors detected.
Maximum delay is 13.354ns.
-----------------------------------------------------------------
Slack: 36.646ns path rwn to output_enable relative to
50.000ns delay constraint

Path rwn to output_enable contains 3 levels of logic:

Path starting from Comp: P102.PAD
Logic Level Timing Report/Post Layout Timing Report 211

To Delay type Delay(ns) Physical

Resource
Logical
Resource(s)
------------------------------------------------- --------
P102.I1 Tpid 3.000R rwn
IPAD_rwn
ix46
CLB_R7C14.F2 net (fanout=1) 1.215R rwn_int
CLB_R7C14.X Tilo 2.700R D
ix79
P99.O net (fanout=1) 1.439R D
P99.OK Took 5.000R output_enable

reg_output_enable
-------------------------------------------------
Total (10.700ns logic, 2.654ns route) 13.354ns (to clock_int)
(80.1% logic, 19.9% route)

Report Paths Failing Timing Constraints

This option generates a report of signals and paths that fail the timing constraints, listed
from worst to best. The logic and routing delays are identified and the failing path delays
are broken out to show all the delays that build up to cause the problem. A close
examination of the delays will provide clues to areas that can be pipelined or simplified to
make the design run faster or identify areas where the constraint is over-specified.
The number of paths reported per constraint is per the selection made in the Limit
Report to n Paths per Timing Constraint dialog box.

Interface Options

Macro Search Path

When the netlist is merged and .ngo files are inserted, the compiler searches for the proper
file to insert. The user can add other search paths. Multiple search paths can be entered, a
semicolon being used as path separator.

Rules File

To be merged in the ncf netlist, the filetype must be an ngo. The rules file path can point to
a utility for converting other netlist file formats to an .ngo filetype.
212 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

Create I/O Pads from Ports

Some design tools convert PAD symbols into module port symbols. This checkbox option
will convert top-level module ports into PADs (device pins).

Simulation Options

Simulation Data Options

Xilinx can create a timing-annotated netlist in three flavors: EDIF, VHDL, and Verilog.
We’ll want to use the Verilog option to support Verilog simulation, of course. Vendors
supported for this version of the Xilinx place-and-route tool include generic EDIF, generic
Verilog, generic VHDL, ActiveVHDL, Concept NC-Verilog, Concept Verilog-XL,
Foundation EDIF, ModelSim Verilog (for the purposes of this book, this is the option we
will use), ModelSim VHDL, NC-Verilog, Quicksim, Verilog-XL, Viewsim-XL, Viewsim-
EDIF, VSS, and Default.

Correlate Simulation Data to Input Design

To use your logic gate and signal names instead of the names assigned by the place-and-
route tool in the optimized netlist, check this checkbox.

Simulation Netlist Name

Define the filename for the simulation output file. If you want to keep multiple versions of
the simulation file, enter the filenames here, otherwise the new file will overwrite the
previous one.

VHDL/VERILOG SIMULATION OPTIONS

Bring Out Global Set/Reset Net as a Port

For simulation purposes, it can be handy to have the internal Set/Reset node available as a
port at the toplevel of the design. The signal name that drives the Global Set/Reset (GSR)
resource can be entered in the dialog box to match the HDL design.
VHDL/Verilog Simulation Options 213

Bring Out Global Tristate Net as a Port

For simulation purposes, it can be handy to have the internal tristate control node available
as a port at the toplevel of the design. The signal name that drives the Global Tristate (GTS)
resource can be entered in the dialog box to match the HDL design. This tristate controls all
device outputs and is useful for isolating a device from a circuit board being tested
(stimulated) with external equipment.

Generate Test Fixture/Testbench File

Check this checkbox to create a Verilog test fixture (.tv) template file.

Include `uselib Directive in Verilog File

Xilinx provides a set of timing-annotated SIMPRIM (SIMulation PRIMitive) files. The path
to these files can be automatically inserted in the Verilog test-fixture file by checking this
checkbox.

Generate Pin File

Check this checkbox to create signal-to-pin (.pin) mapping file.

Retain Hierarchy in Netlist

The Verilog test-fixture file can maintain the input design hierarchy or flatten the netlist into
one big file. Check this checkbox to maintain the input design hierarchy.

Configuration Options

Xilinx devices are SRAM based and must have their configuration loaded after each power-
on. There are many configuration modes, including serial PROM, parallel master, parallel
slave, download cable, etc.

Configuration Rate

Slow (1 MHz) or Fast (8 MHz) internal configuration clock (master modes). These are
approximate speeds.

Threshold Levels (XC4000E and XC4000EX Only)

Select between a TTL-compatible input threshold (nominally 30% of the power-supply

value) or CMOS threshold (nominally 50% of the power-supply value) and output drive.
214 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

Select Read from Design to use the TTL/CMOS input level defined in the physical
constraints (PCF) file.

Configuration Pins

Various pull-up and pull-down options are available for the TDO, Mode, and Done
configuration pins, including a tristate mode.

Perform CRC During Configuration

The internal Xilinx configuration logic can perform a four-bit partial CRC check of
configuration data frames or just do a simple check of the 0110 pattern at the end of each
frame.

Produce ASCII Configuration File

The normal configuration file is a binary .bit file. An ASCII version (.rbt) of this
configuration bitstream file can also be created.

5V Tolerant I/Os (XC4000XLA and XC4000XV Only)

I/O pins on a low-voltage device can be configured to withstand higher drive voltages for
mixed-power-supply operation.

Start-Up Options

Start-up Clock

Configuration can be started based on an internal (CCLK) or external clock (User Clock)
source.

Synchronize Start-up to DONE Input Pin

The status of the open-drain DONE pin can be monitored. In cases where multiple FPGA
DONE pins are wire-ORed together, enabling this feature will cause all devices to start-up
when the last device has finished configuring.

Output Events

Control signals can be asserted or released with different timing. These status signals
include Done, Enable Outputs, and Release Set/Reset.
Other Design Manager Tools 215

Readback

The device configuration can be read when readback is enabled (readback can be disabled
for design security reasons). This tab includes options for the readback clock source
(internal or external) and termination of the readback process.

Tie Unused Interconnect

Unused pins can be tied high or low to reduce noise and power consumption.

Advanced Options

In the master parallel configuration mode, where the FPGA generates address lines to
control a parallel memory device, the configuration address lines can be configured for 18
or 22 lines.

OTHER DESIGN MANAGER TOOLS

Design Manager tools include the Flow Engine (which we used to perform the place-and-
route process), Timing Analyzer, Floor Planner, PROM File Formatter, Hardware Debugger
(which includes the FPGA download utility), and the EPIC Design Editor.

Timing Analyzer

The Timing Analyzer will provide a report of selected paths in the design. For example, it is
possible to examine all clocks in the design. Specific paths can be excluded.

=================================================================
Timing constraint: Default period analysis
12 items analyzed, 0 timing errors detected.
Maximum delay is 11.647ns.
-----------------------------------------------------------------
Delay: 11.647ns device_bus2(0) to device_bus1(2)

Path device_bus2(0) to device_bus1(2) contains 3 levels of logic:

Path starting from Comp: P46.PAD
To Delay type Delay(ns) Physical
Resource
Logical
Resource(s)
------------------------------------------------- --------
P46.I2 Tpid 1.560R device_bus2(0)
216 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

IPAD_device_bus2(0)
ix57
CLB_R24C1.G2 net (fanout=3) 2.016R
device_bus2(0)_int
CLB_R24C1.Y Tilo 1.590R
device_bus1_dup0(3)
ix66
P44.O net (fanout=1) 2.441R
device_bus1_dup0(2)
P44.PAD Topf 4.040R device_bus1(2)
ix50

OPAD_device_bus1(2)
-------------------------------------------------
Total (7.190ns logic, 4.457ns route) 11.647ns
(61.7% logic, 38.3% route)

Listing 6-16 Example, Xilinx Timing Report

List 6-16 shows a generic timing report for the worst path (critical path) in the hier688
design. The maximum delay for this path is 11.647 nsec. Note the division of time between
logic and routes listed at the bottom. As the design gets denser, the routing will be a higher
percentage of the delay.

Floorplanning

Floorplanning is a procedure where the arrangement and location of logic inside the FPGA
is manipulated and optimized. Figure 6-12 illustrates a typical device floorplan. Some
aspects of the design are obvious to the designer and may or may not be recognized by the
automated place-and-route tools. Which parts of the design are critical and should be
located adjacent to other logic elements? Can things be switched around to get a more faster
and more efficient design? Humans are better at these types of tasks than computers.
Other Design Manager Tools 217

Figure 6-12 Xilinx Design Manager Floorplanner Tool

Figure 6-13 shows a zoom view of the hier688 logic, the pin assignments, the CLBs,
and a rats-nest view of the signal routing.
218 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

Figure 6-13 Xilinx Design Manager Floorplanner Tool, hier688 Design Zoom View

PROM File Formatter

Xilinx supports serial and parallel configuration PROM versions. A file can also be created
and linked into a microcontroller PROM. Large devices may require multiple PROMs. The
PROM File Formatter allows the design to be split into multiple configuration devices as
shown in Figure 6-14.
Other Design Manager Tools 219

Figure 6-14 Xilinx Design Manager PROM File Formatter Options

Hardware Debugger

The Hardware Debugger allows communication options (shown in Figure 6-15) which
allow a device to be configured with a PC serial port, parallel port or with a Xilinx
Xchecker cable (which also connects to a PC parallel port). A header (standard 0.25 square
posts, 0.1 center pattern) wired per Figure 6-16 must on the circuit board to support this
download. Xilinx also supports 4-wire (TDI, TMS, TCK, TDO) JTAG serial-port
programming.
220 Real World Design: Tools, Techniques, and Trade-offs Chapter 6

Figure 6-15 Communication Setup Options

Figure 6-16 Xilinx Xchecker Cable Header Wiring

Other Design Manager Tools 221

EPIC Design Editor

This tool provides a graphical representation of the design as if you were looking down at
the physical device itself (see Figure 6-17). Pins, pin buffers and registers, global signals,
signal routing, and CLBs are all visible. Some routing can be done at this level. For
example, it is possible to hook up test-points without resynthesizing and recompiling.

Figure 6-17 EPIC Representation of the hier688 Design

This page intentionally left blank
C H A P T E R 7

A Look at Competing Architectures

I used to think of FPGA vendors as all-

knowing and mysterious entities, magically creating products which sell in copious
quantities. The reality is that these companies are in business and are made up of people,
good and bad, just like you and me. The FPGA manufacturers, which include Xilinx,
Altera, Actel, Lattice/Vantis/Lucent, Quicklogic, Atmel, and others buy their packages from
the same sources and pay nearly the same prices for them. Most (Atmel is an exception)
contract their manufacturing to foundries that all have similar business models.

FACTORS THAT DETERMINE INTEGRATED CIRCUIT PRICING

x The package. A high proportion of the cost of an integrated circuit is in the packaging
(just like the cost of potato chips).

223
224 A Look at Competing Architectures Chapter 7

x The silicon. This includes the “square footage” and the complexity in number of
layers and lithography. This is similar to a warehouse, where the price is correlated to
how big it is and how many floors it has. Other factors can contribute, like whether
the IC fab process is “mainstream” or not. Some programmable devices use EEPROM
technology or specialized materials, such as the tungsten plugs that Actel uses for
layer interconnect vias.
x The volume, or the economy of scale. If you want a cheap IC, you have to either buy a
lot of them or leverage mass quantities consumed by industry (in other words, if you
want a cheap device, buy an IC that many other companies are buying). To turn this
around, in order for a foundry to sell an IC cheaply, it must produce a lot of them; it
doesn’t much matter if one customer or many customers buy the product. FPGAs are
flexible and are used by many designers in many industries. ASICs are specific and
targeted toward specific customers (though these foundry customers may resell the
device to a wide market; from Intel’s point of view, the microprocessor is a type of an
ASIC, for example).

These factors hold up only when a competitive situation exists. Where sources are limited
by specialized processes, proprietary technology, and small markets, the economic situation
is different and prices go up.
The FPGA market is dynamic and exciting but is also a bewildering mess of TLAs
(Three Letter Acronyms) and hyperbole. Each FPGA vendor has a design strategy to solve a
problem for the users of its products. The vendors have individual niches, strategies, and
technologies. The main idea to keep in mind is that no ultimate design approach is ideal for
each design target; each FPGA design has strengths and weaknesses. A Xilinx FPGA may
be the best solution for one design problem (like random arrays of mixed functions); an
Altera CPLD may be the best design solution for the next design (like datapath functions
including digital filters); an Actel device might be the best for designs that require ASIClike
performance.

FPGA DEVICE DESIGN

Other factors influence the design of an FPGA device, including the patent minefield.
It can be quite galling to use a patented architecture and make your competitor rich with
royalties. Can you design around a patent? Can you devise another way of solving a
problem? Is it better to just license the patent? What features does your target market need?
Are there gaps in a competitor’s product line?
FPGA Device Design 225

Figure 7-1 A Day in the Life of an FPGA Silicon Designer

The Verilog FPGA designer must be very aware of the target FPGA architecture.
Keep in mind that your design will be implemented in look-up tables with significant
routing propagation delays. Verilog can be a portable language, but the highest-performance
designs are tailored for the target FPGA. How granular are the FPGA design elements? A
Xilinx 4000XL FPGA can be thought of as an array of four input LUTs, any of which can
be a 16-by-1 RAM. An Altera Flex 8K device is an array of Logic Array Blocks (LABs)
which have four-input LUTs in groups of eight with dedicated RAM blocks spread around
the die. How rich is the routing resource? Let’s say you notice that the die inside an FPGA
that Altera rates at 20,000 gates is about two-thirds the size of a similarly rated Xilinx
device. Does this mean Altera more cleverly packs LUTs on a die, or that Xilinx believes
much more routing resource is necessary to adequately support the use of the CLBs? Are
internal tristate signals supported? How many global low-skew clock networks are
226 A Look at Competing Architectures Chapter 7

available? Leaving vested interests and emotion aside, which architecture is better? Is it
better use of silicon to provide more gates or more routing? There are no universal answers
to these questions.

FPGA TECHNOLOGY SELECTION CHECKLIST

Just about any FPGA can serve about any job, given that the device has enough pins and
gates. Often a device is selected for nonscientific reasons, such as what devices are in the
company stockroom or which device the designer feels a need to put on her resume. This
list will allow an objective comparison of FPGA technologies. We’re not going to ask about
gate count: each vendor counts gates in a different manner, so this number is nearly useless.
These questions are more important than the device overview, because the technology is
constantly changing.

x How many logic blocks and flipflops does the device contain?

x How much RAM does the device have (if any)? Dual port or single port?
Distributed or in blocks? How big are the blocks? In what organization can they
be used (X1, X2, X4, X8, X16, X32, etc.)?

x How many usable I/O pins does the device have?

Keep in mind that each device has power/ground and other dedicated pins that are not
available for use as signals by the Verilog designer.

x How many low-skew global clock/reset/preset networks are available?

It is possible to route a clock or other global signal through signal routing channels, but the
risk of design problems goes up due to clock skew.

x Are internal tristate busses supported?

Tristate busses can speedup a decoder considerably compared to a MUX structure.

x Are denser devices available in the same package and compatible pinout?
If the design grows, will the circuit board need to be redesigned to accommodate a denser
device?

x Does the device pinout have to be locked before the FPGA design is complete?
A requirement to finish a PWB design early can favor an FPGA device (which has more
capability of routing device pins to random logic elements) over a CPLD device.
FPGA Technology Selection Checklist 227

x Can the FPGA be reloaded at power-up or must it be available instantly at

power-up?
Imagine an FPGA that includes memory decoding for a microprocessor. If the
microprocessor initializes the FPGA and the FPGA does not decode memory so that the
microprocessor accesses memory properly, this might be a power-up issue. Often you’ll see
a fast PLD for memory decoding and clock generation along with the FPGA.

x Does the FPGA support in-circuit configuration for field upgrade or board
customization?

x Does the FPGA have flipflops available for input pins?

Having flipflops available on the outside of the device near input pins allows fast and
predictable latching of input signals.

x Does the FPGA have flipflops available for output pins?

Having flipflops available on the outside of the device near output pins creates fast and
predictable clock-to-output times.

x Is a conversion to ASIC required?

If so, an antifuse device, with an ASIClike architecture and speed, should be considered.

x Is a socket required?
For one-time programmable parts, until the design is solid, a socket will be necessary to
ease the changing of devices. This is not a trivial matter; fine-pitch packages can be a real
challenge to socket.

Two vendors have about 80% of the FPGA market: Xilinx and Altera. Which one is
larger is subject to debate at this writing and depends on how the numbers are counted. You
can’t be an FPGA designer without acknowledging these two companies (you need both on
your resume to be most marketable). Other companies have great products and software!
None yet comes close to achieving Altera and Xilinx’s market share, and I think none ever
will. Check the resources section at the end of this book for device-manufacturer website
addresses and visit those sites for the latest data.
228 A Look at Competing Architectures Chapter 7

XILINX FPGA ARCHITECTURES

XC3000/XC3100 Series FPGAs

This is an older SRAM-based architecture. The CLBs have five logic inputs, two flipflops, a
common clock, direct reset, and a clock enable. It’s interesting to look at the clock-enable
implementation (see Figure 7-3). A clock-enable MUX selects between the output of the
look-up table and the latch. In the clock disabled mode, the output of the latch is fed back to
the input.
As shown in Figure 7-2, XC3000 IOBs have input and output flipflops,
programmable tristate, and pull-up resistor output control. This architecture does not have
more modern features like SRAM and fast carry in/out.

Figure 7-2 Xilinx 3K Family I/O Architecture

Xilinx FPGA Architectures 229

Figure 7-3 Xilinx 3K Family CLB Architecture

Signal routing on the FPGA is performed by pass transistors called Programmable

Interconnect Points (PIPs). The pass transistors are controlled by RAM-based configuration
bits. The routing between CLBs includes nondirectional connections, drivers between
horizontal and vertical long lines, drivers between vertical and horizontal long lines, and
other specialized types. Long lines bypass the PIPs and carry signals that travel long
distances across the die. The 3K family includes a buffer which can be used to support an
oscillator circuit (using a crystal, RC network, or resonator to set the operating frequency).
The Xilinx 3K family is an SRAM-based architecture. After power-up, device
configuration is loaded into the device via serial EPROM, serial download cable, via
parallel load from a microprocessor (slave mode) or byte or word-wide EPROM, or other
means. When the power is removed from an SRAM-based device, the configuration is lost.
On power-up, the Xilinx device automatically loads configuration data in the manner
defined by programming mode pins.
230 A Look at Competing Architectures Chapter 7

XC4000 Series FPGAs

The Xilinx 4K family, compared to the 3K family, has greater densities, improved speed,
and other added features. In particular, the addition of distributed RAM (the ability to
configure a CLB as a 16-by-1 RAM cell) in the 4000E and 4000X families is a great feature
for the designer.
The Xilinx CLB, as illustrated in Figure 7-4, contains two four-input LUTs, two D-
type flipflops with dedicated clock enable, set or reset, a clock with configurable polarity,
and fast carry-in and carry-out signal paths. Each CLB in a 4000E/XL device can be used as
two 16X1 single-port RAMs, a single 16X1 dual-port RAM, or as a single-port 32X1 RAM.
The dual-port RAM configuration is synchronous; the other modes can be non-synchronous
(level-sensitive).
Other blocks provided by Xilinx include Input/Output blocks (IOBs) which include
I/O registers and configurable terminations (pull-up or pull-down) and pin buffers (fast or
slow). The 4000 family includes wide decoder blocks which are useful for fast decoders for
up to nine inputs. The 4000 family includes an on-chip oscillator and dedicated low-skew
networks that can be used for clocks and other fast global signals. The 4000 family supports
internal tristate signals and busses.

Figure 7-4 Xilinx 4K Family CLB Architecture

Xilinx FPGA Architectures 231

HardWire Devices

Xilinx offers a hardwired version of its FPGAs, which can save some cost for
applications where the volume does not justify a conversion to a full custom ASIC. In this
technology, Xilinx uses the same CLB architecture but replaces the SRAM routing and
switching arrays with metal layers. The result is less silicon (smaller die) and equal-to or
better-than timing compared to the FPGA. An advantage to conversion to HardWire is the
low stress on the FPGA designer; Xilinx guarantees that the timing and function of the
custom device will match the FPGA device. Conversion to a HardWire device is a test-
vectorless process; Xilinx develops automated test coverage and guarantees the device will
work in your application. This does not let the designer off the hook for doing a
synchronous design and doing thorough testing. For example, all asynchronous logic needs
to be reviewed for race conditions, because the HardWire device will most likely be faster
than the FPGA design.

Virtex Series FPGAs

The latest devices from Xilinx are built with 0.22-micron lithography (with a roadmap to
0.18 micron) and five-layer metal technology. The million-gate device has 75,000,000
transistors. Interesting new features, as shown in Figure 7-5 and Figure 7-6, include mixed-
voltage I/O (including low-voltage differential inputs to support busses like GTL),
dedicated 4096-bit dual-port SRAM blocks, distributed RAM cells, multiple DLLs (Delay-
Locked Loops) to provide controlled-delay clock networks, and vector-based routing
(allowing flexible routing up/down/ left/right between CLBs). These are 2.5-volt devices,
but the I/Os are tolerant of higher interface voltages.
Xilinx advertises that its million-gate device has 27,648 logic cells, 131,072 block
RAM bits, and 660 user I/O pins. Consider that an 8031 microcontroller core is less than
600 gates (less than 100 CLBs), contains 256 bits of RAM, and has 32 user I/O pins.
Welcome to the new millennium.
232 A Look at Competing Architectures Chapter 7

Figure 7-5 Xilinx Virtex Family Architecture

Xilinx FPGA Architectures 233

Figure 7-6 Xilinx Virtex Family CLB Architecture (one of two slices per cell shown)
234 A Look at Competing Architectures Chapter 7

Figure 7-7 Xilinx Virtex Family Input/Output Block

Virtex I/O blocks, as shown in Figure 7-7, include programmable pull-up and pull-
down resistors, a weak keeper circuit (this holds a signal value when a driver is removed),
tristate control, I/O latches, and inputs with programmable delay (for shifting an input signal
with respect to the clock edge).

Configuration Devices

During development, a download cable connected to a PC is the most convenient method

for configuring an FPGA. Once the design is complete, a serial configuration device (serial
PROM) can be used. One-Time Programmable (OTP) devices (from Altera, Xilinx, and
Lucent) and reprogrammable versions (from Atmel) are available. To save the cost of the
serial device, the parallel download mode from a microprocessor can be performed. This
means the processor must be initialized and running before the FPGA can be configured.
Also, Xilinx devices can be programmed via a JTAG serial port.
Altera CPLD Architectures 235

ALTERA CPLD ARCHITECTURES

Altera uses the phrase Complex Programmable Logic Devices (CPLD) to describe their
design approach. Altera uses less routing resource than Xilinx. Their LABs (Logic Array
Blocks) are more complex than Xilinx’s CLBs and fewer of them are available on an Altera
die. Is this approach better than Xilinx’s? It depends on what you’re trying to do. One thing
in Altera’s favor is that its place-and-route software has a much easier job than Xilinx’s
Design Manager because there is much less routing resource. This means the Altera
software is fast and very deterministic; the same design, compiled with the same
compilation settings, gives the same result every time. On one of the Usenet groups a
designer said, “I love Altera’s software and I love Xilinx’s silicon.” There is a lot of
wisdom in that simple statement.
Because of the limited routing, it can be more difficult to route a design. The designer
must not lock down the pins too early in the design cycle. As the design grows, it may not
be possible to “reach” all the pins that were previously assigned. This is much less of an
issue with Xilinx/Lucent/Actel architectures.

Altera FLEX8K Architecture

The FLEX8K (Flexible Logic Element matriX) is an SRAM-based, coarsely grained

architecture (based on large blocks of logic elements) and contains an array of Logic Array
Blocks (LABs). Each LAB has eight LEs (Logic Elements), four control input signals (used
as clock, set/reset, carry in, and cascade in), 24 global logic inputs, eight local feedback
inputs, and eight outputs. Compared to Xilinx, there is much less interconnect between
LABs. For example, the LAB fast carry and cascade outputs are connected only to the LAB
on the right, and the fast carry and cascade inputs come only from the LAB on the left. The
FLEX8000 family does not support internal tristate busses (the compiler automatically
replaces internal tristates with MUXs). Each LE has a four-input LUT and a flipflop.
In Figure 7-8, the IOE’s are Input/Output Elements (pin buffers and registers). The
structure of the IOEs are shown in Figure 7-9.
236 A Look at Competing Architectures Chapter 7

Figure 7-8 FLEX8000 Logic Structure

Altera CPLD Architectures 237

Figure 7-9 FLEX8000 Logic Element

Configuration modes include JTAG, JAM (a serial configuration standard that Altera
is promoting), active/passive serial modes, active/passive parallel synchronous modes, and
asynchronous modes.

Altera FLEX10K Architecture

Altera upgraded the FLEX8K family to create the FLEX10K family. FLEX10K includes
embedded 2048-bit dual-port RAM blocks (Embedded Array Blocks or EABs) which are
spread around the die. The 2048-bit EABs can be used as 2048-by-1, 1024-by-2, 512-by-4,
or 256-by-8 arrays. They can also be combined with other EABs to create larger or deeper
memory elements. The EABs can also be used for logic functions where they act as large
LUTs. The EABs include input and output registers.
FLEX10K LEs are similar to the FLEX8K with the addition of an output connected to
their fast routing channel (called FastTrack). In addition, a clock enable is added to the LE
flipflop.

Altera APEX 20K Architecture

Building on the structure of the 8K/10K architectures, Altera has designed the APEX 20K
family with a logic structure as shown in Figure 7-10. This family has much denser devices
and advanced features like Clock PLL (for clock lock, multiplication and phase shift), a mix
of logic structures (including RAM and wide PLD-type logic for decoders), a variety of
input/output modes (to interface with single-ended and differential busses like Low Voltage
Differential, Stub-Series Terminated, and Gunning Transceiver Logic). The number of
Logic Elements in a LAB is expanded from 8 in the FLEX 8K/10K families to 10. The
238 A Look at Competing Architectures Chapter 7

Logic Elements are similar to the FLEX10K with added synchronous Load and Clear logic
and more clock options.

Figure 7-10 APEX 20K Logic Structure

C H A P T E R 8

Libraries, Reusable Modules, and IP

I
ncreasing electronic design density is a trend
that has continued over the last 40 years. The consumer’s hunger for increasingly
sophisticated gadgets, whether GPS, cellular telephones, games, home automation,
networking, Internet commerce, audio/video entertainment, or computing, seems endless.
We get more free time each year, and we are filling this time by playing with our electronic
toys, all of which continue to get smaller, use less power, and grow more complicated.
While the demand grows, the ability for industry to provide transistors and gates also seems
endless. For the FPGA designer, this means designs will contain more gates.

Suppose an Engineer can design at a rate of 100 or so gates a day. Let’s call this about
10 lines of Verilog code (this includes the overhead of test and documentation). Soon, the
average FPGA design will be 200,000 gates. This means, unless the design methodology
changes, that a two-person team will take 1,000 days to complete a design, almost three
years! Each year the design task gets more complex, but the schedule remains about the
same. The company expects an average project to be complete in a year and a complex
239
240 Libraries, Reusable Modules, and IP Chapter 8

project to be complete in a year and a half. Clearly, something has to give. There are several
options.

KEYS TO INCREASED PRODUCTIVITY

x The size of the design team must increase.

The most productive team is composed of one to three expert designers. If the company can
afford to wait for the product, this will be the cheapest way to get it. However, most
companies (in spite of what they say) are not interested in efficiency and productivity.
Instead, they are interested in getting the product on the market as soon as possible. So,
large teams are created upon the theory that if one woman can bear a child in nine months,
then they’ll just have to get nine women to finish this job in one month.
The problem with large teams is that the complexity of the communication between
team members increases exponentially as the number of team members increases linearly. If
there are two team members, Jack and Jill, then Jack and Jill have to coordinate their work
and there are two communication channels (Jack to Jill and Jill to Jack). If there are three
team members, then Jack has to coordinate with Jill and Jerry, and Jill has to coordinate
with Jack and Jerry, and so on, you get the idea. As confusing as this silly example is, real-
life communication on a design team is worse.
It takes extraordinary effort to keep the team jelled and working in the same direction.
Eventually, the design work of each team member must work with the design work of the
other team members. This will not happen by accident. There will be more meetings (which
reduces productivity), activity reports (which reduces productivity), more specifications to
assure that design elements work together (which reduces productivity), and more chances
of team conflict (which definitely reduces productivity).
Managing large teams is an art more than a science. Though the number of gates in a
typical design is increasing exponentially, the ability for people to work together is not
increasing much, if at all.

COFFMAN’S LAW
The average measure of intelligence of a room is inversely proportional to the
number of people in the room.

All is not lost; there is an alternative to creating large design teams.

x The individual designer must produce more code.

Keys to Increased Productivity 241

In the hardware design world, an electrical designer of the 40s designed with a handful of
vacuum tubes. In the 50s, the tubes were replaced with transistors. In the 60s, the transistors
were replaced with integrated circuits (100s of transistors). Today, ICs with millions of
transistors are common. So, we hardware designers became comfortable with creating
designs by mixing and matching circuit elements we didn’t design. There are two ways this
can happen.

Each line of code can represent larger amounts of circuitry

As synthesis tools get smarter and FPGA designs get denser (so the number of gates
required to implement a design becomes less important and we can afford to waste gates in
order to produce a design more quickly), higher-level constructs become feasible. One day,
we will implement a 1024-bit adder that runs at 100 MHz by writing a line of code like:

a = b + c;

Instead of handcrafting a look-ahead carry adder, the synthesis tool will infer an
efficient adder based on your design constraints.

Designs will be reused

Modules will be included in your code from previous designs (the most common reuse
method) or will be purchased or licensed from someone else. A lot of energy in our industry
is focused on selling Intellectual Property (IP) designs to ASIC designers, and vendors
would love to supply IP to the FPGA market, too. Frankly, the heavy-breathers in the ASIC
and Design Automation areas think they will make large amounts of money selling designs
to companies trying to reduce their product’s time-to-market.
If only there were a tried-and-true market model for using IP! Well, there is, and it’s
related to the use of integrated circuits. This model has been used successfully for over 30
years, so it must work well. From a designer’s point of view, specifications, pricing, and
delivery of various IP offerings are evaluated and the right product is selected for the design
at hand. The financial model for the device manufacturer is interesting to consider. The
device is designed at great expense and placed on the market. The up-front cost to produce
this design (which can be millions of dollars) is paid back slowly over time as the devices
are purchased by electronics manufacturers. This strategy can be quite profitable if the
design becomes popular, but it takes deep pockets to play this game. Can IP vendors play
this way?
The IP provider must provide complete data which characterizes performance
including throughput, latency, signal I/O requirements, module size, and power
consumption. This assures that the design is appropriate to the application and allows
comparison to other products. The successful IP offering will be a stand-alone module that
performs specific functions that designers are comfortable with, like FIFOs and other types
of memory-based modules, microcontrollers, filters, compression/decompression functions,
and communication ports (UARTs, Ethernet, USB, etc.). For even wilder speculation about
242 Libraries, Reusable Modules, and IP Chapter 8

the type of IP that might be feasible, see the Afterword: A Look into the Future, Millions
and Millions and Millions of Gates.
Before we get too excited about off-the-shelf IP, lets take a look at the simplest
method of increasing design productivity, the use of built-in library elements.

LIBRARY ELEMENTS

Each FPGA vendor supplies a set of primitive library elements. The Verilog design is
mapped to the hardware using primitives similar to these. The primitives are implemented in
an efficient manner by the underlying hardware. They get more capable every year as the
FPGA vendor adds elements to, and increases the utility of, the libraries. The FPGA vendor
has a vested interest in providing design aids and shortcuts that increase the efficiency and
ease of use of their products.
The expert designer keeps in mind various levels of abstraction for a design, including
the types of library elements that will be used to implement the design. The following is an
example of what sort of elements we might see in a vendor’s primitive library:

1. AND AND Gates

2. BPAD Bidirectional Pad
3. CKBUF Clock Buffer
4. FF D Flipflop with Async Set, Reset, and Clock Enable
5. INV Inverter
6. IPAD Input pad
7. KEEPER Weak State Keeper (holds last value if driver is removed)
8. LATCH D Latch with Asynchronous Set and Reset
9. LATCHE D Latch with Asynchronous Set, Reset, and Gate Enable
10. LUT Look-up Table
11. MUX Multiplexer
12. ONE Logic “1” Generator
13. OPAD Output Pad
14. OR OR Gates
15. PD Pull-down Resistor
16. PU Pull-up Resistor
17. RAM Read/Write Memory (can be used as Read-only)
18. SFF D Flipflop with Asynchronous/Synchronous Set/Reset
19. SRL16E Shift Register LUT
20. TRI Tristate Buffer
Library Elements 243

21. UPAD Unbonded Pad

22. XOR Exclusive-OR Gates
23. ZERO Logic “0” Generator

There will be various flavors of these primitives. For example, the following versions of the
AND gate might be available:

1. AND2 2-Input AND with Noninverted Inputs

2. AND3 3-Input AND with Noninverted Inputs
3. AND4 4-Input AND with Noninverted Inputs
4. AND5 5-Input AND with Noninverted Inputs
5. AND6 6-Input AND with Noninverted Inputs
6. AND7 7-Input AND with Noninverted Inputs
7. AND8 8-Input AND with Noninverted Inputs
8. AND16 16-Input AND with Noninverted Inputs
9. AND32 32-Input AND with Noninverted Inputs

The Verilog compiler also uses a set of primitives. We see much similarity with the FPGA
vendor primitive library.

1. FALSE
2. TRUE
3. INV
4. BUF
5. AND2
6. OR2
7. XOR2
8. NAND2
9. NOR2
10. MUX
11. DFFRS
12. DFFERS
13. LATRS
14. RSLAT
15. TRI
16. PULLUP
244 Libraries, Reusable Modules, and IP Chapter 8

17. PULLDN
18. TRSTMEM
19. DON’T_CARE

The following device-specific library list is from Exemplar and is for the Xilinx 4000XL
family. There is a similar library for generic primitives.

1. IBUF Input Buffer

2. OBUF Output Buffer
3. OBUF_NG Inverting Output Buffer
4. OBUFT Tristate Output Buffer
5. OBUFT_NG Inverting Tristate Output Buffer
6. OUTFF Output Flipflop
7. OFDX Output D Flipflop with Active High OC
8. OFDXI Output D Flipflop with Active Low OC
9. OFDTX Output Flipflop with Tristate Output
10. OFDTXI Tristate Output D Flipflop with Active Low OC
11. OFD Output D Flipflop
12. OFD_NG Output D Flipflop with Inverted Output
13. OFDX_NG Output D Flipflop with Inverted Tristate Output
14. OFDI
15. OFDI_NG
16. OFDXI_NG
17. OFDT
18. OFDT_NG
19. OFDTX_NG
20. OUTFFT
21. OFDTI
22. OFDTI_NG
23. OFDTXI_NG
24. IFD Input D Flipflop
25. IFDX
26. INFF Input Flipflop
27. IFD_NG
28. IFDX_NG
Library Elements 245

29. IFDI
30. IFDXI
31. IFDI_NG
32. IFDXI_NG
33. INLAT Input Latch
34. ILD_1
35. ILDX_1
36. ILD_1_NG
37. ILDX_1_NG
38. ILD
39. ILDX
40. ILDI
41. ILDXI
42. ILDI_1
43. ILDXI_1
44. ILDI_1_NG
45. ILDXI_1_NG
46. INREG Input Register
47. DFF D Flipflop
48. FDPE
49. FDCE
50. FD
51. FD_GP
52. FDP
53. FD_NGP
54. FD_NG
55. FDC
56. FDC_NG
57. FDCE_NG
58. FDE
59. FDE_GP
60. FDE_NGP
61. FDE_NG
62. FDP_NG
63. FDPE_NG
246 Libraries, Reusable Modules, and IP Chapter 8

64. OUTFFT_IBUF Output Flipflop with Tristate and Input Buffer

65. OUTFFTX_IBUF
66. OBUFT_INFF_1 Output Tristate Buffer with Input Flipflop
67. OBUFT_INFFX_1
68. OBUFT_INLAT_1 Output Tristate Buffer with Input Latch
69. OUTFFT_INFF_1 Output Tristate Buffer with Input Flipflop
70. OUTFFTX_INFF_1
71. OUTFFT_INFFX_1
72. OUTFFTX_INFFX_1
73. OUTFFT_INLAT_1 Output Tristate Flipflop with Input Latch
74. OUTFFTX_INLAT_1
75. DLAT D Latch
76. LDCE_1
77. LDPE_1
78. LD
79. LD_NG
80. LD_1
81. LDC
82. LDC_NG
83. LDCE
84. LDPE
85. LDP
86. LDP_NG
87. LD_NGP
88. BUFGLS
89. BUFGE Global Buffer with Enable
90. BUFFCLK Clock Buffer
91. BUFGS Buffer to Global Resource (secondary)
92. BUFGP Buffer to Global Resource (primary)
93. BUFG Buffer to Global Network
94. BDBUF Bidirectional Buffer
Structural Coding Style 247

STRUCTURAL CODING STYLE

If you follow the digital design newsgroups (see the resources section), you will
periodically see the schematic zealots presenting a case that efficient designs can generally
be implemented only with schematics. This may be true, so, to the Verilog purist, it may
make sense to do a schematic with text by wiring primitives together. Done properly, this
will result in very compact and fast logic. However, it can get unwieldy very fast, so we’ll
want to use this approach only where necessary.
One drawback of a schematic design can’t be argued: it’s not very portable. Using IP
in a design requires some portability if the IP is to be offered for sale to the design world. IP
and HDL need each other.

Listing 8-1 is an example of the structural use of a library primitive, Figure 8-1 shows
the corresponding synthesized circuit, and Figure 8-2 shows the Xilinx structural resource
assignment.

Listing 8-1 Verilog Structural Design with Library Primitive

// Structural Instantiation of Library Primitive.

module global_buffer (clock_out, data_out, chipsel, strobe,

data_in);

input chipsel, strobe, data_in;

output data_out;
reg data_out;
wire clock_in;
output clock_out;

assign clock_in = chipsel & strobe;

BUFG buf1 ( .I(clock_in), .O(clock_out) );

endmodule

// Create black box for buffer.

module BUFG (I, O);
input I;
output O;
endmodule
248 Libraries, Reusable Modules, and IP Chapter 8

Figure 8-1 BUFG Structural Resource Assignment

Figure 8-2 BUFG Structural Resource Assignment (information from Xilinx

Floorplanning Tool)

Figure 8-2, a list of resources used in the global_buffer design, shows that BUF1 was
implemented as a BUFGS (the only type of global buffer available in the Xilinx 4000XL
family).

A SMALL DIVERSION TO COMPARE A SCHEMATIC TO A

VERILOG DESIGN

Figure 8-3 shows a schematic for a simple RAM implementation. It is interesting to

compare this schematic with the Verilog structural version of this design as shown in
Listing 8-2. More information on the LogiBLOX tool is presented in the next section.
A Small Diversion to Compare a Schematic to a Verilog Design 249

Figure 8-3 Schematic Using Library Primitives

Listing 8-2 Verilog Structural Schematic Example

// Structural Schematic Design Example.

module schematic(out_data, in_data, in_addr, clock, write_enable);
input [3:0] in_data, in_addr;
input clock, write_enable;
output [3:0] out_data;
// Define the interface to the black box ram_module.
// This empty box will be filled with a predefined netlist
representing
// a RAM block created by the Xilinx LogiBLOX tool.
//----------------------------------------------------
// LogiBLOX SYNC_RAM Module “ram_module”
// Created by LogiBLOX version M1.5.19
// on Mon Dec 28 17:21:11 1998
// Attributes
// MODTYPE = SYNC_RAM
// BUS_WIDTH = 4
// DEPTH = 16
// STYLE = MAX_SPEED
// USE_RPM = FALSE
//----------------------------------------------------
ram_module u1 ( .A(in_addr), .DO(out_data), .DI(in_data),
.WR_EN(write_enable), .WR_CLK(clock) );
endmodule
250 Libraries, Reusable Modules, and IP Chapter 8

module ram_module(A, DO, DI, WR_EN, WR_CLK);

input [3:0] A;
output [3:0] DO;
input [3:0] DI;
input WR_EN, WR_CLK;
endmodule

Figure 8-4 shows the LogiBLOX main menu.

Figure 8-4 Creating a RAM Module with LogiBLOX

Using LogiBLOX Module Generator 251

Figure 8-5 Structural Schematic using LogiBLOX RAM Module

Figure 8-5 shows the use of a LogiBLOX-generated module.

Which is better, the HDL or the schematic? Which is faster to create, more portable,
and easier to understand? Which is prettier? Which is more portable? Notice that the
compiler inferred buffers as required to implement the design. These buffers must be
instantiated by the designer in the schematic.

USING LOGIBLOX MODULE GENERATOR

How was the RAM module an example of increasing design speed? We didn’t invent
a RAM module from scratch; we used a design tool to help create it. In this case, we used
the Xilinx LogiBLOX tool to create it. Other modules can be created and parameterized,
these include:

Accumulators
Adders/Subtractors
Clock Dividers
Comparators
Constants
252 Libraries, Reusable Modules, and IP Chapter 8

Counters
Data Registers
Decoders
Inputs/Outputs
Memories
Multiplexers
Pads
Shift Registers
Simple Gates
Tristate Buffers

Employing these types of schematiclike elements in a structural design can allow the
HDL designer to use hardware-specific hardware configurations options. For example,
under the Tristate Buffer block definition, there are three options for pull-up resistor: none,
pull-up, and double pull-up. Options like double pull-up are not directly supported by
Verilog but may be required for some design implementations. The use of a structural
design, where HDL is employed in a structural (schematic) fashion, or where Verilog
modules are stitched together with schematics, may be required in some cases. Use
whatever works!
So, one way to increase the number of gates we design in a day is to use a tool that
automates the creation of certain types of modules.

Another Module Generator, the CORE Generator Tool

Xilinx (with the MEMEC company) provides a core generator with a wider variety of more
complex modules compared to LogiBLOX. Examples of the functions include:

x FPGA Development Tools (DSP and FPGA development platforms for evaluation
and benchmarking of DSP and FPGA designs).
x Processor peripherals including C2910A Bit Slice Processor core, DRAM controller,
M8237 DMA Controller, M8254 Programmable Timer, M8255 Programmable
Peripheral Interface, M8259 Programmable Interrupt Controller, XF8256
Multifunction Microprocessor Support Controller, and XF8279 Programmable
Keyboard Display Interface.
x Processor products including Intellicore™ Prototyping System, RISC CPU Core
Demo, Scalable Development Platform, TX400 Series RISC CPU cores, and V8
uRISC Microprocessor.
x UARTS, including M16450, M16550A, and XF8250.
x Communication and Networking Cores, including ATM Cell Assembler, ATM Cell
Delineation, ATM CRC10 Generator and Verifier, ATM CRC32 Generator and
Verifier, ATM Utopia Slave (CC-141), Forward Error Correction Reed-Solomon
Using LogiBLOX Module Generator 253

Decoder/Encoder and Viterbi Decoder, Telecommunications HDLC Protocol Core,

and Telecommunications MT1FT1 Framer.
x XF9128 Video Terminal Logic Controller.
x Standard Bus Interface Cores, including IEEE 1394 Firewire Link Layer Core,
Firewire SuperLINK core evaluation board, 2 Wire Serial Interface, PCMCIA cores,
USB cores.
x And others.
These cores are a mix of Xilinx-supplied modules (which are available for free) and
third-party-supplied modules (which must be licensed). An example of the use of these
cores is a design called sincos8 based on the 8X8 Sin/Cos LUT core model. The CORE
Generator tool created a file called sincos8.vei, a Verilog interface file. This file defines the
ports used by the core and provides an example of the module instantiation for use by the
designer. The sincos8.vei file looks like Listing 8-3.

Listing 8-3 sincos8.vei File

module sincos8 (
ctrl,
theta,
c,
dout);

input ctrl;
input [7:0] theta;
input c;
output [7:0] dout;
endmodule

// The following is an example of an instantiation:

sincos8 YourInstanceName (
.ctrl(ctrl),
.theta(theta),
.c(c),
.dout(dout));

The CORE Generator file also created a Verilog simulation file called sincos8.v. This
file is 5,287 lines of code. Assuming we could write 100 lines of debugged code a day, we
could write this module in a couple of months. It took the CORE Generator about five
seconds. Assuming the module meets the needs of the design, that’s not a bad leverage of
productivity. If it doesn’t work, we don’t have any source code, so the core doesn’t help.
The file we link into our design is the compiled EDIF netlist called sincos8.edf. Out of
curiosity, let’s implement this design in a 4000XL device and see what it looks like. There
are always some tricks to making the tools work; in this case, we need the bus delimiter to
be parentheses B() (use B<> as delimiters for the XNF netlist format) in order for the Xilinx
254 Libraries, Reusable Modules, and IP Chapter 8

Design Manager to suck in the EDIF file properly. This is done by unchecking the Verilog
Instantiation Template and Verilog Behavioral Simulation Model boxes (even though we
want these files to be generated) in order to get the Netlist Bus Format we want as shown in
Figure 8-6. In addition, in the Exemplar Leonardo EDIF output tab, we want to deselect the
Allow Writing Busses checkbox, because Xilinx does not process EDIF busses properly.
Note that the Verilog compiler doesn’t “know” anything about the black box (which
is inserted during the downstream mapping process, as illustrated in Figure 8-6). Any
estimate by the synthesis tool regarding speed and design size will not include the black-box
modules.

Figure 8-6 CORE Generator Options for Netlist Bus Format

Without really trying, this design runs at 61 MHz in the slowest 4005XL (-3) device.
Listing 8-4 is a report on the resources used by this design.
Using LogiBLOX Module Generator 255

Listing 8-4 sincos8 Design Example Resource Utilization

Loading device database for application par from file “map.ncd”.

sincos8_example is NCD, device xc4005xl, package pc84, speed -3
Loading device from file ‘4005xl.nph’ in environment C:/Xilinx.
Device speed data version: x1_0.37 1.22 FINAL.

Device utilization summary:

Number of External IOBs 18 out of 61 29%

Flops: 0
Latches: 0

Number of CLBs: 25 out of 196 12%

Total Latches: 0 out of 392 0%
Total CLB Flops: 25 out of 392 6%
4 input LUTs: 43 out of 392 10%
3 input LUTs: 4 out of 196 2%

Number of TBUFs: 28 out of 448 6%

Another example is an eight-wide and 16-deep FIFO called fifo8x16. The Verilog
simulation file is 293 lines of code. Listing 8-5 shows the interface file.

Listing 8-5 fifo8x16.vei File

module fifo8x16 (
d,
we,
re,
reset,
c,
full,
empty,
bufctr_ce,
bufctr_updn,
q);

input [7:0] d;
input we, re, reset, c;
output full;
output empty;
output bufctr_ce;
output bufctr_updn;
output [7:0] q;
endmodule
256 Libraries, Reusable Modules, and IP Chapter 8

// The following is an example of an instantiation:

fifo8x16 YourInstanceName (
.d(d),
.we(we),
.re(re),
.reset(reset),
.c(c),
.full(full),
.empty(empty),
.bufctr_ce(bufctr_ce),
.bufctr_updn(bufctr_updn),
.q(q));

We can see that using a CORE generator is an effective way to create complex
modules and increase design efficiency. This is a hit-or-miss process, because, if a module
does not meet the needs of the design after trying all available compiler options, without
source code there is no way to make modifications, so another approach will be required
(like taking the time to design an optimized module from scratch).

DESIGN REUSE, REUSING YOUR OWN CODE

As you get some designs under your belt, you will find that you’ll reuse some of your
design approaches and even certain modules. If you designed it, then you certainly
understand its features and limitations and can make an almost instinctive judgment whether
to reuse something or write it from scratch. What can you do to help make a design suitable
for later use in other designs?

Designing Your Code for Reuse

x Document your work.

Use a header that describes the design from a toplevel. The header should describe
input and output requirements and any tricks or quirks that are embedded in the design. Put
in lots of comments, not just describing each line of code, but explaining the overall intent
and strategy for problem solving.

x Use a version-control database product like SourceSafe or VCS.

Having a good database of old designs and following the discipline of using detailed
comments for revisions can enhance the reusability of Verilog code. These products allow
revisions to be “undone” and assure that a working version can be recovered. This feature
Design Reuse, Reusing Your Own Code 257

alone can help you maintain your sanity when crunch time comes. These tools are even
more critical when working with other designers.

x Partition logic into small modules.

Smaller modules, with dedicated specific functions, are more likely to be reusable
than large, complex, and specialized modules.

x Use synchronous design techniques.

A synchronous design is more reliable and portable. For modules that must be
asynchronous, put them in separate and well-documented modules; don’t mix them in with
the synchronous areas of your code.

x Take a typing class.

You should either be infinitely patient or a good typist to create long and descriptive
labels. If you’re a slow typist, you’ll never use easy-to-read and informative labels like
video_output_enable_active_low. Your code will benefit from liberal use of real English.
Try to use fewer acronyms and abbreviations.

x Don’t do odd things with the clock.

For example, try not to use gated clocks or both edges of a clock.

x Don’t use magic numbers.

Magic numbers are constants embedded in the code as so:

(test_pattern == 4’he; // Example of magic number (4’he).)

and they should generally be parameters so they can be changed at a top level or in an
include file.

x Minimize Ports.
Module partitions should be selected to minimize interconnects between modules,
particularly where clock domains are crossed. Like an orange, designs have natural
boundaries for isolation and cohesion. Use these natural boundaries to partition the design.
Split up complex modules into smaller and simpler parts.

x Don’t fix something unless it’s really broken.

If you see something in a working module that you don’t like, leave it alone. The
possibility of inadvertently breaking something is so great that you must be absolutely sure
there really is a problem before changing anything. Pretty code is not necessarily better
code.

x Get some help with sticky problems.

258 Libraries, Reusable Modules, and IP Chapter 8

If you have a problem and you’re not sure what to do, or you’re trying to select
between competing options, talk to your peers about it. Get input from the newsgroups,
Field Applications Engineers, or your neighbors—anywhere you can find it. Even if your
cohorts have bad ideas, they may lead you to think about the problem in a different manner
and may inspire a better approach. If you really can’t find help among your coworkers, then
find a better bunch of people to work with.

x Archive Everything.
Keep scripts, make files, design notes, libraries, old versions, and all software used to
compile and implement a design.

BUYING IP DESIGNS

What does an IP block look like? From a user’s perspective, the interface must be defined,
including clock signals [polarity (uses of rising or falling edges or both), maximum and
minimum frequencies, duty cycle, and loading], reset/preset signals (polarity, synchronous
or asynchronous, required duration, and loading), and requirements for other ports.
The biggest issues with purchasing IP for an FPGA (I call this Revenue IP, some call
it Silicon IP) are not technical but are related to negotiating a license. How much should the
up-front payment be? How much for recurring payments (royalties)? What is the cost model
for unexpected usage when product volumes are higher or lower than expected? How can
usage be audited? How can the IP provider protect its investment and still provide enough
data assure a successful implementation? These questions all must be addressed to make IP
viable for a design.
Will Revenue IP (RIP) ever be a significant part of the FPGA designer’s life? As
hardware designers, we are comfortable using hardware IP in the form of integrated circuits
out of necessity. We are not ASIC designers, so we do not have to use expensive design
tools and we don’t have direct access to the foundries. We could create a functionally
equivalent design, but it would take more board space, take longer to design, and cost more
to implement. Two of these three drawbacks are not present when our design ends up in an
FPGA. We can argue with management that the last drawback, the time to do the design,
will be balanced by the avoiding up-front costs or royalties. So, RIP is the right acronym for
Revenue IP. We will use all the free (vendor-provided) and cheap (vendor-subsidized) IP
we can get our hands on, but we will resist paying for other types of IP. Still, we need to
take a look at the current IP strategies. For FPGAs, they come in two flavors: firm/hard and
soft.

Hard or Firm IP

In the ASIC business, Hard IP is like a Standard Cell, a core that is predesigned and
characterized for a specific foundry process. This option does not really exist for FPGAs;
Summing Up 259

the closest we get is with Hard/Firm IP—prerouted and preplaced modules that can be
linked with our other modules. Hard/Firm IP is most like using an integrated circuit in our
design. It is a black box from a user’s point of view. The user can’t change it; it can only be
plunked into a design and used as-is, with all other circuitry and routing forced around and
through it. These modules are provided with a behavioral model that allows the design to be
evaluated and tested. From a vendor’s point of view, this is the safest IP, as it is very
difficult to reverse-engineer or to modify and present as a new design. From a user’s point
of view, hard IP is not very friendly; it is a one-size-fits-all solution and allows no flexibility
other than built-in configuration options. It’s not portable to new processes or technology
without recompiling by the IP vendor.

Soft IP

From a user’s point of view, having source code that can be tweaked, hacked, and
synthesized is much more desirable. But how then, can the IP vendor be assured of being
paid? Get all the money up front? That’s not very likely. If a design ends up being 47% IP
and 53% hacked by the user, what’s the right compensation? What keeps the designer from
creating timing problems when the design (which can be very complex) is modified? Who
is responsible in this situation?
To protect the IP vendor, Soft IP might be encrypted or obfuscated (comments
removed, informative labels replaced with truncated and useless ones, and the code
compressed to be unreadable) so that it can be synthesized and integrated into other parts of
the design but not easily reverse-engineered.

SUMMING UP

The most common form of design reuse for FPGA designers will be reusing your own
modules. We have covered some ways to design your code that will improve reusability.
The next most common reuse method will be using modules designed by other engineers at
your company; this avoids extra cost and legal issues. The most common tools for
increasing productivity will be vendor-supplied libraries and core generation tools. Third-
party IP will be a small part of the FPGA designer’s reuse strategy.
This page intentionally left blank
C H A P T E R 9

Designing for ASIC Conversion

T here are some advantages to converting an

FPGA design to an ASIC, including merging multiple FPGAs into one ASIC and creating a
device that consumes less power and operates at higher speeds. However, the main
advantage is reducing cost. The cost of an ASIC (even with non-recurring charges factored
in) can be less than a third of that of an FPGA. What drives the decision to convert your
FPGA design to an ASIC? Consider conversion if:

x The yearly usage is greater than 1,000.

x The design is unlikely to require modification.
x Additional protection from reverse engineering is desirable.
x Improved speed or reduced power consumption (compared to an FPGA) is
necessary.

261
262 Designing for ASIC Conversion Chapter 9

For ease of conversion and lower up-front costs, there are three options for converting
an FPGA to a custom device: a hard-wired FPGA, an FPGA conversion using laser-
programmed or custom-routed devices, and a full ASIC design. Using Verilog as a design
and simulation tool greatly enhances ease of the converting to an ASIC, because all ASIC
companies use and are comfortable with Verilog.
An FPGA is not a very good ASIC prototyping device, but they get more ‘ASIClike’
every year. Increasingly, designs will remain implemented in FPGAs because of their
increasing densities and future cost reductions. Still, many of our designs will convert to
ASICs. While FPGAs are getting cheaper and denser, so ASIC technology improves, too.

Why Is an FPGA a Poor ASIC Prototype?

x The FPGA vendor has designed-in “training wheels” which improve the
chances of success for a designer using poor design methodology. Particularly,
the clock network has delay designed in to create a zero-hold-time requirement
for flipflops. The FPGA designer concentrates on meeting the setup-time
requirement; the ASIC designer must meet both setup- and hold-time
requirement window.
x The FPGA provides low-skew global networks for clock and reset/preset; these
networks must be created in the ASIC design.
x The experimental FPGA design mindset (Unsure about something? Try it and
see what happens) is dead wrong for designing ASICs. There is a huge cost to
making an error in an ASIC in terms of foundry charges and leadtime. This
requires a careful (some might say anal-retentive), cautious, and conservative
design approach with extensive testing.
x It can be difficult to cram logic into an FPGA, then make it run at high-speed.
The ASIC will have only the resources demanded by the design (routing and
logic resources in the FPGA are present whether they are used or not); thus will
it be smaller, use less power, and operate at higher speed. Therefore, a lot of
wasted time may be spent optimizing a design to run in an FPGA.

In spite of these caveats, successful FPGA-to-ASIC conversions are done every day.
Using some common-sense design strategies will make the process go smoothly. First, let’s
look at the technologies into which the FPGA might be converted.

ALTERA HARDCOPY DEVICES

Altera HardCopy offers custom hard-wired versions where the Logic Elements are the same
as a regular FPGA (though packed closer together), but the routing is replaced with custom
Semicustom Devices 263

metal runs. The design change is minimal (the device uses the same placement and signal
routing as the FPGA), and the time span for conversion can be as low as a month or so.
Minimum order quantities can be as low as a few hundred pieces. Packages and pinouts,
including power and ground, can be identical to the original FPGA. Configuration signal
emulation can be used. For example, the configuration CONF_DONE pin might be used to
control a processor reset signal on the circuit board. Though the HardCopy device does not
require configuration, having the configuration pins act the same might be required. The
HardCopy devices are built on the same fab lines as the FPGA, so the process technology
(lithography), the pin drive capability, the pin voltage tolerance, and the CLB layout are the
same.
Because the HardCopy silicon is so similar to the FPGA, the HardCopy design can be
captured just from the configuration file. Still, the conversion engineers will request source
design information, which can be informative during the conversion process.
One drawback to the HardCopy device is encountered during production testing. The
configurable devices can be programmed with a test pattern and checked out; the HardCopy
devices must have special test support designed-in (added).

Conversion Issues

Conversion to HardWire technology is the least demanding conversion for the FPGA
designer. Xilinx guarantees that the HardWire design will act the same as the FPGA device.
Still, an FPGA can mask race conditions that can create glitches, because signal routing
transistors with capacitive loading act as a low-pass (RC) filter; this effect will be much
reduced in the HardWire device. Race-condition glitches caused by asynchronous signals,
which are “filtered out” in the FPGA design, can be uncovered. Asynchronous signals will
be flagged by Xilinx during the conversion process, but it’s up to the designer to take
responsibility to insure no hazards exist.

SEMICUSTOM DEVICES

Various technologies exist for arrays where logic is placed on a die, then custom routing is
created with laser programming, where routing segments are removed. Chip Express is a
company that offers fast prototypes with laser-programmed routing (LPGA, or Laser
Personalized Gate Arrays), which can be converted later to devices with one or two metal
routing layers. Clear Logic also offers these types of devices (LPLD, or Laser-Processed
Logic Device). The trade-offs and design considerations are very similar to HardWire
conversion issues.
264 Designing for ASIC Conversion Chapter 9

Semicustom ASIC Conversion

Vendors like AMI (American Microsystems) and Orbit offer FPGA conversions to their
Gate Array designs. These processes offer short lead times (4 to 6 weeks) and low NRE
charges ($5K to $50K). These companies have a lot of experience with doing FPGA
conversions and can smooth the conversion process considerably.

Full Custom ASIC Conversion

In an FPGA design, what the designer can do is limited because the FPGA has a predefined
structure in which the design is implemented. An ASIC is more freeform. There are no
training wheels to keep the designer out of trouble. All the features we take for granted, like
programmable buffers, termination resistors, built-in oscillator buffers, and power-on
reset/preset, are not present unless we specify them in the ASIC. The ASIC has some
advantages because the routing is fully customized and only gates that are actually used get
placed. Also, much greater densities are offered, so designs that live in multiple FPGAs can
be combined into one.

List of Conversion Requirements

The designer must provide information to feed the conversion process. This information
will be present on a checklist provided by the ASIC vendor and will include items like:

x The design netlist.

x Test fixtures and simulation results. The vendor will be comfortable with
Verilog test fixtures, the more of these provided, the lower the risk of problems
during conversion.
x Package, number of pins, pin format, and pin pitch.
x A list of clocks and clock frequencies.
x Gate-count estimate.
x Temperature range and special environmental requirements (like military specs,
etc.).
x Pin list: pin names and pin locations. This includes power, ground,
configuration, and unused pins.
x Special features, like pull-up or pull-down resistors, critical timing paths, pin
driver requirements, RAM and ROM, FIFO’s, and other special logic modules.
Design Rules for ASIC Conversion 265

DESIGN RULES FOR ASIC CONVERSION

Conversion to an ASIC process can be stressful; there are hungry gators swimming in those
waters! Some hazards to watch for include delay networks, race conditions, combinational
feedback, pulse generators, floating internal busses, clock skew, and gated or divided
clocks.
Most vendors offer a “turn-key” conversion process. In this design flow, the ASIC
vendor takes complete responsibility for the conversion and provides all test vectors. This
takes longer and is more expensive than a “joint-design” conversion, where the FPGA
designer provides all or part of the test vectors and takes responsibility for the conversion.

Figure 9-1 Watch for those alligators!

AMI (American Microsystems, Inc.) offers a no-vector conversion; this is the most
painless conversion for the FPGA designer who hates simulation. However, the designer
must obey the following rules, which is nearly impossible:

x Altera, Xilinx, and Actel devices only.

x Single external master clock.
x No combinational feedback loops.
266 Designing for ASIC Conversion Chapter 9

x No delay dependencies or pulse generators.

x Single external master set/reset signal.

SYNCHRONOUS DESIGN RULES

The first rule is to do a synchronous design. This is not always an easy rule to follow, but
each clock added to a design should be carefully considered. Every clock domain, every
signal that crosses a clock-domain boundary, and every asynchronous signal is a hazard
unless dealt with exhaustively. If the purpose of a design is to convert from one clock
domain to another (like a FIFO does), then, obviously, you have no choice. If you need to
save power, but some of the design needs to run at high speed, then again you have no
choice. My personal preference is to run a design at the lowest possible speed, because this
reduces RFI emissions and makes the design more tolerant of the inefficiencies of a generic
HDL implementation. If a section of the design must be asynchronous, put it in quarantine.
Keep it segregated from the synchronous parts of the design and document it well so that
the design intent and hazards are clear.
It’s not hard to handle asynchronous signals, but it is easy to forget to do this
handling. The result is a design that works, but does not work reliably.

SYNCHRONOUS DESIGNS
Synchronous designs do not contain gated clocks or multiplexed clocks. The number of clocks
can be counted on one hand, preferably with four fingers left over.
John McGibbon
Memec Design Services

ASIC conversion vendors sometimes offer “vectorless” conversions if the design is

100% synchronous. This will reduce the span time for conversion and reduce cost. It also
reduces stress on the FPGA designer because the test burden is removed. The ASIC vendor
likes this, because the design is relatively trouble-free and the test vectors can be
automatically generated. For the FPGA designer, creating a 100% synchronous design is
almost impossible but is a very worthwhile goal.

Use Generic Logic Constructs

ASIC vendors who do FPGA conversions routinely replace RAM and other modules (like
adders and counters) with parameterized modules selected from their library. However, each
substitution contributes a new block to the design, and each change adds to the risk that
something will go wrong. LogiBLOX or cores should be evaluated for ease of conversion or
Synchronous Design Rules 267

substitution before being used in the FPGA design. The netlist should be “untouched” as
much as possible during the conversion process. If your design has nothing but NAND
gates in it, it will convert painlessly, because no module substitution will occur.
In a gate array, your RAM and ROM modules will be replaced with registers. This
results in an explosion of the gate count. For a 512-by-8 RAM module, 4,096 flipflops will
be instantiated. The cell decoding logic adds to that number (decoding-logic complexity
doubles for each added address line).
One feature that is particularly troublesome during ASIC conversion is RAM
initialization (or ROM contents, the same thing). The FPGA can write data into RAM cells
during the power-up configuration process. There is no corresponding magic configuration
mode in the ASIC; all RAM cells must be written to via the RAM data bus.

Power-On Conditions

Part of the training wheels for an FPGA design is the power-up initialization of all I/O pins
via the use of GSR (Global Set/Reset) resources and the device configuration process. An
ASIC will not have these features unless the designer specifically puts them in. The ASIC
vendor tends not to want to use many global networks (like reset and/or preset networks)
because they must be custom routed and they consume routing channels.

Internal Busses

Xilinx and other FPGA vendors allow the use of internal tristate busses and buskeepers to
prevent problems due to floating buffer inputs. Some ASIC vendors do not have this
capability or prefer not to use the technology because it complicates testing. Exemplar
Leonardo has a feature where internal tristate busses can be automatically converted to
MUXes for technologies that don’t support internal tristate busses.

Configuration Pins

Often the FPGA configuration pins are used in external logic (for example, using the
configuration CONF_DONE pin in a Power-On Reset logic). Special logic will have to be
designed into the ASIC to provide configuration-pin emulation. The FPGA designer needs
to define which signals are used and how the pins are expected to act. For common FPGA
signals and architectures, the ASIC vendor will have some experience with these signals
and will be able to assist.

Pin I/O Buffers

The input signal thresholds must be defined by the FPGA designer. Input threshold options
include TTL (where the threshold voltage is about 30% of the supply rail), CMOS (where
the voltage threshold is about 50% of the supply rail), or custom.
268 Designing for ASIC Conversion Chapter 9

The output pin drive requirement must also be defined by the FPGA designer. The use
of low-impedance (high-current) buffers should be minimized to reduce power consumption
and RFI noise generated by the design. The ASIC process probably has more options for
drive capacity than the FPGA. Always use the slowest and lowest-power pin buffer that will
do the job.

OSCILLATORS

Oscillators are analog circuits, but sometimes oscillator buffers are available in FPGA
technology. These are inverting buffers with low gain to help assure that the oscillator stays
in the linear mode, the inverter provides 180 degrees of phase lag, and the RC (cheapest,
sloppiest), ceramic resonator (cheap, but not too sloppy), or crystal (best performance, but
more expensive) provides the remaining 180 degrees of phase lag to meet the requirement
for oscillation (a closed loop with 360 degrees of phase shift and an overall gain of one). A
typical gate oscillator is shown in Figure 9-2.

Figure 9-2 Typical Gate Oscillator Circuit

These circuits will need to be identified to the ASIC vendor to assure a compatible
conversion. It’s likely that the oscillator will end up being gated or multiplexed (this is
much different than having a gated clock as part of the normal operating mode) so that the
test equipment can drive the clock output with a clock of known frequency and phase. This
circuitry will be added as part of the ASIC design process and probably will not be part of
the FPGA design.
Never strap an oscillator enable pin high or low; put a resistor in so that an external
source can enable or disable the oscillator as shown in Figure 9-3.
Delay Lines 269

Figure 9-3 ASIC Oscillator Disable Circuit

For best performance, clock circuits should be isolated from other noise sources by
physical distance or by guard rings, and the wiring should be kept tight to reduce loop areas.
Note: the oscillator inverter is run in the linear mode, and the output should approximate a
sinewave as much as possible to reduce EMI.

DELAY LINES

The FPGA designer sometimes uses a delay line to create time-delayed signals, particularly
when interfacing with external SRAM or DRAM components. This delay is another analog
circuit, so use caution! The delay line might be a string of buffers. This method of creating a
delay is not recommended, because it depends on typical buffer delays which are not
controlled and which change with temperature and process/technology changes.
During ASIC conversion, delay-line buffers will be replaced with buffers with
different propagation delays (usually shorter, because ASIC buffers are typically faster than
FPGA buffers) or will be completely removed because they represent redundant logic from
a digital point of view. These delays must be documented and verified to insure they get
implemented properly.
An option might be to use an external circuit to create the delay as shown in Figure 9-
4. This circuit might be an RC delay with buffers to minimize the effect of changing the pin
driver and pin loading during ASIC conversion. This delay is not precise and depends on
the propagation delay of the pin drivers, the buffer propagation delays, the buffer threshold
voltages, the tolerances of the RC components, operating temperature, and the ether-flux of
the moon’s gravitational field.
270 Designing for ASIC Conversion Chapter 9

Figure 9-4 Typical External Buffer RC Delay Circuit

Even better, a delay can be created from a serpentine circuit board trace with about
175 picoseconds of delay per inch as shown in Figure 9-5. Remember to include the buffer
delays, the pad delays, and the circuit-board trace delays. There are many assumption in this
delay, and your mileage will vary. The reader is urged to read Johnson and Graham’s High
Speed Digital Design, a Handbook of Black Magic (see bibliography) before implementing
a circuit like this.
Assumptions include the use of FR-4 circuit-board material, 20-mil traces, 0.6 inches
per segment, 50-mil segment pitch, and that you have a valid exemption from Murphy’s
Law.

Figure 9-5 Typical External Trace Delay Line Circuit

Even better yet, think about spending some money and using a digital delay circuit
like those available from Dallas Semiconductor and others.
The Language of Test 271

THE LANGUAGE OF TEST

We’re not going to cover test topics in depth, but we can learn a few buzzwords.

x At-speed testing. Testing performed at the actual operating speed of the design.
Most testing is performed at slower clock speeds that are comfortable for the
test equipment. These frequencies might be on the order of 1 to 5 MHz.
x Boundary scan. A test scheme where MUXes and latches are added to the
design to support shifting serial data in and out. This allows test patterns to be
applied and internal logic states to be read out.
x BIST. Built-In Self-Test, where hardware is added to the design to allow it to
test itself.
x Fault grading. A measure of the how well the design hardware is tested. It is
the ratio of the number of test vectors and the fault coverage.
x Functional test. Testing a device by applying user-provided test inputs and
checking outputs. These tests are generally not very thorough. These are not
parametric tests for AC performance.
x IDDQ Tests of power-supply current when all internal nodes are quiet. The only
inputs are terminations to prevent oscillation and to keep gates from going
linear. This is a quick test to reject devices that were manufactured improperly.
x JTAG. Joint Test Action Group that created the IEEE 1149.1 boundary scan
register and test access port (TAP) standard.
x Observability. The ability for test equipment to access an internal node. All
output pins are observable.
x Parametric testing. Testing for gate input thresholds and output drive
capability. These are analog tests which verify the ASIC processes.
x Partial scan. A scan test that covers only selected parts of the design.
x Test coverage. A figure of merit for a test suite; it’s the ratio of all detected
faults to the total number of possible faults.
x Stuck-at faults. A failure caused by a node staying in a zero or one state when
it should be driven to a different state.

Boundary Scan

Because SRAM-based FPGA devices can be reprogrammed, the FPGA manufacturer can
load a test configuration and do a thorough production test. Custom devices, like your
ASIC, must have test support designed in. A common method of providing test support is to
insert boundary scan logic (BST), which creates a serial chain that runs near the outside of
272 Designing for ASIC Conversion Chapter 9

the device under test. This chain can include other devices. The serial chain can be four or
five signals (TDI, Test Data Input, TDO, Test Data Output, TCLK, Test Clock, TMS, Test
Mode Select, and an optional Test Reset, TRSTn). Inside the ASIC, MUXes are inserted
which allow selected signals to be connected to a long shift register; this allows signals to
be shifted in and out of the device being tested.

Figure 9-6 Boundary Scan Hardware Overhead

BST adds hardware to the ASIC as shown in Figure 9-6, the added hardware increases
the ASIC design by 15 to 25%. It also adds delays to signal paths on the order of 1-2 nsec
for each BST MUX. Insertion of the BST hardware and generation of scan vectors are
automated processes. Note that the device signal always flows through a MUX. This
architecture allows the serial chain to read device signals, or to shift (pass-through) other
test signals in the chain, or to pass test signals into the signal chain.
There are other test methods. A complete discussion of them is beyond the scope of
this book, but we can at least list them and say a few words about them. Tests can be
divided into production tests (where the design is validated and process problems are tested
for), design conversion tests (insuring that the design was converted properly; this is usually
done mostly with designer-supplied functional test vectors), and static timing tests (to assure
that the ASIC’s different gate delays and clock skews don’t cause problems).
IDDQ Test This is a quick test for production problems; if the current drain of the
device is much higher than expected, then a manufacturing defect has probably occurred
and the device can be quickly rejected.
Functional Test This type of test uses test vectors provided by the designer which
emulate typical operating modes and look for predicted outputs. This type of test is
generally not very thorough, because the designer doesn’t think of all the various
combinations of input modes and logic sequences.
ATPG (Automatic Test Vector Generation) These test vectors can include serial
vectors (the ones that are clocked into the BST scan chain, if present) and parallel vectors
(the ones presented in parallel to the device inputs).
Print-on-Change Test Vectors 273

PRINT-ON-CHANGE TEST VECTORS

The ASIC vendor will request print-on-change (POC) test vectors; this is an ASCII-
formatted list of input sequences and expected output test patterns. Fortunately, it’s not
difficult to extract these vectors from the Verilog test fixture using $display and $monitor
directives.

Listing 9-1 Simple POC Vector Example, OR Gate

II O
NN U
12 T
TIME
0 00 0
50 01 0
53 01 1
100 00 1
103 00 0
150 10 0
153 10 1

From Listing 9-1, you can see that the delay through the gate is 3 nsec (the output
changes in the period between 50 and 53 nsec).
This page intentionally left blank
Afterword: A Look into the Future, Millions
and Millions and Millions of Gates

P ress Release: Xilinx Inc, July 7,

2007, San Jose, California. Xilinx announces the latest member of the XZ-200 family, the
XZ202XXL. The XZ202 supports 64 phase- and delay-locked clocks at speeds to 40 GHz
with a 0.35 V power supply. Integrated analog features include octant power control (power
can be switched on and off the device in eight sections), 10-bit A/D and D/A conversion
with programmable top and bottom voltages, Sample/Track-and-Hold, SVGA monitor
output, Delta-Modulation Converter, Voltage-to-Frequency Converter, PWM, high-speed
data capture, switched-capacitor power supplies (to support RS-232 and other high-voltage
I/O, and CCD/CMOS imagers), and operational/instrumentation amplifiers. Self- and on-
the-fly reconfiguration is supported on an octant-by-octant basis. The XZ202 includes
programmable thresholds for single-ended and differential I/Os, and up to 3500 I/O pins. 64-
MB of high-speed DRAM is available on the core in 8-MB blocks. Each of the 32,768
CLBs can be configured as single-port or dual-port RAMs, each with a synchronous FIFO
mode.
Each device is shipped with a factory-assigned 128-bit IPv6 address. Test support
includes an integrated telephone modem, so that configuration read-out and test can be
performed remotely by the Xilinx support staff. Twenty-four-hour worldwide support is
provided from Xilinx support facilities in Silicon Valley and India. A free license for a built-
in oscilloscope and logic analyzer core is provided with each device. A JTAG emulator and

275
276 Afterword: A Look into the Future, Millions and Millions and Millions of Gates

background debugger is included. A free Verilog design system (with Xilinx object-oriented
extensions) is available.
A wide variety of IP cores are available for inexpensive licensing, including:

x Microprocessor, RISC, and DSP cores, 4 to 64 bit, 10 MHz to 1 GHz.

x Direct digital conversion for 2.4 GHz, GSM, PCS, and other radio frequencies
and formats.
x JPEG, MJPEG, and MPEG encoding and decoding.
x Cable-ready and antenna-ready TV and AM/FM radio tuners.
x PGP and other encryption/decryption cores.
x DCT, FFT, wavelet, and other transforms.
x Games and educational cores, including Flight Simulator, Riven IV and Quake
2005.
Price and availability: $25.00 in 10,000 quantities, samples available in Q3 2006, with
production in Q1 2007.

Xilinx is an equal-opportunity employer, and all Xilinx devices are Y3K compliant.
Resources

For updates and errata for Real World FPGA Design with Verilog, surf over to
www.bytechservices.com

To report errors or to compliment/complain about something, email to

kcoffman@sos.net

The World Wide Web is an excellent tool for research. The Usenet newsgroups are an
excellent source of unfiltered information, opinions, and gossip.

Usenet Newsgroups
comp.lang.verilog
comp.lang.vhdl
comp.arch.fpga
comp.cad.synthesis

Verilog FAQ
http://www.faqs.org/faqs/verilog-faq/

FPGA and CPLD Manufacturers

www.actel.com
www.altera.com
www.latticesemi.com
www.xilinx.com

Software Suppliers
www.bluepc.com
www.cadence.com
www.exemplar.com
www.model.com
www.simucad.com
www.synopsys.com
www.synplicity.com
www.veritools-web.com

277
This page intentionally left blank
Glossary

AHDL Altera Hardware Description Language, a proprietary HDL.

algorithm A step-by-step method of solving a problem.

antifuse A connection link that turns into a low impedance when

stressed.

ASIC Application-Specific Integrated Circuit, an integrated circuit

designed to perform a specific job, though the job might be generic (a
microprocessor is an ASIC, for example).

ATPG Automatic Test Pattern Generator.

asynchronous Logic that operates without a reference clock. 90% of the

problems the logic designer will face are related to asynchronous signal
timing.

autorouting A computerized method of determining signal or element

interconnection.

behaviorial A procedural coding style that describes logic without a direct

link to the synthesized hardware. This is a more abstract form of logic
definition compared with structural gates and continuous assignment
statements.

bidirectional A port that acts as both input and output (inout). This port
will have output drivers connected to an input port. It is up to the
designer to assure that only one output driver is active at a given time.

binary A system with two states, either one or zero.

279
280 Glossary

BIST Built-In Self Test.

bit A contraction for binary digit.

bitstream FPGA/CPLD configuration information that is formatted for

serial communication.

bitwise Describes an operation where a bit in one vector acts on or is

acted on by the corresponding bit in another vector.

blocking A blocking assignment will complete before later statements

get executed (i.e., statements that follow the blocking assignment are
postponed until the blocking assignment is complete). In a sequential
construct, the order of blocking assignments is significant and unwanted
latches can be inferred.

Boolean A system of symbolic logic based on the manipulation of

symbols and numbers.

Buskeeper A low-current driver circuit that maintains a logic state on a

node when the bus is tristated.

buffer A signal driver used to isolate signals or provide power gain

for driving low impedance loads.

capacitance The measure of how a circuit or circuit element stores or

couples charges.

case A multi-input decision statement. The test cases are

prioritized, the earliest case that matches the input will be executed. The
case decision is either true or false. The input is tested for an exact
match to 0, 1, X, and Z conditions.

casex A case decision that treats Z and X conditions as “don’t care”

(X) conditions.

casez A case decision that treats Z conditions as a “don’t care” (X)

condition.
Glossary 281

checksum A modulo-n result of adding data values. A checksum is used

to validate a data packet.

CLB Configurable Logic Block, a basic Xilinx FPGA element

consisting of a 3-5 input LUT.

CLM Career Limiting Move.

CMOS Complementary symmetry (i.e., uses both P- and N-style

transistors) Metal Oxide Semiconductor.

combinational An asynchronous operation that makes a direct and

immediate assignment to the output.

concatenation Items linked together in a continuous and related chain. In

Verilog, items enclosed in {}, are linked together and operated on as a
single entity.

configuration The process of loading the FPGA with the user’s design
file(s).

constraints Conditions and requirements added to a design to provide

optimized performance. Constraints include signal path timing
requirements, device pin assignments, and logic block relative
locations, etc.

core An intellectual property element, a pre-designed function

block.

CPLD Complex Programmable Logic Devices. Compared to an

FPGA, a CPLD has more complex logic elements and more regimented
routing tracks that lead to more deterministic, but less flexible, circuit
performance.

CPU Central Processing Unit.

CRC Cyclic Redundancy Checksum. A pseudorandom number

correlated to a data stream.
282 Glossary

DeMorgan's Theorems Two Boolean Logic theorems that convert

between OR and AND forms. In Verilog form, here are the two
theorems:
~(A | B) = (~A & ~B);
~(A & B) = (~A | ~B);

DFF D-Type (edge-triggered) FlipFlop.

dissipation Waste created during the performance of some useful task. In

the context of FPGAs, this is power wasted when signals switch. This
causes heating of the FPGA device. The dissipation (heating) is
proportional to the signal loading and the switching frequency.

DLL Delay-Locked Loop. A method of controlling clock skew

across a device by delaying clocks paths a variable amount until all
edges are nearly simultaneous.

DRAM Dynamic Random Access Memory.

EAB Embedded Array Block. This is Altera’s basic RAM block in

their CPLDs.

edge-triggered A signal that is evaluated only at the rising and/or falling

edge of a reference clock.

EDIF Electronic Design Interchange Format. This standard is

administered by the Electronic Industries Association (EIA).

EEPROM Electrically Erasable Programmable Read-Only Memory.

EMI Electro Magnetic Interference. Some of the energy that is

wasted during signal switching is radiated into space. If this energy, if
not managed, can cause problems for other electronic circuits.

EPROM Electrically Programmable Read-Only Memory.

fanout A measure of the unit-loading of a driver.

Glossary 283

feedback A signal wrapped from an output back to the input.

FET Field Effect Transistor.

FG Function Generator. A 3-, 4-, or 5- input look-up table, a

basic Xilinx logic element.

FIFO First-In First-Out register set.

flatten The process of merging modules and library parts to create a

single homogenous netlist.

flipflop A bistable multivibrator, a circuit where the output is either

true or false. The output depends on the input and the input history
(memory).

floorplan The arrangement of logic elements in the physical structure of

the device.

footprint The arrangement and style of the physical pins and package
of a device.

FPGA Field Programmable Gate Array. Compared to a CPLD, the

FPGA has more segmented routing and less complex logic elements.
This leads to more flexible, but less deterministic, circuit performance.

FSM Finite State Machine.

GAL Generic Array Logic. Early PLDs had active-low or active-

high polarity outputs, the GAL allowed programming the polarity of the
output.

GIGO Garbage-In, Garbage-Out. A maxim that the quality of the

output is directly related to the quality of the input.

glitch A short and unwanted signal transition.

284 Glossary

GSR Global Set/Reset. A dedicated and device-wide signal routing

and buffering resource.

GTL Gunning Transistor Logic.

GTS Global TriState.

GUI Graphical User Interface.

hazard An overlap or dropout of input signals that cause a glitch.

HDL Hardware Description Language. A text-based method of

capturing a design.

hex Short for hexadecimal, a numbering system with 16 values

where each power is represented by the single digit 0-9 and A-F.

hierarchy A pyramidal arrangement of modules.

hold time The period of time after a clock edge that an input signal must
be stable to assure the flipflop or latch output follows the input
correctly.

hysteresis A condition similar to friction where feedback is used to slow

an output’s response to an input signal change. Often used to help
prevent glitches.

impedance The opposition to a change of signal direction or strength.

Impedance is the sum of resistance and reactance.

inout A bidirectional module port.

input A module port that is driven by an external signal or signals.

instance An occurrence of a signal, library part, or module.

instantiate To create an occurrence of a signal, library part, or module.

Glossary 285

integer A whole number (no fractional or decimal part). Verilog

defines an integer to be at least 32 bits wide.

IP Intellectual Property.

LAB Logic Array Block. This is Altera’s basic logic block in their
CPLDs.

latch A level-sensitive storage element. This circuit has feedback

which allows it to “remember” its history and maintain a condition
based on that history.

latency The time it takes to process inputs to create the output. In a

synchronous system this time can be measured in the number of clock
cycles required to complete an operation.

LE Logic Element. Built from look-up tables and flipflips.

Altera’s LABs are built of structured groups of LEs.

Lint A computer language syntax-checker.

LSB Least Significant Bit.

LUT Look-Up Table.

management The person who provides guidance and direction for a team.
The manager of a team sets the limit for team achievement.

metastability When the setup- or hold-time for a flipflop is violated, the

output becomes indeterminate, this characteristic of a flipflop is called
metastability.

MSB Most Significant Bit.

MUX A multiplexer. A circuit where the output is switched or

selected by a control or set of controls.

NAND Not-AND, an AND gate with the output inverted.

286 Glossary

net A connection point similar to a trace on a circuit board.

netlist A textual version of a design which includes all elements and

their interconnections.

newbie Someone who is new to a technology and therefore clueless.

nonblocking An assignment that can be scheduled without blocking the

procedural flow. Nonblocking assignments occur simultaneously and do
not interfere with each other, their order in a sequential block is not
significant.

nsec Nanosecond (10-9).

oscillator A device that produces an alternating or pulsating output.

These circuits are often used to create reference clocks for synchronous
circuits. The basic requirements for an oscillator are: 360 degrees of
feedback and an overall loop gain of 1. There is an old saying: if you’re
trying to design an oscillator, you will get an amplifier, if you’re trying
to design an amplifier, you will get an oscillator.

output A module port that drives external signal or signals.

pad A net that connects the FPGA logic to the outside world.

parameter An operating value for a module. This is generally a value

that can be changed during compilation.

PCB Printed Circuit Board.

pipeline A method of reducing logic that must be resolved between

clock edges. Pipelining increases operating speed at the expense of
latency.

PIP Programmable Interconnect Point. Altera’s method of making

signal connections.

PLD Programmable Logic Device.

Glossary 287

PLL Phase-Locked Loop. A method of synchronizing to a

reference frequency.

portability A measure of the ability to transfer a design from one target

device to another.

POST Power-On Self Test.

primitive The most basic elements of a design. Verilog primitives are

and, nand, nor, or, xor, and xnor. Primitives may also describe the
elements of an FPGA/CPLD architecture (pin buffers, clock drivers,
LUT’s, etc.).

propagation Signals are represented by charges. It takes time for charges

to be distributed across and through circuitry, this time is called
propagation.

pull-down A termination resistor, unless the wire is otherwise driven, the

resistor pulls the node to a logic low.

pull-up A termination resistor, unless the wire is otherwise driven, the

resistor pulls the node to a logic high.

PWB Printed Wiring Board.

RAM Random Access Memory.

reg A data storage element which can be a latch, a flipflop, or a

memory cell or cells. The default state of a Verilog reg is X.

RFI Radio Frequency Interference.

route The physical path a signal follows to get to its destination.

RTL Register Transfer Level. RTL assumes a set of hardware

constructs are defined in FPGA hardware and library elements. HDL
code is mapped to these constructs. RTL constructs include circuit
288 Glossary

blocks like flipflops, latches, MUXs, etc., all connected together with
the FPGA routing resources.

schematic A graphical circuit diagram.

SDF Standard Delay Format, a netlist that includes signal delay

information.

sensitivity list Also called an event list or event sensitivity list. This is an
index of signals used in a block. This list drives the simulator: the
simulator can evaluate signals that change and determine if the signal is
used in a block. If the signal is not used, the block does not have to be
processed.

setup time The period of time before a clock edge that an input signal
must be stable to assure the flipflop or latch output follows the input
correctly.

skew The time difference between when a signal is generated in

one part of an FPGA and when it arrives at destination(s) at other parts
of the FPGA.

slack time The extra time available to allow logic to resolve before a
timing violation occurs. Positive slack time is good, negative slack time
is bad.

SMT Surface Mount.

SRAM Static Read-Only Memory.

structural A form of HDL coding style where circuit elements are

connected together like a schematic.

stuck A form of logic fault. A signal can be stuck at a certain value

(like a stuck-at-1 fault) when it should change based on some input
change.

synchronous A form of circuitry that uses a clock reference.

Glossary 289

synthesis The process of mapping HDL to the available hardware.

ternary Arranged in a group of three.

threshold The voltage level where a signal is resolved into a zero or one
value. For TTL, this voltage is approximately 1.4V, for CMOS the
voltage is approximately ½ the supply voltage.

tick The accent grave or open quote symbol ( ` ) used by Verilog

to identify compiler directives (`define for example). Not to be confused
with the close quote symbol ( ‘ ) used in defining numbers (1’b0 for
example).

timescale The basic unit of time used during simulation. The default
time unit in Verilog is nsec.

TLA Three Letter Acronym.

toggle To change state.

tri A Verilog net that can be driven by only multiple sources.

tristate Three levels of output drive, 0, 1, or Z (open or no drive).

uA Microamp (10-6 Amps).

UART Universal Asynchronous Receiver-Transmitter.

vector Multibit net or register variable. Verilog only supports one-

dimensional vectors. This can also be shorthand for test vectors, a set of
input and output values used for test.

vendor A supplier of goods or services.

Verilog A HDL simulation language designed by Phil Moorby et al in

1983-1984 for Automated Integrated Design Systems (later called
Gateway Design Automation). Gateway was acquired by Cadence in
1989. Cadence placed Verilog into the public domain managed by the
290 Glossary

Open Verilog International (OVI) in 1990. IEEE Std 1364-1995

Standard Hardware Description Language Based on the Verilog®
Hardware description Language was approved in 1995.

VHDL Very high-speed integrated circuit Hardware Description

Language. This language has its roots in the Ada programming
language and is the main competitor to Verilog.

wire A Verilog net (can be driven by only one source).

X An unknown value.

XNOR Exclusive NOR, the output is an inverted version of the XOR

function.

XOR Exclusive OR, the output is true only when the inputs are
different.

Z A high impedance value (open or not driven).

Bibliography

Bergeron, Janick, Writing Testbenches: Functional Verification of HDL Models,

Second Edition, Kluwer Academic Publishers, Norwell, MA, 2003
Bhasker, J., A Verilog HDL Primer, Star Galaxy Press, Allentown, PA, 1997.
Bhasker, J., Verilog HDL Synthesis, A Practical Primer, Star Galaxy Publishing,
Allentown, PA, 1998.
Ciletti, Michael D., Modeling, Synthesis and Rapid Prototyping with the Verilog HDL,
Prentice Hall, Upper Saddle River, NJ, 1999.
Johnson, Howard W., and Graham, Martin, High-Speed Digital Design: A Handbook
of Black Magic, Prentice Hall, Upper Saddle River, NJ, 1992.
Keating, Michael, and Bricaud, Pierre, Reuse Methodology Manual for System-on-a-
Chip Designs, Kluwer Academic Publishers, Norwell, MA, 1998.
Kurup, Pran, Abbasi, Taher, and Bedi, Ricky, It’s the Methodology, Stupid, Bytek
Designs, Palo Alto CA, 1998.
Lee, James M., Verilog Quickstart, Kluwer Academic Publishers, Norwell, MA, 1997.
Malvino, Albert Paul, and Leach, Donald P., Digital Principles and Applications, 2nd
ed., McGraw-Hill Book Company, New York, NY, 1975
Maxfield, Clive “Max”, Designus Maximus Unleashed!, Butterworth-Heinemann,
Woburn, MA, 1998.
Palnitkar, Samir, Verilog HDL, A Guide to Digital Design and Synthesis, Prentice
Hall, Upper Saddle River, NJ, 1996.
Rajan, Sundar, Essential VHDL RTL Synthesis Done Right, S&G Publishing, San
Jose, CA, 1997.
Sagdeo, Vivek, The Complete Verilog Book, Kluwer Academic Publishers, Norwell,
MA, 1998.
Smith, Douglas J., HDL Chip Design, Doone Publications, Madison AL, 1997.
Smith, Michael J. S., Application-Specific Integrated Circuits, Addison-Wesley,
Reading, MA, 1997.
Sternheim, Eli, Singh, Rajvir, Madhaven, Rajeev, and Trivedi, Yatin, Digital Design
and Synthesis with Verilog HDL, Automata, San Jose, CA, 1993.
Zeidman, Bob, Verilog Designer’s Library, Prentice Hall, Upper Saddle River, NJ,
1999.

291
Index

Index

&, 31 assign, 9
&&, 31 asynchronous, 58
|, 31 ATPG, 277
||, 31 autorouting, 75
=, 10, 25
==, 33 B
<=, 10, 25
$display, 156 bidirectional bus, 83
$dump, 165 bitstream, 275
$dumpall, 165 bitwise, 31, 276
$dumpfile, 165 blocking, 10, 25
$dumplimit, 164 buf, 16
$dumpoff, 164 bufg, 180
$dumpon, 164 bufif0, 16
$dumpvars, 164 bufif1, 17
$finish, 156 buskeeper, 276
$monitor, 162
$monitoroff 162 C
$monitoron, 162
$readmemb, 166 capacitance, 276
$readmemh, 166 carry, 106
$stop, 156 case, 276
$write, 157 casex, 276
'define, 154 checklist, 222
'include, 155 checksum, 276
'timescale, 155 CLB, 225, 276
undef, 154 clock, 61
CMOS, 276
A Coffman's Law, 236
combinational, 276
adder, 50, 105 comment, 9
alligators, 261 commutative, 70
Altera, 231 concatenation, 39
always, 10, 25 conditional, 36
array, 143 configuration, 277
ASIC, 275 constraints, 277

292
Index

core, 277 H
counter, 121
CPLD, 277 hazard, 279
CRC, 133 HDL, 279
cyclic redundancy checksum, 133 hierarchy, 12
hold time, 58, 279
D hysteresis, 63, 279

default, 85 I
delay, 56
DeMorgan's Theorems, 69 inout, 9, 83
DFFs, 277 instances, 279
dissipation, 277 instantiate, 279
division, 35, 133 IOB, 202, 224
DLL, 227
J
E
Johnson counter, 122
EAB, 233, 277
edgetrig, 24 L
event, 10
event sensitivity list, 282 latch, 18, 280
latency, 92
F LFSR, 124
Linear Feedback Shift Register 124
false, 30 Lint, 280
fanout, 56, 278 LogiBLOX, 247
feedback, 19-20 LUT, 280
FGs, 278
FLEX, 231 M
flipflop, 12, 18
floorplan, 278 metastability, 57, 280
footprint, 278 module, 9, 12
multiplexer, 44
G multiplication, 116
multivibrator, 18
GIGO, 8 MUX, 44,280
glitch, 56, 278
Gray Code, 97-103 N
GSR, 279
GTS, 279 NAND, 14

293
Index

negedge, 10 Schmitt trigger, 63

netlist, 2, 46, 280 script, 175
nonblocking, 25, 280 SDF, 53, 282
NOR, 15 sensitivity list, 10, 282
setup time, 58, 282
O skew, 63, 282
structural, 282
OR, 14 subtractor, 114
obuf, 240 synchronize, 59
oscillator, 264, 281 synthesis, 2, 283
overtime, 7
T
P
ternary, 37
parameter, 38 test-vector, 269
pins, 12, 196, 222 threshold, 55, 210, 283
PLL, 281 tick, 29
port, 9 tilde (~), 30
portability, 11 timeformat, 159
posedge,10 tristate bus, 78
priority encoder, 11-12, 33
primitive, 281 U
propagation, 63
unknown, 33
Q
V
Quine-McCluskey, 74
QM.EXE, 74 vector, 283
vectorless, 227, 262
R Verilog, 284
Virtex, 227
race condition, 64
RAM, 136 W–X–Y-Z
redundant, 69
reg, 282 weak keeper, 230
ripple counter, 121 wire, 284
ROM, 135 X, 33, 284
RTL, 282 Xilinx, TBD
XNOR, 15
S XOR, 15
Z, 284
schedule, 7
schematic, 282

294
T H E A U T H O R

Ken Coffman has held a variety of jobs,

including strawberry picker, dishwasher,
laborer in a cat food factory, Air Force
Sergeant, rock’n’roll bass player, college
lecturer, electronics technician, injection-
molding machine operator, electrical
engineer, engineering manager, concert
promoter (with Craig Ranta), small
business owner and salesman. He reserves
the right to go back to washing dishes if
these careers don’t prove profitable.
Ken is the author of four novels:
Steel Waters, Alligator Alley (with Mark
Bothum), Twisted Shadow (with Mark
Bothum) and Glen Wilson’s Bad
Medicine. He wrote a screenplay called
Expense Report based on Alligator Alley.
Ken was task leader of the
semantics section of the IEEE 1364.1
Verilog RTL Synthesis Specification. He
holds a BSEET degree from Cogswell
College North.
He lives in the hills northeast of
Seattle with his wife Judy and a dog
named Bear.

Photograph by Dwight Freeman

295
This page intentionally left blank

EBook Computer Architecture
No ratings yet
EBook Computer Architecture
301 pages
Full Download Avionics Certification A Complete Guide To DO 178 DO 178C DO 254 PDF
100% (1)
Full Download Avionics Certification A Complete Guide To DO 178 DO 178C DO 254 PDF
24 pages
Aerospace System Engineering
No ratings yet
Aerospace System Engineering
15 pages
Model-Based Engineering of Embedded Systems - The SPES 2020 Methodology (PDFDrive)
No ratings yet
Model-Based Engineering of Embedded Systems - The SPES 2020 Methodology (PDFDrive)
297 pages
iCEcube2 User Guide
No ratings yet
iCEcube2 User Guide
207 pages
Hyperlynx High-Speed Serial Interface Analysis: Student Workbook
No ratings yet
Hyperlynx High-Speed Serial Interface Analysis: Student Workbook
6 pages
Avionics System Design Final
No ratings yet
Avionics System Design Final
27 pages
Multicore Architecture Trends
No ratings yet
Multicore Architecture Trends
28 pages
Software Engineering by Abraham
No ratings yet
Software Engineering by Abraham
55 pages
Developing Safety Critical Software A Practical Guide For Aviation Software and DO 178C Compliance 1st Edition Leanna Rierson Newest Edition 2025
80% (5)
Developing Safety Critical Software A Practical Guide For Aviation Software and DO 178C Compliance 1st Edition Leanna Rierson Newest Edition 2025
126 pages
Introducing Adaptive System On Modules
No ratings yet
Introducing Adaptive System On Modules
36 pages
Embedded System Testing with HIL
No ratings yet
Embedded System Testing with HIL
11 pages
Misra C
No ratings yet
Misra C
51 pages
VHDL Basic
No ratings yet
VHDL Basic
242 pages
Designing Embedded Communications Software
No ratings yet
Designing Embedded Communications Software
239 pages
VHDL Programming Guide
No ratings yet
VHDL Programming Guide
92 pages
Systems Engineering Challenges and MBSE Opportunities For Automotive System Design
No ratings yet
Systems Engineering Challenges and MBSE Opportunities For Automotive System Design
6 pages
HDL Coding For Xilinx FPGA
No ratings yet
HDL Coding For Xilinx FPGA
251 pages
Electronics System Design Using FPGA
No ratings yet
Electronics System Design Using FPGA
15 pages
Energy Effi Cient Embedded Video Processing Systems: Muhammad Usman Karim Khan Muhammad Shafi Que Jörg Henkel
No ratings yet
Energy Effi Cient Embedded Video Processing Systems: Muhammad Usman Karim Khan Muhammad Shafi Que Jörg Henkel
242 pages
Embedded System Design - A Unified Hardware - Software Introduction PDF
No ratings yet
Embedded System Design - A Unified Hardware - Software Introduction PDF
3 pages
Modelsim User
No ratings yet
Modelsim User
854 pages
Simplified FPGA Design Implementation Flow
No ratings yet
Simplified FPGA Design Implementation Flow
36 pages
EG - VHDL - Altera
No ratings yet
EG - VHDL - Altera
144 pages
Ug1703 Vitis Ai Developer Guide WTMKX
No ratings yet
Ug1703 Vitis Ai Developer Guide WTMKX
137 pages
DDR4 Memory Controller IP - Xilinx
No ratings yet
DDR4 Memory Controller IP - Xilinx
130 pages
Embedded Systems Architecture Programming and Design (Scanned Copy) by Raj Kamal (Z-Lib - Org) - 7
No ratings yet
Embedded Systems Architecture Programming and Design (Scanned Copy) by Raj Kamal (Z-Lib - Org) - 7
48 pages
Python Tutorial
No ratings yet
Python Tutorial
136 pages
2016 Complete Symbolic Simulation of SystemC Models Efficient Formal Verification of Finite Non-Terminating Programs
No ratings yet
2016 Complete Symbolic Simulation of SystemC Models Efficient Formal Verification of Finite Non-Terminating Programs
172 pages
Design Automation
No ratings yet
Design Automation
11 pages
Expert System for Airborne Software Standards
No ratings yet
Expert System for Airborne Software Standards
21 pages
Send An Image Over A Network Using QT
No ratings yet
Send An Image Over A Network Using QT
11 pages
Embedded Software Development For Safety-Critical Systems 2nd Edition Chris Hobbs No Waiting Time
100% (2)
Embedded Software Development For Safety-Critical Systems 2nd Edition Chris Hobbs No Waiting Time
148 pages
Avionics Subsystems Overview
No ratings yet
Avionics Subsystems Overview
3 pages
Fpga Adv WKB 62
No ratings yet
Fpga Adv WKB 62
638 pages
Misra C 2023 (Misra C 2012)
No ratings yet
Misra C 2023 (Misra C 2012)
13 pages
Security and Safety in Embedded Applications
No ratings yet
Security and Safety in Embedded Applications
14 pages
FAA Order 8110 49 W-CHG 2
No ratings yet
FAA Order 8110 49 W-CHG 2
84 pages
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
No ratings yet
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
6 pages
Altera DSP Builder Handbook Vol 1 - Intro Do DSP Builder
No ratings yet
Altera DSP Builder Handbook Vol 1 - Intro Do DSP Builder
790 pages
VHDL Made Easy
No ratings yet
VHDL Made Easy
424 pages
Emmeskay MIL-SIL Tutorial
No ratings yet
Emmeskay MIL-SIL Tutorial
52 pages
Real Time Embedded Systems
100% (6)
Real Time Embedded Systems
35 pages
Avionics Software Re-use Strategies
No ratings yet
Avionics Software Re-use Strategies
5 pages
Multi-Core Programming Digital Edition (06!29!06)
No ratings yet
Multi-Core Programming Digital Edition (06!29!06)
362 pages
Intro To Fpga and Quartus Prime Software
No ratings yet
Intro To Fpga and Quartus Prime Software
38 pages
ECSS E ST 50 15C (1may2015)
No ratings yet
ECSS E ST 50 15C (1may2015)
97 pages
A Comparison of SAE ARP 4754A and ARP 4754
No ratings yet
A Comparison of SAE ARP 4754A and ARP 4754
8 pages
ARNOLD, K. (2001) - Embedded Controller Hardware Design
No ratings yet
ARNOLD, K. (2001) - Embedded Controller Hardware Design
245 pages
Distributed Real-Time Embedded Systems
No ratings yet
Distributed Real-Time Embedded Systems
111 pages
A Course On Advanced Real-Time Embedded Systems
No ratings yet
A Course On Advanced Real-Time Embedded Systems
192 pages
Sys ML
No ratings yet
Sys ML
398 pages
5F9AC
No ratings yet
5F9AC
5 pages
Verilog Designers Library 0130811548 9780130811547 - Compress
No ratings yet
Verilog Designers Library 0130811548 9780130811547 - Compress
430 pages
Fundamental of Digital Logic With VHDL Design
No ratings yet
Fundamental of Digital Logic With VHDL Design
890 pages
Contents Mastering FPGA Chip Design
0% (1)
Contents Mastering FPGA Chip Design
9 pages
Bob Zeidman - Verilog Designer's Library-Prentice Hall (1999) PDF
100% (1)
Bob Zeidman - Verilog Designer's Library-Prentice Hall (1999) PDF
411 pages
M.tech Vlsi Syllabus: D.A.John & K.Martin, Analog Integrated Circuit Design, Wiley, 1997
No ratings yet
M.tech Vlsi Syllabus: D.A.John & K.Martin, Analog Integrated Circuit Design, Wiley, 1997
5 pages
EC2354 - VLSI DESIGN - Unit 5
No ratings yet
EC2354 - VLSI DESIGN - Unit 5
84 pages
Digital Systems Design and Test
No ratings yet
Digital Systems Design and Test
9 pages
Data Mining Slides
No ratings yet
Data Mining Slides
17 pages
Floating Point Number
No ratings yet
Floating Point Number
28 pages
Lab Manual Solution
No ratings yet
Lab Manual Solution
48 pages
Short em
No ratings yet
Short em
20 pages
Notificationreg - Academic Calender 2025
No ratings yet
Notificationreg - Academic Calender 2025
4 pages
Digital Top 150 Question
No ratings yet
Digital Top 150 Question
217 pages
Microprocessor Stack & Subroutines
No ratings yet
Microprocessor Stack & Subroutines
14 pages
#1 Penetration Testing Internship Report
No ratings yet
#1 Penetration Testing Internship Report
18 pages
OS and ALGO Formula Series
No ratings yet
OS and ALGO Formula Series
36 pages
Bug Bounty Lab Manual for Teachers
No ratings yet
Bug Bounty Lab Manual for Teachers
30 pages
Malware Detection for Researchers
No ratings yet
Malware Detection for Researchers
11 pages
Number Theory Frontbackmatter
100% (1)
Number Theory Frontbackmatter
47 pages
MPI GTU Study Material E-Notes Unit-5 13052022115156AM
No ratings yet
MPI GTU Study Material E-Notes Unit-5 13052022115156AM
15 pages
8086 Microprocessor Guide
No ratings yet
8086 Microprocessor Guide
27 pages
Python Data Handling Essentials
No ratings yet
Python Data Handling Essentials
81 pages
EEE111+ (Exp+7) +Study+of+Switching+Characteristics
No ratings yet
EEE111+ (Exp+7) +Study+of+Switching+Characteristics
5 pages
NI Tutorial 6994 en
No ratings yet
NI Tutorial 6994 en
1 page
Procedure For Experiments in PART-B (Schematic of CMOS Invertor)
No ratings yet
Procedure For Experiments in PART-B (Schematic of CMOS Invertor)
2 pages
Digital Clock But Without A Microcontroller Hardco
No ratings yet
Digital Clock But Without A Microcontroller Hardco
17 pages
SN 74 Aup 1 T 04
No ratings yet
SN 74 Aup 1 T 04
15 pages
The Intel Pen Ti Um Processor
No ratings yet
The Intel Pen Ti Um Processor
12 pages
RAM and ROM Handout
No ratings yet
RAM and ROM Handout
4 pages
ATX Form Card Pinout
No ratings yet
ATX Form Card Pinout
2 pages
Cochin University of Science and Technology
No ratings yet
Cochin University of Science and Technology
2 pages
ATMEGA88PA
No ratings yet
ATMEGA88PA
326 pages
XW 8200
No ratings yet
XW 8200
2 pages
The Difference Between RAM Speed and CAS Latency
No ratings yet
The Difference Between RAM Speed and CAS Latency
1 page
Datasheet 119
No ratings yet
Datasheet 119
4 pages
FOV-Unit1 Complete
No ratings yet
FOV-Unit1 Complete
140 pages
Final Assignment DLD B
No ratings yet
Final Assignment DLD B
1 page
Cs2202 ANNA UNIV Question Paper 1
No ratings yet
Cs2202 ANNA UNIV Question Paper 1
3 pages
Criminal Complaint US vs. Ignjatov
No ratings yet
Criminal Complaint US vs. Ignjatov
27 pages
Um1079 User Manual
100% (1)
Um1079 User Manual
39 pages
Chapter 5: Bus System
No ratings yet
Chapter 5: Bus System
5 pages
Normarc 7013 ILS Technical Handbook 91139 PDF
No ratings yet
Normarc 7013 ILS Technical Handbook 91139 PDF
132 pages
HP Compaq Nx6125 Epw00la 2541
No ratings yet
HP Compaq Nx6125 Epw00la 2541
58 pages
PTM01 CPU Technical Ref
No ratings yet
PTM01 CPU Technical Ref
111 pages
Atmega 2560 Ingles (111-140)
No ratings yet
Atmega 2560 Ingles (111-140)
30 pages
Proficy Serial Troubleshooting
No ratings yet
Proficy Serial Troubleshooting
9 pages
Lab 04
No ratings yet
Lab 04
6 pages
Stm32Cubeprog: Stm32Cubeprogrammer All-In-One Software Tool
No ratings yet
Stm32Cubeprog: Stm32Cubeprogrammer All-In-One Software Tool
4 pages
Digital Circuits and Microprocessors K-Notes
No ratings yet
Digital Circuits and Microprocessors K-Notes
47 pages
Purchae Order
No ratings yet
Purchae Order
4 pages
Friends Niga Memes
No ratings yet
Friends Niga Memes
61 pages
Chapter 4 - Synchronous MOD Counters
No ratings yet
Chapter 4 - Synchronous MOD Counters
8 pages