Ug579 Ultrascale DSP
Ug579 Ultrascale DSP
DSP Slice
User Guide
Chapter 1: Overview
Introduction to UltraScale Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
UltraScale Architecture DSP Slice Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Differences from Previous Generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Device Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Recommended Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Overview
Virtex® UltraScale+™ devices provide the highest performance and integration capabilities
in a FinFET node, including both the highest serial I/O and signal processing bandwidth, as
well as the highest on-chip memory density. As the industry's most capable FPGA family,
the Virtex UltraScale+ devices are ideal for applications including 1+Tb/s networking and
data center and fully integrated radar/early-warning systems.
Virtex UltraScale devices provide the greatest performance and integration at 20 nm,
including serial I/O bandwidth and logic capacity. As the industry's only high-end FPGA at
the 20 nm process node, this family is ideal for applications including 400G networking,
large scale ASIC prototyping, and emulation.
Zynq® UltraScale+ MPSoC devices provide 64-bit processor scalability while combining
real-time control with soft and hard engines for graphics, video, waveform, and packet
processing. Integrating an Arm®-based system for advanced analytics and on-chip
programmable logic for task acceleration creates unlimited possibilities for applications
including 5G Wireless, next generation ADAS, and Industrial Internet-of-Things.
This user guide describes the UltraScale architecture DSP Slice resources and is part of the
UltraScale architecture documentation suite available at: www.xilinx.com/documentation.
B XOR
+
A – P
27 x 18
Multiplier Pattern
D =
Pre-adder Detect
Pattern Detector
C
X16750-082917
The DSP48E2 slice supports both sequential and cascaded operations due to the dynamic
OPMODE and cascade capabilities. Applications of the DSP slice include:
° Support adding two other input operands with the multiplier’s two partial products,
instead of only one in DSP48E1
° Add a memory-cell based rounding constant while freeing the C input for the
following function: A x B + C + RND
° WMUX provides another accumulator feedback path to reduce the size of the
complex multiply-accumulate (MACC) or a semi-parallel FIR filter.
• Wide XOR of X, Y, Z multiplexers
° 48 3-bit XOR at first level feeds XOR tree to create octal 12-bit XOR, quad 24-bit
XOR, dual 48-bit XOR, single 96-bit XOR
° Cascading two DSP48E2 slices in Wide XOR mode creates octal 24-bit XOR, quad
48-bit XOR, dual 96-bit XOR, or single 192-bit XOR. Cascade depth is limited to DSP
column size
° Sequentially create wider XOR via XOR accumulation feedback with a single
DSP48E2, extending the XOR width by 96 bits every clock cycle
• Unique features in DSP48E2:
The DSP48E2 blocks use a signed arithmetic implementation. To best match the resource
capabilities and, in general, to get the most efficient mapping, write code using signed
values in the HDL source. Designs created for the 25 x 18 multiplier in the 7 series FPGAs
may need to be sign-extended for the 27 x 18 multiplier in the UltraScale architecture. For
more details on migration and design methodologies, see UltraScale Architecture Migration
Methodology Guide (UG1026) [Ref 1]. When migrating designs with many cascaded DSP
slices, the number of DSP slices per column in the new target device should be taken into
consideration.
Device Resources
The DSP resources are optimized and scalable across the UltraScale portfolio, providing a
common architecture that improves implementation efficiency, IP implementation, and
design migration. Migration between UltraScale families does not require any design
changes for the DSP48E2 slice.
Two DSP48E2 slices with a dedicated interconnect form each DSP tile (see Figure 1-2). The
DSP tiles stack vertically in a DSP48E2 column. The height of a DSP tile is the same as five
configurable logic blocks (CLBs) and also matches the height of one 36K block RAM. The
block RAM can be split into two 18K block RAMs. Each DSP48E2 slice aligns horizontally
with an 18K block RAM, providing optimal connectivity between resources.
X-Ref Target - Figure 1-2
CLBs \ Interconnect
18K Block RAM DSP48E2 Slice
X16751-042617
Note: The maximum might be less in different DSP columns (if PS limits the DSP cascade height) or
SLRs (SLR with HBM interface vs. no HBM interface).
Table 1-1 shows the maximum number of DSP48E2 slices that can be directly cascaded
vertically in a column, and the total number of DSP48E2 slices, for the UltraScale FPGAs.
Notes:
1. KU085 max cascade is 96 in SLR1.
2. VU160 max cascade is 96 in SLR0.
Table 1-2 shows the same information for the UltraScale+ FPGAs.
For more information on design techniques, see Chapter 4, DSP48E2 Usage Guidelines.
Pinout Planning
DSP usage has little effect on pinouts because DSP48E2 slices are distributed throughout
the device. The best approach is to let the tools choose the DSP48E2 and I/O locations
based on the implementation requirements. Results can be adjusted if necessary for board
layout considerations. The timing constraints should be set so that the tools can choose
optimal placement to meet the specific design requirements. The only directional
consideration in the DSP structure is that DSP48E2 slices cascade vertically up a column,
allowing wide buses to drive a vertical orientation to other logic, including I/O. The I/O
columns typically provide 13 I/O in the same vertical space as every six DSP48E2 slices, with
every clock region defined in height by a bank of 52 I/O and 12 DSP tiles (24 DSP slices).
DSP48E2 Functionality
Overview
This chapter provides technical details of the DSP48E2 element. The DSP48E2 slice consists
of a 27-bit pre-adder, 27 x 18 multiplier and a flexible 48-bit ALU that serves as a
post-adder/subtracter, accumulator, or logic unit (see Figure 2-1).
X-Ref Target - Figure 2-1
CARRYCASCOUT*
MULTSIGNOUT* PCOUT*
BCOUT* ACOUT* 0
RND
W XOR OUT
8
48 A:B
18 30
ALUMODE
18 0 4 48
B 18 18 X
Dual B Register
U
18
MULT M
A 30 27 X 18 V CARRYOUT 4
30
Dual A, D, 27 0 Y P
48
1 P
30 and Pre-adder
D 27
C 0 PATTERNDETECT
48
C
2 17-Bit Shift PATTERNBDETECT
4
Z
5 17-Bit Shift CREG/C Bypass/Mask
INMODE
3
CARRYIN MULTSIGNIN*
OPMODE 9
CARRYCASCIN*
CARRYINSEL
48
BCIN* ACIN* PCIN*
*These signals are dedicated routing paths internal to the DSP48E2 column. They are not accessible via general-purpose routing resources.
X16752-042617
• Multiply
• Multiply accumulate (MACC)
• Multiply add
• Four-input add
• Barrel shift
• Wide-bus multiplexing
• Magnitude comparator
• Bitwise logic functions
• Wide XOR
• Pattern detect
• Wide counter
The architecture also supports cascading multiple DSP48E2 slices to form wide math
functions, DSP filters, and complex arithmetic without the use of general logic.
DSP48E2 Features
The features in the DSP48E2 slice are:
° Bitwise logic operations—two-input AND, OR, NOT, NAND, NOR, XOR, and XNOR
° Overflow/underflow support
° Terminal count detection support and auto resetting: auto resetting can give
priority to clock enable
• Cascading 48-bit P bus supports internal low-power adder cascade: 48-bit P bus allows
for 12-bit quad or 24-bit dual SIMD adder cascade support
• Optional 17-bit right shift to enable wider multiplier implementation
• Dynamic user-controlled operating modes:
° 5-bit INMODE control bus provides selects for 2-deep A and B registers, pre-adder
add-sub control as well as mask gates for pre-adder multiplexer functions
° 4-bit ALUMODE control bus selects logic unit function and accumulator add-sub
control
• Carry in for the second stage adder:
The DSP slice consists of a multiplier followed by an accumulator. At least three pipeline
registers are required for both multiply and multiply-accumulate operations to run at full
speed. The multiply operation in the first stage generates two partial products that need to
be added together in the second stage.
When only one or two registers exist in the multiplier design, the M register should always
be used to save power and improve performance.
Add/Sub and Logic Unit operations require at least two pipeline registers (input, output) to
run at full speed.
The cascade capabilities of the DSP slice are extremely efficient at implementing
high-speed pipelined filters built on the adder cascades instead of adder trees.
Multiplexers are controlled with dynamic control signals, such as OPMODE, ALUMODE, and
CARRYINSEL, enabling a great deal of flexibility. Designs using registers and dynamic
opmodes are better equipped to take advantage of the DSP slice’s capabilities than
combinatorial multiplies.
In general, the DSP slice supports both sequential and cascaded operations due to the
dynamic OPMODE and cascade capabilities. Fast Fourier Transforms (FFTs), floating point,
computation (multiply, add/sub, divide), counters, and large bus multiplexers are some
applications of the DSP slice.
Additional capabilities of the DSP slice include synchronous resets and clock enables, dual
A input pipeline registers, pattern detection, Logic Unit functionality, single
instruction/multiple data (SIMD) functionality, and MACC and Add-Acc extension to 96 bits.
The DSP slice supports convergent and symmetric rounding, terminal count detection and
auto-resetting for counters, and overflow/underflow detection for sequential accumulators.
A 96-bit wide XOR function can be implemented as eight 12-bit wide XOR, four 24-bit wide
XOR, or two 48-bit wide XOR.
BCOUT
18
X MUX
18
B
18 B2 INMODE[1]B B2B1
B MULT
BCIN B1
AD_DATA
CARRYCASCOUT*
18
MULTSIGNOUT*
BMULTSEL PCOUT*
BCOUT* ACOUT* 0
CEB1 RSTB CEB2 RSTB
INMODE[4]
RND
W XOR OUT
8
48 A:B
18 30
ALUMODE
18 0 4 48
B 18 18 X
Dual B Register
U
18
MULT M
27 X 18 V CARRYOUT 4
A 30
30
Dual A, D, 27 0 Y 48 P
and Pre-adder 1 P
30
D 27
C 0 PATTERNDETECT
48
C
2 17-Bit Shift PATTERNBDETECT
4
Z
5 17-Bit Shift CREG/C Bypass/Mask
INMODE
3
CARRYIN MULTSIGNIN*
OPMODE 9 ACOUT
CARRYCASCIN*
CARRYINSEL 30 X MUX
48
30
A
BCIN* ACIN* PCIN*
30 A2 27
INMODE[1]A
ACIN A1
AD_DATA
B2B1 27
PREADD_AB
18 AMULTSEL
D PREADDINSEL + AD
–
D 27
27 INMODE[2]
INMODE[3]
CED RSTD CEAD RSTD
X16753-030618
Figure 2-2: Hierarchical View of the DSP48E2 Slice Input Registers and Pre-adder
Each DSP48E2 slice has a two-input multiplier followed by multiplexers and a four-input
adder/subtracter/accumulator. The DSP48E2 multiplier has asymmetric inputs and accepts
an 18-bit two’s complement operand and a 27-bit two’s complement operand. The
multiplier stage produces a 45-bit two’s complement result in the form of two partial
products. These partial products are sign-extended to 48 bits in the X multiplexer and
Y multiplexer and fed into four-input adder for final summation. This results in a 45-bit
multiplication output, which has been sign-extended to 48 bits. Therefore, when the
multiplier is used, the adder effectively becomes a three-input adder.
The second stage adder/subtracter accepts four 48-bit, two’s complement operands and
produces a 48-bit, two’s complement result when the multiplier is bypassed by setting
USE_MULT attribute to NONE and with the appropriate OPMODE setting. In SIMD mode,
the 48-bit adder/subtracter also supports dual 24-bit or quad 12-bit SIMD arithmetic
operations with CARRYOUT bits. In this configuration, bitwise logic operations on two
48-bit binary numbers (and three 48-bit binary numbers in the special XOR3 case) are also
supported with dynamic ALUMODE control signals.
Higher level DSP functions are supported by cascading individual DSP48E2 slices in a
DSP48E2 column. Two datapaths (ACOUT and BCOUT) and the DSP48E2 slice outputs
(PCOUT, MULTSIGNOUT, and CARRYCASCOUT) provide the cascade capability. The ability to
cascade datapaths is useful in filter designs. For example, a Finite Impulse Response (FIR)
filter design can use the cascading inputs to arrange a series of input data samples and the
cascading outputs to arrange a series of partial output results. The ability to cascade
provides a high-performance and low-power implementation of DSP filter functions
because the general routing in the fabric is not used.
The C input allows the formation of many 3-input mathematical functions, such as 3-input
addition or 2-input multiplication with an addition. One subset of this function is the
valuable support of symmetrically rounding a multiplication toward zero or toward infinity.
The C input together with the pattern detector also supports convergent rounding.
For multi-precision arithmetic, the DSP48E2 slice provides a right wire shift by 17. Thus, a
partial product from one DSP48E2 slice can be right justified and added to the next partial
product computed in an adjacent DSP48E2 slice. Using this technique, the DSP48E2 slices
can be used to build bigger multipliers.
The pattern detector at the output of the DSP48E2 slice provides support for convergent
rounding, overflow/underflow, block floating point, and support for accumulator terminal
count (counter auto reset). The pattern detector can detect if the output of the DSP48E2
slice matches a pattern, as qualified by a mask.
The data and control inputs to the DSP48E2 slice feed the arithmetic and logic stages. The
A and B data inputs can optionally be registered one or two times to assist the construction
of different, highly pipelined, DSP application solutions. The D path and the AD path can
each be registered once. The other data inputs and the control inputs can be optionally
registered once. Maximum frequency operation as specified in the UltraScale and
UltraScale+ device data sheets [Ref 2] is achieved by using pipeline registers.
In its most basic form, the output of the adder/subtracter/logic unit is a function of its
inputs. The inputs are driven by the upstream multiplexers, carry select logic, and multiplier
array.
A typical use of the slice is where A and B inputs are multiplied and the result is added to or
subtracted from the C register. More detailed operations based on control and data inputs
are described in later sections. Selecting the multiplier function consumes both X and Y
multiplexer outputs to feed the adder. The two 45-bit partial products from the multiplier
are sign extended to 48 bits before being sent to the adder/subtracter.
When not using the first stage multiplier, the 48-bit, dual input, bit-wise logic function
implements AND, OR, NOT, NAND, NOR, XOR, and XNOR. The inputs to these functions are:
Since PCIN is a cascade input from a lower DSP slice, creating even wider logic operations
is feasible via this cascade path. A 48-bit, triple input, bit-wise XOR3 logic operation is
supported when the Y multiplexer selects the C input and ALUMODE[3:0] = 0100.
The output of the adder/subtracter or logic unit feeds the pattern detector logic. The
pattern detector allows the DSP48E2 slice to support Convergent Rounding, Counter
Figure 2-3 shows the DSP48E2 slice in a very simplified form. The nine OPMODE bits control
the selects of W, X, Y, and Z multiplexers, feeding the inputs to the adder/subtracter or logic
unit. In all cases, the 45-bit partial product data from the multiplier to the X and Y
multiplexers is sign extended, forming 48-bit input datapaths to the adder/subtracter.
Based on 45-bit operands and a 48-bit accumulator output, the number of guard bits (i.e.,
bits available to guard against overflow) is 3. To extend the number of MACC operations,
the MACC_EXTEND feature should be used, which allows the MACC to extend to 96 bits with
two DSP48E2 slices. If A is limited to 18 bits (sign-extended to 27), then there are 12 guard
bits for the MACC. The CARRYOUT bits are invalid during multiply operations. Combinations
of OPMODE, ALUMODE, CARRYINSEL, and CARRYIN control the function of the
adder/subtracter or logic unit.
X-Ref Target - Figure 2-3
RND
W
P
A:B
X
D
+
A
B
Y
All 1s
P
C
All 0s
PCIN OPMODE, CARRYINSEL,
Z
and ALUMODE Control
Behavior
Shifters
X16754-042617
Input Ports
This section describes the input ports of the DSP48E2 slice in detail. The input ports of the
DSP48E2 slice are highlighted in Figure 2-4.
X-Ref Target - Figure 2-4
CARRYCASCOUT*
BCOUT* ACOUT MULTSIGNOUT* PCOUT*
0
*
RND
W XOR OUT
8
48 A:B
18 30
ALUMODE
18 0 4 48
B 18 18 X
Dual B Register
U
18
MULT M
A 30 27 X 18 V CARRYOUT 4
30
Dual A, D, 27 0 Y P
48
1 P
30 and Pre-adder
D 27
C 0 PATTERNDETECT
48
C
2 17-Bit Shift PATTERNBDETECT
4
Z
5 17-Bit Shift CREG/C Bypass/Mask
INMODE
3
CARRYIN MULTSIGNIN*
OPMODE 9
CARRYCASCIN*
CARRYINSEL
48
BCIN* ACIN* PCIN*
*These signals are dedicated routing paths internal to the DSP48E2 column. They are not accessible via general-purpose routing resources
X16783-042617
A, B, C, and D Ports
The DSP48E2 slice input data ports support many common DSP and math algorithms. The
DSP48E2 slice has four direct input data ports labeled A, B, C, and D. The A data port is
30 bits wide, the B data port is 18 bits wide, the C data port is 48 bits wide, and the
pre-adder D data port is 27 bits wide.
The 27-bit A (A[26:0]) and 18-bit B ports supply input data to the 27-bit by 18-bit, two’s
complement multiplier. With independent C port, each DSP48E2 slice is capable of
Multiply-Add, Multiply-Subtract, and Multiply-Round operations.
Concatenated A and B ports (A:B) bypass the multiplier and feed the X multiplexer input.
The 30-bit A input port forms the upper 30 bits of A:B concatenated datapath, and the
18-bit B input port forms the lower 18 bits of the A:B datapath. The A:B datapath, together
with the C input port, enables each DSP48E2 slice to implement a full 48-bit
Each DSP48E2 slice also has two cascaded input datapaths (ACIN and BCIN), providing a
cascaded input stream between adjacent DSP48E2 slices. The cascaded path is 30 bits wide
for the A input and 18 bits wide for the B input. Applications benefiting from this feature
include FIR filters, complex multiplication, multi-precision multiplication and complex
MACCs.
The A and B input port and the ACIN and BCIN cascade port can have 0, 1, or 2 pipeline
stages in its datapath. The dual A, D, and pre-adder port logic is shown in Figure 2-5. The
dual B register port logic is shown in Figure 2-6. The different pipestages are set using
attributes. Attributes AREG and BREG are used to select the number of pipeline stages for A
and B direct inputs to the X multiplexer to the ALU, and INMODE[0] can dynamically change
the number of pipeline stages to the multiplier. Attributes ACASCREG and BCASCREG select
the number of pipeline stages in the ACOUT and BCOUT cascade datapaths. The allowed
attribute settings are shown in Table 3-3, page 53. Multiplexers controlled by configuration
bits select flow through paths, optional registers, or cascaded inputs. The data port
registers allow users to typically trade off increased clock frequency (i.e., higher
performance) vs. data latency.
X-Ref Target - Figure 2-5
ACOUT
30 X MUX
30
A
30 A2 27
INMODE[1]A
ACIN A1
BCOUT
18
X MUX
18
B B2
0
B2B1
18 INMODE[1]B B MULT
BCIN B1
AD_DATA
18
1
BMULTSEL
CEB1 RSTB CEB2 RSTB
INMODE[4]
X16759-042617
Table 2-1 and Table 2-2 shows the encoding for the INMODE[4:0] dynamic control bits and
AMULTSEL, BMULTSEL, and PREADDINSEL static control bits. Note that the DSP48E2
attribute AMULTSEL has replaced the DSP48E1 attribute USE_DPORT due to increased
functionality of the pre-adder.
These bits select the functionality of the pre-adder, the A, B, and D input registers.
AMULTSEL must be set to AD to enable the pre-adder functions described in Table 2-1 and
Table 2-2.
In summary, the INMODE dynamic control signals along with AMULTSEL, BMULTSEL, and
PREADDINSEL static attributes control the pre-adder functionality and A, B, and D register
bus multiplexers that precede the multiplier. The DSP48E2 supports two-deep A or B
sourcing the pre-adder as well as a pre-adder squaring function.
Notes:
1. Set the data on the D and the A ports so the pre-adder, which does not support saturation, does not overflow or underflow.
See Pre-Adder, page 36.
Pre-Adder/Multiplier
Multiplier B Port(3)
Multiplier A Port
INMODE[1]A(1)
INMODE[1]B(1)
PREADDINSEL
(USE_DPORT)
INMODE[4]
INMODE[3]
INMODE[2]
INMODE[1]
INMODE[0]
AMULTSEL
BMULTSEL
Function
0/1 0 0 0 0 0 0/1 A B A (FALSE) A2/A1 B2/B1 A*B
X 0 1 1 1 0 X A AD AD (TRUE) D D D2
Notes:
1. INMODE[1]A and INMODE[1]B are internal signals defined by the user settings of PREADDINSEL and INMODE[1]. If
PREADDINSEL=A, INMODE[1]A (see Figure 2-5, page 23) is INMODE[1] and INMODE[1]B (see Figure 2-6, page 24) is 0. If
PREADDINSEL=B, INMODE[1]B is INMODE[1] and INMODE[1]A is 0.
2. Set the data on the D and the A or B ports so the pre-adder, which does not support saturation, does not overflow or
underflow. See Pre-Adder, page 36.
3. A or D are limited to 18 bits when provided through the B port, and are limited to 17-bit two's complement sign-extended
numbers when the pre-adder is used.
INMODE[1] may be used to gate the A or B datapath to use the pre-adder to create a 2:1
bus multiplexer along with the INMODE[2] control signal.
When INMODE[2] = 0, the D input to the pre-adder is 0. INMODE[1] and INMODE[2] enable
multiplexing between the D register and the A or B registers, without having to use resets
to force them to zero.
The 48-bit C port is used as a general input to the W, Y and Z multiplexers to perform add,
subtract, four-input add/subtract, and logic functions. The C input is also connected to the
pattern detector for rounding function implementations. The C port logic is shown in
Figure 2-7. The CREG attribute selects the number of pipestages for the C input datapath.
X-Ref Target - Figure 2-7
48
C C Input to
48 W, Y and Z
D 48 Multiplexers and
CEC EN Pattern Detector
RST
RSTC
X16760-042617
9
OPMODE
9 To the W, X, Y, Z
Multiplexers and
1 D
3-Input Adder/Subtracter
CECTRL EN
RST
1
RSTCTRL
4
ALUMODE
4
To Adder/Subtracter
D
1
CEALUMODE EN
RST
1
RSTALUMODE
3
CARRYINSEL
3
To Carry Input
D Select Logic
EN
RST
X16761-042617
W, X, Y, and Z Multiplexers
The OPMODE (Operating Mode) control input contains fields for W, X, Y, and Z multiplexer
selects.
The OPMODE input provides a way for you to dynamically change DSP48E2 functionality
from clock cycle to clock cycle (e.g., when altering the internal datapath configuration of
the DSP48E2 slice relative to a given calculation sequence).
The OPMODE bits can be optionally registered using the OPMODEREG attribute (as noted
in Table 3-4).
Table 2-3, Table 2-4, Table 2-5, and Table 2-6 list the possible values of OPMODE and the
resulting function at the outputs of the four multiplexers (W, X, Y, and Z multiplexers). The
multiplexer outputs supply four operands to the following adder/subtracter. Not all
possible combinations for the multiplexer select bits are allowed. Some are marked in the
tables as “illegal selection” and give undefined results. If the multiplier output is selected,
then both the X and Y multiplexers are used to supply the multiplier partial products to the
adder/subtracter.
ALUMODE Inputs
The 4-bit ALUMODE controls the behavior of the second stage add/sub/logic unit.
ALUMODE = 0000 selects add operations of the form Z + (W + X + Y + CIN). CIN is the
output of the CARRYIN mux (see Figure 2-9). ALUMODE = 0011 selects subtract operations
of the form Z – (W + X + Y + CIN). ALUMODE = 0001 can implement
–Z + (W + X + Y + CIN) – 1. ALUMODE = 0010 can implement –(Z + W + X + Y + CIN) – 1,
which is equivalent to not (Z + W + X + Y + CIN). The negative of a two’s complement
number is obtained by performing a bitwise inversion and adding one, e.g.,
–k = not (k) + 1. Other subtract and logic operations can also be implemented with the
enhanced add/sub/logic unit. See Table 2-7.
Notes:
1. In two’s complement: –Z = not (Z) + 1.
See Table 2-10, page 38 for two-input ALUMODE operations and Table 5-3, page 70.
CARRYIN
3
RSTALLCARRYIN RST CARRYINSEL
D
CECARRYIN CE
000
Large Add/Sub/Acc CARRYCASCIN
(Parallel Op) 010
Large Add/Sub/Acc CARRYCASCOUT
(Seq Op) 100
A[26] XNOR B[17]
Round A * B
110
RSTALLCARRYIN RST CIN
D
CEM CE
111
Inverted P[47]
Round Output 101
011
Inverted PCIN[47]
001
X16762-042617
The fourth input (CARRYINSEL is equal to binary 110) is A[26] XNOR B[17] for symmetrically
rounding multiplier outputs. This signal can be optionally registered to match the MREG
pipeline delay. The fifth and sixth inputs (CARRYINSEL is equal to binary 111 and 101)
selects the true or inverted P output MSB P[47] for symmetrically rounding the P output.
The seventh and eight inputs (CARRYINSEL is equal to binary 011 and 001) selects the true
or inverted cascaded P input MSB PCIN[47] for symmetrically rounding the cascaded P
input.
Table 2-8 lists the possible values of the three carry input select bits (CARRYINSEL) and the
resulting carry inputs or sources.
Output Ports
This section describes the output ports of the DSP48E2 slice in detail. The output ports of
the DSP48E2 slice are shown in Figure 2-10.
X-Ref Target - Figure 2-10
CARRYCASCOUT*
MULTSIGNOUT* PCOUT*
BCOUT* ACOUT* 0
RND
W XOR OUT
8
48 A:B
18 30
ALUMODE
18 0 4 48
B 18 18 X
Dual B Register
U
18 MULT
M
A 30 27 X 18 V CARRYOUT 4
30
Dual A, D, 27 0 Y P
48
1 P
30 and Pre-adder
D 27
C 0 PATTERNDETECT
48
C
2 17-Bit Shift PATTERNBDETECT
4
Z
5 17-Bit Shift CREG/C Bypass/Mask
INMODE
3
CARRYIN MULTSIGNIN*
OPMODE 9
CARRYCASCIN*
CARRYINSEL
48
BCIN* ACIN
PCIN*
*
*These signals are dedicated routing paths internal to the DSP48E2 column. They are not accessible via general-purpose routing resources. X16784-042617
All the output ports except ACOUT and BCOUT are reset by RSTP and enabled by CEP (see
Figure 2-11). ACOUT and BCOUT are reset by RSTA and RSTB, respectively (shown in
Figure 2-5 and Figure 2-6).
X-Ref Target - Figure 2-11
P/PCOUT/MULTSIGNOUT/
CARRYCASCOUT/
DSP48E2
CARRYOUT/
Slice Output
PATTERNDETECT/ D
PATTERNBDETECT/ CEP EN Q
XOROUT
RST
RSTP
X16763-042617
P Port
Each DSP48E2 slice has a 48-bit output port P. This output can be connected (cascaded
connection) to the adjacent DSP48E2 slice internally through the PCOUT path. The PCOUT
connects to the input of the Z multiplexer (PCIN) in the adjacent DSP48E2 slice. This path
provides an output cascade stream between adjacent DSP48E2 slices.
The CARRYOUT signal is cascaded to the next adjacent DSP48E2 slice using the
CARRYCASCOUT port. Larger add, subtract, ACC, and MACC functions can be implemented
in the DSP48E2 slice using the CARRYCASCOUT output. The 1-bit CARRYCASCOUT signal
corresponds to CARRYOUT[3], but is not identical. The CARRYCASCOUT signal is also fed
back into the same DSP48E2 slice via the CARRYINSEL multiplexer.
The CARRYOUT[3] signal should be ignored when the multiplier or a 3-input (or 4-input)
add/subtract operation is used. Because a MACC operation includes a three-input adder in
the accumulator stage, the combination of MULTSIGNOUT and CARRYCASCOUT signals is
required to perform a 96-bit MACC, spanning two DSP48E2 slices. The second DSP48E2
slice’s OPMODE must be MACC_EXTEND (001001000) to use both CARRYCASCOUT and
MULTSIGNOUT, thereby eliminating the ternary adder carry restriction for the upper
DSP48E2 slice. The actual hardware implementation of CARRYOUT/CARRYCASCOUT and the
MULTSIGNOUT Logic
MULTSIGNOUT is a software abstraction of the hardware signal. It is modeled as the MSB of
the multiplier output and used only in MACC extension applications to build a 96-bit MACC.
The actual hardware implementation of MULTSIGNOUT is described in Chapter 5,
Cascading: CARRYOUT, CARRYCASCOUT, and MULTSIGNOUT.
The MSB of a multiplier output is cascaded to the next DSP48E2 slice using the MULTSIGNIN
signal and can be used only in MACC extension applications to build a 96-bit accumulator.
The actual hardware implementation of MULTSIGNOUT is described in Chapter 5,
Cascading: CARRYOUT, CARRYCASCOUT, and MULTSIGNOUT.
A mask field can also be used to hide certain bit locations in the pattern detector.
PATTERNDETECT computes ((P == pattern)||mask) on a bitwise basis and then ANDs the
results to a single output bit. Similarly, PATTERNBDETECT can detect if
((P == ~pattern)||mask). The pattern and the mask fields can each come from a distinct
48-bit configuration field or from the (registered) C input. When the C input is used as the
PATTERN, the OPMODE should be set to select a 0 at the input of the Z multiplexer. If all the
registers are reset, PATTERNDETECT is High for one clock cycle immediately after the RESET
is deasserted.
The pattern detector allows the DSP48E2 slice to support convergent rounding and counter
auto reset when a count value has been reached as well as support overflow, underflow, and
saturation in accumulators.
Embedded Functions
The embedded functions include a pre-adder, 27 x 18 multiplier, adder/subtracter/logic
unit, and pattern detector logic (see Figure 2-12).
X-Ref Target - Figure 2-12
CARRYCASCOUT*
MULTSIGNOUT* PCOUT*
BCOUT* ACOUT* 0
RND
W XOR OUT
8
48 A:B
18 30
ALUMODE
18 0 4 48
B 18 18 X
Dual B Register
U
18 MULT
M
A 30 27 X 18 V CARRYOUT 4
30
Dual A, D, and 27 0 Y P
48
Pre-adder 1 P
30
D 27
C 0 PATTERNDETECT
48
C
2 17-Bit Shift PATTERNBDETECT
4
Z
5 17-Bit Shift CREG/C Bypass/Mask
INMODE
3
CARRYIN MULTSIGNIN*
OPMODE 9
CARRYCASCIN*
CARRYINSEL
48
BCIN* ACIN
PCIN*
*
*These signals are dedicated routing paths internal to the DSP48E2 column. They are not accessible via general-purpose routing resources. X16785-042617
Pre-Adder
The DSP slice has a 27-bit pre-adder, which is inserted in the A or B register path (shown in
Figure 2-12 with an expanded view in Figure 2-5, page 23). With the pre-adder,
pre-additions or pre-subtractions are possible prior to feeding the multiplier. Since the
pre-adder does not contain saturation logic, designers should limit input operands to
26-bit (or 17-bit for the B path) two’s complement sign-extended data to avoid overflow or
underflow during arithmetic operations. Optionally, the pre-adder can be bypassed, making
D the new input path to the multiplier. When the D path is not used, the output of the A or
B pipeline can be negated prior to driving the multiplier. There are up to 15 operating
modes, including pre-adder squaring, making this pre-adder block very flexible.
In Equation 2-2, A (or B) and D are added initially through the pre-adder/subtracter. The
result of the pre-adder is then multiplied against B (or A), with the result of the
multiplication being added to the C input. This equation facilitates efficient symmetric
filters.
Figure 2-13 shows an optional pipeline register (MREG) for the output of the multiplier.
Using the register provides increased performance with an increase of one clock latency.
X-Ref Target - Figure 2-13
45
90 Partial Product 1
A or AD
X 45
B or AD
Partial Product 2
Optional
MREG X16764-042617
As with the input multiplexers, the OPMODE bits specify a portion of this function. The
symbol ± in the table means either add or subtract and is specified by the state of the
ALUMODE control signal. The symbol “:” in the table means concatenation. The outputs of
the X and Y multiplexer and CIN are always added together. Refer to ALUMODE Inputs,
page 30.
Table 2-10 lists the logic functions that can be implemented in the second stage of the four
input adder/subtracter/logic unit. The table also lists the settings of the control signals,
namely OPMODE and ALUMODE.
An XOR3 can be built by setting the OPMODE[3:2] to 11, selecting the C input at the Y
multiplexer output. The XOR3 is only valid for ALUMODE[3:0] = 0100, as shown in
Table 2-10.
Table 2-10: OPMODE and ALUMODE Control Bits Select Logic Unit Outputs
OPMODE[3:2] ALUMODE[3:0]
Logic Unit Mode
3 2 3 2 1 0
X XOR Z 0 0 0 1 0 0
X XNOR Z 0 0 0 1 0 1
X XNOR Z 0 0 0 1 1 0
X XOR Z 0 0 0 1 1 1
X AND Z 0 0 1 1 0 0
X AND (NOT Z) 0 0 1 1 0 1
X NAND Z 0 0 1 1 1 0
(NOT X) OR Z 0 0 1 1 1 1
X XNOR Z 1 0 0 1 0 0
X XOR Z 1 0 0 1 0 1
X XOR Z 1 0 0 1 1 0
X XNOR Z 1 0 0 1 1 1
X OR Z 1 0 1 1 0 0
Table 2-10: OPMODE and ALUMODE Control Bits Select Logic Unit Outputs (Cont’d)
OPMODE[3:2] ALUMODE[3:0]
Logic Unit Mode
3 2 3 2 1 0
X OR (NOT Z) 1 0 1 1 0 1
X NOR Z 1 0 1 1 1 0
(NOT X) AND Z 1 0 1 1 1 1
X XOR Y XOR Z (1) 1 1 0 1 0 0
Notes:
1. Valid when Y multiplexer selects C input.
0
P
W
RND
CQ [47:36] P[47:36], CARRYOUT[3]
0
[47:0]
A:B X
[35:24] P[35:24], CARRYOUT[2]
P
0
[23:12] P[23:12], CARRYOUT[1]
1 Y
C
ALUMODE[3:0]
X16765-042617
• Four segments of dual or ternary or quad adders with 12-bit inputs, a 12-bit output,
and a carry output for each segment
• Function controlled dynamically by ALUMODE[3:0], and operand source by
OPMODE[8:0]
• All four adder/subtracter/accumulators perform same function
• Two segments of dual or ternary or quad adders with 24-bit inputs, a 24-bit output, and
a carry output for each segment is also available (not pictured).
The SIMD feature, shown in Figure 2-14, allows the 48-bit logic unit to be split into multiple
smaller logic units. Each smaller logic unit performs the same function. This function can
also be changed dynamically through the ALUMODE[3:0] and OPMODE control inputs.
The pattern detector is best described as an equality check on the output of the
adder/subtracter/logic unit that produces its result on the same cycle as the P output. There
is no extra latency between the pattern detect output and the P output of the DSP48E2
slice. The use of the pattern detector leads to a moderate speed reduction due to the extra
logic on the pattern detect path (see Figure 2-15).
P
P
(1)
PATTERNBDETECTPAST
PATTERNBDETECT
SEL_PATTERN
(1)
PATTERNDETECTPAST
C (Register)
PATTERNDETECT
PATTERN
C Shift by 2, 00 (Mode 2)
C Shift by 1, 0 (Mode 1)
SEL_MASK
SEL_MASK
Notes:
1. Denotes an internal signal.
X16766-042617
If the pattern detector is not being employed, it can be used for other creative design
implementations. These include:
• Duplicating a pin (e.g., the sign bit) to reduce fanout and thus increase speed.
• Implementing a built-in inverter on one bit (e.g., the sign bit) without having to route
out to the CLBs.
• Checking for sticky bits in floating point, handling special cases, or monitoring the
DSP48E2 slice outputs.
A mask field can also be used to mask out certain bit locations in the pattern detector. The
pattern field and the mask field can each come from a distinct 48-bit memory cell field or
from the (registered) C input.
(1)
PATTERNDETECTPAST
Overflow
PATTERNBDETECT
PATTERNDETECT
(1)
PATTERNBDETECTPAST
Underflow
PATTERNBDETECT
PATTERNDETECT
See Figure 2-17 and Figure 2-18 for overflow and underflow examples, respectively.
Overflow
X16768-042617
Underflow
X16769-042617
Overflow is caused by addition when the value at the output of the adder/subtracter/logic
unit goes over 3. Adding 1 to the final value of 0..0011 gives 0..0100 as the result, which
causes the PATTERNDETECT output to go to 0. When the PATTERNDETECT output goes from
1 to 0, an overflow is flagged.
Underflow is caused by subtraction when the value goes below –4. Subtracting 1 from
1..1100 yields 1..1010 (–5), which causes the PATTERNBDETECT output to go to 0. When
the PATTERNBDETECT output goes from 1 to 0, an underflow is flagged.
Overflow and underflow are relative to the previous value (positive or negative,
respectively). Overflow can result from subtraction when a value outside the valid range is
subtracted from a positive value. Similarly, underflow can result from addition when a value
outside the valid range is added to a negative value.
Wide XOR
A new feature in the DSP48E2 slice is the ability to perform a 96-bit wide XOR function. The
XOR uses the X, Y, and Z multiplexers as inputs. The W multiplexer selects all 0s at its output.
The ALU logic is used for the first stage of the wide XOR by using the proper OPMODE and
ALUMODE signals as shown in Table 2-10, to implement either X XOR Z or X XOR Y XOR Z.
The signals then branch out to an XOR logic tree with dedicated outputs. Multiplexers allow
selection as eight 12-bit wide XOR, four 24-bit wide XOR, two 48-bit wide XOR, or one
96-bit wide XOR. See Figure 2-19. In Figure 2-19 the S[47:0} internal bus is not the P[47:0]
output, it is one of the 4:2 compressor busses.
XOROUT[7]
S[47:42] XOR12H
XOR24D
XOROUT[6]
S[41:36] XOR12G
XOR48B
XOROUT[5]
S[35:30] XOR12F
0 [47:0]
A:B X XOR24C
P
XOROUT[4]
S[29:24] XOR12E
0 [47:0] + S[47:0] XOR96
1 Y –
C XOROUT[3]
S[23:18] XOR12D
0
[47:0] XOR24B
PCIN
P
Z XOROUT[2]
C S[17:12] XOR12C
ALUMODE[3:0]
XOR48A
XOROUT[1]
S[11:6] XOR12B
XOR24A
XOROUT[0]
S[5:0] XOR12A
X16770-042617
The XORSIMD attribute is used to select the width of the XOR function as either 96-bits or
12-/24-/48- bits, as shown in Table 2-11.
The dedicated XOR logic enables performance improvements when implementing forward
error correction and cyclic redundancy checking algorithms. There is also a USE_WIDEXOR
attribute to enable a power saving mode if the wide XOR function is not desired (see
Table 3-3, page 53.
The first level XOR can be either XOR2 or XOR3. In both cases, ALUMODE[3:0] = 0100 for
the XOR function in the ALU. When the Y multiplexer selects 0, an XOR2 is created. When
the Y multiplexer selects the C register, an XOR3 is created, supporting up to 48 XOR3 in the
ALU. The third input can come from the P output or the PCIN cascade, which provides
XOR-accumulate and cascade capability for even wider XOR functions.
Overview
Xilinx offers integrated DSP design flows tailored for the unique needs of hardware,
algorithm, and traditional processor-based DSP designers, supporting all mainstream DSP
design entry methods to ensure productivity.
Vivado™ Design Suite System Generator for DSP enables high-level model-based designs to
be created using MathWorks MATLAB and Simulink and provides automatic fixed or
floating-point hardware generation, co-simulation, and system integration into RTL or
embedded systems. See Vivado Design Suite Reference Guide: Model-Based DSP Design
Using System Generator (UG958) [Ref 4].
Vivado Design Suite includes an extensive library of device-optimized DSP IP that can be
used with RTL or with System Generator or Vivado HLS to quickly assemble DSP designs that
deliver high quality of results without requiring extensive FPGA design experience. DSP
algorithms implemented in RTL can be verified from within DSP specific simulation
environments such as MATLAB/Simulink or C/C++.
The DSP48E2 slices are inferred automatically from HDL code for most DSP functions and
many arithmetic functions when using synthesis tools (check the documentation for your
synthesis tools for details). Instantiation of the DSP48E2 primitive can be used to directly
access specific features and provide more advanced user control.
30 30
A[29:0] ACOUT[29:0]
18 18
B[17:0] BCOUT[17:0]
48 48
C[47:0] PCOUT[47:0]
27
D[26:0]
9
OPMODE[8:0]
4 48
ALUMODE[3:0] P[47:0]
CARRYIN
3
CARRYINSEL[2:0]
5
INMODE[4:0]
4
CEA 1 CARRYOUT[3:0]
CEA 2 CARRYCASCOUT
CEB 1 MULTSIGNOUT
CEB 2
CEC PATTERNDETECT
CED PATTERNBDETECT
CEM
CEP
CEAD
OVERFLOW
CEALUMODE UNDERFLOW
CECTRL
CECARRYIN
8
CEINMODE XOROUT[7:0]
RSTA
RSTB
RSTC
RSTD
RSTM
RSTP
RSTCTRL
RSTALLCARRYIN
RSTALUMODE
RSTINMODE
CLK
30
ACIN[29:0]
18
BCIN[17:0]
48
PCIN[47:0]
CARRYCASCIN
MULTSIGNIN
X16771-042617
Overview
This chapter describes some design features and techniques to use to achieve higher
performance, lower power, and lower resources in a particular design.
IMPORTANT: If latency is important in the design and only one or two registers can be used within the
DSP48E2 slice, always use the M register.
h7(n)
18 × +
48
18
48
h6(n)
18 × +
X(n-4)
18
Z-2
h5(n)
18 × +
48
18
48
h4(n)
18 ×
X(n-2)
18
Z-2 + y(n-6)
h3(n)
18 × +
48
18
The final stages of the post
48 addition in logic are the
performance bottlenecks that
h2(n) consume more power.
18 × +
X(n)
18
Z-2
h1(n)
18 × +
48
18
48
h0(n)
18 ×
X(n)
18 X16772-042617
In the traditional approach, the fabric adders are usually the performance bottleneck. The
number of adders needed and the associated routing depends on the size of the filter. The
depth of the adder tree scales as the log 2 of the number of taps in the filter. Using the adder
tree structure shown in Figure 4-1 could also increase the cost, logic resources, and power.
The UltraScale™ architecture CLB allows the use of both the 6LUT and the carry chain in the
same slice to build an efficient ternary adder. The 6LUT in the CLB functions as a dual 5LUT.
The 5LUT is used as a 3:2 compressor to add three input values to produce two output
values. The 3:2 compressor is shown in Figure 4-2.
CY(1)
B4 IN4
X(1) ABUS(1)
B3 IN3 O6B
Y(1)
0 1
0
B2 IN2
Z(1) 1 BMUX BBUS(1)
B5 IN5
BBUS(0)
SUB/ B1 IN1 BQ SUM(1)
D Q
ADDB
VCC O5B
CK
IN6
BX
BBUS(0)
CY(0)
A4 IN4
X(0) ABUS(0)
A3 IN3 O6A
Y(0)
0 1
0
A2 IN2
Z(0) 1 AMUX BBUS(0)
SUB/ A5 IN5
ADDB AQ SUM(0)
D Q
VCC O5A
CK
IN6
SUB/ AX GND
ADDB
X16773-042617
ABUS
A
46 2-Input
3:2 SUM
B Cascade
46 Compressor 48
BBUS Adder
C Left Shift By 1
46
X16774-042617
The 3:1 adder shown in Figure 4-3 is used as a building block for larger adder trees.
Depending on the number of inputs to be added, a 5:3 or a 6:3 compressor is also built in
CLB logic using multiple 5LUTs or 6LUTs. The serial combination of 6:3 compressor, along
with two DSP48E2 slices, adds six operands together to produce one output, as shown in
Figure 4-4. The LSB bits of the first DSP48E2 slice that are left open due to left shift of the
Y and Z buses should be tied to zero. The last DSP48E2 slice uses 2-deep A:B input registers
to align (pipeline matching) the X bus to the output of the first DSP48E2 slice. Multiple
levels of 6:3 compressors can be used to expand the number of input buses.
X-Ref Target - Figure 4-4
X
A DSP48E2 SUM
45 6:3 Y Slice 48
Left Shift By 1 DSP48E2
Compressor
Z Slice
F Left Shift By 2
45
X16775-042617
X(n) = A(n) XOR B(n) XOR C(n) XOR D(n) XOR E(n) XOR F(n) Equation 4-1
Y(n) = A(n)B(n) XOR A(n)C(n) XOR A(n)D(n) XOR A(n)E(n)
XOR A(n)F(n) XOR B(n)C(n) XOR B(n)D(n) XOR B(n)E(n)
Equation 4-2
XOR B(n)F(n) XOR C(n)D(n) XOR C(n)E(n) XOR C(n)F(n)
XOR D(n)E(n) XOR D(n)F(n) XOR E(n)F(n)
The compressor elements and cascade adder can be arranged like a tree in order to build
larger adders. The last add stage should be implemented in the DSP48E2 slice. Pipeline
registers should be added as needed to meet timing requirements of the design. These
adders can have higher area and/or power than the adder cascade.
Adder Cascade
The adder cascade implementation accomplishes the post addition process with minimal
silicon resources by using the cascade path within the DSP48E2 slice. This involves
computing the additive result incrementally, utilizing a cascaded approach as illustrated in
Figure 4-5.
Slice 8
h7(n-7)
18 × 48
+ 48
Y(n–10)
18
No Wire Shift
Slice 7 48
h6(n-6)
18 × 48
+
18
No Wire Shift
Slice 6 48
h5(n-5)
18 × 48
+
18
No Wire Shift
48
Slice 5
h4(n-4)
18 × 48
+ The post adders are
contained entirely in
18
No Wire Shift dedicated silicon for
highest performance
Slice 4 48 and lowest power.
h3(n-3)
18 × 48
+
18
No Wire Shift
Slice 3 48
h2(n-2)
18
48
+
18
No Wire Shift
Slice 2 48
h1(n-1)
18 × 48
+
18
No Wire Shift
48
Slice 1
h0(n)
X(n)
18 × 48
+
18
Zero
Sign Extended from 36 Bits to 48 Bits X16776-042617
IMPORTANT: The height of the DSP column can differ between devices and should be considered while
porting designs.
Spanning columns is possible by taking the bus output from the top of one DSP column and
adding CLB slice pipeline registers to route this bus to the C input of the bottom DSP48E2
slice of the adjacent DSP column. Alignment of input operands is also necessary to span
multiple DSP columns.
These time-multiplexed DSP designs have optional pipelining that permits aggregate
multichannel sample rates of up to 500 million samples per second. Implementing a
time-multiplexed design using the DSP48E2 slice results in reduced resource utilization and
reduced power.
The DSP48E2 slice contains the basic elements of classic FIR filters: a multiplier followed by
an adder, delay or pipeline registers, and the ability to cascade an input stream (B bus) and
an output stream (P bus) without exiting to a general CLB slice.
Figure 4-6 illustrates how the pre-adder (shaded in gray) can be used in an 8-tap even
symmetric systolic FIR design.
X-Ref Target - Figure 4-6
x(n) z-2
z-2 z-2 z-2
+ + + +
h0 h1 h2 h3
z-1 x z-1 x z-1 x z-1 x
z-1 z-1
y(n-8)
+ z-1 + z-1 + z-1 + z-1
Overview
This chapter describes the dedicated cascade features and clarifies key details of cascade
signals.
CARRYOUT/CARRYCASCOUT
The DSP48E2 slice and the fabric carry chain use a different implementation style for
subtract functions. The carry chain implementation in the CLB slices requires the CLB carry
input pin to be connected to a constant 1 during subtract operations. The standard subtract
operation with ALUMODE = 0011 in the DSP48E2 slice does not require the CARRYIN pin to
be set to 1.
A
0 +
B
1
0 A+B
Carry Input
A
0 +
B
1
A±B
Sub/Add = 1/0
(Carry input must be 1 for a subtract operation, so it is not available for other uses.)
X16778-042617
B 0
+
0 1
A
1
Sub/add
Y 0
+
0 1
Z
1
ALUMODE[0]
CIN
ALUMODE[1]
• This inversion of the P output obtained by using ALUMODE 0010 can be cascaded to
another DSP slice to implement a two’s complement subtract.
IMPORTANT: CARRYOUT[3] and CARRYCASCOUT are not valid for three-input and four-input add/sub
functions.
All DSP four-input add operations (including Multiply-Add and Multiply Accumulate)
produce two CARRYOUT bits for retaining full precision. This is shown in Figure 5-4.
MULTSIGNOUT and CARRYCASCOUT serve as the two carry bits for MACC_EXTEND
operations. If MULTSIGNOUT is the multiplier sign bit and CARRYCASCOUT is the cascaded
carryout bit, the result is the software/Unisim model abstraction, shown in Figure 5-4.
CIN
X16781-042617
It is also necessary to set the OPMODEREG and CARRYINSELREG to 1 when building large
accumulators such as the 96-bit Multiply Accumulate. This prevents the simulation model
from propagating unknowns to the upper DSP48E2 slice when a reset occurs.
Summary
Adder/Subtracter-only Operation
CARRYOUT[3]: Hardware and software match.
CARRYCASCOUT: Hardware and software match when ALUMODE = 0000, 0001, and 0010,
and inverted when ALUMODE = 0011. The mismatch happens because the DSP48E2 slice
performs the subtract operation using a different algorithm from the CLB logic; thus, the
DSP48E2 slice requires an inverted CARRYOUT from the CLB logic.
MACC Operation
CARRYOUT[3] is invalid in the MACC operation.
Software Model
MULTSIGNOUT
A CARRYCASCOUT
x + P[47:0]
B
CARRYIN
Zmux (e.g., C, P, PCIN)
Hardware Implementation
MULTSIGNOUT
A CARRYCASCOUT
x + P[47:0]
B
CARRYIN
Zmux (e.g., C, P, PCIN)
Partial products from the multiply operation are added together in the second stage four-input adder.
X16782-042617
Xilinx Resources
For support resources such as Answers, Documentation, Downloads, and Forums, see Xilinx
Support.
Solution Centers
See the Xilinx Solution Centers for support on devices, software tools, and intellectual
property at all stages of the design cycle. Topics include design assistance, advisories, and
troubleshooting tips.
• From the Vivado® IDE, select Help > Documentation and Tutorials.
• On Windows, select Start > All Programs > Xilinx Design Tools > DocNav.
• At the Linux command prompt, enter docnav.
Xilinx Design Hubs provide links to documentation organized by design tasks and other
topics, which you can use to learn key concepts and address frequently asked questions. To
access the Design Hubs:
• In the Xilinx Documentation Navigator, click the Design Hubs View tab.
• On the Xilinx website, see the Design Hubs page.
Note: For more information on Documentation Navigator, see the Documentation Navigator page
on the Xilinx website.
References
1. UltraScale Architecture Migration Methodology Guide (UG1026)
2. UltraScale and UltraScale+ device data sheets: