Ug579 Ultrascale DSP
Ug579 Ultrascale DSP
DSP Slice
User Guide
       Chapter 1: Overview
           Introduction to UltraScale Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
           UltraScale Architecture DSP Slice Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
           Differences from Previous Generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
           Device Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
           Recommended Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Overview
       Virtex® UltraScale+™ devices provide the highest performance and integration capabilities
       in a FinFET node, including both the highest serial I/O and signal processing bandwidth, as
       well as the highest on-chip memory density. As the industry's most capable FPGA family,
       the Virtex UltraScale+ devices are ideal for applications including 1+Tb/s networking and
       data center and fully integrated radar/early-warning systems.
       Virtex UltraScale devices provide the greatest performance and integration at 20 nm,
       including serial I/O bandwidth and logic capacity. As the industry's only high-end FPGA at
       the 20 nm process node, this family is ideal for applications including 400G networking,
       large scale ASIC prototyping, and emulation.
                            Zynq® UltraScale+ MPSoC devices provide 64-bit processor scalability while combining
                            real-time control with soft and hard engines for graphics, video, waveform, and packet
                            processing. Integrating an Arm®-based system for advanced analytics and on-chip
                            programmable logic for task acceleration creates unlimited possibilities for applications
                            including 5G Wireless, next generation ADAS, and Industrial Internet-of-Things.
                            This user guide describes the UltraScale architecture DSP Slice resources and is part of the
                            UltraScale architecture documentation suite available at: www.xilinx.com/documentation.
B XOR
                                                                                     +
             A                                                                           –                              P
                                                                      27 x 18
                                                                      Multiplier                                        Pattern
            D                                                                                   =
                                                 Pre-adder                                                              Detect
                                                                                         Pattern Detector
            C
X16750-082917
       The DSP48E2 slice supports both sequential and cascaded operations due to the dynamic
       OPMODE and cascade capabilities. Applications of the DSP slice include:
           °   Support adding two other input operands with the multiplier’s two partial products,
               instead of only one in DSP48E1
           °   Add a memory-cell based rounding constant while freeing the C input for the
               following function: A x B + C + RND
           °   WMUX provides another accumulator feedback path to reduce the size of the
               complex multiply-accumulate (MACC) or a semi-parallel FIR filter.
       •   Wide XOR of X, Y, Z multiplexers
           °   48 3-bit XOR at first level feeds XOR tree to create octal 12-bit XOR, quad 24-bit
               XOR, dual 48-bit XOR, single 96-bit XOR
           °   Cascading two DSP48E2 slices in Wide XOR mode creates octal 24-bit XOR, quad
               48-bit XOR, dual 96-bit XOR, or single 192-bit XOR. Cascade depth is limited to DSP
               column size
           °   Sequentially create wider XOR via XOR accumulation feedback with a single
               DSP48E2, extending the XOR width by 96 bits every clock cycle
       •   Unique features in DSP48E2:
       The DSP48E2 blocks use a signed arithmetic implementation. To best match the resource
       capabilities and, in general, to get the most efficient mapping, write code using signed
       values in the HDL source. Designs created for the 25 x 18 multiplier in the 7 series FPGAs
       may need to be sign-extended for the 27 x 18 multiplier in the UltraScale architecture. For
       more details on migration and design methodologies, see UltraScale Architecture Migration
       Methodology Guide (UG1026) [Ref 1]. When migrating designs with many cascaded DSP
       slices, the number of DSP slices per column in the new target device should be taken into
       consideration.
       Device Resources
       The DSP resources are optimized and scalable across the UltraScale portfolio, providing a
       common architecture that improves implementation efficiency, IP implementation, and
       design migration. Migration between UltraScale families does not require any design
       changes for the DSP48E2 slice.
       Two DSP48E2 slices with a dedicated interconnect form each DSP tile (see Figure 1-2). The
       DSP tiles stack vertically in a DSP48E2 column. The height of a DSP tile is the same as five
       configurable logic blocks (CLBs) and also matches the height of one 36K block RAM. The
       block RAM can be split into two 18K block RAMs. Each DSP48E2 slice aligns horizontally
       with an 18K block RAM, providing optimal connectivity between resources.
       X-Ref Target - Figure 1-2
                                                           CLBs \ Interconnect
                                    18K Block RAM                                    DSP48E2 Slice
X16751-042617
       Note: The maximum might be less in different DSP columns (if PS limits the DSP cascade height) or
       SLRs (SLR with HBM interface vs. no HBM interface).
       Table 1-1 shows the maximum number of DSP48E2 slices that can be directly cascaded
       vertically in a column, and the total number of DSP48E2 slices, for the UltraScale FPGAs.
        Notes:
        1. KU085 max cascade is 96 in SLR1.
        2. VU160 max cascade is 96 in SLR0.
Table 1-2 shows the same information for the UltraScale+ FPGAs.
For more information on design techniques, see Chapter 4, DSP48E2 Usage Guidelines.
       Pinout Planning
       DSP usage has little effect on pinouts because DSP48E2 slices are distributed throughout
       the device. The best approach is to let the tools choose the DSP48E2 and I/O locations
       based on the implementation requirements. Results can be adjusted if necessary for board
       layout considerations. The timing constraints should be set so that the tools can choose
       optimal placement to meet the specific design requirements. The only directional
       consideration in the DSP structure is that DSP48E2 slices cascade vertically up a column,
       allowing wide buses to drive a vertical orientation to other logic, including I/O. The I/O
       columns typically provide 13 I/O in the same vertical space as every six DSP48E2 slices, with
       every clock region defined in height by a bank of 52 I/O and 12 DSP tiles (24 DSP slices).
DSP48E2 Functionality
                               Overview
                               This chapter provides technical details of the DSP48E2 element. The DSP48E2 slice consists
                               of a 27-bit pre-adder, 27 x 18 multiplier and a flexible 48-bit ALU that serves as a
                               post-adder/subtracter, accumulator, or logic unit (see Figure 2-1).
X-Ref Target - Figure 2-1
                                                                                                                                               CARRYCASCOUT*
                                                                                                                                      MULTSIGNOUT*         PCOUT*
                                    BCOUT*              ACOUT*                                         0
                                                                                                     RND
                                                                                                                    W                                      XOR OUT
                                                                                                                                                                     8
                                                                           48              A:B
                                                 18              30
                                                                                                                            ALUMODE
                                                                 18                                         0                 4                           48
          B                   18                                            18                                      X
                                             Dual B Register
                                                                                                     U
                                   18
                                                                                  MULT       M
          A                   30                                                 27 X 18             V                                                   CARRYOUT 4
                                                                      30
                                                 Dual A, D,                          27                     0       Y                                                P
                                                                                                                                                    48
                                                                                                            1                                  P
                                        30   and Pre-adder
          D                   27
          C                                                                                                 0                                        PATTERNDETECT
                                                                                      48
                                                                                             C
                                                                  2                                  17-Bit Shift                                    PATTERNBDETECT
                                                       4
                                                                                                                    Z
                                             5                                                       17-Bit Shift                     CREG/C Bypass/Mask
                            INMODE
                                                                                                                        3
                            CARRYIN                                                                                                   MULTSIGNIN*
                            OPMODE                                                               9
                                                                                                                                      CARRYCASCIN*
           CARRYINSEL
                                                                                                                                          48
                             BCIN*                    ACIN*                                                                                                PCIN*
          *These signals are dedicated routing paths internal to the DSP48E2 column. They are not accessible via general-purpose routing resources.
                                                                                                                                                               X16752-042617
       •   Multiply
       •   Multiply accumulate (MACC)
       •   Multiply add
       •   Four-input add
       •   Barrel shift
       •   Wide-bus multiplexing
       •   Magnitude comparator
       •   Bitwise logic functions
       •   Wide XOR
       •   Pattern detect
       •   Wide counter
       The architecture also supports cascading multiple DSP48E2 slices to form wide math
       functions, DSP filters, and complex arithmetic without the use of general logic.
       DSP48E2 Features
       The features in the DSP48E2 slice are:
° Bitwise logic operations—two-input AND, OR, NOT, NAND, NOR, XOR, and XNOR
° Overflow/underflow support
           °   Terminal count detection support and auto resetting: auto resetting can give
               priority to clock enable
       •   Cascading 48-bit P bus supports internal low-power adder cascade: 48-bit P bus allows
           for 12-bit quad or 24-bit dual SIMD adder cascade support
       •   Optional 17-bit right shift to enable wider multiplier implementation
       •   Dynamic user-controlled operating modes:
           °   5-bit INMODE control bus provides selects for 2-deep A and B registers, pre-adder
               add-sub control as well as mask gates for pre-adder multiplexer functions
           °   4-bit ALUMODE control bus selects logic unit function and accumulator add-sub
               control
       •   Carry in for the second stage adder:
       The DSP slice consists of a multiplier followed by an accumulator. At least three pipeline
       registers are required for both multiply and multiply-accumulate operations to run at full
       speed. The multiply operation in the first stage generates two partial products that need to
       be added together in the second stage.
       When only one or two registers exist in the multiplier design, the M register should always
       be used to save power and improve performance.
       Add/Sub and Logic Unit operations require at least two pipeline registers (input, output) to
       run at full speed.
       The cascade capabilities of the DSP slice are extremely efficient at implementing
       high-speed pipelined filters built on the adder cascades instead of adder trees.
       Multiplexers are controlled with dynamic control signals, such as OPMODE, ALUMODE, and
       CARRYINSEL, enabling a great deal of flexibility. Designs using registers and dynamic
       opmodes are better equipped to take advantage of the DSP slice’s capabilities than
       combinatorial multiplies.
       In general, the DSP slice supports both sequential and cascaded operations due to the
       dynamic OPMODE and cascade capabilities. Fast Fourier Transforms (FFTs), floating point,
       computation (multiply, add/sub, divide), counters, and large bus multiplexers are some
       applications of the DSP slice.
       Additional capabilities of the DSP slice include synchronous resets and clock enables, dual
       A input pipeline registers, pattern detection, Logic Unit functionality, single
       instruction/multiple data (SIMD) functionality, and MACC and Add-Acc extension to 96 bits.
       The DSP slice supports convergent and symmetric rounding, terminal count detection and
       auto-resetting for counters, and overflow/underflow detection for sequential accumulators.
       A 96-bit wide XOR function can be implemented as eight 12-bit wide XOR, four 24-bit wide
       XOR, or two 48-bit wide XOR.
                                                                                                                                                                                            BCOUT
                                                                                                                                                                                     18
                                                                                                                                                                                            X MUX
                                                                                                                                                                                     18
                        B
                                             18                                                B2                                           INMODE[1]B                  B2B1
                                                                                                                                                                                            B MULT
                        BCIN                                  B1
                                                                                                                                                                      AD_DATA
                                                                                                                                                                                CARRYCASCOUT*
                                                                                                                                                                                  18
                                                                                                                                                                 MULTSIGNOUT*
                                                                                                                                                                 BMULTSEL                     PCOUT*
                                    BCOUT*               ACOUT*                                                          0
                                                        CEB1 RSTB                          CEB2 RSTB
                                                                                                                   INMODE[4]
                                                                                                                 RND
                                                                                                                                  W                                                           XOR OUT
                                                                                                                                                                                                         8
                                                                                  48                 A:B
                                                  18                    30
                                                                                                                                                   ALUMODE
                                                                        18                                                   0                          4                                     48
            B                 18                                                   18                                                 X
                                             Dual B Register
                                                                                                                 U
                                   18
                                                                                         MULT           M
                                                                                        27 X 18                  V                                                                          CARRYOUT 4
            A                 30
                                                                             30
                                                 Dual A, D,                                    27                            0        Y                                               48                  P
                                             and Pre-adder                                                                   1                                                   P
                                        30
            D                 27
            C                                                                                                                0                                                            PATTERNDETECT
                                                                                               48
                                                                                                        C
                                                                         2                                       17-Bit Shift                                                             PATTERNBDETECT
                                                         4
                                                                                                                                      Z
                                             5                                                                   17-Bit Shift                                     CREG/C Bypass/Mask
                            INMODE
                                                                                                                                               3
                            CARRYIN                                                                                                                               MULTSIGNIN*
                            OPMODE                                                                          9                                                          ACOUT
                                                                                                                                                                 CARRYCASCIN*
             CARRYINSEL                                                                                               30                                                        X MUX
                                                                                                                                                                         48
                                                                                                                      30
                                A
                             BCIN*                     ACIN*                                                                                                                                   PCIN*
                                                       30                                 A2                                                       27
                                                                                                                     INMODE[1]A
                                   ACIN                            A1
                                                                                                                                                            AD_DATA
                                                  B2B1                                                                                                                   27
                                                                                                         PREADD_AB
                                                        18                                                                                                    AMULTSEL
                                   D                                                      PREADDINSEL                +                       AD
                                                                                                                         –
                                                                   D                                                             27
                                                       27                         INMODE[2]
                                                                                                     INMODE[3]
                                                              CED RSTD                                                                    CEAD RSTD
                                                                                                                                                                                                   X16753-030618
Figure 2-2: Hierarchical View of the DSP48E2 Slice Input Registers and Pre-adder
       Each DSP48E2 slice has a two-input multiplier followed by multiplexers and a four-input
       adder/subtracter/accumulator. The DSP48E2 multiplier has asymmetric inputs and accepts
       an 18-bit two’s complement operand and a 27-bit two’s complement operand. The
       multiplier stage produces a 45-bit two’s complement result in the form of two partial
       products. These partial products are sign-extended to 48 bits in the X multiplexer and
       Y multiplexer and fed into four-input adder for final summation. This results in a 45-bit
       multiplication output, which has been sign-extended to 48 bits. Therefore, when the
       multiplier is used, the adder effectively becomes a three-input adder.
       The second stage adder/subtracter accepts four 48-bit, two’s complement operands and
       produces a 48-bit, two’s complement result when the multiplier is bypassed by setting
       USE_MULT attribute to NONE and with the appropriate OPMODE setting. In SIMD mode,
       the 48-bit adder/subtracter also supports dual 24-bit or quad 12-bit SIMD arithmetic
       operations with CARRYOUT bits. In this configuration, bitwise logic operations on two
       48-bit binary numbers (and three 48-bit binary numbers in the special XOR3 case) are also
       supported with dynamic ALUMODE control signals.
       Higher level DSP functions are supported by cascading individual DSP48E2 slices in a
       DSP48E2 column. Two datapaths (ACOUT and BCOUT) and the DSP48E2 slice outputs
       (PCOUT, MULTSIGNOUT, and CARRYCASCOUT) provide the cascade capability. The ability to
       cascade datapaths is useful in filter designs. For example, a Finite Impulse Response (FIR)
       filter design can use the cascading inputs to arrange a series of input data samples and the
       cascading outputs to arrange a series of partial output results. The ability to cascade
       provides a high-performance and low-power implementation of DSP filter functions
       because the general routing in the fabric is not used.
       The C input allows the formation of many 3-input mathematical functions, such as 3-input
       addition or 2-input multiplication with an addition. One subset of this function is the
       valuable support of symmetrically rounding a multiplication toward zero or toward infinity.
       The C input together with the pattern detector also supports convergent rounding.
       For multi-precision arithmetic, the DSP48E2 slice provides a right wire shift by 17. Thus, a
       partial product from one DSP48E2 slice can be right justified and added to the next partial
       product computed in an adjacent DSP48E2 slice. Using this technique, the DSP48E2 slices
       can be used to build bigger multipliers.
       The pattern detector at the output of the DSP48E2 slice provides support for convergent
       rounding, overflow/underflow, block floating point, and support for accumulator terminal
       count (counter auto reset). The pattern detector can detect if the output of the DSP48E2
       slice matches a pattern, as qualified by a mask.
       The data and control inputs to the DSP48E2 slice feed the arithmetic and logic stages. The
       A and B data inputs can optionally be registered one or two times to assist the construction
       of different, highly pipelined, DSP application solutions. The D path and the AD path can
       each be registered once. The other data inputs and the control inputs can be optionally
       registered once. Maximum frequency operation as specified in the UltraScale and
       UltraScale+ device data sheets [Ref 2] is achieved by using pipeline registers.
       In its most basic form, the output of the adder/subtracter/logic unit is a function of its
       inputs. The inputs are driven by the upstream multiplexers, carry select logic, and multiplier
       array.
       A typical use of the slice is where A and B inputs are multiplied and the result is added to or
       subtracted from the C register. More detailed operations based on control and data inputs
       are described in later sections. Selecting the multiplier function consumes both X and Y
       multiplexer outputs to feed the adder. The two 45-bit partial products from the multiplier
       are sign extended to 48 bits before being sent to the adder/subtracter.
       When not using the first stage multiplier, the 48-bit, dual input, bit-wise logic function
       implements AND, OR, NOT, NAND, NOR, XOR, and XNOR. The inputs to these functions are:
       Since PCIN is a cascade input from a lower DSP slice, creating even wider logic operations
       is feasible via this cascade path. A 48-bit, triple input, bit-wise XOR3 logic operation is
       supported when the Y multiplexer selects the C input and ALUMODE[3:0] = 0100.
       The output of the adder/subtracter or logic unit feeds the pattern detector logic. The
       pattern detector allows the DSP48E2 slice to support Convergent Rounding, Counter
       Figure 2-3 shows the DSP48E2 slice in a very simplified form. The nine OPMODE bits control
       the selects of W, X, Y, and Z multiplexers, feeding the inputs to the adder/subtracter or logic
       unit. In all cases, the 45-bit partial product data from the multiplier to the X and Y
       multiplexers is sign extended, forming 48-bit input datapaths to the adder/subtracter.
       Based on 45-bit operands and a 48-bit accumulator output, the number of guard bits (i.e.,
       bits available to guard against overflow) is 3. To extend the number of MACC operations,
       the MACC_EXTEND feature should be used, which allows the MACC to extend to 96 bits with
       two DSP48E2 slices. If A is limited to 18 bits (sign-extended to 27), then there are 12 guard
       bits for the MACC. The CARRYOUT bits are invalid during multiply operations. Combinations
       of OPMODE, ALUMODE, CARRYINSEL, and CARRYIN control the function of the
       adder/subtracter or logic unit.
       X-Ref Target - Figure 2-3
                                   RND
                                                               W
                                      P
                                    A:B
                                                                X
                                     D
                                          +
                                     A
                                     B
                                                                Y
                                              All 1s
                                                                                                              P
                                     C
                                              All 0s
                                   PCIN                                 OPMODE, CARRYINSEL,
                                                                Z
                                                                        and ALUMODE Control
                                                                        Behavior
                                                Shifters
                                                                                                       X16754-042617
                             Input Ports
                             This section describes the input ports of the DSP48E2 slice in detail. The input ports of the
                             DSP48E2 slice are highlighted in Figure 2-4.
X-Ref Target - Figure 2-4
                                                                                                                                                   CARRYCASCOUT*
                                    BCOUT*                 ACOUT                                                                         MULTSIGNOUT*          PCOUT*
                                                                                                          0
                                    *
                                                                                                        RND
                                                                                                                       W                                       XOR OUT
                                                                                                                                                                         8
                                                                             48               A:B
                                                 18                30
                                                                                                                               ALUMODE
                                                                   18                                          0                 4                            48
                  B           18                                              18                                       X
                                             Dual B Register
                                                                                                        U
                                   18
                                                                                    MULT        M
                  A           30                                                   27 X 18              V                                                    CARRYOUT 4
                                                                        30
                                                 Dual A, D,                             27                     0       Y                                                 P
                                                                                                                                                        48
                                                                                                               1                                    P
                                        30   and Pre-adder
                  D           27
                  C                                                                                            0                                        PATTERNDETECT
                                                                                        48
                                                                                                C
                                                                    2                                   17-Bit Shift                                    PATTERNBDETECT
                                                       4
                                                                                                                       Z
                                             5                                                          17-Bit Shift                     CREG/C Bypass/Mask
                            INMODE
                                                                                                                           3
                            CARRYIN                                                                                                      MULTSIGNIN*
                            OPMODE                                                                  9
                                                                                                                                         CARRYCASCIN*
                   CARRYINSEL
                                                                                                                                             48
                             BCIN*                    ACIN*                                                                                                    PCIN*
                  *These signals are dedicated routing paths internal to the DSP48E2 column. They are not accessible via general-purpose routing resources
                                                                                                                                                                   X16783-042617
                             A, B, C, and D Ports
                             The DSP48E2 slice input data ports support many common DSP and math algorithms. The
                             DSP48E2 slice has four direct input data ports labeled A, B, C, and D. The A data port is
                             30 bits wide, the B data port is 18 bits wide, the C data port is 48 bits wide, and the
                             pre-adder D data port is 27 bits wide.
                             The 27-bit A (A[26:0]) and 18-bit B ports supply input data to the 27-bit by 18-bit, two’s
                             complement multiplier. With independent C port, each DSP48E2 slice is capable of
                             Multiply-Add, Multiply-Subtract, and Multiply-Round operations.
                             Concatenated A and B ports (A:B) bypass the multiplier and feed the X multiplexer input.
                             The 30-bit A input port forms the upper 30 bits of A:B concatenated datapath, and the
                             18-bit B input port forms the lower 18 bits of the A:B datapath. The A:B datapath, together
                             with the C input port, enables each DSP48E2 slice to implement a full 48-bit
       Each DSP48E2 slice also has two cascaded input datapaths (ACIN and BCIN), providing a
       cascaded input stream between adjacent DSP48E2 slices. The cascaded path is 30 bits wide
       for the A input and 18 bits wide for the B input. Applications benefiting from this feature
       include FIR filters, complex multiplication, multi-precision multiplication and complex
       MACCs.
       The A and B input port and the ACIN and BCIN cascade port can have 0, 1, or 2 pipeline
       stages in its datapath. The dual A, D, and pre-adder port logic is shown in Figure 2-5. The
       dual B register port logic is shown in Figure 2-6. The different pipestages are set using
       attributes. Attributes AREG and BREG are used to select the number of pipeline stages for A
       and B direct inputs to the X multiplexer to the ALU, and INMODE[0] can dynamically change
       the number of pipeline stages to the multiplier. Attributes ACASCREG and BCASCREG select
       the number of pipeline stages in the ACOUT and BCOUT cascade datapaths. The allowed
       attribute settings are shown in Table 3-3, page 53. Multiplexers controlled by configuration
       bits select flow through paths, optional registers, or cascaded inputs. The data port
       registers allow users to typically trade off increased clock frequency (i.e., higher
       performance) vs. data latency.
       X-Ref Target - Figure 2-5
                                                                                                                                        ACOUT
                                                                                            30                                          X MUX
                                                                                            30
                                   A
                                             30                      A2                                         27
                                                                                           INMODE[1]A
                                   ACIN              A1
                                                                                                                         BCOUT
                                                                                                                    18
                                                                                                                         X MUX
                                                                                                                    18
                B                                           B2
                                                                                       0
                                                                                                             B2B1
                                   18                                                      INMODE[1]B                    B MULT
                BCIN                       B1
                                                                                                          AD_DATA
                                                                                                                    18
                                                                                       1
                                                                                                         BMULTSEL
                                        CEB1 RSTB        CEB2 RSTB
                                                                           INMODE[4]
X16759-042617
                            Table 2-1 and Table 2-2 shows the encoding for the INMODE[4:0] dynamic control bits and
                            AMULTSEL, BMULTSEL, and PREADDINSEL static control bits. Note that the DSP48E2
                            attribute AMULTSEL has replaced the DSP48E1 attribute USE_DPORT due to increased
                            functionality of the pre-adder.
                            These bits select the functionality of the pre-adder, the A, B, and D input registers.
                            AMULTSEL must be set to AD to enable the pre-adder functions described in Table 2-1 and
                            Table 2-2.
                            In summary, the INMODE dynamic control signals along with AMULTSEL, BMULTSEL, and
                            PREADDINSEL static attributes control the pre-adder functionality and A, B, and D register
                            bus multiplexers that precede the multiplier. The DSP48E2 supports two-deep A or B
                            sourcing the pre-adder as well as a pre-adder squaring function.
Notes:
1. Set the data on the D and the A ports so the pre-adder, which does not support saturation, does not overflow or underflow.
   See Pre-Adder, page 36.
                                                                                                                                                                                                         Pre-Adder/Multiplier
                                                                                                                                                                               Multiplier B Port(3)
                                                                                                                                               Multiplier A Port
                                                   INMODE[1]A(1)
INMODE[1]B(1)
PREADDINSEL
                                                                                                                           (USE_DPORT)
  INMODE[4]
INMODE[3]
INMODE[2]
INMODE[1]
INMODE[0]
                                                                                                                            AMULTSEL
                                                                                                              BMULTSEL
                                                                                                                                                                                                               Function
 0/1           0            0           0            0               0             0/1          A              B         A (FALSE)          A2/A1                           B2/B1                        A*B
X 0 1 1 1 0 X A AD AD (TRUE) D D D2
Notes:
1. INMODE[1]A and INMODE[1]B are internal signals defined by the user settings of PREADDINSEL and INMODE[1]. If
   PREADDINSEL=A, INMODE[1]A (see Figure 2-5, page 23) is INMODE[1] and INMODE[1]B (see Figure 2-6, page 24) is 0. If
   PREADDINSEL=B, INMODE[1]B is INMODE[1] and INMODE[1]A is 0.
2. Set the data on the D and the A or B ports so the pre-adder, which does not support saturation, does not overflow or
   underflow. See Pre-Adder, page 36.
3. A or D are limited to 18 bits when provided through the B port, and are limited to 17-bit two's complement sign-extended
   numbers when the pre-adder is used.
               INMODE[1] may be used to gate the A or B datapath to use the pre-adder to create a 2:1
               bus multiplexer along with the INMODE[2] control signal.
               When INMODE[2] = 0, the D input to the pre-adder is 0. INMODE[1] and INMODE[2] enable
               multiplexing between the D register and the A or B registers, without having to use resets
               to force them to zero.
       The 48-bit C port is used as a general input to the W, Y and Z multiplexers to perform add,
       subtract, four-input add/subtract, and logic functions. The C input is also connected to the
       pattern detector for rounding function implementations. The C port logic is shown in
       Figure 2-7. The CREG attribute selects the number of pipestages for the C input datapath.
       X-Ref Target - Figure 2-7
                                                     48
                                   C                                        C Input to
                                                                48          W, Y and Z
                                              D      48                     Multiplexers and
                                       CEC    EN                            Pattern Detector
RST
                                       RSTC
                                                                                     X16760-042617
                                                9
                                      OPMODE
                                                                             9       To the W, X, Y, Z
                                                                                     Multiplexers and
                                                1           D
                                                                                     3-Input Adder/Subtracter
                                       CECTRL               EN
                                                             RST
                                                1
                                      RSTCTRL
                                                4
                                     ALUMODE
                                                                             4
                                                                                     To Adder/Subtracter
                                                            D
                                                1
                                   CEALUMODE                EN
                                                             RST
                                                1
                                   RSTALUMODE
                                                3
                                   CARRYINSEL
                                                                             3
                                                                                     To Carry Input
                                                            D                        Select Logic
                                                            EN
RST
X16761-042617
       W, X, Y, and Z Multiplexers
       The OPMODE (Operating Mode) control input contains fields for W, X, Y, and Z multiplexer
       selects.
       The OPMODE input provides a way for you to dynamically change DSP48E2 functionality
       from clock cycle to clock cycle (e.g., when altering the internal datapath configuration of
       the DSP48E2 slice relative to a given calculation sequence).
       The OPMODE bits can be optionally registered using the OPMODEREG attribute (as noted
       in Table 3-4).
       Table 2-3, Table 2-4, Table 2-5, and Table 2-6 list the possible values of OPMODE and the
       resulting function at the outputs of the four multiplexers (W, X, Y, and Z multiplexers). The
       multiplexer outputs supply four operands to the following adder/subtracter. Not all
       possible combinations for the multiplexer select bits are allowed. Some are marked in the
       tables as “illegal selection” and give undefined results. If the multiplier output is selected,
       then both the X and Y multiplexers are used to supply the multiplier partial products to the
       adder/subtracter.
       ALUMODE Inputs
       The 4-bit ALUMODE controls the behavior of the second stage add/sub/logic unit.
       ALUMODE = 0000 selects add operations of the form Z + (W + X + Y + CIN). CIN is the
       output of the CARRYIN mux (see Figure 2-9). ALUMODE = 0011 selects subtract operations
       of the form Z – (W + X + Y + CIN). ALUMODE = 0001 can implement
       –Z + (W + X + Y + CIN) – 1. ALUMODE = 0010 can implement –(Z + W + X + Y + CIN) – 1,
       which is equivalent to not (Z + W + X + Y + CIN). The negative of a two’s complement
       number is obtained by performing a bitwise inversion and adding one, e.g.,
       –k = not (k) + 1. Other subtract and logic operations can also be implemented with the
       enhanced add/sub/logic unit. See Table 2-7.
        Notes:
        1. In two’s complement: –Z = not (Z) + 1.
See Table 2-10, page 38 for two-input ALUMODE operations and Table 5-3, page 70.
                                            CARRYIN
                                                                                                                 3
                                                                    RSTALLCARRYIN        RST                         CARRYINSEL
                                                                                         D
                                                                          CECARRYIN      CE
                                                                                                                 000
                                   Large Add/Sub/Acc     CARRYCASCIN
                                         (Parallel Op)                                                           010
                                   Large Add/Sub/Acc     CARRYCASCOUT
                                            (Seq Op)                                                             100
                                                         A[26] XNOR B[17]
                                         Round A * B
                                                                                                                 110
                                                                    RSTALLCARRYIN        RST                               CIN
                                                                                         D
                                                                                  CEM    CE
                                                                                                                 111
                                                         Inverted P[47]
                                        Round Output                                                             101
                                                                                                                 011
                                                         Inverted PCIN[47]
                                                                                                                 001
X16762-042617
       The fourth input (CARRYINSEL is equal to binary 110) is A[26] XNOR B[17] for symmetrically
       rounding multiplier outputs. This signal can be optionally registered to match the MREG
       pipeline delay. The fifth and sixth inputs (CARRYINSEL is equal to binary 111 and 101)
       selects the true or inverted P output MSB P[47] for symmetrically rounding the P output.
       The seventh and eight inputs (CARRYINSEL is equal to binary 011 and 001) selects the true
       or inverted cascaded P input MSB PCIN[47] for symmetrically rounding the cascaded P
       input.
       Table 2-8 lists the possible values of the three carry input select bits (CARRYINSEL) and the
       resulting carry inputs or sources.
                               Output Ports
                               This section describes the output ports of the DSP48E2 slice in detail. The output ports of
                               the DSP48E2 slice are shown in Figure 2-10.
X-Ref Target - Figure 2-10
                                                                                                                                                                    CARRYCASCOUT*
                                                                                                                                                         MULTSIGNOUT*             PCOUT*
                                                 BCOUT*                ACOUT*                                           0
                                                                                                                      RND
                                                                                                                                       W                                          XOR OUT
                                                                                                                                                                                              8
                                                                                          48                A:B
                                                                18              30
                                                                                                                                               ALUMODE
                                                                                18                                             0                 4                                48
             B                 18                                                          18                                          X
                                                            Dual B Register
                                                                                                                      U
                                         18                                                     MULT
                                                                                                              M
             A                 30                                                               27 X 18               V                                                       CARRYOUT 4
                                                                                     30
                                                                Dual A, D,                           27                        0       Y                                                       P
                                                                                                                                                                         48
                                                                                                                               1                                    P
                                                       30   and Pre-adder
             D                 27
             C                                                                                                                 0                                          PATTERNDETECT
                                                                                                       48
                                                                                                              C
                                                                                 2                                    17-Bit Shift                                        PATTERNBDETECT
                                                                      4
                                                                                                                                       Z
                                                            5                                                         17-Bit Shift                       CREG/C Bypass/Mask
                             INMODE
                                                                                                                                           3
                             CARRYIN                                                                                                                     MULTSIGNIN*
                             OPMODE                                                                               9
                                                                                                                                                         CARRYCASCIN*
              CARRYINSEL
                                                                                                                                                             48
                              BCIN*                                  ACIN
                                                                                                                                                                                  PCIN*
                              *
             *These signals are dedicated routing paths internal to the DSP48E2 column. They are not accessible via general-purpose routing resources.                                    X16784-042617
                               All the output ports except ACOUT and BCOUT are reset by RSTP and enabled by CEP (see
                               Figure 2-11). ACOUT and BCOUT are reset by RSTA and RSTB, respectively (shown in
                               Figure 2-5 and Figure 2-6).
                               X-Ref Target - Figure 2-11
                                                                 P/PCOUT/MULTSIGNOUT/
                                                                       CARRYCASCOUT/
                                                                                                                                                                   DSP48E2
                                                                            CARRYOUT/
                                                                                                                                                                   Slice Output
                                                                       PATTERNDETECT/                                     D
                                                                      PATTERNBDETECT/                       CEP           EN       Q
                                                                              XOROUT
                                                                                                                           RST
                                                                                                          RSTP
                                                                                                                                                                          X16763-042617
       P Port
       Each DSP48E2 slice has a 48-bit output port P. This output can be connected (cascaded
       connection) to the adjacent DSP48E2 slice internally through the PCOUT path. The PCOUT
       connects to the input of the Z multiplexer (PCIN) in the adjacent DSP48E2 slice. This path
       provides an output cascade stream between adjacent DSP48E2 slices.
       The CARRYOUT signal is cascaded to the next adjacent DSP48E2 slice using the
       CARRYCASCOUT port. Larger add, subtract, ACC, and MACC functions can be implemented
       in the DSP48E2 slice using the CARRYCASCOUT output. The 1-bit CARRYCASCOUT signal
       corresponds to CARRYOUT[3], but is not identical. The CARRYCASCOUT signal is also fed
       back into the same DSP48E2 slice via the CARRYINSEL multiplexer.
       The CARRYOUT[3] signal should be ignored when the multiplier or a 3-input (or 4-input)
       add/subtract operation is used. Because a MACC operation includes a three-input adder in
       the accumulator stage, the combination of MULTSIGNOUT and CARRYCASCOUT signals is
       required to perform a 96-bit MACC, spanning two DSP48E2 slices. The second DSP48E2
       slice’s OPMODE must be MACC_EXTEND (001001000) to use both CARRYCASCOUT and
       MULTSIGNOUT, thereby eliminating the ternary adder carry restriction for the upper
       DSP48E2 slice. The actual hardware implementation of CARRYOUT/CARRYCASCOUT and the
       MULTSIGNOUT Logic
       MULTSIGNOUT is a software abstraction of the hardware signal. It is modeled as the MSB of
       the multiplier output and used only in MACC extension applications to build a 96-bit MACC.
       The actual hardware implementation of MULTSIGNOUT is described in Chapter 5,
       Cascading: CARRYOUT, CARRYCASCOUT, and MULTSIGNOUT.
       The MSB of a multiplier output is cascaded to the next DSP48E2 slice using the MULTSIGNIN
       signal and can be used only in MACC extension applications to build a 96-bit accumulator.
       The actual hardware implementation of MULTSIGNOUT is described in Chapter 5,
       Cascading: CARRYOUT, CARRYCASCOUT, and MULTSIGNOUT.
       A mask field can also be used to hide certain bit locations in the pattern detector.
       PATTERNDETECT computes ((P == pattern)||mask) on a bitwise basis and then ANDs the
       results to a single output bit. Similarly, PATTERNBDETECT can detect if
       ((P == ~pattern)||mask). The pattern and the mask fields can each come from a distinct
       48-bit configuration field or from the (registered) C input. When the C input is used as the
       PATTERN, the OPMODE should be set to select a 0 at the input of the Z multiplexer. If all the
       registers are reset, PATTERNDETECT is High for one clock cycle immediately after the RESET
       is deasserted.
       The pattern detector allows the DSP48E2 slice to support convergent rounding and counter
       auto reset when a count value has been reached as well as support overflow, underflow, and
       saturation in accumulators.
                                Embedded Functions
                                The embedded functions include a pre-adder, 27 x 18 multiplier, adder/subtracter/logic
                                unit, and pattern detector logic (see Figure 2-12).
X-Ref Target - Figure 2-12
                                                                                                                                                 CARRYCASCOUT*
                                                                                                                                       MULTSIGNOUT*          PCOUT*
                                     BCOUT*             ACOUT*                                           0
                                                                                                       RND
                                                                                                                      W                                      XOR OUT
                                                                                                                                                                      8
                                                                           48                A:B
                                                  18             30
                                                                                                                              ALUMODE
                                                                 18                                           0                 4                            48
          B                    18                                           18                                        X
                                              Dual B Register
                                                                                                       U
                                    18                                           MULT
                                                                                               M
          A                    30                                                27 X 18               V                                                   CARRYOUT 4
                                                                      30
                                              Dual A, D, and                          27                      0       Y                                                P
                                                                                                                                                      48
                                                  Pre-adder                                                   1                                  P
                                         30
          D                    27
          C                                                                                                   0                                        PATTERNDETECT
                                                                                        48
                                                                                               C
                                                                  2                                    17-Bit Shift                                    PATTERNBDETECT
                                                        4
                                                                                                                      Z
                                              5                                                        17-Bit Shift                     CREG/C Bypass/Mask
                             INMODE
                                                                                                                          3
                             CARRYIN                                                                                                    MULTSIGNIN*
                             OPMODE                                                                9
                                                                                                                                        CARRYCASCIN*
           CARRYINSEL
                                                                                                                                            48
                              BCIN*                    ACIN
                                                                                                                                                             PCIN*
                              *
          *These signals are dedicated routing paths internal to the DSP48E2 column. They are not accessible via general-purpose routing resources.               X16785-042617
                                Pre-Adder
                                The DSP slice has a 27-bit pre-adder, which is inserted in the A or B register path (shown in
                                Figure 2-12 with an expanded view in Figure 2-5, page 23). With the pre-adder,
                                pre-additions or pre-subtractions are possible prior to feeding the multiplier. Since the
                                pre-adder does not contain saturation logic, designers should limit input operands to
                                26-bit (or 17-bit for the B path) two’s complement sign-extended data to avoid overflow or
                                underflow during arithmetic operations. Optionally, the pre-adder can be bypassed, making
                                D the new input path to the multiplier. When the D path is not used, the output of the A or
                                B pipeline can be negated prior to driving the multiplier. There are up to 15 operating
                                modes, including pre-adder squaring, making this pre-adder block very flexible.
                                In Equation 2-2, A (or B) and D are added initially through the pre-adder/subtracter. The
                                result of the pre-adder is then multiplied against B (or A), with the result of the
       multiplication being added to the C input. This equation facilitates efficient symmetric
       filters.
       Figure 2-13 shows an optional pipeline register (MREG) for the output of the multiplier.
       Using the register provides increased performance with an increase of one clock latency.
       X-Ref Target - Figure 2-13
                                                                                       45
                                                             90                               Partial Product 1
                                          A or AD
                                                       X                               45
                                          B or AD
                                                                                              Partial Product 2
                                                                    Optional
                                                                    MREG                                X16764-042617
       As with the input multiplexers, the OPMODE bits specify a portion of this function. The
       symbol ± in the table means either add or subtract and is specified by the state of the
       ALUMODE control signal. The symbol “:” in the table means concatenation. The outputs of
       the X and Y multiplexer and CIN are always added together. Refer to ALUMODE Inputs,
       page 30.
       Table 2-10 lists the logic functions that can be implemented in the second stage of the four
       input adder/subtracter/logic unit. The table also lists the settings of the control signals,
       namely OPMODE and ALUMODE.
       An XOR3 can be built by setting the OPMODE[3:2] to 11, selecting the C input at the Y
       multiplexer output. The XOR3 is only valid for ALUMODE[3:0] = 0100, as shown in
       Table 2-10.
       Table 2-10:   OPMODE and ALUMODE Control Bits Select Logic Unit Outputs
                                           OPMODE[3:2]            ALUMODE[3:0]
                   Logic Unit Mode
                                             3       2       3       2       1       0
        X XOR Z                              0       0       0       1       0       0
        X XNOR Z                             0       0       0       1       0       1
        X XNOR Z                             0       0       0       1       1       0
        X XOR Z                              0       0       0       1       1       1
        X AND Z                              0       0       1       1       0       0
        X AND (NOT Z)                        0       0       1       1       0       1
        X NAND Z                             0       0       1       1       1       0
        (NOT X) OR Z                         0       0       1       1       1       1
        X XNOR Z                             1       0       0       1       0       0
        X XOR Z                              1       0       0       1       0       1
        X XOR Z                              1       0       0       1       1       0
        X XNOR Z                             1       0       0       1       1       1
        X OR Z                               1       0       1       1       0       0
       Table 2-10:                     OPMODE and ALUMODE Control Bits Select Logic Unit Outputs (Cont’d)
                                                                  OPMODE[3:2]                 ALUMODE[3:0]
                                    Logic Unit Mode
                                                                    3       2           3          2          1         0
            X OR (NOT Z)                                            1       0           1          1          0         1
            X NOR Z                                                 1       0           1          1          1         0
            (NOT X) AND Z                                           1       0           1          1          1         1
            X XOR Y XOR Z (1)                                       1       1           0          1          0         0
        Notes:
        1. Valid when Y multiplexer selects C input.
                                       0
                                      P
                                             W
                                    RND
                                     CQ                                  [47:36]                       P[47:36], CARRYOUT[3]
                                       0
                                                     [47:0]
                                     A:B     X
                                                                         [35:24]                       P[35:24], CARRYOUT[2]
                                      P
                                       0
                                                                         [23:12]                       P[23:12], CARRYOUT[1]
                                       1     Y
                                      C
                                                                                   ALUMODE[3:0]
                                                                                                                       X16765-042617
       •   Four segments of dual or ternary or quad adders with 12-bit inputs, a 12-bit output,
           and a carry output for each segment
       •   Function controlled dynamically by ALUMODE[3:0], and operand source by
           OPMODE[8:0]
       •   All four adder/subtracter/accumulators perform same function
       •   Two segments of dual or ternary or quad adders with 24-bit inputs, a 24-bit output, and
           a carry output for each segment is also available (not pictured).
       The SIMD feature, shown in Figure 2-14, allows the 48-bit logic unit to be split into multiple
       smaller logic units. Each smaller logic unit performs the same function. This function can
       also be changed dynamically through the ALUMODE[3:0] and OPMODE control inputs.
       The pattern detector is best described as an equality check on the output of the
       adder/subtracter/logic unit that produces its result on the same cycle as the P output. There
       is no extra latency between the pattern detect output and the P output of the DSP48E2
       slice. The use of the pattern detector leads to a moderate speed reduction due to the extra
       logic on the pattern detect path (see Figure 2-15).
                                                                                                                                              P
                                                                                          P
                                                                                                                                        (1)
                                                                                                                PATTERNBDETECTPAST
                                                                                                                      PATTERNBDETECT
                                                                     SEL_PATTERN
                                                                                                                                      (1)
                                                                                                                PATTERNDETECTPAST
                                    C (Register)
                                                                                                                       PATTERNDETECT
                                    PATTERN
C Shift by 2, 00 (Mode 2)
C Shift by 1, 0 (Mode 1)
SEL_MASK
                                                      SEL_MASK
                                    Notes:
                                    1. Denotes an internal signal.
                                                                                                                                        X16766-042617
       If the pattern detector is not being employed, it can be used for other creative design
       implementations. These include:
       •                   Duplicating a pin (e.g., the sign bit) to reduce fanout and thus increase speed.
       •                   Implementing a built-in inverter on one bit (e.g., the sign bit) without having to route
                           out to the CLBs.
       •                   Checking for sticky bits in floating point, handling special cases, or monitoring the
                           DSP48E2 slice outputs.
       A mask field can also be used to mask out certain bit locations in the pattern detector. The
       pattern field and the mask field can each come from a distinct 48-bit memory cell field or
       from the (registered) C input.
                                                                           (1)
                                            PATTERNDETECTPAST
                                                                                                                Overflow
                                                  PATTERNBDETECT
                                                     PATTERNDETECT
                                                                           (1)
                                          PATTERNBDETECTPAST
                                                                                                                Underflow
                                                  PATTERNBDETECT
                                                     PATTERNDETECT
See Figure 2-17 and Figure 2-18 for overflow and underflow examples, respectively.
                                    Overflow
                                                                                                                                       X16768-042617
                                    Underflow
                                                                                                                                      X16769-042617
       Overflow is caused by addition when the value at the output of the adder/subtracter/logic
       unit goes over 3. Adding 1 to the final value of 0..0011 gives 0..0100 as the result, which
       causes the PATTERNDETECT output to go to 0. When the PATTERNDETECT output goes from
       1 to 0, an overflow is flagged.
       Underflow is caused by subtraction when the value goes below –4. Subtracting 1 from
       1..1100 yields 1..1010 (–5), which causes the PATTERNBDETECT output to go to 0. When
       the PATTERNBDETECT output goes from 1 to 0, an underflow is flagged.
       Overflow and underflow are relative to the previous value (positive or negative,
       respectively). Overflow can result from subtraction when a value outside the valid range is
       subtracted from a positive value. Similarly, underflow can result from addition when a value
       outside the valid range is added to a negative value.
       Wide XOR
       A new feature in the DSP48E2 slice is the ability to perform a 96-bit wide XOR function. The
       XOR uses the X, Y, and Z multiplexers as inputs. The W multiplexer selects all 0s at its output.
       The ALU logic is used for the first stage of the wide XOR by using the proper OPMODE and
       ALUMODE signals as shown in Table 2-10, to implement either X XOR Z or X XOR Y XOR Z.
       The signals then branch out to an XOR logic tree with dedicated outputs. Multiplexers allow
       selection as eight 12-bit wide XOR, four 24-bit wide XOR, two 48-bit wide XOR, or one
       96-bit wide XOR. See Figure 2-19. In Figure 2-19 the S[47:0} internal bus is not the P[47:0]
       output, it is one of the 4:2 compressor busses.
                                                                                                                            XOROUT[7]
                                                                          S[47:42]       XOR12H
                                                                                                     XOR24D
                                                                                                                            XOROUT[6]
                                                                          S[41:36]       XOR12G
                                                                                                               XOR48B
                                                                                                                            XOROUT[5]
                                                                          S[35:30]       XOR12F
               0                     [47:0]
             A:B              X                                                                      XOR24C
               P
                                                                                                                            XOROUT[4]
                                                                          S[29:24]       XOR12E
                    0                [47:0]         +         S[47:0]                                           XOR96
                    1         Y                         –
                    C                                                                                                       XOROUT[3]
                                                                          S[23:18]       XOR12D
          0
                                     [47:0]                                                          XOR24B
       PCIN
          P
                              Z                                                                                             XOROUT[2]
          C                                                               S[17:12]       XOR12C
                                               ALUMODE[3:0]
                                                                                                               XOR48A
                                                                                                                            XOROUT[1]
                                                                          S[11:6]        XOR12B
                                                                                                     XOR24A
                                                                                                                            XOROUT[0]
                                                                           S[5:0]        XOR12A
                                                                                                                                 X16770-042617
                             The XORSIMD attribute is used to select the width of the XOR function as either 96-bits or
                             12-/24-/48- bits, as shown in Table 2-11.
       The dedicated XOR logic enables performance improvements when implementing forward
       error correction and cyclic redundancy checking algorithms. There is also a USE_WIDEXOR
       attribute to enable a power saving mode if the wide XOR function is not desired (see
       Table 3-3, page 53.
       The first level XOR can be either XOR2 or XOR3. In both cases, ALUMODE[3:0] = 0100 for
       the XOR function in the ALU. When the Y multiplexer selects 0, an XOR2 is created. When
       the Y multiplexer selects the C register, an XOR3 is created, supporting up to 48 XOR3 in the
       ALU. The third input can come from the P output or the PCIN cascade, which provides
       XOR-accumulate and cascade capability for even wider XOR functions.
       Overview
       Xilinx offers integrated DSP design flows tailored for the unique needs of hardware,
       algorithm, and traditional processor-based DSP designers, supporting all mainstream DSP
       design entry methods to ensure productivity.
       Vivado™ Design Suite System Generator for DSP enables high-level model-based designs to
       be created using MathWorks MATLAB and Simulink and provides automatic fixed or
       floating-point hardware generation, co-simulation, and system integration into RTL or
       embedded systems. See Vivado Design Suite Reference Guide: Model-Based DSP Design
       Using System Generator (UG958) [Ref 4].
       Vivado Design Suite includes an extensive library of device-optimized DSP IP that can be
       used with RTL or with System Generator or Vivado HLS to quickly assemble DSP designs that
       deliver high quality of results without requiring extensive FPGA design experience. DSP
       algorithms implemented in RTL can be verified from within DSP specific simulation
       environments such as MATLAB/Simulink or C/C++.
       The DSP48E2 slices are inferred automatically from HDL code for most DSP functions and
       many arithmetic functions when using synthesis tools (check the documentation for your
       synthesis tools for details). Instantiation of the DSP48E2 primitive can be used to directly
       access specific features and provide more advanced user control.
                                    30                                                 30
                                            A[29:0]                  ACOUT[29:0]
                                    18                                                 18
                                            B[17:0]                  BCOUT[17:0]
                                    48                                                 48
                                            C[47:0]                  PCOUT[47:0]
                                    27
                                            D[26:0]
                                        9
                                            OPMODE[8:0]
                                        4                                              48
                                            ALUMODE[3:0]                  P[47:0]
                                            CARRYIN
                                        3
                                            CARRYINSEL[2:0]
                                        5
                                            INMODE[4:0]
                                                                                        4
                                            CEA 1                CARRYOUT[3:0]
                                            CEA 2               CARRYCASCOUT
                                            CEB 1                 MULTSIGNOUT
                                            CEB 2
                                            CEC                 PATTERNDETECT
                                            CED                PATTERNBDETECT
                                            CEM
                                            CEP
                                            CEAD
                                                                      OVERFLOW
                                            CEALUMODE                UNDERFLOW
                                            CECTRL
                                            CECARRYIN
                                                                                        8
                                            CEINMODE                 XOROUT[7:0]
                                            RSTA
                                            RSTB
                                            RSTC
                                            RSTD
                                            RSTM
                                            RSTP
                                            RSTCTRL
                                            RSTALLCARRYIN
                                            RSTALUMODE
                                            RSTINMODE
                                            CLK
                                   30
                                            ACIN[29:0]
                                   18
                                            BCIN[17:0]
                                   48
                                            PCIN[47:0]
                                            CARRYCASCIN
                                            MULTSIGNIN
                                                                                      X16771-042617
       Overview
       This chapter describes some design features and techniques to use to achieve higher
       performance, lower power, and lower resources in a particular design.
       IMPORTANT: If latency is important in the design and only one or two registers can be used within the
       DSP48E2 slice, always use the M register.
                                   h7(n)
                                           18       ×                      +
                                                               48
                                           18
                                                                       48
                                   h6(n)
                                           18       ×                                    +
                  X(n-4)
                                           18
                                     Z-2
                                   h5(n)
                                           18       ×                      +
                                                               48
                                           18
                                                                      48
                             h4(n)
                                           18       ×
                  X(n-2)
                                           18
                                   Z-2                                                                     +               y(n-6)
                                   h3(n)
                                           18       ×                      +
                                                               48
                                           18
                                                                                                        The final stages of the post
                                                                      48                                addition in logic are the
                                                                                                        performance bottlenecks that
                             h2(n)                                                                      consume more power.
                                           18       ×                                    +
                            X(n)
                                           18
                                   Z-2
                                   h1(n)
                                           18       ×                      +
                                                               48
                                           18
                                                                       48
                                   h0(n)
                                           18       ×
                                   X(n)
                                           18                                                                                X16772-042617
                            In the traditional approach, the fabric adders are usually the performance bottleneck. The
                            number of adders needed and the associated routing depends on the size of the filter. The
                            depth of the adder tree scales as the log 2 of the number of taps in the filter. Using the adder
                            tree structure shown in Figure 4-1 could also increase the cost, logic resources, and power.
                            The UltraScale™ architecture CLB allows the use of both the 6LUT and the carry chain in the
                            same slice to build an efficient ternary adder. The 6LUT in the CLB functions as a dual 5LUT.
                            The 5LUT is used as a 3:2 compressor to add three input values to produce two output
                            values. The 3:2 compressor is shown in Figure 4-2.
CY(1)
                                                 B4 IN4
                                      X(1)                                       ABUS(1)
                                                 B3 IN3                                    O6B
                                      Y(1)
                                                                                                   0 1
                                                                  0
                                                 B2 IN2
                                      Z(1)                        1                                                        BMUX        BBUS(1)
                                                 B5 IN5
                                   BBUS(0)
                                      SUB/       B1 IN1                                                                       BQ       SUM(1)
                                                                                                                      D Q
                                     ADDB
                                                           VCC                             O5B
                                                                                                                      CK
                                                           IN6
                                                 BX
                                   BBUS(0)
                                                                                                         CY(0)
                                                 A4 IN4
                                      X(0)                                       ABUS(0)
                                                 A3 IN3                                    O6A
                                      Y(0)
                                                                                                   0 1
                                                                  0
                                                 A2 IN2
                                      Z(0)                        1                                                        AMUX        BBUS(0)
                                      SUB/       A5 IN5
                                     ADDB                                                                                     AQ       SUM(0)
                                                                                                                      D Q
                                                           VCC                             O5A
                                                                                                                      CK
                                                           IN6
                                      SUB/       AX                                                              GND
                                     ADDB
X16773-042617
                                                                          ABUS
                                             A
                                                      46                                              2-Input
                                                               3:2                                                     SUM
                                             B                                                       Cascade
                                                      46    Compressor                                           48
                                                                          BBUS                           Adder
                                             C                                   Left Shift By 1
                                                      46
                                                                                                                       X16774-042617
       The 3:1 adder shown in Figure 4-3 is used as a building block for larger adder trees.
       Depending on the number of inputs to be added, a 5:3 or a 6:3 compressor is also built in
       CLB logic using multiple 5LUTs or 6LUTs. The serial combination of 6:3 compressor, along
       with two DSP48E2 slices, adds six operands together to produce one output, as shown in
       Figure 4-4. The LSB bits of the first DSP48E2 slice that are left open due to left shift of the
       Y and Z buses should be tied to zero. The last DSP48E2 slice uses 2-deep A:B input registers
       to align (pipeline matching) the X bus to the output of the first DSP48E2 slice. Multiple
       levels of 6:3 compressors can be used to expand the number of input buses.
       X-Ref Target - Figure 4-4
                                                             X
                                   A                                                                      DSP48E2   SUM
                                       45         6:3        Y                                              Slice   48
                                                                     Left Shift By 1     DSP48E2
                                               Compressor
                                                              Z                            Slice
                                   F                                 Left Shift By 2
                                       45
                                                                                                                    X16775-042617
                                            X(n) = A(n) XOR B(n) XOR C(n) XOR D(n) XOR E(n) XOR F(n)                     Equation 4-1
                                              Y(n) = A(n)B(n) XOR A(n)C(n) XOR A(n)D(n) XOR A(n)E(n)
                                               XOR A(n)F(n) XOR B(n)C(n) XOR B(n)D(n) XOR B(n)E(n)
                                                                                                                         Equation 4-2
                                               XOR B(n)F(n) XOR C(n)D(n) XOR C(n)E(n) XOR C(n)F(n)
                                                       XOR D(n)E(n) XOR D(n)F(n) XOR E(n)F(n)
       The compressor elements and cascade adder can be arranged like a tree in order to build
       larger adders. The last add stage should be implemented in the DSP48E2 slice. Pipeline
       registers should be added as needed to meet timing requirements of the design. These
       adders can have higher area and/or power than the adder cascade.
       Adder Cascade
       The adder cascade implementation accomplishes the post addition process with minimal
       silicon resources by using the cascade path within the DSP48E2 slice. This involves
       computing the additive result incrementally, utilizing a cascaded approach as illustrated in
       Figure 4-5.
                                    Slice 8
                                   h7(n-7)
                                              18                 ×          48
                                                                                        +           48
                                                                                                            Y(n–10)
                                              18
                                                                                   No Wire Shift
                                    Slice 7                                                 48
                                   h6(n-6)
                                              18                 ×          48
                                                                                        +
                                              18
                                                                                   No Wire Shift
                                    Slice 6                                                 48
                                   h5(n-5)
                                              18                 ×          48
                                                                                        +
                                              18
                                                                                   No Wire Shift
                                                                                            48
                                    Slice 5
                                   h4(n-4)
                                              18                 ×          48
                                                                                        +             The post adders are
                                                                                                      contained entirely in
                                              18
                                                                                   No Wire Shift      dedicated silicon for
                                                                                                      highest performance
                                    Slice 4                                                 48        and lowest power.
                                   h3(n-3)
                                              18                 ×          48
                                                                                        +
                                              18
                                                                                   No Wire Shift
                                    Slice 3                                                 48
                                   h2(n-2)
                                              18
                                                                            48
                                                                                        +
                                              18
                                                                                   No Wire Shift
                                    Slice 2                                                 48
                                   h1(n-1)
                                              18                 ×          48
                                                                                        +
                                              18
                                                                                   No Wire Shift
                                                                                            48
                                    Slice 1
                                     h0(n)
                                      X(n)
                                              18                 ×          48
                                                                                        +
                                              18
                                                                                 Zero
                                              Sign Extended from 36 Bits to 48 Bits                                 X16776-042617
       IMPORTANT: The height of the DSP column can differ between devices and should be considered while
       porting designs.
       Spanning columns is possible by taking the bus output from the top of one DSP column and
       adding CLB slice pipeline registers to route this bus to the C input of the bottom DSP48E2
       slice of the adjacent DSP column. Alignment of input operands is also necessary to span
       multiple DSP columns.
       These time-multiplexed DSP designs have optional pipelining that permits aggregate
       multichannel sample rates of up to 500 million samples per second. Implementing a
       time-multiplexed design using the DSP48E2 slice results in reduced resource utilization and
       reduced power.
       The DSP48E2 slice contains the basic elements of classic FIR filters: a multiplier followed by
       an adder, delay or pipeline registers, and the ability to cascade an input stream (B bus) and
       an output stream (P bus) without exiting to a general CLB slice.
                             Figure 4-6 illustrates how the pre-adder (shaded in gray) can be used in an 8-tap even
                             symmetric systolic FIR design.
X-Ref Target - Figure 4-6
         x(n)                                                    z-2
                             z-2                                                               z-2                          z-2
+ + + +
                            h0                                 h1                             h2                           h3
                                   z-1         x                       z-1       x                   z-1     x                    z-1         x
z-1 z-1
                                                                                                                                                          y(n-8)
                                               +     z-1                         +      z-1                  +      z-1                       +     z-1
       Overview
       This chapter describes the dedicated cascade features and clarifies key details of cascade
       signals.
       CARRYOUT/CARRYCASCOUT
       The DSP48E2 slice and the fabric carry chain use a different implementation style for
       subtract functions. The carry chain implementation in the CLB slices requires the CLB carry
       input pin to be connected to a constant 1 during subtract operations. The standard subtract
       operation with ALUMODE = 0011 in the DSP48E2 slice does not require the CARRYIN pin to
       be set to 1.
                                                  A
                                                                0         +
                                                  B
                                                                1
0 A+B
Carry Input
                                                  A
                                                                0         +
                                                  B
                                                                1
                                                                                        A±B
Sub/Add = 1/0
                                   (Carry input must be 1 for a subtract operation, so it is not available for other uses.)
                                                                                                                     X16778-042617
                                                                 B                                          0
                                                                                         +
                                                                               0                            1
                                                                 A
                                                                               1
Sub/add
                                                               Y                              0
                                                                                  +
                                                                          0                   1
                                                               Z
                                                                          1
ALUMODE[0]
CIN
ALUMODE[1]
       •                   This inversion of the P output obtained by using ALUMODE 0010 can be cascaded to
                           another DSP slice to implement a two’s complement subtract.
       IMPORTANT: CARRYOUT[3] and CARRYCASCOUT are not valid for three-input and four-input add/sub
       functions.
       All DSP four-input add operations (including Multiply-Add and Multiply Accumulate)
       produce two CARRYOUT bits for retaining full precision. This is shown in Figure 5-4.
       MULTSIGNOUT and CARRYCASCOUT serve as the two carry bits for MACC_EXTEND
       operations. If MULTSIGNOUT is the multiplier sign bit and CARRYCASCOUT is the cascaded
       carryout bit, the result is the software/Unisim model abstraction, shown in Figure 5-4.
                                   CIN
                                                                                            X16781-042617
       It is also necessary to set the OPMODEREG and CARRYINSELREG to 1 when building large
       accumulators such as the 96-bit Multiply Accumulate. This prevents the simulation model
       from propagating unknowns to the upper DSP48E2 slice when a reset occurs.
       Summary
       Adder/Subtracter-only Operation
       CARRYOUT[3]: Hardware and software match.
       CARRYCASCOUT: Hardware and software match when ALUMODE = 0000, 0001, and 0010,
       and inverted when ALUMODE = 0011. The mismatch happens because the DSP48E2 slice
       performs the subtract operation using a different algorithm from the CLB logic; thus, the
       DSP48E2 slice requires an inverted CARRYOUT from the CLB logic.
       MACC Operation
       CARRYOUT[3] is invalid in the MACC operation.
                                   Software Model
                                                                                                        MULTSIGNOUT
                                                 A                                                      CARRYCASCOUT
                                                               x              +                         P[47:0]
                                                 B
                                                     CARRYIN
                                     Zmux (e.g., C, P, PCIN)
                                   Hardware Implementation
                                                                                                        MULTSIGNOUT
                                                 A                                                      CARRYCASCOUT
                                                               x              +                         P[47:0]
                                                 B
                                                     CARRYIN
                                     Zmux (e.g., C, P, PCIN)
                                    Partial products from the multiply operation are added together in the second stage four-input adder.
                                                                                                                                 X16782-042617
       Xilinx Resources
       For support resources such as Answers, Documentation, Downloads, and Forums, see Xilinx
       Support.
       Solution Centers
       See the Xilinx Solution Centers for support on devices, software tools, and intellectual
       property at all stages of the design cycle. Topics include design assistance, advisories, and
       troubleshooting tips.
       •   From the Vivado® IDE, select Help > Documentation and Tutorials.
       •   On Windows, select Start > All Programs > Xilinx Design Tools > DocNav.
       •   At the Linux command prompt, enter docnav.
       Xilinx Design Hubs provide links to documentation organized by design tasks and other
       topics, which you can use to learn key concepts and address frequently asked questions. To
       access the Design Hubs:
       •   In the Xilinx Documentation Navigator, click the Design Hubs View tab.
       •   On the Xilinx website, see the Design Hubs page.
       Note: For more information on Documentation Navigator, see the Documentation Navigator page
       on the Xilinx website.
       References
       1. UltraScale Architecture Migration Methodology Guide (UG1026)
       2. UltraScale and UltraScale+ device data sheets: