Programmable Logic Devices
Tutorial 8
Michal Kubíček
Department of Radio Electronics, FEEC BUT Brno
Vytvořeno za podpory projektu OP VVV Moderní a otevřené studium techniky CZ.02.2.69/0.0/0.0/16_015/0002430.
Tutorial 8
FPGAs in detail
❑ Logic cells
❑ FPGA architecture
❑ Memories in FPGA
page 2 kubicek@vutbr.cz
Logic cells
LUT / MUX technology
page 3 kubicek@vutbr.cz
Logic cells
FPGA versus CPLD
❑ The same basic structure: configurable logic cells connected with a
programmable interconnect structure, complemented with configurable IO cells.
FPGA CPLD
page 4 kubicek@vutbr.cz
Logic cells
FPGA versus CPLD
So what is the (main) difference? CPLD cell
coarse grain
FPGA cell architecture
fine grain
architecture
Flip-
Flop
Inputs
Function
generator
Inputs
Function Flip-
generator Flop
page 5 kubicek@vutbr.cz
Logic cells
Logic cell
composed of combinatorial and sequential block
CPLD
page 6
Logic cells
Logic cell
composed of combinatorial and sequential block
FPGA
page 7 kubicek@vutbr.cz
Logic cells
Logic cell
composed of combinatorial and sequential block
❑ FPGA architecture is Medium Grained, each cell
performs relatively simple logic function
❑ It leads to better area/utilization efficiency,
higher maximum frequency, but puts higher
requirements on programmable interconnect
❑ Granularity: CPLD is considered to be Coarse
Grained, while ASIC to be Fine Grained.
page 8 kubicek@vutbr.cz
Combinatorial function generator
page 9 kubicek@vutbr.cz
Logic cells
MAP process
RTL Schematic Technology Schematic
page 10 kubicek@vutbr.cz
Logic cells
MAP result each LUT has its content (defined logic function)
O = ((I0 * I1 * !I2 * !I3) + (!I0 * !I1 * I2 * I3) + (I0 * I1 * I2 * I3) + (!I0 * !I1 *
!I2 * !I3));
Logic cells
Combinatorial portion of the logic cell
FPGA:
Inputs
Function Flip-
generator Flop
Two variants:
• based on multiplexers (MUX based)
• based on look-up table (LUT based)
page 12 kubicek@vutbr.cz
Logic cells
"Y = 1 when C = 1 or when
A and B = 1"
MUX based LUT based
page 13 kubicek@vutbr.cz
Logic cells
Combinatorial portion of the logic cell
❑ The MUX structure is better suited for data control and switching logic
implementationje.
❑ The LUT structure is MUCH better for arithmetic function implementation.
Moreover, it is easier to handle for automatic synthesis tools (Computer
Aided Design; CAD)
The LUT architecture dominates since early 90s, virtually no other FPGAs
are now available today (which doesn't mean that MUX based FPGAs will not
appear again sometime in a future).
page 14 kubicek@vutbr.cz
Logic cells
Optimum LUT size (number of inputs)
LUT size is directly dependent on the number of its inputs:
• 3 inputs => 8 memory cells
• 4 inputs => 16 memory cells (Spartan-3, Virtex-II a 4)
• 5 inputs => 32 memory cells
• 6 inputs => 64 memory cells (Spartan-6, Virtex-5,6,7)
• 10 inputs => 1024 memory cells
More inputs ➔ larger address decoder (AND stage) ➔ slower propagation (delay)
Large LUTs are inefficient for implementation of simple logic functions (with few inputs)
Some of first FPGAs feature heterogeneous structure composed of both 3-input and 4-input
LUTs with the aim to better utilize silicon area. However, this structure was not well suited for
rather simple implementation tool (synthesizers, mappers) of the day.
page 15 kubicek@vutbr.cz
Logic cells
Optimum LUT size (number of inputs)
So small LUTs are better because they are faster and more efficient, right?
No, not really: for more complex logic functions (with more inputs) it is necessary to
chain several simple LUTs ➔ programmable interconnect must be used ➔ the propagation
delay is rising significantly!
A compromise is needed – the choice is upon FPGA manufacturers.
CPLD – each logic cell can implement a complex logic function ➔ CPLD is efficient for small
number of complex functions. When a functions is so complex that it cannot fit into a single
cell, several cells must be chained. Each cell by itself has already relatively large propagation
delay ➔ propagation through several cells severely deteriorates system performance (FMAX).
FPGA – even relatively simple logic functions must use several logic cells. BUT there is a huge
number of Flip-Flops available in FPGA ➔ it is easy and "cheap" to use intensive pipelining ➔
logic cells are used efficiently while system performance is not affected.
page 16 kubicek@vutbr.cz
Logic cells
Optimum LUT size (number of inputs)
Spartan-3: 4-input LUT
page 17
Logic cells
Optimum LUT size (number of inputs)
Virtex-5: 6-input LUT
page 18
Logic cells
Optimum LUT size (number of inputs)
Virtex-7U: 6-input LUT / 2x5-input LUT
page 19
Logic cells
Optimum LUT size (number of inputs)
Today usually several small LUTs are grouped into a logic cell instead of a single large LUT ➔
more efficient structure (puts higher requirements on implementation tools).
Adaptive Logic Block (ALM)
Stratix-10 (Intel)
page 20
Logic cells
Optimum LUT size (number of inputs)
Today usually several small
LUTs are grouped into a
logic cell instead of a single
large LUT.
Adaptive Logic Block (ALM)
Stratix-10 (Intel)
Real implementation:
one pseudo 6-input LUT is
composed of four 4-input
LUTs
page 21
Logic cells
Optimum LUT size (number of inputs)
Today usually several small
LUTs are grouped into a
logic cell instead of a single
large LUT.
Adaptive Logic Block (ALM)
Stratix-10 (Intel)
The result: very flexible
(efficient) structure
page 22
Alternative use of a LUT
LUT = only a logic function generator?
page 23 kubicek@vutbr.cz
LUT: other functions
Alternative LUT functions
Each 4-input LUT is composed of 16 SRAM cells = 16 registers (flip-flops)
By a small modification of the LUT (internal interconnection of the registers) it is possible to
enable alternative usage of these registers (most FPGAs enable this today).
Not each LUT in the FPGA is capable of all the alternative functions (depends on particular FPGA).
Combinatorial function of four input and one
Comb output variable.
16b RAM memory ("distributed RAM").
LUT4 RAM
Up to 16b dynamic length shift register;
Shift
input signals are used to adjust the shift register
REG
length.
page 24
LUT = RAM/ROM
page 25 kubicek@vutbr.cz
LUT: other functions
LUT as a RAM/ROM
16 x 1b Data
Distributed RAM/ROM
Small and fast memories
16 x 1b
Data
16 x 1b
Address 16 x 1b
Address 16 x 1b
Theoretical maximum of distributed
memory 16 x 4b (64b)
RAM ins some FPGAs
Spartan-3 xc3s500: 73 kb
Virtex-7 UltraScale+: 46 Mb
page 26 kubicek@vutbr.cz
LUT: other functions
LUT as a RAM/ROM
Distributed RAM/ROM
Modes (Virtex-5)
page 27 kubicek@vutbr.cz
LUT: other functions
LUT as a RAM/ROM
Distributed RAM/ROM
Modes (Virtex-5)
page 28 kubicek@vutbr.cz
How to use the distributed
RAM in your design?
page 29 kubicek@vutbr.cz
LUT: other functions
How to use the distributed RAM
Several options; the same is valid for other primitive components
❑ Inference: synthesizer/mapper is able to extract the required functionality from a generic HDL
code. Best portability and flexibility, but not functional for all primitive components.
❑ Instantiation: direct use of primitive components (templates and description can be found in
Language Templates and Architecture Libraries Guide). Not portable, hard to code, but
fully controllable.
❑ IP Core / Wizard: easy to use, optimized implementation, but not portable, less flexible, not
available for all primitive components.
❑ Special macros: for example Xilinx Parametrized Macros (XPM); HDL code (as instantiation)
but easier usage, assured optimization and worse portability (as IP Core / Wizard).
page 30 kubicek@vutbr.cz
LUT: other functions
Design – inference
Distributed RAM/ROM – using a VHDL code
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
---------------------------------------------------------
ENTITY RAM_64x8 IS
PORT (
clk : IN STD_LOGIC;
WR : IN STD_LOGIC;
ADDR : IN STD_LOGIC_VECTOR (5 DOWNTO 0);
D_in : IN STD_LOGIC_VECTOR (7 DOWNTO 0);
D_out : OUT STD_LOGIC_VECTOR (7 DOWNTO 0);
D_out_reg : OUT STD_LOGIC_VECTOR (7 DOWNTO 0));
END RAM_64x8;
page 31 kubicek@vutbr.cz
LUT: other functions
Design – inference
Distributed RAM/ROM – using a VHDL code
-- memory declaration and initialization
TYPE RamType IS ARRAY (0 TO 63) OF
STD_LOGIC_VECTOR(7 DOWNTO 0);
SIGNAL ram_1 : RamType := (X"78", X"76", X"30",
OTHERS => X"00");
page 32 kubicek@vutbr.cz
LUT: other functions
Design – inference
Distributed RAM/ROM – using a VHDL code
-- data write
RAM_write_proc : PROCESS (clk) BEGIN
IF rising_edge(clk) THEN
IF WR = '1' THEN
ram_1(to_integer(unsigned(ADDR))) <= D_in;
END IF;
END IF;
END PROCESS RAM_write_proc;
page 33 kubicek@vutbr.cz
LUT: other functions
Design – inference
Distributed RAM/ROM – using a VHDL code
-- asynchronous read
D_out <= ram_1(to_integer(unsigned(ADDR)));
-- synchronous read
RAM_read_proc : PROCESS (clk) BEGIN
IF rising_edge(clk) THEN
D_out_reg <= ram_1(to_integer(unsigned(ADDR)));
END IF;
END PROCESS RAM_read_proc;
page 34 kubicek@vutbr.cz
LUT: other functions
Design – inference
Distributed RAM/ROM or Block RAM/ROM – using a VHDL code
-- distributed RAM
attribute RAM_STYLE : string;
attribute RAM_STYLE of ram_1: signal is "DISTRIBUTED";
-- block RAM
attribute RAM_STYLE of ram_1: signal is "BLOCK";
Complete list of attribute values (XST 6 and 7 series FPGAs):
{auto|block|distributed|pipe_distributed|block_power1|block_power2}
page 35 kubicek@vutbr.cz
LUT: other functions
Design – instantiation
Distributed RAM/ROM – VHDL instantiation (Language templates)
RAM16X1D_1_inst : RAM16X1D
generic map (
INIT => X"0000")
port map (
DPO => DPO, -- Read-only 1-bit data output for DPRA
SPO => SPO, -- R/W 1-bit data output for A0-A3
A0 => A0, -- R/W address[0] input bit
A1 => A1, -- R/W address[1] input bit
A2 => A2, -- R/W address[2] input bit
A3 => A3, -- R/W ddress[3] input bit
D => D, -- Write 1-bit data input
DPRA0 => DPRA0, -- Read-only address[0] input bit
DPRA1 => DPRA1, -- Read-only address[1] input bit
DPRA2 => DPRA2, -- Read-only address[2] input bit
DPRA3 => DPRA3, -- Read-only address[3] input bit
WCLK => WCLK, -- Write clock input
WE => WE -- Write enable input
);
page 36 kubicek@vutbr.cz
LUT: other functions
Design – IP Core / Wizard
Distributed RAM/ROM – IP Core Generator (Xilinx ISE)
page 37 kubicek@vutbr.cz
LUT: other functions
Design – IP Core / Wizard
Distributed RAM/ROM – IP Core Generator (Xilinx Vivado)
page 38 kubicek@vutbr.cz
LUT: other functions
Design – XPM (Language Templates)
page 39 kubicek@vutbr.cz
LUT = shift register
page 40 kubicek@vutbr.cz
LUT: other functions
LUT as a shift register
Very efficient structure for delaying of data signals
There must be no RESET, SET or LOAD function described in the code in order to be correctly
extracted (inferred). CLOCK ENABLE function is allowed. It is possible to use address input ➔
shift register with dynamically adjustable length.
MUX Q
D_in ...
D_out
clk
adr
page 41 kubicek@vutbr.cz
LUT: other functions
LUT as a shift register
Artix-7 LUT:
page 42 kubicek@vutbr.cz
LUT: other functions
LUT as a shift register – inference
TYPE t_shreg IS ARRAY (15 DOWNTO 0) OF
STD_LOGIC_VECTOR(7 DOWNTO 0);
SIGNAL shreg : t_shreg := (X"33", X"22", X"11",
OTHERS => X"00");
...
-- the shift register functionality No reset allowed
shreg_proc : PROCESS (clk) BEGIN (both synchronous and
IF rising_edge(clk) THEN asynchronous)
IF ce = '1' THEN
shreg <= shreg(shreg'HIGH-1 DOWNTO 0) & data_i;
END IF;
END IF;
END PROCESS shreg_proc;
page 43 kubicek@vutbr.cz
LUT: other functions
LUT as a shift register – inference
...
-- Shift register output tap selection (dynamic length functionality)
out_select_proc : PROCESS (shreg, addr) BEGIN
CASE addr IS
WHEN "000" => data_o <= shreg(0);
WHEN "001" => data_o <= shreg(1);
...
END CASE;
END PROCESS out_select_proc;
-- equivallent description (to the previous process)
data_o <= shreg(to_integer(unsigned(addr)));
page 44 kubicek@vutbr.cz
Sequential portion of a logic cell
Register (Flip Flop)
page 45 kubicek@vutbr.cz
Logic cell
Sequential portion of a logic cell
❑ Today always a D-type register
❑ Parametrized (several settings)
❑ Can be set to LATCH or FLIP-FLOP
function; modern FPGAs support only
the FLIP-FLOP function.
❑ There is no saving (in terms of HW
resources) when using the LATCH
functionality instead of the FLIP-
FLOP.
❑ The reset can be asynchronous but it
is not recommended for Xilinx FPGAs.
page 46 kubicek@vutbr.cz
FPGA architecture
Grouping of basic cells
page 47 kubicek@vutbr.cz
FPGA architecture
SLICE (Xilinx FPGAs)
The logic cells are grouped into a bigger units. Xilinx use term SLICE for them. The
slice size is variable and depends on the FPGA architecture.
SLICE = 2 x LUT + 2 x Flip-Flop (Spartan-3)
SLICE = 4 x LUT + 4 x Flip-Flop (Virtex-5)
SLICE = 4 x LUT + 8 x Flip-Flop (Virtex 7)
page 48 kubicek@vutbr.cz
FPGA architecture
SLICE
The CLOCK SIGNAL is common for the
SLICE = 2 x LUT + 2 x Flip-Flop whole slice.
(Spartan-3)
The registers feature a CE (clock enable)
input, which is usually also common for the
whole slice.
The registers have a set/reset input. Only a
single variant of the set/reset function is
supported by the registers (either set or reset,
either synchronous or asynchronous). Any
additional set/reset function (if needed) must
be emulated using a general purpose logic
(LUTs).
page 49 kubicek@vutbr.cz
FPGA architecture
SLICE
A very important part of the logic cells is so called CARRY LOGIC, which is a dedicated high-
speed interconnect of neighboring slices. It is often used to implement arithmetic functions,
thus the name.
It is a set of multiplexers that enable fast
interconnect (chaining) of neighboring slices so that
a general purpose interconnect need not to be
accessed. The general purpose interconnect has
significantly higher latency.
Implementation tools are able to utilize this
structure without any user intervention.
page 50 kubicek@vutbr.cz
FPGA architecture
SLICE
Modern FPGA SLICEs feature additional components that improve their utilization: a XOR gate
(the most demanding logic function), signal switches (to enable independent utilization of the
LUT and the register), switches for LUT function selection, etc.
Spartan-3: ½ SLICE
page 51 kubicek@vutbr.cz
FPGA architecture
Configurable Logic Block
A higher hierarchical unit in Xilinx FPGAs is called Configurable Logic Block (CLB)
There are several SLICEs in each CLB, some slices have limited functionality.
Number of SLICEs per CLB can be different for each
FPGA family (architecture):
Spartan-3:
CLB = 4 x SLICE
SLICE = 2 x (LUT + Flip-Flop)
Virtex-6:
CLB = 2 x SLICE
SLICE = 4 x (LUT + Flip-Flop)
CLB = 4 x SLICE (Spartan-3)
page 52 kubicek@vutbr.cz
FPGA architecture
Configurable Logic Block
Virtex-5: each CLB contains 2 fully featured SLICEs and 2 simplified SLICEs
FPGA architecture
Xilinx FPGA: Clock regions, TILEs, Super Logic Regions
page 54
Memories in FPGA
page 55 kubicek@vutbr.cz
Block RAM memories
FPGA: block RAM
page 56 kubicek@vutbr.cz
Block RAM memories
BRAM
Properties:
• True Dual Port
• Works at FPGA core clock frequency
• Synchronous, optional output registers
• Native support of ECC
Usage
• Fast memories (RAMs, CACHEs)
• Core of a FIFO memory
• Frame/packet buffers
• ROM memories
page 57 kubicek@vutbr.cz
Block RAM memories
BRAM 36 kb
Available modes:
1 x 36k 2 x 18k
• 32K x 1 • 16K x 1
• 16K x 2 • 8K x 2
• 8K x 4 • 4K x 4
• 4K x 9 • 2K x 9
• 2K x 18 • 1K x 18
• 1K x 36 • 512 x 36
• 512 x 72
The mode can be different for each port
page 58 kubicek@vutbr.cz
Block RAM memories
Write and read process (write first mode)
page 59 kubicek@vutbr.cz
Block RAM memories
Write and read process (read first mode)
page 60 kubicek@vutbr.cz
Block RAM memories
Write and read process (no change mode)
page 61 kubicek@vutbr.cz
Block RAM memories
BRAM as a core of a FIFO memory
page 62
Block RAM memories
BRAM as a core of a FIFO memory
page 63
Block RAM memories
BRAM as a core of a FIFO memory
page 64
Block RAM memories
BRAM – microsequencer (FSM)
Block RAM memories
Spartan-3, Virtex-II,4: BRAM 18k
Spartan-6, Virtex-5,6,7: BRAM 36k
Spartan-3 xc3s200 (15 USD): 12 x 18 kb = 216 kb (27 kB)
Spartan-6 xc6slx4 (12 USD): 4 x 36 kb = 144 kb
Spartan-6 xc6slx150 (165 USD): 134 x 36 kb = 4,8 Mb
Virtex-7 xc7vx1140t (16 800 USD): 1880 x 36 kb = 67,7 Mb
Kintex-7 UltraScale xcku115 (5 500 USD): 75,9 Mb
Virtex-7 UltraScale xcvu440 (55 000 USD): 88,6 Mb (11 MB)
page 66 kubicek@vutbr.cz
RAM memories
Xilinx 7-series UltraScale+: UltraRAM
Larger block memories 4Kx72 (288 Kb = 36 kB)
Up to 432 blocks on a single chip
Virtex-7 UltraScale+ (VU13P)
Distributed RAM 48 Mb (6 MB)
BlockRAM 95 Mb (12 MB)
UltraRAM 360 Mb (45 MB)
page 67 kubicek@vutbr.cz
RAM memories
Xilinx: DRAM memory (HBM) integration into FPGA
page 68 kubicek@vutbr.cz
RAM memories
Xilinx: DRAM memory (HBM) integration into FPGA
Integration of HBM chips directly
into the FPGA package: a silicon
interposer is used to connect FPGA
to the HBM.
page 69 kubicek@vutbr.cz
RAM memories
Intel: DRAM memory (HBM) integration into FPGA
Aggregated throughput up to 512 GB/s
For comparison: 10 x DDR4 DIMM has a
peak throughput of 256 GB/s
page 70 kubicek@vutbr.cz
Thank You for Your attention!