hls4ml FPGA Tutorial Guide
hls4ml FPGA Tutorial Guide
● In this session you will get hands on experience with the hls4ml package
● We’ll learn how to:
○ Translate models into synthesizable FPGA code
○ Explore the different handles provided by the tool to optimize the inference
■ Latency, throughput, resource usage
● Make our inference more computationally efficient with pruning and quantization
r l
s
lli Hz
ge ve
pu ine
g
on
tin
ig e
co M
ge
om ffl
Tr h-L
si
pp 40
ig
O
ig
Tr
H
L1
C
DATA FLOW
L1 trigger:
∙ 40 MHz in / 100 KHz out
∙ Process 100s TB/s
∙ Trigger decision to be made in ≈ 10 μs
∙ Coarse local reconstruction
∙ FPGAs / Hardware implemented
hls4ml tutorial
5
hls4ml origins: triggering at (HL-)LHC
hls4ml tutorial 6
LHC Experiment Data Flow
r l
s
lli Hz
ge ve
pu ine
g
on
tin
ig e
co M
ge
om ffl
Tr h-L
si
pp 40
ig
O
ig
Tr
H
L1
C
DATA FLOW
hls4ml tutorial
7
The challenge: triggering at (HL-)LHC
The trigger discards events forever, so selection must be very precise
ML can improve sensitivity to rare physics
Needs to be fast!
Enter: hls4ml (high level synthesis for machine learning)
hls4ml tutorial 8
Muon trigger example
Originally popular for prototyping ASICs, but now also for high
performance computing
Logic cell
Look-up
Flip-flop
table
(logic) (registers)
Faster and more efficient than using LUTs for these types of
operations
And for Neural Nets, DSPs are often the most scarce
DSP
(multiplication)
[*] https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug902-vivado-high-level-synthesis.pdf
Aug 18, 2021 hls4ml tutorial
19
Jargon
● LUT - Look Up Table aka ‘logic’ - generic functions on small bitwidth inputs. Combine many to build the
algorithm
● FF - Flip Flops - control the flow of data with the clock pulse. Used to build the pipeline and achieve high
throughput
● DSP - Digital Signal Processor - performs multiplication and other arithmetic in the FPGA
● BRAM - Block RAM - hardened RAM resource. More efficient memories than using LUTs for more than a few
elements
● HLS - High Level Synthesis - compiler for C, C++, SystemC into FPGA IP cores
● HDL - Hardware Description Language - low level language for describing circuits
● RTL - Register Transfer Level - the very low level description of the function and connection of logic gates
● Latency - time between starting processing and receiving the result
○ Measured in clock cycles or seconds
● II - Initiation Interval - time from accepting first input to accepting next input
Catapult
Coming Soon
https://fastmachinelearning.org/hls4ml/
Aug 18, 2021 hls4ml tutorial
21
Neural network inference
∙ Part 2:
- Learn how to tune inference performance with quantization & ReuseFactor
∙ Part 3:
- Perform model compression and observe its effect on the FPGA resources/latency
∙ Part 4:
- Train using QKeras “quantization aware training” and study impact on FPGA metrics
29
Physics case: jet tagging
Study a multi-classification task to be implemented on FPGA: discrimination between highly
energetic (boosted) q, g, W, Z, t initiated jets
better
AUC = area under ROC curve
(100% is perfect, 20% is random)
Aug 18, 2021 hls4ml tutorial
32
Hands On - Setup
● The interactive part is served with Python notebooks
● Open https://cern.ch/ssummers/hls4ml-tutorial in your web browser
● Authenticate with your Github account (login if necessary)
● Open and start running through “part1_getting_started” !
● If you’re new to Jupyter notebooks, select a cell and hit “shift + enter” to execute the code
● If you have Vivado install yourself, you might prefer to work locally, see ‘conda’ section at:
https://github.com/fastmachinelearning/hls4ml-tutorial
34
Efficient NN design: quantization
ap_fixed<width bits, integer bits> ∙ In the FPGA we use fixed point representation
0101.1011101010 − Operations are integer ops, but we can represent fractional
integer fractional values
width ∙ But we have to make sure we’ve used the correct data types!
Scan integer bits Scan fractional bits
Fractional bits fixed to 8 Integer bits fixed to 6
Full performance at 6
Full performance at 8
integer bits
fractional bits
More resources,
Fully parallel Higher throughput,
Lower latency
Reuse factor: how much to parallelize operations in a hidden layer
Aug 18, 2021 hls4ml tutorial
37
Parallelization: DSP usage
More resources
Fully parallel
Each mult. used 1x
…
Longer latency
Longer latency
~ 175 ns
Latency (clock
… …
Each mult. used 3x
Fully parallel
~ 75 ns Each mult. used 1x
More resources
Aug 18, 2021 hls4ml tutorial
39
Large MLP
● ‘Strategy: Resource’ for IOType: io_parallel # options: io_serial/io_parallel
larger networks and higher HLSConfig:
reuse factor Model:
● Uses a slightly different HLS
Precision: ap_fixed<16,6>
implementation of the dense
layer to compile faster and ReuseFactor: 128
better for large layers Strategy: Resource
● Here, we use a different LayerName:
partitioning on the first layer dense1:
for the best partitioning of
ReuseFactor: 112
arrays
This config is for a model trained on the MNIST digits classification dataset
Architecture (fully connected): 784 → 128 → 128 → 128 → 10
Model accuracy: ~97%
We can work out how many DSPs this should use...
∙ The DSPs should be: (784 x 128) / 112 + (2 x 128 x 128 + 128 x 10) / 128 = 1162 🤞
============================
============================
+ Timing (ns): =====================================
== Utilization Estimates
* Summary:
=====================================
+--------+-------+----------+------------+ +---------------------+---------+-------+---------+--------+
| Clock | Target| Estimated| Uncertainty| | Name | BRAM_18K| DSP48E| FF | LUT |
+---------------------+---------+-------+---------+--------+
+--------+-------+----------+------------+ ...
|ap_clk | 5.00| 4.375| 0.62| +---------------------+---------+-------+---------+--------+
+--------+-------+----------+------------+ |Total | 1962| 1162| 169979| 222623|
+---------------------+---------+-------+---------+--------+
|Available SLR | 2160| 2760| 663360| 331680|
+ Latency (clock cycles): +---------------------+---------+-------+---------+--------+
|Utilization SLR (%) | 90| 42| 25| 67|
* Summary:
+---------------------+---------+-------+---------+--------+
+-----+-----+-----+-----+----------+ |Available | 4320| 5520| 1326720| 663360|
| Latency | Interval | Pipeline | +---------------------+---------+-------+---------+--------+
|Utilization (%) | 45| 21| 12| 33|
| min | max | min | max | Type | +---------------------+---------+-------+---------+--------+
+-----+-----+-----+-----+----------+
| 518| 522| 128| 128| dataflow |
+-----+-----+-----+-----+----------+
42
NN compression methods
● Network compression is a widespread technique to reduce the size, energy consumption, and
overtraining of deep neural networks
● Several approaches have been studied:
○ parameter pruning: selective removal of weights based on a particular ranking
[arxiv.1510.00149, arxiv.1712.01312]
○ low-rank factorization: using matrix/tensor decomposition to estimate informative parameters
[arxiv.1405.3866]
○ transferred/compact convolutional filters: special structural convolutional filters to save
parameters [arxiv.1602.07576]
○ knowledge distillation: training a compact network with distilled knowledge of a large network
[doi:10.1145/1150402.1150464]
● Today we’ll use the tensorflow model sparsity toolkit
○ https://blog.tensorflow.org/2019/05/tf-model-optimization-toolkit-pruning-API.html
● But you can use other methods!
Fully parallelized
(max DSP use)
compression
Number of DSPs available
● DSPs (used for multiplication) are often
limiting resource
○ maximum use when fully parallelized
○ DSPs have a max size for input (e.g.
70% compression ~ 70% fewer DSPs 27x18 bits), so number of DSPs per
multiplication changes with precision
46
Efficient NN design: quantization
● hls4ml allows you to use different data types
everywhere, we saw how to tune that in part 2
● We will also try quantization-aware training with
QKeras (part 4)
● With quantization-aware we can even go down to
just 1 or 2 bits
○ See our recent work:
https://arxiv.org/abs/2003.06308
● See other talks on quantization at this workshop:
Amir, Thea, Benjamin