DEPARTMENT OF ELECTRONICS ENGINEERING
HBTU, KANPUR
PROJECT REPORT
SESSION-2020-21
32-bit Floating Point Arithmetic Unit
Under the supervision of: Group Members:
Mrs Rajani Bisht Dhruv Rastogi(180106010)
Associate Professor Anmol Srivastava(180106005)
Department of Electronics Engineering
HBTU, Kanpur
1
ABSTRACT
Most of the algorithms implemented in FPGAs used to
be fixed-point. Floating-point operations are useful for
computations involving large dynamic range, but they
require significantly more resources than integer
operations. With the current trends in system
requirements and available FPGAs, floating-point
implementations are becoming more common and
designers are increasingly taking advantage of FPGAs
as a platform for floating-point implementations. The
rapid advance in Field-Programmable Gate Array
(FPGA) technology makes such devices increasingly
attractive for implementing floating-point arithmetic.
Compared to Application Specific Integrated Circuits,
FPGAs offer reduced development time and costs.
Moreover, their flexibility enables field upgrade and
adaptation of hardware to run-time conditions. A 32
bit floating point arithmetic unit with IEEE 754
Standard has been designed using VHDL code and all
operations of addition, subtraction, multiplication and
division are tested on Xilinx. Thereafter, Simulink
model in MAT lab has been created for verification of
VHDL code of that Floating Point Arithmetic Unit in
Modelsim.
2
ACKNOWLEDGEMENT
We wish to extend our sincere gratitude to my seminar guide
Mrs Rajani Bisht ASSOCIATE PROFESSOR, ET
Department for his valuable guidance & encouragement and
we are also grateful to Dr. Krishna Raj PROFESSOR &
HEAD Of Electronics Department HBTU Kanpur for
providing all the required resources for the successful
completion of our project, but unfortunately this pandemic
leads to many problems . We express our thanks to Assistant
Professor, and all staff members and friends for all the help
and co-ordination extended in bringing out this project
successfully. We will also be grateful to the authors of the
references and other literature referred to in this project.
We are also grateful to our parents for their constant
support,both morally & financially. And last but not least; we
thank almighty God for his blessings without which the
completion of this project wouldn’t be possible.
3
4
INDEX
1. Abstract 3
2. Introduction 6-8
3. Floating point Architecture 9-10
4. Algorithms 10-14
5. Block Diagram 14-17
6. VHDL Code 18-26
7. Result of VHDL Code 27-28
8. Applications and Advantages 29-30
9. Future Scope 31
10. References 32
5
INTRODUCTION
The floating point operations have found intensive
applications in the various fields for the requirements for
high precious operation due to its great dynamic range, high
precision and easy operation rules. High attention has been
paid on the design and research of the floating point
processing units. With the increasing requirements for the
floating point operations for the high-speed data signal
processing and the scientific operation, the requirements for
the high-speed hardware floating point arithmetic units have
become more and more exigent. The implementation of the
floating point arithmetic has been very easy and convenient
in the floating point high level languages, but the
implementation of the arithmetic by hardware has been very
difficult. With the development of the very large scale
integration (VLSI) technology, a kind of devices like Field
Programmable Gate Arrays (FPGAs) have become the best
options for implementing floating hardware arithmetic units
because of their high integration density, low price, high
performance and flexible applications requirements for high
precious operation. Floating-point implementation on FPGAs
has been the interest of many researchers. The use of custom
floating-point formats in FPGAs has been investigated in a
long series of work [1, 2, 3, 4, 5]. In most of the cases, these
formats are shown to be adequate for some applications that
require significantly less area to implement than IEEE formats
[6] and to run significantly faster than IEEE formats.
6
Moreover, these efforts demonstrate that such customized
formats enable significant speedups for certain chosen
applications. The earliest work on IEEE floating-point [7]
focused on single precision although found to be feasible but
it was extremely slow. Eventually, it was demonstrated [8]
that while FPGAs were uncompetitive with CPUs in terms of
peak FLOPs, they could provide competitive sustained
floating-point performance. Since then, a variety of work [2,
5, 9, 10] has demonstrated the growing feasibility of IEEE
compliant, single precision floating point arithmetic and
other floating-point formats of approximately same
complexity. In [2, 5], the details of the floating-point format
are varied to optimize performance. The specific issues of
implementing floating-point division in FPGAs have been
studied [10]. Early implementations either involved multiple
FPGAs for implementing IEEE 754 single precision floating-
point arithmetic, or they adopted custom data formats to
enable a single-FPGA solution. To overcome device size
restriction, subsequent single-FPGA implementations of IEEE
754 standard employed serial arithmetic or avoided features,
such as supporting gradual underflow, which are expensive
to implement. In this paper, a high-speed IEEE754-compliant
32-bit floating point arithmetic unit designed using VHDL
code has been presented and all operations of addition,
subtraction, multiplication and division got tested on Xilinx
and verified successfully. Thereafter, the new feature of
creating Simulink model using MAT lab for verification of
VHDL code of that 32-bit Floating Point Arithmetic Unit in
Modelsim has been explained. The simu lation results of
7
addition, subtraction, multiplication and division in Modelsim
wave window have been demonstrated.
8
FLOATING POINT ARCHITECTURE
Floating point numbers are one possible way of representing
real numbers in binary format; the IEEE 754 [11] standard
presents two different floating point formats, Binary
interchange format and Decimal interchange format. This
paper focuses only on single precision normalized binary
interchange format. Figure 1 shows the IEEE 754 single
precision binary format representation; it consists of a one
bit sign (S), an eight bit exponent (E), and a twenty three bit
fraction (M) or Mantissa.
32 bit Single Precision Floating Point Numbers IEEE standard
are stored as:
S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMM
S: Sign – 1 bit
E: Exponent – 8 bits
M: Mantissa – 23 bits Fraction
9
An extra bit is added to the mantissa to form what is called
the significand. If the exponent is greater than 0 and smaller
than 255, and there is 1 in the MSB of the significand then
the number is said to be a normalized number; in this case
the real number is represented by
V = (-1 s ) * 2 (E - Bias) * (1.M)
Where M = m22 2-1 + m21 2-2 + m20 2-3 +…+ m1 2- 22+m0 2-23;
Bias = 127.
10
ALGORITHMS FOR FLOATING POINT
ARITHMETIC UNIT
The algorithms using flow charts for floating point
addition/subtraction, have been described in this section,
that become the base for writing VHDL codes for
implementation of 32-bit floating point arithmetic unit.
FLOATING POINT ADDITION / SUBTRACTION
The algorithm for floating point addition is explained through
flow chart in Figure 2. While adding the two floating point
numbers, two cases may arise. Case I: when both the
numbers are of same sign i.e. when both the numbers are
either +ve or –ve. In this case MSB of both the numbers are
either 1 or 0. Case II: when both the numbers are of different
sign i.e. when one number is +ve and other number is –ve. In
this case the MSB of one number is 1 and other is 0.
11
12
A description of the proposed implementation algorithm is as
follows :-
1. The two operands, N1 and N2 are read in and compared
for denormalization and infinity. If numbers are
denormalized , set the implicit bit to 0 otherwise it is set to 1.
At this point, the fraction part is extended to 24 bits.
2. The two exponents, e1 and e2 are compared using 8-bit
subtraction. If e1 is less than e2, N1 and N2 are swapped i.e.
previous f2 will now be referred to as f1 and vice versa.
3. The smaller fraction, f2 is shifted right by the absolute
difference result of the two exponents‟ subtraction. Now
both the numbers have the same exponent.
4. The two signs are used to see whether the operation is a
subtraction or an addition.
5. If the operation is a subtraction, the bits of the f2 are
inverted.
6. Now the two fractions are added using a 2‟s complement
adder.
7. If the result sum is a negative number, it has to be
inverted and a 1 has to be added to the result.
8. The result is then passed through a leading one detector or
leading zero counter. This is the first step in the
normalization step.
13
9. Using the results from the leading one detector, the result
is then shifted left to be normalized. In some cases, 1-bit right
shift is needed.
10. The result is then rounded towards nearest even, the
default rounding mode.
11. If the carry out from the rounding adder is 1, the result is
left shifted by one.
12. Using the results from the leading one detector, the
exponent is adjusted. The sign is computed and after
overflow and underflow check, the result is registered.
14
BLOCK DIAGRAM OF STANDARD FLOATING
POINT ADDER
The main hardware modules for a single-precision floating-
point adder are:
1)The exponent difference module: It has the following two
functions:
• To compute absolute difference of two 8-bit numbers.
15
• To identify if e1 is smaller than e2.
2) Right shift shifter: The right shifter is used to shift right the
significand of the smaller operand by the absolute exponent
difference. This is done so that the two numbers have the
same exponent and normal integer addition can be
implemented. Right shifter is one of the most important
modules when designing for latency.
3) 2’s complement adder: 2‟s complement adder is a simple
integer addition process which adds or subtracts the pre-
normalized significands.
4) Leading one detector: After the addition, the next step is
to normalize the result. The first step is to identify the
leading or first one in the result. This result is used to shift
left the adder result by the number of zeros in front of the
leading one. In order to perform this operation, special
hardware, called Leading One Detector (LOD) or Leading Zero
Counter (LZC), has to be implemented.
5) Left shift shifter: Using the results from the LOD, the result
from the adder is shifted left to normalize the result. That
means now the first bit is 1. This shifter can be implemented
using “shl” operator in VHDL or by describing it behaviorally
using „case‟ statements.
6) The rounding module: Rounding is done using the guard,
round and sticky bit of the result. REN mode is accomplished
by rounding up if the guard bit is set, and then pulling down
the lowest bit of the output if the r and s bits are 0. A 1 is
16
added to the result if r and s bit are 1 or r and either of the
last two bits of the normalized result is 1. This step is really
important to assure precision and omit loss of accuracy.
17
VHDL CODE
ARITHMETIC UNIT STRUCTURE
entity fp_alu is port(in1,in2:in std_logic_vector(31 downto 0);
clk:in std_logic
sel:in std_logic_vector(1 downto 0)
output1:out std_logic_vector(31 downto 0))
end fp_alu
architecture fp_alu_struct of fp_alu is component divider is port( clk : in std_logic
res : in std_logic
GO : in std_logic
x : in std_logic_vector(31 downto 0)
y : in std_logic_vector(31 downto 0)
z : out std_logic_vector(31 downto 0)
done : out std_logic
overflow : out std_logic
)
end component
component fpa_seq is
port(
n1,n2:in std_logic_vector(32 downto 0)
clk:in std_logic
sum:out std_logic_vector(32 downto 0)
)
end component
component fpm is port(in1,in2:in std_logic_vector(31 downto 0)
18
out1:out std_logic_vector(31 downto 0)
)
end component
signal out_fpa: std_logic_vector(32 downto 0)
signal out_fpm,out_div: std_logic_vector(31 downto 0)
signal in1_fpa,in2_fpa: std_logic_vector(32 downto 0)
begin in1_fpa<=in1&'0'
in2_fpa<=in2&'0'
fpa1:fpa_seq port map(in1_fpa,in2_fpa,clk,out_fpa)
fpm1:fpm port map(in1,in2,out_fpm)
fpd1:divider port map(clk,'0','1',in1,in2,out_div)
process(sel,clk) begin if(sel="01")then output1<=out_fpa(32 downto 1)
elsif(sel="10")then output1<=out_fpm
elsif(sel="11")then output1<=out_div
end if
end process
end fp_alu_struct
FPA BEHAVIOUR
entity fpa_seq is port(n1,n2:in std_logic_vector(32 downto 0)
clk:in std_logic
sum:out std_logic_vector(32 downto 0))
end fpa_seq
architecture Behavioral of fpa_seq is
--signal f1,f2:std_logic_vector(23 downto 0):="000000000000000000000000"
signal sub_e:std_logic_vector(7 downto 0):="00000000"
--signal addi:std_logic_vector(34 downto 0)
signal c_temp:std_logic:='0'
--_vector(34 downto 0)
signal shift_count1:integer:=0
signal num2_temp2: std_logic_vector(32 downto 0):="000000000000000000000000000000000"
19
signal s33:std_logic_vector(23 downto 0):="000000000000000000000000"
signal s2_temp :std_logic_vector(23 downto 0):="000000000000000000000000"
signal diff:std_logic_vector(7 downto 0):="00000000"
----------sub calling---------------------------------------------
-------------------- sub(e1,e2,d)
if(d>="00011100")then sum<=num1
elsif(d<"00011100")then shift_count:=conv_integer(d)
shift_count1<=shift_count
num2_temp2<=num2
--s2_temp<=s2
--------------shifter calling-------------------------------------
-------------------- shift(s2,shift_count,s3)
--s33<=s3
------------sign bit checking------ if (num1(32)/=num2(32))then s3:=(not(s3)+'1')
------2's complement adder23(s1,s3,s4,c_out)
if(c_out='1')then shift_left(s4,d_shl,ss4)
sub(e1,d_shl,ee4)
sum<=n1(32)& ee4 & ss4
else if(s4(23)='1')then s4:=(not(s4)+'1')
------2's complement
sum<=n1(32)& e1 & ss4
end if
end if
else s3:=s3
-- end if
20
---------------------same sign start ---------------adder 8 calling---------------
adder8(e2,d,e3)
sub_e<=e3
num1_temp:=n1(32)& e1 & s1
num2_temp:=n2(32)& e3 & s3
---------------adder 23 calling--------------- adder23(s1,s3,s4,c_out)
--s2_temp<=s4
c_temp<=c_out
if(c_out='1')then --shift1(s4,s_1,s5)
--s2_temp<=s5
s33<=s4
s5:='1' & s4(23 downto 1)
s2_temp<=s5
adder8(e3,"00000001",e4)
e3:=e4
--sub_e<=e4
sum<=n1(32)& e3 & s5
else
sum<=n1(32)& e3 & s4
end if
end if
end if
end if
----same sign end end if
------final result assembling----------
--sum_temp<=n1(32)& e1 & s4
--sum<=n1(32)& e3 & s4
end process
end Behavioral
21
FPM BEHAVIOUR
entity fpm is port(in1,in2:in std_logic_vector(31 downto 0)
out1:out std_logic_vector(31 downto 0))
end fpm
architecture Behavioral of fpm is procedure adder( a,b:in std_logic_vector(7 downto 0)
sout : out STD_LOGIC_VECTOR (8 downto
0))is variable g,p:std_logic_vector(7 downto 0)
variable c:std_logic_vector(8 downto 0)
variable sout1 :STD_LOGIC_VECTOR (7 downto 0)
begin c(0):='0'
for i in 0 to 7 loop g(i):= a(i) and b(i)
p(i):= a(i) xor b(i)
end loop
for i in 0 to 7 loop c(i+1):=(g(i) or (c(i) and p(i)))
end loop
for i in 0 to 7 loop sout1(i):=c(i) xor a(i) xor b(i)
end loop
sout:=c(8) & sout1
end adder
-------------------------------------------multiplier-------------
------------------
procedure multiplier ( a,b : in STD_LOGIC_VECTOR
(23 downto 0)
y : out STD_LOGIC_VECTOR (47 downto 0))is variable temp,prod:std_logic_vector(47
downto 0)
begin temp:="000000000000000000000000"&a
prod:="000000000000000000000000000000000000000
000000000"
for i in 0 to 23 loop if b(i)='1' then
prod:=prod+temp
22
end if
temp:=temp(46 downto 0)&'0'
end loop
y:=prod
end multiplier
--------------------------end multipier--------------------------
--------------------- begin process(in1,in2) variable sign_f,sign_in1,sign_in2: std_logic:='0'
variable e1,e2: std_logic_vector(7 downto
0):="00000000"
variable add_expo:std_logic_vector(8 downto
0):="000000000"
variable m1,m2: std_logic_vector(23 downto 0):="000000000000000000000000"
variable mantisa_round: std_logic_vector(22 downto 0):="00000000000000000000000"
variable prod:std_logic_vector(47 downto
0):="00000000000000000000000000000000000000000
0000000"
variable mul_mantisa :std_logic_vector(47 downto
0):="00000000000000000000000000000000000000000
0000000"
variable bias:std_logic_vector(8 downto 0):="001111111"
variable bias_sub:std_logic_vector(7 downto
0):="00000000"
variable inc_bias:std_logic_vector(8 downto
0):="000000000"
variable bias_round:std_logic_vector(8 downto
0):="000000000"
begin sign calculation sign_in1:=in1(31)
23
sign_in2:=in2(31)
sign_f:=sign_in1 xor sign_in2
FPD BEHAVIOUR
entity divider is port( clk : in std_logic
res : in std_logic
GO : in std_logic
x : in std_logic_vector(31 downto 0)
y : in std_logic_vector(31 downto 0)
z : out std_logic_vector(31 downto 0)
done : out std_logic
overflow : out std_logic
)
end divider
architecture design of divider is signal x_reg : std_logic_vector(31 downto 0)
signal y_reg : std_logic_vector(31 downto 0)
signal x_mantissa : std_logic_vector(23 downto 0)
signal y_mantissa :
std_logic_vector(23 downto 0)
signal z_mantissa : std_logic_vector(23 downto 0)
signal x_exponent : std_logic_vector(7 downto 0)
signal y_exponent : std_logic_vector(7 downto 0)
signal z_exponent : std_logic_vector(7 downto 0)
signal x_sign : std_logic
signal y_sign : std_logic
signal z_sign : std_logic
signal sign : std_logic
signal SC : integer range 0 to 26
signal exp : std_logic_vector(9 downto 0)
signal EA : std_logic_vector(24 downto 0)
24
signal B : std_logic_vector(23 downto 0)
signal Q : std_logic_vector(24 downto 0)
type states is (reset, idle, s0, s1, s2, s3, s4)
signal state : states
begin x_mantissa <= '1' & x_reg(22 downto 0)
x_exponent <= x_reg(30 downto 23)
x_sign <= x_reg(31)
y_mantissa <= '1' & y_reg(22 downto 0)
y_exponent <= y_reg(30 downto 23)
y_sign <= y_reg(31)
process(clk) begin if clk'event and clk = '1' then if res = '1' then state <= reset
exp <= (others => '0')
sign <= '0'
x_reg <= (others => '0')
y_reg <= (others => '0')
z_sign <= '0'
z_mantissa <= (others => '0')
z_exponent <= (others => '0')
EA <= (others => '0')
Q <= (others => '0')
B <= (others => '0')
overflow <= '0'
done <= '0'
else case state is when reset => state <= idle
when idle => if GO = '1' then state <= s0
x_reg <= x
y_reg <= y
end if
when s0 => state <= s1
25
overflow <= '0'
SC <= 25
done <= '0'
sign <= x_sign xor y_sign
EA <= '0' & x_mantissa
B <= y_mantissa
Q <= (others => '0')
exp <= ("00" & x_exponent) + not ("00" & y_exponent) + 1 + "0001111111"
when s1 => if (y_mantissa = x"800000" and y_exponent = x"00") then
overflow <= '1'
z_sign <= sign
z_mantissa <= (others => '0')
z_exponent <= (others => '1')
then Q <= Q (23 downto 0) & '0'
exp <= exp - 1
end if
state <= s4
else
EA <= EA(23 downto 0) & Q(24)
Q <= Q(23 downto 0) & '0'
state <= s1
end if
end design;
26
RESULT OF VHDL CODE
Simulation result of decimal inputs 1.1 & 1.1 for „adder‟ in
Modelsim wave window
27
Simulation result of decimal inputs 2.5 & 4.75 for „adder‟ in
Modelsim wave window
28
ADVANTAGES AND APPLICATIONS
ADVANTAGES:
1. This floating point adder algorithm mainly reduces the
overall latency and improves the performance.
2. It reduced power consumption of floating point unit
without sacrificing correctness.
3. Design challenges of the redundant format, namely the
leading digit detection and the rounding.
4. This floating point adder unit performs the addition and
subtraction using substantially the same hardware used for
floating point operation. This advantage causes saving the
core area by minimizing the number of element.
5. Real floating point hardware uses more sophisticated
means to round the summed result but in this paper take the
simplification of the truncating bits if there are more bits that
can be represented.
APPLICATIONS:
Floating point adder is generally used in the arithmetic
calculations
This adder can be used in the instruments which involve
mathematical processes.
29
The floating point adder is used in such as :
1. CPU
2. Calculators
30
CONCLUSIONS & FUTURE SCOPE
The VHDL code written for complete 32-bit floating
point arithmetic unit has been implemented and
tested on Xilinx. A process described to create Simulink
model in MAT lab for verification of VHDL code in
Modelsim HDL Simulator has been used on the same
VHDL code and results were found in order. Once the
Simulink model has been created using MAT lab for
VHDL code, the same can be optimized in MAT lab and
the VHDL code can be regenerated with the optimized
results and tested on Xilinx to see the improvement in
the parameters.
31
REFERENCES
[1] N. Shirazi, A. Walters, and P. Athanas,
“Quantitative analysis of floating point arithmetic on
fpga based custom computing machines,” in
Proceedings of the IEEE Symposium on FPGAs for
Custom Computing Machines, pp. 155–162, 1995.
[2] P. Belanovic and M. Leeser, “A library of
parameterized floating-point modules and their
use,”in Proceedings of the International Conference
on Field Programmable Logic and Applications,
2002.
[3] J. Dido, N. Geraudie, L. Loiseau, O. Payeur, Y.
Savaria, and D. Poirier, “A flexible floating-point
format for optimizing data-paths and operators in
fpga based dsps,” in Proceedings of the ACM
International Symposium on Field Programmable
Gate Arrays, (Monterrey, CA), February 2002.
32
33