0% found this document useful (0 votes)

18 views15 pages

Fixed Points An Introduction

This document, authored by Randy Yates, serves as a technical reference on fixed-point arithmetic, detailing binary representations and arithmetic operations relevant to digital signal processing. It outlines the definitions, rules, and guidelines for manipulating fixed-point numbers, emphasizing the importance of understanding these concepts for implementing algorithms on platforms using integer arithmetic. The document also includes sections on dimensional analysis, finite precision math, and practical examples, aiming to provide a comprehensive resource for users in the field.

Uploaded by

Mouloud Ibelaiden

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views15 pages

Fixed Points An Introduction

Uploaded by

Mouloud Ibelaiden

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Technical Reference

s Fixed-Point Arithmetic: An Introduction 1 (15)

Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

Fixed-Point Arithmetic: An Introduction

Randy Yates
September 15, 2020

s
signal processi ng systems

http://www.digitalsignallabs.com

Typeset using LATEX 2ε

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 2 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

Contents
1 Introduction 3

2 Fixed-Point Binary Representations 3

2.1 Unsigned Fixed-Point Rationals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 The Operations of One’s Complement and Two’s Complement 5

4 Signed Two’s Complement Fixed-Point Rationals 5

5 Fundamental Rules of Fixed-Point Arithmetic 6

5.1 Unsigned Wordlength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.2 Signed Wordlength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.3 Unsigned Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.4 Signed Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.5 Addition Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.6 Addition Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.7 Unsigned Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.8 Signed Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.9 Unsigned Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.10 Signed Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.11 Wordlength Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.12 Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.12.1 Literal Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.12.1.1 Multiplying/Dividing By A Power of Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.12.1.2 Modifying Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.12.2 Virtual Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

6 Dimensional Analysis in Fixed-Point Arithmetic 10

7 Concepts of Finite Precision Math 12

7.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7.2 Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7.3 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7.5 Dynamic Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

8 Fixed-Point Analysis—An Example 13

9 Acknowledgments 14

10 Terms and Abbreviations 14

11 Revision History 14

12 References 15

List of Figures
List of Tables
1 Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 3 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

1 Introduction
This document presents definitions of signed and unsigned fixed-point binary number representations and devel-
ops basic rules and guidelines for the manipulation of these number representations using the common arithmetic
and logical operations found in fixed-point DSPs and hardware components.

While there is nothing particularly difficult about this subject, I found little documentation either in hardcopy
or on the web. What documentation I did find was disjointed, never putting together all of the aspects of fixed-
point arithmetic that I think are important. I therefore decided to develop this material and to place it on the web
not only for my own reference but for the benefit of others who, like myself, find themselves needing a complete
understanding of the issues in implementing fixed-point algorithms on platforms utilizing integer arithmetic.

During the writing of this paper, I was developing assembly language code for the Texas Instruments TMS320C50
Digital Signal Processor, thus my approach to the subject is undoubtedly biased towards this processor in terms
of the operation of the fundamental arithmetic operations. For example, the C50 performs adds and multiplies as
if the numbers are simple signed two’s complement integers. Contrast this against the Motorola 56k series which
performs two’s complement fractional arithmetic, with values always in the range −1 ≤ x < +1.

It is my hope that this material is clear, accurate, and helpful. If you find any errors or inconsistencies, please email
me at yates@ieee.org.

Finally, the reader may be interested in the author’s related paper [1] on the application of fixed-point arithmetic
to the implementation of FIR filters.

2 Fixed-Point Binary Representations

A collection of N (N a positive integer) binary digits (bits) has 2N possible states. This can be seen from elementary
counting theory, which tells us that there are two possibilities for the first bit, two possibilities for the next bit, and
so on until the last bit, resulting in
2 × 2 × . . . × 2 = 2N
| {z }
N times
possibilities.

In the most general sense, we can allow these states to represent anything conceivable. In the case of an N -bit
binary word, some examples are up to 2N :

1. students at a university;

2. species of plants;

3. atomic elements;

4. integers;

5. voltage levels.

Drawing from set theory and elementary abstract algebra, one could view a representation as an onto mapping
between the binary states and the elements in the representation set (in the case of unassigned binary states, we
assume there is an “unassigned” element in the representation set to which all such states are mapped).

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 4 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

The salient point is that there is no meaning inherent in a binary word, although most people are tempted to
think of them (at first glance, anyway) as positive integers (i.e., the natural binary representation, defined in the
next section). However, the meaning of an N-bit binary word depends entirely on its interpretation, i.e., on the
representation set and the mapping we choose to use.

In this section, we consider representations in which the representation set is a particular subset of the rational
numbers. Recall that the rational numbers are the set of numbers expressible as a/b, where a, b ∈ Z, b , 0. (Z is
the set of integers.) The subset to which we refer are those rationals for which b = 2n . We also further constrain
the representation sets to be those in which every element in the set has the same number of binary digits and in
which every element in the set has the binary point at the same position, i.e., the binary point is fixed. Thus these
representations are called “fixed-point.”

The following sections explain four common binary representations: unsigned integers, unsigned fixed-point ra-
tionals, signed two’s complement integers, and signed two’s complement fixed-point rationals. We view the integer
representations as special cases of the fixed-point rational representations, therefore we begin by defining the fixed-
point rational representations and then subsequently show how these can simplify to the integer representations.
We begin with the unsigned representations since they require nothing more than basic algebra. Section 2.2 de-
fines the notion of a “two’s complement” so that we may proceed well-grounded to the discussion of signed two’s
complement rationals in section 2.3.

2.1 Unsigned Fixed-Point Rationals

An N-bit binary word, when interpreted as an unsigned fixed-point rational, can take on values from a subset P of
the non-negative rationals given by
P = {p/2b | 0 ≤ p ≤ 2N − 1, p ∈ Z}.
Note that P contains 2N elements. We denote such a representation U (a, b), where a = N − b.

In the U (a, b) representation, the nth bit, counting from right to left and beginning at 0, has a weight of 2n /2b = 2n−b .
Note that when n = b the weight is exactly 1. Similar to normal everyday base-10 decimal notation, the binary point
is between this bit and the bit to the right. This is sometimes referred to as the implied binary point. A U (a, b)
representation has a integer bits and b fractional bits.

The value of a particular N-bit binary number x in a U (a, b) representation is given by the expression
N
X −1
x = (1/2b ) 2n xn
n=0

where xn represents bit n of x. The range of a U (a, b) representation is from 0 to (2N − 1)/2b = 2a − 2−b .

For example, the 8-bit unsigned fixed-point rational representation U (6, 2) has the form
b5 b4 b3 b2 b1 b0 .b−1 b−2 ,
where bit bk has a weight of 2k . Note that since b = 2 the binary point is to the left of the second bit from the right
(counting from zero), and thus the number has six integer bits and two fractional bits. This representation has a
range of from 0 to 26 − 2−2 = 64 − 1/4 = 63 3/4.

The unsigned integer representation can be viewed as a special case of the unsigned fixed-point rational represen-
tation where b = 0. Specifically, an N-bit unsigned integer is identical to a U (N , 0) unsigned fixed-point rational.
Thus the range of an N-bit unsigned integer is
0 ≤ U (N , 0) ≤ 2N − 1.

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 5 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

and it has N integer bits and 0 fractional bits. The unsigned integer representation is sometimes referred to as
“natural binary.”

Examples:

1. U (6, 2). This number has 6 + 2 = 8 bits and the range is from 0 to 26 − 1/22 = 63.75. The value 8Ah (1000,1010b)
is
(1/22 )(21 + 23 + 27 ) = 34.5.

2. U (−2, 18). This number has −2 + 18 = 16 bits and the range is from 0 to 2−2 − 1/218 = 0.2499961853027. The
value 04BCh (0000,0100,1011,1100b) is

(1/218 )(22 + 23 + 24 + 25 + 27 + 210 ) = 1212/218 = 0.004623413085938.

3. U (16, 0). This number has 16 + 0 = 16 bits and the range is from 0 to 216 − 1 = 65, 535. The value 04BCh
(0000,0100,1011,1100b) is
(1/20 )(22 + 23 + 24 + 25 + 27 + 210 ) = 1212/20 = 1212.

3 The Operations of One’s Complement and Two’s Complement

Consider an N-bit binary word x interpreted as if in the N-bit natural binary representation (i.e., U (N , 0)). The one’s
complement of x is defined to be an operation that inverts every bit of the original value x. This can be performed
arithmetically in the U (N , 0) representation by subtracting x from 2N −1. That is, if we denote the one’s complement
of x as x̃, then
x̃ = 2N − 1 − x.

The two’s complement of x, denoted x̂, is determined by taking the one’s complement of x and then adding one to
it:

x̂ = x̃ + 1
= 2N − x.
(1)

Examples:

1. The one’s complement of the U(8,0) number 03h (0000,0011b) is FCh (1111,1100b).

2. The two’s complement of the U(8,0) number 03h (0000,0011b) is FDh (1111,1101b).

4 Signed Two’s Complement Fixed-Point Rationals

An N-bit binary word, when interpreted as a signed two’s complement fixed-point rational, can take on values from
a subset P of the rationals given by

P = {p/2b | − 2N −1 ≤ p ≤ 2N −1 − 1, p ∈ Z}.

Note that P contains 2N elements. We denote such a representation A(a, b), where a = N − b − 1.

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 6 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

The value of a specific N-bit binary number x in an A(a,b) representation is given by the expression
 N −2

 X 
x = (1/2b ) −2N −1 xN −1 + 2n xn  ,
n=0

where xn represents bit n of x. The range of an A(a,b) representation is

−2N −1−b ≤ x ≤ +2N −1−b − 1/2b .

Note that the number of bits in the magnitude term of the sum above (the summation, that is) has one less bit than
the equivalent prior unsigned fixed-point rational representation. Further note that these bits are the N − 1 least
significant bits. It is for these reasons that the most-significant bit in a signed two’s complement number is usually
referred to as the sign bit.

Example:

A(13,2). This number has 13+2+1=16 bits and the range is from −213 = −8192 to +213 − 1/4 = 8191.75.

5 Fundamental Rules of Fixed-Point Arithmetic

The following are practical rules of fixed-point arithmetic. For these rules we note that when a scaling can be either
signed (A(a, b)) or unsigned (U (a, b)), we use the notation X(a, b).

5.1 Unsigned Wordlength

The number of bits required to represent U (a, b) is a + b.

5.2 Signed Wordlength

The number of bits required to represent A(a, b) is a + b + 1.

5.3 Unsigned Range

The range of U (a, b) is 0 ≤ x ≤ 2a − 2−b .

5.4 Signed Range

The range of A(a, b) is −2a ≤ α ≤ 2a − 2−b .

5.5 Addition Operands

Two binary numbers must be scaled the same in order to be added. That is, X(c, d) + Y (e, f ) is only valid if X = Y
(either both A or both U ) and c = e and d = f .

5.6 Addition Result

The scale of the sum of two binary numbers scaled X(e, f ) is X(e + 1, f ), i.e., the sum of two M-bit numbers requires
M + 1 bits.

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 7 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

5.7 Unsigned Multiplication

U (a1 , b1 ) × U (a2 , b2 ) = U (a1 + a2 , b1 + b2 ).

5.8 Signed Multiplication

A(a1 , b1 ) × A(a2 , b2 ) = A(a1 + a2 + 1, b1 + b2 ).

5.9 Unsigned Division

U (a1 ,b1 )
Let U (a3 , b3 ) = U (a2 ,b2 )
and consider the largest possible result:

largest dividend
largest result =
smallest divisor
2a1 − 2−b1
= .
2−b2
= 2a1 +b2 − 2b2 −b1 . (2)
Thus we require
2a3 − 2−b3 ≥ 2a1 +b2 − 2b2 −b1 . (3)
It is natural to let a3 = a1 + b2 , in which case the inequalities below result:
2a3 − 2−b3 ≥ 2a3 − 2b2 −b1
−2−b3 ≥ −2b2 −b1
2−b3 ≤ 2b2 −b1
−b3 ≤ b2 − b1
b3 ≥ b1 − b2 . (4)
Thus we have a constraint on b3 due to b1 and b2 .

Now consider the smallest possible result:

smallest dividend
smallest result =
largest divisor
2−b1
= . (5)
2a2 − 2−b2
This then requires b3 to obey the following constraint:
2−b1
2−b3 ≤
− 2−b2
2a2
b3 ≥ b1 + log2 (2a2 − 2−b2 )
(6)
If we assume b2 is positive, (6) is the more stringent of the two constraints (4) and (6) on b3 . We then express (6) in
a slightly simpler form:
b3 ≥ log2 (2a2 +b1 − 2b1 −b2 ). (7)

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 8 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

The final result is then

U (a1 , b1 )/U (a2 , b2 ) = U (a1 + b2 , ⌈log2 (2a2 +b1 − 2b1 −b2 )⌉). (8)

5.10 Signed Division

Let r = n/d where n is scaled A(an , bn ) and d is scaled A(ad , bd ). What is the scaling of r (A(ar , br ))?

|nM |
|rM | = (9)
|dm |
2an
= −b (10)
2 d
= 2an +bd . (11)
Since this maximum can be positive (when the numerator and denominator are both negative), ar = an + bd + 1.

Similarly,
|nm |
|rm | = (12)
|dM |
2−bn
= (13)
2ad
= 2−(ad +bn ) . (14)
This implies br = ad + bn .

Thus
A(an , bn )
= A(an + bd + 1, ad + bn ). (15)
A(ad , bd )

5.11 Wordlength Reduction

Define the operation HI n(X(a, b)) to be the extraction of the n most-significant bits of X(a, b). Similarly, define the
operation LOn(X(a, b)) to be the extraction of the n least-significant bits of X(a, b). For signed values,
HI n(A(a, b)) = A(a, n − a − 1) and
LOn(A(a, b)) = A(n − b − 1, b). (16)
Similarly, for unsigned values,
HI n(U (a, b)) = U (a, n − a) and
LOn(U (a, b)) = U (n − b, b). (17)

5.12 Shifting

We define two types of shift operations below, literal and virtual, and describe the scaling results of each.

Note that shifts are expressed in terms of right shifts by integer n. Shifting left is accomplished when n is negative.

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 9 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

5.12.1 Literal Shift

A literal shift occurs when the bit positions in a register move left or right. A literal shift can be performed for two
possible reasons, to divide or multiply by a power of two, or to change the scaling.

In both cases note that this will possibly result in a loss of precision or overflow assuming the output register width
is the same as the input register width.

5.12.1.1 Multiplying/Dividing By A Power of Two A literal shift that is done to multiply or divide by a power
of two shifts the bit positions but keeps the output scaling the same as the input scaling.

Thus we have the following scaling:

X(a, b) >> n = X(a, b). (18)

Example:
Let’s say X is a 16-bit signed two’s complement integer that is scaled A(14,1), or Q1. Let’s set that integer equal to
128: X = +128 = 0x0080, and thus its scaled value is

x = X/21 (19)
= 128/2 (20)
= 64.0 (21)

Now we want to divide that by 4, so we shift it right by 2, X = X >> 2, so that the new value of X is 32. This shift
didn’t change the scaling since we are dividing by actually shifting, so it’s still scaled Q1 after the shift. So now
X = 32 and x = X/2 = 16.0. Since the original value of x was 64.0, we see that we have indeed divided that value by
4, which was the objective.

Note that this is probably a bad way to multiply or divide a fixed-point value since, if you’re multiplying, you run
the risk of overflowing, and if you’re dividing, you run the risk of losing precision. It would be much better to
perform the multiplication or division using the "virtual shift" method described in section 5.12.2.

5.12.1.2 Modifying Scaling A literal shift that is done to modify the scaling shifts the bit positions and makes
the output scaling different than the input scaling. Thus we have the following scaling:

X(a, b) >> n = X(a + n, b − n). (22)

Example:
Again let’s say X is a 16-bit signed two’s complement integer that is scaled A(14,1), or Q1, and let’s set that integer
equal to 128: X = +128 = 0x0080, and thus its scaled value is

x = X/21 (23)
= 128/2 (24)
= 64.0 (25)

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 10 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

Now let’s say we want to change the scaling from Q1 to Q3 (or equivalently, from A(14,1) to A(12,3)). So we shift
the integer left by two bits: X = X << 2. So our new integer value is 512, but we’ve now also changed our scaling as
in equation 22 so that it’s A(12,3) (n = −2 here since we’re shifting left).

So in this case our final fixed-point value is still 512 / 8 = 64.0.

5.12.2 Virtual Shift

A virtual shift shifts the virtual binary point1 without modifying the underlying integer value. It can be used as an
alternate method of performing a multiplication or division by a power of two. However, unlike the literal shift
case, the virtual shift method loses no precision and avoids overflow. This is because the bit positions don’t actually
move—the operation is simply a reinterpretation of the scaling.

X(a, b) >> n = X(a − n, b + n). (26)

6 Dimensional Analysis in Fixed-Point Arithmetic

Consider a fixed-point variable x that is scaled A(ax , bx ). Denote the scaled value of the variable as lowercase x and
the unscaled value as uppercase X so that
x = X/2bx .
Units, such as inches, seconds, furlongs/fortnight, etc., may be associated with a fixed-point variable by assigning a
weight to the variable. Denote the scaled weight as w and the unscaled weight as W , so that the value and dimension
of a quantity αx that is to be represented by x can be expressed as

αx = x × w
= X ×W. (27)

Since x = X/2bx ,

x × w = X/2bx × w, (28)

and since x × w = X × W ,

X/2bx × w = X × W ⇒ w = 2bx W . (29)

Example 1:

An inertial sensor provides a linear acceleration signal to a 16-bit signed two’s complement A/D converter with
a reference voltage of 2V peak. The analog sensor signal is related to acceleration in meters per second squared
through the conversion m/(s2 − volt), i.e., the actual acceleration α(t) in meters/second2 can be determined from
the sensor voltage v(t) as
" #
m
α(t) = v(t) [volts] × 2 . (30)
(s )(volt)
1 The virtual binary point is called virtual since it doesn’t actually exist anywhere except in the programmer’s mind.

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 11 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

If we consider the incoming A/D samples to be scaled A(16, −1), what is the corresponding scaled and unscaled
weights?

Solution:

We know that -32768 corresponds to -2 volts. Using the equation

αx = x × w
= X/2bx × w (31)

we simply plug in the values to obtain

−2 [volts] = −32768/2−1 × wv . (32)

Solving for wv provides the intermediate result

" #
volt
wv = . (33)
32768

Let’s check: An unscaled value of 32767 corresponds to a scaled value of v = 32767/2−1 = 65534, and thus the
physical quantity αv to which this corresponds is

αv = v × wv
" #
volt
= 65534 ×
32768
= 1.999939 [volts] . (34)

m
Now simply multiply wv by the original analog conversion factor (s2 )(volt) to obtain the acceleration weighting wa
directly:
" #
m
wa = wv × 2
(s )(volt)
" # " #
volt m
= × 2
32768 (s )(volt)

m
= . (35)
32768s2

The unscaled weight is then determined from the scaled weight and the scaling as

Wa = wa /2ba

m
= 2
/2−1
32768s
m
= . (36)
16384s2

Example 2:

Bias is an important error in inertial measurement systems. An average scaled value of 29 was measured from the
inertial measurement system in example 1 with the system at rest. What is the bias β?

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 12 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

Solution:

β = x×w

m
= 29 ×
32768s2
m
= 0.88501 × 10−3 2 . (37)
s

7 Concepts of Finite Precision Math

7.1 Precision

Precision is the maximum number of non-zero bits representable. For example, an A(13,2) number has a precision
of 16 bits. For fixed-point representations, precision is equal to the wordlength.

7.2 Resolution

Resolution is the smallest non-zero magnitude representable. For example, an A(13,2) has a resolution of 1/22 =
0.25.

7.3 Range

Range is the difference between the most negative number representable and the most positive number repre-
sentable,
XR = XMAX+ − XMAX− . (38)
For example, an A(13,2) number has a range from -8192 to +8191.75, i.e., 16383.75.

7.4 Accuracy

Accuracy is the magnitude of the maximum difference between a real value and it’s representation. For example,
the accuracy of an A(13,2) number is 1/8. Note that accuracy and resolution are related as follows:
A(x) = R(x)/2, (39)
where A(x) is the accuracy of x and R(x) is the resolution of x.

7.5 Dynamic Range

Dynamic range is the ratio of the maximum absolute value representable and the minimum positive (i.e., non-zero)
absolute value representable. For a signed fixed-point rational representation A(a, b), dynamic range is
×2a /2−b = 2a+b = 2N −1 . (40)
For an unsigned fixed-point rational representation U (a, b), dynamic range is
(2a − 2−b )/2−b = 2a+b − 1 = 2N − 1. (41)
For N of any significant size, the “-1” is negligible.

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 13 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

8 Fixed-Point Analysis—An Example

An algorithm is usually defined and developed using an algebraically complete number system such as the real
or complex numbers. To be more precise, the operations of addition, subtraction, multiplication, and division are
performed (for example) over the field (ℜ, +, ×), where subtraction is equivalent to adding the additive inverse and
division is equivalent to multiplying by the multiplicative inverse.

As an example, consider the algorithm for calculating the average of the square of a digital signal x(n) over the
interval N (here, the signal is considered to be quantized in time but not in amplitude):
N −1
1 X 2
y(n) = x (n − k). (42)
N
k=0

In this form, the algorithm implicitly assumes x(n) ∈ ℜ, and the operations of addition and multiplication are
performed over the field (ℜ, +, ×). In this case, the numerical representations have infinite precision.

This state of affairs is perfectly acceptable when working with pencil and paper or higher-level floating-point com-
puting environments such as Matlab or MathCad. However, when the algorithm is to be implemented in fixed-point
hardware or software, it must necessarily utilize a finite number of binary digits to represent x(n), the intermediate
products and sums, and the output y(n).

Thus the basic task of converting such an algorithm into fixed-point arithmetic is that of determining the wordlength,
accuracy, and range required for each of the arithmetic operations involved in the algorithm. In the terms of the
fundamentals given in section 2, we need to determine a) whether the value should be signed (A(a, b)) or unsigned
(U (a, b)), b) the value of N (the wordlength), and c) the values for a and b (the accuracy and range). Any two
of wordlength, accuracy, and range determine the third. For example, given wordlength and accuracy, range is
determined. In other words, we cannot independently specify all of wordlength, accuracy, and range.

Continuing with our example, assume the input x(n) is scaled A(15, 0), i.e., plain old 16-bit signed two’s complement
samples. The first operation to be performed is to compute the square. According to the rules of fixed-point
arithmetic, A(15, 0) × A(15, 0) = A(31, 0). In other words, we require 32 bits for the result of the square in order to
guarantee that we will avoid overflow and maintain precision. It is at this point that design tradeoffs and other
information begin to affect how we implement our algorithm.

For example, in one possible scenario, we may know a-priori that the input data x(n) do not span the full dynamic
range of the A(15, 0) representation, thus it may be possible to reduce the 32-bit requirement for the result and still
guarantee that the square operation does not overflow.

Another possible scenario is that we do not require all of the precision in the result, and this also will reduce the
required wordlength.
In yet a third scenario, we may look ahead to the summation to be performed and realize that if we don’t scale back
the result of each square we will overflow the sum that is to subsequently be performed (assuming we have a 32-bit
accumulator). On the other hand, we may be using a fixed-point processor such as the TI TMS320C54x which has
a 40-bit accumulator, thus we have 8 “guard bits” past the 32-bit result which may be used in the accumulations to
prevent overflow for up to 256 (8=log2(256)) sums.

To complete our example, let’s further assume that a) we keep all 32 bits of the result of the squaring operation,
b) the averaging “time,” N , does not exceed 24 = 16 samples, c) we are using a fixed-point processor with an
accumulator of 32 + 4 = 36 bits or greater, and d) the output wordlength for y(n) is 16 bits (A(15, 0)). The final
decision that must be made is to determine which method we will use to form a 16-bit value from our 36-bit sum.

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 14 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

It is clear that we should take the 16 bits from bits 20 to 35 of the accumulator (where bit 0 is the LSB) in order
to avoid overflowing the output, but shall we truncate or round? Shall we utilize some type of dithering or noise-
shaping? These are all questions that relate to the process of quantization since we are quantizing a 36-bit word to
a 16-bit word. The theory of quantization and the tradeoffs to be made are outside the scope of this topic.

9 Acknowledgments
I wish to thank my colleague John Storbeck, a fellow DSP programmer and full-time employee of GEC-Marconi
Hazeltine (Wayne, NJ), for his guidance during my early encounters with the C50 and fixed-point programming,
and for his stealy critiques of my errant thoughts on assorted programming solutions for the C50. I also wish
to thank Dr. Donald Heckathorn for nurturing my knowledge of fixed-point arithmetic during the same time-
frame. Finally, I wish to thank my friend and fellow DSP engineering colleague Robert Bristow-Johnson for his
encouragement and guidance during my continuing journey through the world of fixed-point programming and
digital signal processing.

10 Terms and Abbreviations

DSP Digital Signal Processor, or Digital Signal Processing.

FIR Finite Impulse Response. A type of digital filter that does not require a recursive architecture (i.e., the use of
feedback), that is inherently stable, and generally easier to design and implement than IIR filters. However, an
FIR filter requires more computational resources than an equivalent IIR filter.

IIR Infinite Impulse Response. A type of digital filter that requires a recursive architecture, is potentially unstable,
and substantially more difficult to design and implement than FIR filters. However, an IIR filter requires less
computational resources than an equivalent FIR filter.

11 Revision History
Table 1 lists the revision history for this document.

Rev. Date/Time Person Changes

PA1 circa 1997 Randy Yates Initial Version
PA2 unknown Randy Yates 1. Converted to Digital Signal Labs document format.
2. Corrected signed dynamic range in equation 40.
PA3 29-Mar-2007 Randy Yates 1. Updated document design.
PA4 17-Apr-2007 Randy Yates 1. Removed “this is a test” from introductory paragraph (!).
2. Added justification of signed division scaling.
PA5 23-Aug-2007 Randy Yates Updated shift section.
PA6 07-Jul-2009 Randy Yates 1. Added derivation of unsigned division result.
2. Changed units notation to remove what looked like a subtractions (−).
PA7 01-Jan-2013 Randy Yates 1. Changed “right” to “left” in description of binary point position in last paragraph of page
four (section 2.1). Thanks to Rick Lyons for finding this error!
2. Added examples to sections 5.12.1.1 and 5.12.1.2.
PA8 02-Jan-2013 Randy Yates 1. Changed “will lose precision” to “run the risk of losing precision” in section 5.12.1.1.

Table 1: Revision History

Digital Signal Labs Public Information

Technical Reference
s Fixed-Point Arithmetic: An Introduction 15 (15)
Author Date Time Rev No. Reference

Randy Yates September 15, 2020

23:34 PA8 n/a fp.tex

12 References

References
[1] R. Yates, “Practical Considerations in Fixed-Point Arithmetic: FIR Filter Implementations.”

Digital Signal Labs Public Information

Fixed-Point Arithmetic Guide
No ratings yet
Fixed-Point Arithmetic Guide
13 pages
Technical Reference Fixed-Point Arithmetic: An Introduction Randy Yates 16-May-2024 19:39 PA11 N/a FP - Tex
No ratings yet
Technical Reference Fixed-Point Arithmetic: An Introduction Randy Yates 16-May-2024 19:39 PA11 N/a FP - Tex
25 pages
Digital Systems Book
No ratings yet
Digital Systems Book
345 pages
Arith
No ratings yet
Arith
245 pages
Fixed-Point Designer™ User's Guide
No ratings yet
Fixed-Point Designer™ User's Guide
1,827 pages
Math For Programmers PDF
100% (1)
Math For Programmers PDF
127 pages
Advanced C Programming Techniques
No ratings yet
Advanced C Programming Techniques
41 pages
Fix Point Implementation of Elementry Functions
No ratings yet
Fix Point Implementation of Elementry Functions
134 pages
Mca-0 1
No ratings yet
Mca-0 1
91 pages
Fixed Point Signal Processing by W Paddget
100% (1)
Fixed Point Signal Processing by W Paddget
133 pages
Fixed-Point Signal Processing
No ratings yet
Fixed-Point Signal Processing
133 pages
Fixed-Point Algorithm Basics
No ratings yet
Fixed-Point Algorithm Basics
6 pages
Numerical Analysis 1 Notes
No ratings yet
Numerical Analysis 1 Notes
111 pages
Practical Considerations in Fixed-Point FIR Filter Implem
No ratings yet
Practical Considerations in Fixed-Point FIR Filter Implem
15 pages
Flint-2 5
No ratings yet
Flint-2 5
671 pages
Unit 2
No ratings yet
Unit 2
9 pages
GNU Scientific Library Reference Manual v1.16
No ratings yet
GNU Scientific Library Reference Manual v1.16
529 pages
DSY1501 Ch1 2 Notes
No ratings yet
DSY1501 Ch1 2 Notes
45 pages
Floating Point & Fixed Point Representation - BCA II
No ratings yet
Floating Point & Fixed Point Representation - BCA II
24 pages
(DFT) Julius O. Smith III-Mathematics of The Discrete Fourier Transform-W3K Publishing (2003)
No ratings yet
(DFT) Julius O. Smith III-Mathematics of The Discrete Fourier Transform-W3K Publishing (2003)
431 pages
Computer Arithmetic - M. Vladutiu
No ratings yet
Computer Arithmetic - M. Vladutiu
269 pages
Digital Arithmetic for CSE Students
No ratings yet
Digital Arithmetic for CSE Students
97 pages
Smith J - Mathematics of The Discrete Fourier Transform PDF
No ratings yet
Smith J - Mathematics of The Discrete Fourier Transform PDF
253 pages
PNL Manual
No ratings yet
PNL Manual
69 pages
Computational Physics
No ratings yet
Computational Physics
165 pages
Systemarchitekturskript Prof. Paul
No ratings yet
Systemarchitekturskript Prof. Paul
520 pages
IEEE 754: Floating-Point Basics
No ratings yet
IEEE 754: Floating-Point Basics
3 pages
Lathi LSAS Ctoc
No ratings yet
Lathi LSAS Ctoc
11 pages
Co Micro
No ratings yet
Co Micro
32 pages
Eio0000001474 03
No ratings yet
Eio0000001474 03
250 pages
c051317 ISO IEC 10967-1 2012
No ratings yet
c051317 ISO IEC 10967-1 2012
144 pages
Contents ICD202
No ratings yet
Contents ICD202
7 pages
Week8 Slides
No ratings yet
Week8 Slides
43 pages
Yacas Algo - Book
No ratings yet
Yacas Algo - Book
71 pages
COA Module 2
No ratings yet
COA Module 2
65 pages
DSY1501 Study Guide Final
No ratings yet
DSY1501 Study Guide Final
133 pages
So Machine Basic Func Library
No ratings yet
So Machine Basic Func Library
282 pages
Libtommath User Manual V0.39: Tom ST Denis Tomstdenis@Iahu - Ca April 4, 2006
No ratings yet
Libtommath User Manual V0.39: Tom ST Denis Tomstdenis@Iahu - Ca April 4, 2006
62 pages
GSL Ref PDF
No ratings yet
GSL Ref PDF
663 pages
Math 248: Numerical Algorithms Course
No ratings yet
Math 248: Numerical Algorithms Course
162 pages
An Fpga Based 64-Bit Ieee - 754 Double Precision Floating Point Adder/Subtractor and Multiplier Using VHDL
No ratings yet
An Fpga Based 64-Bit Ieee - 754 Double Precision Floating Point Adder/Subtractor and Multiplier Using VHDL
11 pages
Computer Arithmetic: Part II: Integer Arithmetic & Floating Point
No ratings yet
Computer Arithmetic: Part II: Integer Arithmetic & Floating Point
30 pages
Julius O. Smith - Mathematics of The Discrete Fourier Transform
No ratings yet
Julius O. Smith - Mathematics of The Discrete Fourier Transform
247 pages
Essential Maths Skills For ASA Level Computer Science Nodrm
100% (1)
Essential Maths Skills For ASA Level Computer Science Nodrm
105 pages
Julia-1 4 1
No ratings yet
Julia-1 4 1
1,312 pages
Numerical Methods Chap1
No ratings yet
Numerical Methods Chap1
14 pages
BiD 09
No ratings yet
BiD 09
56 pages
C Manual
No ratings yet
C Manual
122 pages
Fir Imp DSP
No ratings yet
Fir Imp DSP
34 pages
Atmel 42747 Atsamd21e16lmotor Userguide
No ratings yet
Atmel 42747 Atsamd21e16lmotor Userguide
16 pages
Interrupt Registers
No ratings yet
Interrupt Registers
8 pages
BTN7960BDC Motor Driver
No ratings yet
BTN7960BDC Motor Driver
1 page
Pin Out 2560
100% (1)
Pin Out 2560
1 page
Hurst DMB0224C10002 BLDC Motor DataSheet
No ratings yet
Hurst DMB0224C10002 BLDC Motor DataSheet
4 pages
Module 1: Signals & System: Lecture 6: Basic Signals in Detail
No ratings yet
Module 1: Signals & System: Lecture 6: Basic Signals in Detail
37 pages
Infineon-AURIX ADC Filtering 1 KIT TC397 TFT-Training-v01 01-EN
No ratings yet
Infineon-AURIX ADC Filtering 1 KIT TC397 TFT-Training-v01 01-EN
22 pages
SSZT 421
No ratings yet
SSZT 421
5 pages
Arduino Sketch
No ratings yet
Arduino Sketch
5 pages
Elahi, Ata - Gschwender, Adam - Zigbee Wireless Sensor and Control Network-Prentice Hall (2009)
No ratings yet
Elahi, Ata - Gschwender, Adam - Zigbee Wireless Sensor and Control Network-Prentice Hall (2009)
289 pages
Test Booklet: Csat Paper-2
No ratings yet
Test Booklet: Csat Paper-2
36 pages
Unit-I TO Composite Materials
100% (1)
Unit-I TO Composite Materials
40 pages
FOP 21 September
No ratings yet
FOP 21 September
40 pages
Columns (Section 410 NSCP 2015) : Materials (410.2.1)
100% (2)
Columns (Section 410 NSCP 2015) : Materials (410.2.1)
8 pages
Annual Report SMGR
No ratings yet
Annual Report SMGR
425 pages
6.1. Electrochemistry
No ratings yet
6.1. Electrochemistry
77 pages
SN65HVD233-HT 3.3-V CAN Transceiver: 1 Features 3 Description
No ratings yet
SN65HVD233-HT 3.3-V CAN Transceiver: 1 Features 3 Description
37 pages
HUMAN PHYSIOLOGY Paper II
100% (1)
HUMAN PHYSIOLOGY Paper II
5 pages
String Theory Our Best Candidate For The Theory of Everything - JoshuaLijfering - Corrected
No ratings yet
String Theory Our Best Candidate For The Theory of Everything - JoshuaLijfering - Corrected
5 pages
MaxiTPMS ITS600 Pro User Manual US
No ratings yet
MaxiTPMS ITS600 Pro User Manual US
173 pages
Your Electronic Ticket Receipt
No ratings yet
Your Electronic Ticket Receipt
2 pages
Industrial Robotics Assignment 1
No ratings yet
Industrial Robotics Assignment 1
5 pages
Miniaturization of Rectangular Microstrip Patches Using Genetic Algorithms
No ratings yet
Miniaturization of Rectangular Microstrip Patches Using Genetic Algorithms
4 pages
Losoid Lo 20
No ratings yet
Losoid Lo 20
6 pages
Enlistment 2024 2025
No ratings yet
Enlistment 2024 2025
196 pages
3461
No ratings yet
3461
2 pages
Especificação Técnica RM7800L1087 PDF
No ratings yet
Especificação Técnica RM7800L1087 PDF
36 pages
20 Self Exploration Exercises
100% (1)
20 Self Exploration Exercises
12 pages
Spices: Quality & Processing Insights
100% (1)
Spices: Quality & Processing Insights
87 pages
1.-Va BMP Spec No 6 Rainwater Harvesting Final Draft v2!2!060613
No ratings yet
1.-Va BMP Spec No 6 Rainwater Harvesting Final Draft v2!2!060613
56 pages
STP EEDIC 020700 001 Rev3 Electrical Interconnection Diagram M007 Markup1 21.10.2024 SMU1 25.05.25
No ratings yet
STP EEDIC 020700 001 Rev3 Electrical Interconnection Diagram M007 Markup1 21.10.2024 SMU1 25.05.25
65 pages
Kanon Green Binder Rev 2
100% (1)
Kanon Green Binder Rev 2
202 pages
Quantitative Ability
No ratings yet
Quantitative Ability
9 pages
Purplsoc 2017 Pursuit of Pattern Languages For Societal Change Richard Sickinger Download
No ratings yet
Purplsoc 2017 Pursuit of Pattern Languages For Societal Change Richard Sickinger Download
28 pages
PROC 5071: Process Equipment Design I: Mixing and Agitation
No ratings yet
PROC 5071: Process Equipment Design I: Mixing and Agitation
43 pages
Nps Tables For Selected Sizes NPS To NPS 3
No ratings yet
Nps Tables For Selected Sizes NPS To NPS 3
4 pages
Syllabus Decoded UPSC
No ratings yet
Syllabus Decoded UPSC
30 pages
Paradise Weekly Model Test Cee Mds Based Model Test: (Saturday, Kartik 19, 2079)
No ratings yet
Paradise Weekly Model Test Cee Mds Based Model Test: (Saturday, Kartik 19, 2079)
35 pages
Forensic Examination of Sweat
No ratings yet
Forensic Examination of Sweat
8 pages