KEMBAR78
C language slides for c programming book by ANSI | PDF
Floating Point Numbers
Review of Numbers
• Computers are made to deal with numbers
• What can we represent in N bits?
• Unsigned integers:
0 to 2N - 1
• Signed Integers (Two’s Complement)
-2(N-1) to 2(N-1) - 1
Signed Integers
-2(N-1) - 1 to 2(N-1) - 1
Other Numbers
• What about other numbers?
• Very large numbers? (seconds/century)
3,155,760,00010 (3.1557610 x 109)
• Very small numbers? (atomic diameter)
0.0000000110 (1.010 x 10-8)
• Rationals (repeating pattern)
• 2/3 (0.666666666. . .)
• Irrationals
21/2 (1.414213562373. . .)
• Transcendentals
• e (2.718...),  (3.141...)
• All represented in scientific notation
2i
2i-1
4
2
1
1/2
1/4
1/8
2-j
bi bi-1 ••• b2 b1 b0 b-1 b-2 b-3 ••• b-j
•
•
•
Fractional Binary Numbers
• Representation
• Bits to right of “binary point” represent fractional powers of 2
• Represents rational number:
• • •
Fractional Binary Numbers: Examples
 Value Representation
5 3/4 = 23/4 101.112 = 4 + 1 + 1/2 + 1/4
2 7/8 = 23/8 010.1112 = 2 + 1/2 + 1/4 + 1/8
1 7/16 = 23/16 001.01112 = 1 + 1/4 + 1/8 + 1/16
 Observations
 Divide by 2 by shifting right (unsigned)
 Multiply by 2 by shifting left
 Numbers of form 0.111111…2 are just below 1.0
 1/2 + 1/4 + 1/8 + … + 1/2i + … ➙ 1.0
 Use notation 1.0 – ε
Representable Numbers
• Limitation #1
• Can only exactly represent numbers of the form x/2k
• Other rational numbers have repeating bit representations
• Value Representation
• 1/3 0.0101010101[01]…2
• 1/5 0.001100110011[0011]…2
• 1/10 0.0001100110011[0011]…2
• Limitation #2
• Just one setting of binary point within the w bits
• Limited range of numbers (very small values? very large?)
Objective
• To understand the fundamentals of floating-
point representation
• To know the IEEE-754 Floating Point
Standard
Patriot Missile
• Gulf War I
• Failed to intercept
incoming Iraqi scud
missile (Feb 25, 1991)
• 28 American soldiers
killed
GAO Report: GAO/IMTEC-92-26 Patriot Missile Software
Problem
http://www.fas.org/spp/starwars/gao/im92026.htm
Patriot Design
• Intended to operate only for a few hours
• Defend Europe from Soviet aircraft and missile
• Four 24-bit registers (1970s design!)
• Kept time with integer counter: incremented every
1/10 second
• Calculate speed of incoming missile to predict
future positions:
velocity = loc1 – loc0/(count1 – count0) * 0.1
• But, cannot represent 0.1 exactly!
Floating Imprecision
• 24-bits:
0.1 = 1/24 + 1/25 + 1/28 + 1/29
+ 1/212 + 1/213 + 1/216 + 1/217
+ 1/220 + 1/221
= 209715 / 2097152
Error is 0.2/2097152 = 1/10485760
One hour = 3600 seconds
3600 * 1/10485760 * 10 = 0.0034s
20 hours = 0.0687s
Miss target! (137 meters)
Two weeks before the incident, Army officials received Israeli data
indicating some loss in accuracy after the system had been running
for 8 consecutive hours. Consequently, Army officials modified the
software to improve the system's accuracy. However, the modified
software did not reach Dhahran until February 26, 1991--the day
after the Scud incident.
GAO Report
http://fas.org/spp/starwars/gao/im92026.htm
• Numerical Form:
(–1)s M 2E
• Sign bit s determines whether number is negative or positive
• Significand M normally a fractional value in range [1.0,2.0).
• Exponent E weights value by power of two
• Encoding
• MSB s is sign bit s
• exp field encodes E (but is not equal to E)
• frac field encodes M (but is not equal to M)
Floating Point Representation
s exp frac
Example:
1521310 = (-1)0 x 1.11011011011012 x 213
Exponential Notation
The representations differ
in that the decimal place –
the “point” -- “floats” to
the left or right (with the
appropriate adjustment in
the exponent).
• The following are equivalent
representations of 1,234
123,400.0 x 10-2
12,340.0 x 10-1
1,234.0 x 100
123.4 x 101
12.34 x 102
1.234 x 103
0.1234 x 104
Parts of a Floating Point Number
-0.9876 x 10-3
Sign of
mantissa
Location of
decimal point Mantissa
Exponent
Sign of
exponent
Base
IEEE 754 Standard
• Most common standard for representing floating
point numbers
• Single precision: 32 bits, consisting of...
• Sign bit (1 bit)
• Exponent (8 bits)
• Mantissa (23 bits)
• Double precision: 64 bits, consisting of…
• Sign bit (1 bit)
• Exponent (11 bits)
• Mantissa (52 bits)
Prof. Willian Kahan
Single Precision Format
32 bits
Mantissa (23 bits)
Exponent (8 bits)
Sign of mantissa (1 bit)
Normalization
• The mantissa is normalized
• Has an implied decimal place on left
• Has an implied “1” on left of the decimal place
• E.g.,
• Mantissa 
• Represents…
10100000000000000000000
1.1012 = 1.62510
• Normalized form: no leadings 0s
(exactly one digit to left of decimal point)
• Normalized: 1.0 x 10-9
• Not normalized: 0.1 x 10-8,10.0 x 10-10
Excess Notation
• To include +ve and –ve exponents, “excess”
notation is used
• Single precision: excess 127
• Double precision: excess 1023
• The value of the exponent stored is larger than the
actual exponent
• E.g., excess 127,
• Exponent 
• Represents…
10000111
135 – 127 = 8
Example
• Single precision
0 10000010 11000000000000000000000
1.112
130 – 127 = 3
0 = positive mantissa
+1.112 x 23 = 1110.02 = 14.010
Hexadecimal
• It is convenient and common to represent
the original floating point number in
hexadecimal
• The preceding example…
0 10000010 11000000000000000000000
4 1 6 0 0 0 0 0
Converting from Floating Point
• E.g., What decimal value is represented by
the following 32-bit floating point number?
C17B000016
• Step 1
• Express in binary and find S, E, and M
C17B000016 =
1 10000010 111101100000000000000002
S E M
1 = negative
0 = positive
• Step 2
• Find “real” exponent, n
• n = E – 127
= 100000102 – 127
= 130 – 127
= 3
• Step 3
• Put S, M, and n together to form binary result
• (Don’t forget the implied “1.” on the left of the
mantissa.)
-1.11110112 x 2n =
-1.11110112 x 23 =
-1111.10112
• Step 4
• Express result in decimal
-1111.10112
-15
2-1 = 0.5
2-3 = 0.125
2-4 = 0.0625
0.6875
Answer: -15.6875
Converting from Floating Point
• E.g., What decimal value is represented by
the following 32-bit floating point number?
42808000 16
Converting to Floating Point
• E.g., Express 36.562510 as a 32-bit floating
point number (in hexadecimal)
• Step 1
• Express original value in binary
36.562510 =
100100.10012
• Step 2
• Normalize
100100.10012 =
1.0010010012 x 25
• Step 3
• Determine S, E, and M
+1.0010010012 x 25
S = 0 (because the value is positive)
M
S
n E = n + 127
= 5 + 127
= 132
= 100001002
• Step 4
• Put S, E, and M together to form 32-bit binary
result
0 10000100 001001001000000000000002
S E M
• Step 5
• Express in hexadecimal
0 10000100 001001001000000000000002 =
0100 0010 0001 0010 0100 0000 0000 00002 =
4 2 1 2 4 0 0 016
Answer: 4212400016
Converting to Floating Point
• E.g., Express 6.510 as a 32-bit floating point
number (in hexadecimal)
Converting to Floating Point
• E.g., Express 0.1 as a 32-bit floating point
number (in hexadecimal)
Zero, Infinity, and NaN
• Zero
– Exponent field E = 0 and fraction F = 0
– +0 and –0 are possible according to sign bit S
• Infinity
– Infinity is a special value represented with maximum E and F = 0
• For single precision with 8-bit exponent: maximum E = 255
• For double precision with 11-bit exponent: maximum E = 2047
– Infinity can result from overflow or division by zero
– +∞ and –∞ are possible according to sign bit S
• NaN (Not a Number)
– NaN is a special value represented with maximum E and F ≠ 0
– Result from exceptional situations, such as 0/0 or sqrt(negative)
– Operation on a NaN results is NaN: Op(X, NaN) = NaN
Simple 6-bit Floating Point Example
• 6-bit floating point representation
– Sign bit is the most significant bit
– Next 3 bits are the exponent with a bias of 3
– Last 2 bits are the fraction
• Same general form as IEEE
– Normalized, denormalized
– Representation of 0, infinity and NaN
• Value of normalized numbers (–1)S × (1.F)2 × 2E – 3
• Value of denormalized numbers (–1)S × (0.F)2 × 2– 2
S Exponent3 Fraction2
Values Related to Exponent
Exp. exp E 2E
0 000 2
- ¼
1 001 2
- ¼
2 010 1
- ½
3 011 0 1
4 100 1 2
5 101 2 4
6 110 3 8
7 111 n/a
Denormalized
Inf or NaN
Normalized
Dynamic Range of Values
s exp frac E value
0 000 00 2
- 0
0 000 01 2
- 1/4*1/4=1/16
0 000 10 2
- 2/4*1/4=2/16
0 000 11 2
- 3/4*1/4=3/16
0 001 00 2
- 4/4*1/4=4/16=1/4=0.25
0 001 01 2
- 5/4*1/4=5/16
0 001 10 2
- 6/4*1/4=6/16
0 001 11 2
- 7/4*1/4=7/16
0 010 00 1
- 4/4*2/4=8/16=1/2=0.5
0 010 01 1
- 5/4*2/4=10/16
0 010 10 1
- 6/4*2/4=12/16=0.75
0 010 11 1
- 7/4*2/4=14/16
smallest denormalized
largest denormalized
smallest normalized
Dynamic Range of Values
s exp frac E value
0 011 00 0 4/4*4/4=16/16=1
0 011 01 0 5/4*4/4=20/16=1.25
0 011 10 0 6/4*4/4=24/16=1.5
0 011 11 0 7/4*4/4=28/16=1.75
0 100 00 1 4/4*8/4=32/16=2
0 100 01 1 5/4*8/4=40/16=2.5
0 100 10 1 6/4*8/4=48/16=3
0 100 11 1 7/4*8/4=56/16=3.5
0 101 00 2 4/4*16/4=64/16=4
0 101 01 2 5/4*16/4=80/16=5
0 101 10 2 6/4*16/4=96/16=6
0 101 11 2 7/4*16/4=112/16=7
Dynamic Range of Values
s exp frac E value
0 110 00 3 4/4*32/4=128/16=8
0 110 01 3 5/4*32/4=160/16=10
0 110 10 3 6/4*32/4=192/16=12
0 110 11 3 7/4*32/4=224/16=14
0 111 00 
0 111 01 NaN
0 111 10 NaN
0 111 11 NaN
largest normalized
Floating Point Addition Example
• Consider adding: (1.111)2 × 2–1 + (1.011)2 × 2–3
– For simplicity, we assume 4 bits of precision (or 3 bits of
fraction)
• Cannot add significands … Why?
– Because exponents are not equal
• How to make exponents equal?
– Shift the significand of the lesser exponent right
until its exponent matches the larger number
• (1.011)2 × 2–3 = (0.1011)2 × 2–2 = (0.01011)2 × 2–1
– Difference between the two exponents = –1 – (–3) = 2
– So, shift right by 2 bits
• Now, add the significands: Carry
1.111
0.01011
10.00111
+
Addition Example
• So, (1.111)2 × 2–1 + (1.011)2 × 2–3 = (10.00111)2 × 2–1
• However, result (10.00111)2 × 2–1 is NOT normalized
• Normalize result: (10.00111)2 × 2–1 = (1.000111)2 × 20
– In this example, we have a carry
– So, shift right by 1 bit and increment the exponent
• Round the significand to fit in appropriate number of bits
– We assumed 4 bits of precision or 3 bits of fraction
• Round to nearest: (1.000111)2 ≈ (1.001)2
– Renormalize if rounding generates a carry
• Detect overflow / underflow
– If exponent becomes too large (overflow) or too small (underflow)
1.000 111
1
1.001
+
Summary: IEEE Floating Point
Single Precision (32 bits)
31 0
22
Sign
30 23
Exponent Fraction
8 bits
1 23 bits
Exponent values: 0 zeroes
1-254 exp + 127
255 infinities, NaN
Value = (1 – 2*Sign) (1 + Fraction)Exponent - 127
Denormalized Values
• Condition
• exp = 000…0
• Value
• Exponent value E = –Bias + 1
• Significand value M = 0.xxx…x2
•xxx…x: bits of frac
• Cases
• exp = 000…0, frac = 000…0
• Represents value 0
• Note that have distinct values +0 and –0
• exp = 000…0, frac  000…0
• Numbers very close to 0.0
Special Values
• Condition
• exp = 111…1
• Cases
• exp = 111…1, frac = 000…0
• Represents value(infinity)
• Operation that overflows
• Both positive and negative
• E.g., 1.0/0.0 = 1.0/0.0 = +, 1.0/0.0 = 
• exp = 111…1, frac  000…0
• Not-a-Number (NaN)
• Represents case when no numeric value can be
determined
• E.g., sqrt(–1), 
Interesting Numbers
• Description exp frac Numeric Value
• Zero 00…00 00…00 0.0
• Smallest Pos. Denorm. 00…00 00…01 2– {23,52} X 2– {126,1022}
• Single  1.4 X 10–45
• Double  4.9 X 10–324
• Largest Denormalized 00…00 11…11 (1.0 – ) X 2– {126,1022}
• Single  1.18 X 10–38
• Double  2.2 X 10–308
• Smallest Pos. Normalized 00…01 00…00 1.0 X 2– {126,1022}
• Just larger than largest denormalized
• One 01…11 00…00 1.0
• Largest Normalized 11…10 11…11 (2.0 – ) X 2{127,1023}
• Single  3.4 X 1038
• Double  1.8 X 10308
Visualization: Floating Point
Encodings
+
−
0
+Denorm +Normalized
−Denorm
−Normalized
+0
NaN NaN
Tiny Floating Point Example
• 8-bit Floating Point Representation
• the sign bit is in the most significant bit
• the next four bits are the exp, with a bias of 7
• the last three bits are the frac
• Same general form as IEEE Format
• normalized, denormalized
• representation of 0, NaN, infinity
s exp frac
1 4-bits 3-bits
s exp frac E Value
0 0000 000 -6 0
0 0000 001 -6 1/8*1/64 = 1/512
0 0000 010 -6 2/8*1/64 = 2/512
…
0 0000 110 -6 6/8*1/64 = 6/512
0 0000 111 -6 7/8*1/64 = 7/512
0 0001 000 -6 8/8*1/64 = 8/512
0 0001 001 -6 9/8*1/64 = 9/512
…
0 0110 110 -1 14/8*1/2 = 14/16
0 0110 111 -1 15/8*1/2 = 15/16
0 0111 000 0 8/8*1 = 1
0 0111 001 0 9/8*1 = 9/8
0 0111 010 0 10/8*1 = 10/8
…
0 1110 110 7 14/8*128 = 224
0 1110 111 7 15/8*128 = 240
0 1111 000 n/a inf
Dynamic Range (s=0 only)
closest to zero
largest denorm
smallest norm
closest to 1 below
closest to 1 above
largest norm
Denormalized
numbers
Normalized
numbers
v = (–1)s M 2E
norm: E = exp – Bias
denorm: E = 1 – Bias
(-1)0(0+1/4)*2-6
(-1)0(1+1/8)*2-6
-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity
Distribution of Values
• 6-bit IEEE-like format
• e = 3 exponent bits
• f = 2 fraction bits
• Bias is 23-1-1 = 3
• Notice how the distribution gets denser toward zero.
8 values
s exp frac
1 3-bits 2-bits
Floats are not Reals
Need to understand details of underlying implementations
Int’s:
eg. 40000 * 40000 --> 1600000000
600000* 600000 --> ?
Floats:
Eg 2 : Is (x + y) + z = x + (y + z)?
eg
(1e20 + -1e20) + 3.14 --> 3.14
1e20 + (-1e20 + 3.14) --> ??
231−1=2,147,483,647
IEEE 754
Component Bits
Sign bit 1
Exponent 5
Fraction 10
Total 16 bits (2 bytes)
IEEE 754 Binary16 (F16) Format
Field Bits Description
Sign 1 0 = positive, 1 = negative
Exponent 8 Encodes exponent with bias
Fraction (Mantissa) 23
Precision bits (fractional
part)
Overview of IEEE 754 Binary 32
IEEE 754 Binary16 (F128) Format
Field Bits Description
Sign 1 0 = positive, 1 = negative
Exponent 11 Encodes exponent with bias
Fraction (Mantissa) 52
Precision bits (fractional
part)
IEEE 754 Binary64
Field Bits Description
Sign 1 0 = positive, 1 = negative
Exponent 15
Encodes exponent using a
bias of 16383
Fraction (Mantissa) 112
Fractional part of the
significand
IEEE 754

C language slides for c programming book by ANSI

  • 1.
  • 2.
    Review of Numbers •Computers are made to deal with numbers • What can we represent in N bits? • Unsigned integers: 0 to 2N - 1 • Signed Integers (Two’s Complement) -2(N-1) to 2(N-1) - 1 Signed Integers -2(N-1) - 1 to 2(N-1) - 1
  • 3.
    Other Numbers • Whatabout other numbers? • Very large numbers? (seconds/century) 3,155,760,00010 (3.1557610 x 109) • Very small numbers? (atomic diameter) 0.0000000110 (1.010 x 10-8) • Rationals (repeating pattern) • 2/3 (0.666666666. . .) • Irrationals 21/2 (1.414213562373. . .) • Transcendentals • e (2.718...),  (3.141...) • All represented in scientific notation
  • 4.
    2i 2i-1 4 2 1 1/2 1/4 1/8 2-j bi bi-1 •••b2 b1 b0 b-1 b-2 b-3 ••• b-j • • • Fractional Binary Numbers • Representation • Bits to right of “binary point” represent fractional powers of 2 • Represents rational number: • • •
  • 5.
    Fractional Binary Numbers:Examples  Value Representation 5 3/4 = 23/4 101.112 = 4 + 1 + 1/2 + 1/4 2 7/8 = 23/8 010.1112 = 2 + 1/2 + 1/4 + 1/8 1 7/16 = 23/16 001.01112 = 1 + 1/4 + 1/8 + 1/16  Observations  Divide by 2 by shifting right (unsigned)  Multiply by 2 by shifting left  Numbers of form 0.111111…2 are just below 1.0  1/2 + 1/4 + 1/8 + … + 1/2i + … ➙ 1.0  Use notation 1.0 – ε
  • 6.
    Representable Numbers • Limitation#1 • Can only exactly represent numbers of the form x/2k • Other rational numbers have repeating bit representations • Value Representation • 1/3 0.0101010101[01]…2 • 1/5 0.001100110011[0011]…2 • 1/10 0.0001100110011[0011]…2 • Limitation #2 • Just one setting of binary point within the w bits • Limited range of numbers (very small values? very large?)
  • 7.
    Objective • To understandthe fundamentals of floating- point representation • To know the IEEE-754 Floating Point Standard
  • 8.
    Patriot Missile • GulfWar I • Failed to intercept incoming Iraqi scud missile (Feb 25, 1991) • 28 American soldiers killed GAO Report: GAO/IMTEC-92-26 Patriot Missile Software Problem http://www.fas.org/spp/starwars/gao/im92026.htm
  • 9.
    Patriot Design • Intendedto operate only for a few hours • Defend Europe from Soviet aircraft and missile • Four 24-bit registers (1970s design!) • Kept time with integer counter: incremented every 1/10 second • Calculate speed of incoming missile to predict future positions: velocity = loc1 – loc0/(count1 – count0) * 0.1 • But, cannot represent 0.1 exactly!
  • 10.
    Floating Imprecision • 24-bits: 0.1= 1/24 + 1/25 + 1/28 + 1/29 + 1/212 + 1/213 + 1/216 + 1/217 + 1/220 + 1/221 = 209715 / 2097152 Error is 0.2/2097152 = 1/10485760 One hour = 3600 seconds 3600 * 1/10485760 * 10 = 0.0034s 20 hours = 0.0687s Miss target! (137 meters)
  • 11.
    Two weeks beforethe incident, Army officials received Israeli data indicating some loss in accuracy after the system had been running for 8 consecutive hours. Consequently, Army officials modified the software to improve the system's accuracy. However, the modified software did not reach Dhahran until February 26, 1991--the day after the Scud incident. GAO Report http://fas.org/spp/starwars/gao/im92026.htm
  • 12.
    • Numerical Form: (–1)sM 2E • Sign bit s determines whether number is negative or positive • Significand M normally a fractional value in range [1.0,2.0). • Exponent E weights value by power of two • Encoding • MSB s is sign bit s • exp field encodes E (but is not equal to E) • frac field encodes M (but is not equal to M) Floating Point Representation s exp frac Example: 1521310 = (-1)0 x 1.11011011011012 x 213
  • 13.
    Exponential Notation The representationsdiffer in that the decimal place – the “point” -- “floats” to the left or right (with the appropriate adjustment in the exponent). • The following are equivalent representations of 1,234 123,400.0 x 10-2 12,340.0 x 10-1 1,234.0 x 100 123.4 x 101 12.34 x 102 1.234 x 103 0.1234 x 104
  • 14.
    Parts of aFloating Point Number -0.9876 x 10-3 Sign of mantissa Location of decimal point Mantissa Exponent Sign of exponent Base
  • 15.
    IEEE 754 Standard •Most common standard for representing floating point numbers • Single precision: 32 bits, consisting of... • Sign bit (1 bit) • Exponent (8 bits) • Mantissa (23 bits) • Double precision: 64 bits, consisting of… • Sign bit (1 bit) • Exponent (11 bits) • Mantissa (52 bits) Prof. Willian Kahan
  • 16.
    Single Precision Format 32bits Mantissa (23 bits) Exponent (8 bits) Sign of mantissa (1 bit)
  • 17.
    Normalization • The mantissais normalized • Has an implied decimal place on left • Has an implied “1” on left of the decimal place • E.g., • Mantissa  • Represents… 10100000000000000000000 1.1012 = 1.62510 • Normalized form: no leadings 0s (exactly one digit to left of decimal point) • Normalized: 1.0 x 10-9 • Not normalized: 0.1 x 10-8,10.0 x 10-10
  • 18.
    Excess Notation • Toinclude +ve and –ve exponents, “excess” notation is used • Single precision: excess 127 • Double precision: excess 1023 • The value of the exponent stored is larger than the actual exponent • E.g., excess 127, • Exponent  • Represents… 10000111 135 – 127 = 8
  • 19.
    Example • Single precision 010000010 11000000000000000000000 1.112 130 – 127 = 3 0 = positive mantissa +1.112 x 23 = 1110.02 = 14.010
  • 20.
    Hexadecimal • It isconvenient and common to represent the original floating point number in hexadecimal • The preceding example… 0 10000010 11000000000000000000000 4 1 6 0 0 0 0 0
  • 21.
    Converting from FloatingPoint • E.g., What decimal value is represented by the following 32-bit floating point number? C17B000016
  • 22.
    • Step 1 •Express in binary and find S, E, and M C17B000016 = 1 10000010 111101100000000000000002 S E M 1 = negative 0 = positive
  • 23.
    • Step 2 •Find “real” exponent, n • n = E – 127 = 100000102 – 127 = 130 – 127 = 3
  • 24.
    • Step 3 •Put S, M, and n together to form binary result • (Don’t forget the implied “1.” on the left of the mantissa.) -1.11110112 x 2n = -1.11110112 x 23 = -1111.10112
  • 25.
    • Step 4 •Express result in decimal -1111.10112 -15 2-1 = 0.5 2-3 = 0.125 2-4 = 0.0625 0.6875 Answer: -15.6875
  • 26.
    Converting from FloatingPoint • E.g., What decimal value is represented by the following 32-bit floating point number? 42808000 16
  • 27.
    Converting to FloatingPoint • E.g., Express 36.562510 as a 32-bit floating point number (in hexadecimal)
  • 28.
    • Step 1 •Express original value in binary 36.562510 = 100100.10012
  • 29.
    • Step 2 •Normalize 100100.10012 = 1.0010010012 x 25
  • 30.
    • Step 3 •Determine S, E, and M +1.0010010012 x 25 S = 0 (because the value is positive) M S n E = n + 127 = 5 + 127 = 132 = 100001002
  • 31.
    • Step 4 •Put S, E, and M together to form 32-bit binary result 0 10000100 001001001000000000000002 S E M
  • 32.
    • Step 5 •Express in hexadecimal 0 10000100 001001001000000000000002 = 0100 0010 0001 0010 0100 0000 0000 00002 = 4 2 1 2 4 0 0 016 Answer: 4212400016
  • 33.
    Converting to FloatingPoint • E.g., Express 6.510 as a 32-bit floating point number (in hexadecimal)
  • 34.
    Converting to FloatingPoint • E.g., Express 0.1 as a 32-bit floating point number (in hexadecimal)
  • 35.
    Zero, Infinity, andNaN • Zero – Exponent field E = 0 and fraction F = 0 – +0 and –0 are possible according to sign bit S • Infinity – Infinity is a special value represented with maximum E and F = 0 • For single precision with 8-bit exponent: maximum E = 255 • For double precision with 11-bit exponent: maximum E = 2047 – Infinity can result from overflow or division by zero – +∞ and –∞ are possible according to sign bit S • NaN (Not a Number) – NaN is a special value represented with maximum E and F ≠ 0 – Result from exceptional situations, such as 0/0 or sqrt(negative) – Operation on a NaN results is NaN: Op(X, NaN) = NaN
  • 36.
    Simple 6-bit FloatingPoint Example • 6-bit floating point representation – Sign bit is the most significant bit – Next 3 bits are the exponent with a bias of 3 – Last 2 bits are the fraction • Same general form as IEEE – Normalized, denormalized – Representation of 0, infinity and NaN • Value of normalized numbers (–1)S × (1.F)2 × 2E – 3 • Value of denormalized numbers (–1)S × (0.F)2 × 2– 2 S Exponent3 Fraction2
  • 37.
    Values Related toExponent Exp. exp E 2E 0 000 2 - ¼ 1 001 2 - ¼ 2 010 1 - ½ 3 011 0 1 4 100 1 2 5 101 2 4 6 110 3 8 7 111 n/a Denormalized Inf or NaN Normalized
  • 38.
    Dynamic Range ofValues s exp frac E value 0 000 00 2 - 0 0 000 01 2 - 1/4*1/4=1/16 0 000 10 2 - 2/4*1/4=2/16 0 000 11 2 - 3/4*1/4=3/16 0 001 00 2 - 4/4*1/4=4/16=1/4=0.25 0 001 01 2 - 5/4*1/4=5/16 0 001 10 2 - 6/4*1/4=6/16 0 001 11 2 - 7/4*1/4=7/16 0 010 00 1 - 4/4*2/4=8/16=1/2=0.5 0 010 01 1 - 5/4*2/4=10/16 0 010 10 1 - 6/4*2/4=12/16=0.75 0 010 11 1 - 7/4*2/4=14/16 smallest denormalized largest denormalized smallest normalized
  • 39.
    Dynamic Range ofValues s exp frac E value 0 011 00 0 4/4*4/4=16/16=1 0 011 01 0 5/4*4/4=20/16=1.25 0 011 10 0 6/4*4/4=24/16=1.5 0 011 11 0 7/4*4/4=28/16=1.75 0 100 00 1 4/4*8/4=32/16=2 0 100 01 1 5/4*8/4=40/16=2.5 0 100 10 1 6/4*8/4=48/16=3 0 100 11 1 7/4*8/4=56/16=3.5 0 101 00 2 4/4*16/4=64/16=4 0 101 01 2 5/4*16/4=80/16=5 0 101 10 2 6/4*16/4=96/16=6 0 101 11 2 7/4*16/4=112/16=7
  • 40.
    Dynamic Range ofValues s exp frac E value 0 110 00 3 4/4*32/4=128/16=8 0 110 01 3 5/4*32/4=160/16=10 0 110 10 3 6/4*32/4=192/16=12 0 110 11 3 7/4*32/4=224/16=14 0 111 00  0 111 01 NaN 0 111 10 NaN 0 111 11 NaN largest normalized
  • 41.
    Floating Point AdditionExample • Consider adding: (1.111)2 × 2–1 + (1.011)2 × 2–3 – For simplicity, we assume 4 bits of precision (or 3 bits of fraction) • Cannot add significands … Why? – Because exponents are not equal • How to make exponents equal? – Shift the significand of the lesser exponent right until its exponent matches the larger number • (1.011)2 × 2–3 = (0.1011)2 × 2–2 = (0.01011)2 × 2–1 – Difference between the two exponents = –1 – (–3) = 2 – So, shift right by 2 bits • Now, add the significands: Carry 1.111 0.01011 10.00111 +
  • 42.
    Addition Example • So,(1.111)2 × 2–1 + (1.011)2 × 2–3 = (10.00111)2 × 2–1 • However, result (10.00111)2 × 2–1 is NOT normalized • Normalize result: (10.00111)2 × 2–1 = (1.000111)2 × 20 – In this example, we have a carry – So, shift right by 1 bit and increment the exponent • Round the significand to fit in appropriate number of bits – We assumed 4 bits of precision or 3 bits of fraction • Round to nearest: (1.000111)2 ≈ (1.001)2 – Renormalize if rounding generates a carry • Detect overflow / underflow – If exponent becomes too large (overflow) or too small (underflow) 1.000 111 1 1.001 +
  • 44.
    Summary: IEEE FloatingPoint Single Precision (32 bits) 31 0 22 Sign 30 23 Exponent Fraction 8 bits 1 23 bits Exponent values: 0 zeroes 1-254 exp + 127 255 infinities, NaN Value = (1 – 2*Sign) (1 + Fraction)Exponent - 127
  • 45.
    Denormalized Values • Condition •exp = 000…0 • Value • Exponent value E = –Bias + 1 • Significand value M = 0.xxx…x2 •xxx…x: bits of frac • Cases • exp = 000…0, frac = 000…0 • Represents value 0 • Note that have distinct values +0 and –0 • exp = 000…0, frac  000…0 • Numbers very close to 0.0
  • 46.
    Special Values • Condition •exp = 111…1 • Cases • exp = 111…1, frac = 000…0 • Represents value(infinity) • Operation that overflows • Both positive and negative • E.g., 1.0/0.0 = 1.0/0.0 = +, 1.0/0.0 =  • exp = 111…1, frac  000…0 • Not-a-Number (NaN) • Represents case when no numeric value can be determined • E.g., sqrt(–1), 
  • 47.
    Interesting Numbers • Descriptionexp frac Numeric Value • Zero 00…00 00…00 0.0 • Smallest Pos. Denorm. 00…00 00…01 2– {23,52} X 2– {126,1022} • Single  1.4 X 10–45 • Double  4.9 X 10–324 • Largest Denormalized 00…00 11…11 (1.0 – ) X 2– {126,1022} • Single  1.18 X 10–38 • Double  2.2 X 10–308 • Smallest Pos. Normalized 00…01 00…00 1.0 X 2– {126,1022} • Just larger than largest denormalized • One 01…11 00…00 1.0 • Largest Normalized 11…10 11…11 (2.0 – ) X 2{127,1023} • Single  3.4 X 1038 • Double  1.8 X 10308
  • 48.
    Visualization: Floating Point Encodings + − 0 +Denorm+Normalized −Denorm −Normalized +0 NaN NaN
  • 49.
    Tiny Floating PointExample • 8-bit Floating Point Representation • the sign bit is in the most significant bit • the next four bits are the exp, with a bias of 7 • the last three bits are the frac • Same general form as IEEE Format • normalized, denormalized • representation of 0, NaN, infinity s exp frac 1 4-bits 3-bits
  • 50.
    s exp fracE Value 0 0000 000 -6 0 0 0000 001 -6 1/8*1/64 = 1/512 0 0000 010 -6 2/8*1/64 = 2/512 … 0 0000 110 -6 6/8*1/64 = 6/512 0 0000 111 -6 7/8*1/64 = 7/512 0 0001 000 -6 8/8*1/64 = 8/512 0 0001 001 -6 9/8*1/64 = 9/512 … 0 0110 110 -1 14/8*1/2 = 14/16 0 0110 111 -1 15/8*1/2 = 15/16 0 0111 000 0 8/8*1 = 1 0 0111 001 0 9/8*1 = 9/8 0 0111 010 0 10/8*1 = 10/8 … 0 1110 110 7 14/8*128 = 224 0 1110 111 7 15/8*128 = 240 0 1111 000 n/a inf Dynamic Range (s=0 only) closest to zero largest denorm smallest norm closest to 1 below closest to 1 above largest norm Denormalized numbers Normalized numbers v = (–1)s M 2E norm: E = exp – Bias denorm: E = 1 – Bias (-1)0(0+1/4)*2-6 (-1)0(1+1/8)*2-6
  • 51.
    -15 -10 -50 5 10 15 Denormalized Normalized Infinity Distribution of Values • 6-bit IEEE-like format • e = 3 exponent bits • f = 2 fraction bits • Bias is 23-1-1 = 3 • Notice how the distribution gets denser toward zero. 8 values s exp frac 1 3-bits 2-bits
  • 52.
    Floats are notReals Need to understand details of underlying implementations Int’s: eg. 40000 * 40000 --> 1600000000 600000* 600000 --> ? Floats: Eg 2 : Is (x + y) + z = x + (y + z)? eg (1e20 + -1e20) + 3.14 --> 3.14 1e20 + (-1e20 + 3.14) --> ?? 231−1=2,147,483,647
  • 53.
    IEEE 754 Component Bits Signbit 1 Exponent 5 Fraction 10 Total 16 bits (2 bytes) IEEE 754 Binary16 (F16) Format Field Bits Description Sign 1 0 = positive, 1 = negative Exponent 8 Encodes exponent with bias Fraction (Mantissa) 23 Precision bits (fractional part) Overview of IEEE 754 Binary 32
  • 54.
    IEEE 754 Binary16(F128) Format Field Bits Description Sign 1 0 = positive, 1 = negative Exponent 11 Encodes exponent with bias Fraction (Mantissa) 52 Precision bits (fractional part) IEEE 754 Binary64 Field Bits Description Sign 1 0 = positive, 1 = negative Exponent 15 Encodes exponent using a bias of 16383 Fraction (Mantissa) 112 Fractional part of the significand IEEE 754