Lecture 10: Floating-Point
EEE 105 Computer Organization
1st Semester AY 2019-20
Floating Point Numbers
• A lot of values in scientific calculations that cannot be
represented as integers
• Floating-point can also represent a higher range of numbers
compared to fixed-point numbers
• Radix point can vary in position (floating point)
• Examples:
◦ 6.0247x1023 mol-1 (Avogadro’s Number)
◦ 1.6022x10-19 Coul (magnitude of electron charge)
Floating Point Numbers
• A number is said to be normalized when the radix point is
placed right to the first non-zero significant digit
• For a decimal system
±𝑋1 . 𝑋2 𝑋3 𝑋4 𝑋5 𝑋6 𝑋7 × 10±𝑌1 𝑌2
◦ Number of significant digits = 7
◦ Range of exponent = ±99
◦ Mantissa: 𝑋2 𝑋3 … 𝑋6 𝑋7
▪ Set of digits or binary numbers following the radix point
IEEE-754 Standard
• Defines the functionality of floating-point representation and
arithmetic
• Specifies the basic and extended floating-point number
formats
◦ Five basic formats based on the definition of the standard:
▪ Three binary floating-point formats (encoded w/ 32, 64, and 128 bits)
▪ Two decimal floating-point formats (encoded w/ 64 and 128 bits)
◦ Two most-common formats:
▪ 32-bit single precision format (binary32)
▪ 64-bit double precision format (binary64)
IEEE-754 Standard
• 32-bit Single Precision Format
content S E M
sign exponent mantissa
size (bits) 1 8 23
Conversion: 𝑁 = −1 𝑆 × 2𝐸−127 × 1. 𝑀
implied 1
◦ Float format
IEEE-754 Standard
• 64-bit Extended/Double Precision Format
content S E M
sign exponent “mantissa"
size (bits) 1 11 52
Conversion: 𝑁 = −1 𝑆 × 2𝐸−1023 × 1. 𝑀
implied 1
◦ Double format
◦ To allow for extended range and precision
◦ Reduces round-off errors during intermediate calculations
IEEE -754 Standard
• Implied one is used to maximize the bits in encoding
◦ Since for a non-zero number, the first non-zero (binary) digit is
always one
• However, implied one is used only when E is between max(E) and
min(E)
◦ Max(E) = all 1’s
◦ Min(E) = all 0’s
• Exponent bias (i.e. 127 for single and 1023 for double) allows for the
representation of very small and very large numbers
◦ Max(E) and min(E) has special meanings – different interpretation
◦ Effective range of E
▪ -126 to 127 (single)
▪ -1022 to 1023 (double)
Special Numbers
• Zero
◦ 𝐸 = 0 (min(E) or all 0’s), 𝑀 = 0
◦ Can have positive and negative zero
• Subnormal/Denormal
◦ 𝐸 = 0 (min€ or all 0’s), 𝑀 ≠ 0
◦ No implied one (a zero is used)
◦ Resulting representation is not normalized
◦ Actual exponent is -126 instead of -127
◦ −1 𝑆 × 0. 𝑀 × 2−126
Special Numbers
• Infinities
◦ 𝐸 = max(𝐸) or all 1’s, 𝑀 = 0
◦ Also used as replacement for the result when overflow occurs
• Not a number (NaN)
◦ 𝐸 = max(𝐸) or all 1’s, 𝑀 ≠ 0
◦ Result of invalid operations like
▪ 0/0
▪ Sqrt(-1)
▪ Infinity*0
▪ [+infinity] + [-infinity]
Subnormal Numbers
• Example (single precision but assuming that M is only 2 bits long)
◦ 𝐸 = 110 (exponent factor is 21−127 = 2−126 )
𝑴 Significand Value
112 1.112 1.112 × 2−126
102 1.102 1.102 × 2−126
012 1.012 1.012 × 2−126
continuation
002 1.002 1.002 × 2−126 towards zero
◦ 𝐸 = 010 (exponent factor is still 2−126 but no implied 1)
𝑴 Significand Value
112 0.112 0.112 × 2−126
102 0.102 0.102 × 2−126
012 0. 012 0.012 × 2−126
002 0. 002 0.002 × 2−126
Examples
Minimum and maximum
numbers (binary32)
• Good to identify these numbers in identifying possible cases for
overflow and underflow
◦ Maximum magnitude (𝐸 = max 𝐸 − 1, 𝑀 = all 1’s)
▪ 1. 𝑀 × 2127 = 224 − 1 2−23 2127 ⇒ ~𝟐𝟏𝟐𝟖
▪ Must be infinity when the multiplier reaches 2128 (since 𝐸 = 𝑚𝑎𝑥 𝐸 )
▪ Note: normalized
◦ Minimum non-zero magnitude (𝐸 = 0, 𝑀 = 0. . 01)
▪ 0.0 … 01 × 2−126 = 2−23 2−126 = 𝟐−𝟏𝟒𝟗
▪ Note: denormal
Floating-Point Operations
Some Required Operations in the IEEE-754 standard
• Add
• Subtract
• Multiply
• Divide
• Square Root
• Remainder
• Comparisons
• Conversion between formats
Floating-Point Exceptions
• Causes:
◦ Invalid floating-point operations (e.g. square root of neg.)
◦ Division by zero – result must be “infinity”
◦ Overflow – resulting 𝐸 becomes more than max 𝐸
◦ Underflow – resulting 𝐸 becomes less than 0
◦ Inexact result
▪ By default, result is rounded-off to fit into the format
Floating Point Conversion
• If 𝜋 is approximately 3.14159265359, how do you represent
this in single precision format?
Floating-Point Addition and
Subtraction
• Has more steps (because of formatting) than fixed point
addition or subtraction
• Addition of significands (implied one (if any) with mantissa) is
the same as fixed point addition
• Similar procedure is used for subtraction
Floating-Point Addition and
Subtraction
1. Compare exponents
2. Significand of the number with smaller exponent must be
shifted right by the difference in exponents
◦ To align the radix points before addition/subtraction
◦ Possible shifting out of some significant digits
◦ Example: 𝐸1 = 100 and 𝐸2 = 95
▪ Then, 2nd significand must be shifted to the right 5 times
Significand2 = 1.00001111000011110001011 Note: simplification only
(assumes NO guard bits)
Significand2 = 0.0000100001111000011110001011
Floating-Point Addition and
Subtraction
3. Sign bit is used to determine whether true addition or
subtraction must be done
4. Addition/subtraction of aligned significands is performed
5. Normalize the result (and adjust the exponent 𝐸) if needed
◦ Limit the exponent to -126 (single) or -1022 (double)
6. Round-off the resulting mantissa before placing into the final
format
◦ Extract the part of the mantissa that must be encoded in 𝑀
◦ Possibly need to remove an implied 1 in encoding
Floating Point Addition and
Subtraction
Additional Notes:
• Addition
◦ Addition of significands may require normalization (shifting to the right)
▪ Resulting sum may have more than 1 significant digit on the left of the radix point
◦ Easiest to add magnitudes (positive numbers)
• Subtraction
◦ Subtraction of significands may require normalization (shifting to the
left)
▪ Resulting difference may have its first non-zero digit at the left of the radix point
◦ Result of subtraction may be negative – need to get 2’s complement and
update the sign bit of the result
▪ Do the 2’s complement before normalizing
Floating-Point Addition
+1.111 0010 0000 0000 0000 0010 x 24
+1.100 0000 0000 0001 1000 0101 x 22
• Need to align radix points first
+ 1.111 0010 0000 0000 0000 0010 x 24
+ 0.011 0000 0000 0000 0110 0001 01 x 24
+10.010 0010 0000 0000 0110 0011 01 x 24
• Need to normalize to identify the correct 𝐸 and 𝑀 to use
+ 1.001 0001 0000 0000 0011 0001 101 x 25
𝑀 = 001 0001 0000 0000 0011 00012 while 𝐸 = 5 + 127 = 132
Floating-Point Subtraction
1.000 0000 0101 1000 1000 1101 x 2-6
- 1.000 0000 0000 0000 1001 1010 x 2-1
• Align radix points first
0.000 0100 0000 0010 1100 0100 01101 x 2-1
- 1.000 0000 0000 0000 1001 1010 x 2-1
• Perform subtraction (or perform 2’s complement then add)
00.000 0100 0000 0010 1100 0100 01101 x 2-1
+10.111 1111 1111 1111 0110 0110 x 2-1
11.000 0100 0000 0010 0010 1010 01101 x 2-1
Floating-Point Subtraction
• After subtraction of aligned operands:
11.000 0100 0000 0010 0010 1010 01101 x 2-1
• Perform 2’s complement (since the result is negative and we encode mag)
- 0.111 1011 1111 1101 1101 1101 10011 x 2-1
• Normalize the result
- 1.111 0111 1111 1011 1011 1011 0011 x 2-2
Floating-Point Addition and
Subtraction
• Assuming that both operands are positive
◦ Addition may require normalization of the sum by shifting to
the right
▪ Magnitude of answer is larger
◦ Subtraction may require normalization of the difference by
shifting to the left
▪ Magnitude of answer is smaller
Floating-Point Multiplication
and Division
1. Determine the sign of the result
2. Determine the initial exponent of the result
◦ For multiplication: 𝐸 = 𝐸1 + 𝐸2 − 𝑏𝑖𝑎𝑠
◦ For division: 𝐸 = 𝐸1 − 𝐸2 + 𝑏𝑖𝑎𝑠
3. Perform multiplication and division of significands (similar to
integer multiplication and division)
4. Normalize result for the mantissa if needed
5. Round-off the resulting mantissa before placing into the final
format
Floating-Point Multiplication
Operands
• Both normal
◦ Result: normal, subnormal, overflow, or underflow
• Normal and subnormal
◦ Result: normal, subnormal, underflow
• Both subnormal
◦ Result: underflow
Floating-Point Multiplication
• Consider two normalized (positive) single precision numbers 𝑁1 and 𝑁2
𝑁1 × 𝑁2 = 1. 𝑀1 × 2𝐸1 −127 1. 𝑀2 × 2𝐸2 −127
𝑁1 × 𝑁2 = 1. 𝑀1 × 1. 𝑀2 2 𝐸1 +𝐸2 −127 −127
• If both operands are normal, initial 𝐸 of the result is
◦ 𝐸1 + 𝐸2 − 127
• If one operand is a subnormal number, initial 𝐸 is either
◦ −126 + 𝐸1 or −126 + 𝐸2
• Note: Initial exponent cannot fully identify if overflow or
underflow will occur especially with (but not limited to)
subnormal operands.
Floating-Point Multiplication
Normal only: 𝑁1 × 𝑁2 = 1. 𝑀1 × 1. 𝑀2 2 𝐸1 +𝐸2 −127 −127
• Significands are 24-bits wide each
◦ Whole product must be 48-bits
◦ Radix point is after the 2nd MSB
• When both operands are normalized, the first 1 can either be
at the
◦ 1st MSB (47th bit): extract bits 46 to 24
▪ Happens with last carry out in summing partial products
◦ 2nd MSB (46th bit): extract bits 45 to 23
▪ Happens without last carry out in summing partial products
Floating-Point Multiplication
Normal only: 𝑁1 × 𝑁2 = 1. 𝑀1 × 1. 𝑀2 2 𝐸1 +𝐸2 −127 −127
With subnormal: 𝑁1 × 𝑁2 = 0. 𝑀1 × 1. 𝑀2 2 −126+𝐸2 −127
• When one operand is subnormal, the first 1 can appear almost
“anywhere” in the 48-bit product
◦ First 1 can be at 47th bit (MSB) down to 23rd bit
◦ Possibly need to store the LSBs of the 48-bit of product
▪ When first 1 is at 23rd bit, need bits 22 to 0 for the mantissa
◦ Do not truncate partial sums! Store whole 48-bit product.
• Is this the only option?
Floating-Point Multiplication
General Case: 𝑁1 × 𝑁2 = 𝑏1 . 𝑀1 × 𝑏2 . 𝑀2 2𝐸𝑎𝑐𝑡𝑢𝑎𝑙,1 +𝐸𝑎𝑐𝑡𝑢𝑎𝑙,2
where 𝑏1,2 can be a 1 or 0
Possible Strategies:
1. Multiply as is (same as previous slide)
▪ Radix point of 48-bit product is always after the 2nd MSB
▪ First 1 of the 48-bit product can be “anywhere”
2. Move radix points of to the right of both mantissas
▪ Adjust exponent/s
▪ Operands will always look like 𝑋23 𝑋22 … 𝑋0 . and 𝑌23 𝑌22 … 𝑌0 .
▪ Radix point of 48-bit product is always after the LSB
▪ First 1 of the 48-bit product can be “anywhere”
Floating-Point Multiplication
General Case: 𝑁1 × 𝑁2 = 𝑏1 . 𝑀1 × 𝑏2 . 𝑀2 2 𝐸1 +𝐸2 −127 −127
where 𝑏1,2 can be a 1 or 0
Possible Strategies:
3. Normalize subnormal operands before multiplication
▪ Adjust exponent/s
▪ Will always be similar to having “1. 𝑀1 ” and “1. 𝑀2 ”
▪ Radix point of 48-bit product is always after the 2nd MSB
▪ First 1 of the 48-bit product can only be at the MSB (bit 47) or next
MSB (bit 46)
▫ Need only at most 1 shift to normalize (if result can be normal)
▫ Otherwise, shift right until exponent is -126 (for subnormal results)
▫ Allows the LSBs of the partial sum to be shifted out in the addition of
partial products (no need to store the whole 48 bits throughout)
Floating-Point Division
Operands
• Both normal
◦ Result: normal, subnormal, overflow, or underflow
• Dividend is normal, divisor is subnormal
◦ Result: normal, overflow
• Dividend is subnormal, divisor is normal
◦ Result: normal, subnormal, underflow
• Both subnormal
◦ Result: normal
Floating-Point Division
• Consider two normalized (positive) single precision numbers 𝑁1 and 𝑁2
𝑁1 /𝑁2 = 1. 𝑀1 × 2𝐸1 −127 / 1. 𝑀2 × 2𝐸2 −127
𝑁1 /𝑁2 = 1. 𝑀1 /1. 𝑀2 2 𝐸1 −𝐸2 +127 −127
• If both operands are normal, initial 𝐸 of the result is
◦ 𝐸1 − 𝐸2 + 127
• Adjust initial 𝐸 when at least one operand is subnormal
• Note: Initial exponent cannot fully identify if overflow or
underflow will occur especially with (but not limited to)
subnormal operands.
Floating-Point Division
Possible Cases for 𝑁1 /𝑁2
• N/N: 1. 𝑀1 /1. 𝑀2 2 𝐸1 +𝐸2 −127 −127
• N/D: 1. 𝑀1 /0. 𝑀2 2 𝐸1 +126 −127
• D/N: 0. 𝑀1 /1. 𝑀2 2 128−𝐸2 −127
• D/D: 0. 𝑀1 /0. 𝑀2 2 127 −127
Legend: (N) normal, (D) denormal
Floating-Point Division
General Case: 𝑁1 /𝑁2 = 𝑏1 . 𝑀1 /𝑏2 . 𝑀2 2 𝐸1 +𝐸2 −127 −127
where 𝑏1,2 can be a 1 or 0
• Goal: Divide the WITHOUT REMAINDER
◦ May need to extend the dividend to the left with zeros
• Quotient can be 1 bit long only to infinitely long!
• Moving the radix for both operands will have no effect
𝑏1 . 𝑀1 𝑏1 𝑀1
◦ i.e. =
𝑏2 . 𝑀2 𝑏2 𝑀2
• Radix point is after the “last” bit of the mantissas
• However, in general, the first 1 of the quotient can appear
anywhere (to the right or to the left ) of the radix point.
Floating-Point Division
General Case: 𝑁1 /𝑁2 = 𝑏1 𝑀1 /𝑏2 𝑀2 2𝐸𝑖𝑛𝑖𝑡−127
where 𝑏1,2 can be a 1 or 0
Possible Strategy:
• Proceed with the division process discussed (without remainder) with the
following changes
◦ Assume that the first resulting quotient bit is a 1 and that the result will
be a normal number
◦ Given this assumption, it will take 24 cycles of the division to complete
the significand of the quotient (to get a 23-bit mantissa)
▪ Quotient becomes 24-bits (with an MSB of 1 as assumed)
◦ For this to quotient, the radix point must be shifted to the left 23 times
▪ 23-bit LSB of the quotient must be used as result mantissa (𝑀)
▪ 23 must be added to the initial exponent
▫ Radix point moved from the end of the LSB to after the MSB of the 24-bit quotient
Floating-Point Division
General Case: 𝑁1 /𝑁2 = 𝑏1 𝑀1 /𝑏2 𝑀2 2 𝐸1 +𝐸2 −127 −127
where 𝑏1,2 can be a 1 or 0
Possible Strategy:
• Adjust the approach to cover all cases (first quotient bit is not a 1)
• To simplify things, let 𝑋 represent the numerator 𝑏1 𝑀1 and 𝑌 represent
the denominator 𝑏2 𝑀2
• Initially we assumed:
? × 2𝐸𝑖𝑛𝑖𝑡−127 1𝑥𝑥𝑥𝑥 … 𝑥. × 2𝐸𝑖𝑛𝑖𝑡−127
𝑏2 𝑀2 𝑏1 𝑀1 𝑌23 𝑌22 … 𝑌0 𝑋23 𝑋22 … 𝑋0
1. 𝑥𝑥𝑥𝑥 … 𝑥 × 2𝐸𝑖𝑛𝑖𝑡+23−127
Floating-Point Division
• What if the first 1 appears at the middle of the quotient?
01𝑥𝑥𝑥 … 𝑥. 𝑥 × 2𝐸𝑖𝑛𝑖𝑡−127 1. 𝑥𝑥𝑥𝑥 … 𝑥 × 2𝐸𝑖𝑛𝑖𝑡+22−127
𝑌23 𝑌22 … 𝑌0 𝑋23 𝑋22 … 𝑋0 . 0
◦ Need to extend the dividend to the right to get 23 bits of
quotient after the first 1
◦ Subtract 1 from the exponent for every additional cycle
◦ Use the last 23 bits of the quotient as the mantissa of the
result
▪ 23 bits does not include the first/implied one
Floating-Point Division
• The described approach can also be used even when the final
quotient must be subnormal
◦ The algorithm stops only when 24 significant digits are obtained
◦ If the actual exponent becomes less than -126
▪ Shift the quotient to the right (throw the LSB and shift in a zero)
▪ Add 1 to the exponent
▪ Repeat until the exponent is -126
◦ If 23 shifts are done and the exponent is still less than -126, then
an underflow occurred (return a floating point zero as the
answer)
• Further optimization?
◦ Identify the exponent
◦ Identify along the way if the result is subnormal?
Floating-Point Division
• Considering 1 bit at a time (from the MSB side) of the dividend is
like shifting in the bits into a “current dividend” register
◦ Recall: (non-)restoring division and division hardware
◦ When extending the dividend, if needed, simply shift in a zero
• A 24-bit quotient is needed where results are placed one bit at a
time (placing new results at the LSB) and shifting old results to the
left
◦ When a 1 is shifted (left) into the MSB of this quotient, then the
last division cycle must have completed the 24-bit result
◦ Indicator that the division process can be stopped
• However, the exponent can also be used to flag the division process
to stop (when the answer is a subnormal)
◦ How to do this?
Floating-Point Division
• From previous described process, every additional division
cycle (extension of the dividend) requires the exponent to be
increased by 1
◦ This process can be traced back from the initial division
without waiting for the 24-bit quotient to be completed!
• This time, assume that every new result from a division cycle is
the last bit of the mantissa even the from the first division
◦ Simplifies the process by just extracting always the LSBs (23
bits) of the current quotient
◦ Additional step: add 47 to the initial exponent
Floating-Point Division
• Revised process
1. Determine the sign of the result
2. Compute the initial exponent 𝐸𝑖𝑛𝑖𝑡
◦ Note the case (N/N, N/D, D/N, or D/D)
3. Add 47 to the 𝐸𝑖𝑛𝑖𝑡
4. Perform division of significands (see details next slide)
◦ Exponent, mantissa, and normalization is done on this step
5. Place into the final registers
Floating-Point Division
• Division of significands
1. Shift a bit of the dividend (from the MSB side) into a
current dividend (also “remainder”) register
2. Perform division as with restoring division
▪ Subtract, evaluate and shift in a new quotient bit, restore if needed
▪ Check if 23rd bit of quotient is already a 1
▫ If yes, then 24-bit significand is already complete (last cycle)
· Still perform the step below
3. Subtract 1 from the exponent
▪ Check if the actual exponent is -126
▫ If yes, then result is subnormal. Extract the 23 LSB of the quotient and save
as Mantissa.
4. Repeat if not yet done
Rounding off
• Intermediate values of significands may need to be
represented using more bits to minimize errors
• Guard bits – additional bits in the mantissa retained during
intermediate steps or operations
◦ Maintains additional accuracy in final results
• Three common-rounding off methods
◦ Truncation
◦ Von Neumann rounding
◦ Round to neares (even)
Rounding Off (Truncation)
Truncation
• Extra bits are simply discarded
• Biased approximation since error range is not symmetrical
around in between numbers
• “Rounding towards 0”
• Example: truncate from 6 bits to 3 bits
◦ 0.001000 → 0.001
◦ 0.010110 → 0.010
Rounding Off (Von Neumann)
Von Neumann rounding
• If bits to be removed are all 0, truncate
• If any of the bits to be removed is 1, set LSB of retained bits to 1
• Unbiased approximation (symmetric error)
• Larger error range
• Example: from 6 bits to 3 bits
◦ 0.001000 → 0.001
◦ 0.001001 → 0.001
◦ 0.010001 → 0.011
Rounding Off (Nearest Even)
Round to nearest (even)
• Requires 3 guard bits
◦ First two guard bits are part of the mantissa to be removed
◦ Third bit is the OR of all bits beyond the first two (at least
one 1)
• Achieves least range of error but is also most difficult to
implement because of addition and possible re-normalization
Rounding Off (Nearest Even)
Round to nearest (even)
• Do the following for each case of guard bits
◦ 0xx – truncate
◦ 100 – round to nearest (even)
▪ +1 to LSB if LSB is 1 (odd)
▪ Truncate if LSB is 0 (already even)
◦ 101, 110, 111 – round up (+1 to LSB)
• Example: From 6 bits to 3 bits
◦ 0.010101 → 0.010 + 0.001 = 0.011 (+1 LSB)
◦ 0.001101 → 0.001 + 0.001 = 0.010 (+1 LSB)
◦ 0.010011 → 0.010 (truncate)
◦ 0.001100 → 0.001 + 0.001 = 0.010 (nearest even, +1 LSB)
◦ 0.010100 → 0.110 (nearest even, truncate)
Rounding Off (Nearest Even)
• Consider the following:
1.000 x 25 (32.00)
- 1.111 x 21 - ( 3.75)
28.35
What happens with infinite precision?
What happens with nearest even rounding off?
What happens with 2 guard bits?
Rounding Off (Nearest Even)
• With infinite precision for partial operands
1.000 x 25 1.000 x 25 01.000 0000 x 25
- 1.111 x 21 - 0.000 1111 x 25 + 11.111 0001 x 25
00.111 0001 x 25
normalized complete answer: 1.110 001 x 24 (28.25)
normalized rounded answer: 1.110 x 24 (28)
• With nearest even rounding off (3 guard bits)
1.000 x 25 1.000 x 25 01.000 x 25
- 0.000 1111 x 25 - 0.000 111 x 25 + 11.111 001 x 25
00.111 001 x 25
normalized answer: 1.110 01 x 24 (28.2510)
rounded answer: 1.110 x 24 (28)
Rounding Off (Nearest Even)
• With infinite precision (2 guard bits)
◦ Last guard bit (2nd guard bit) acts as a sticky bit
1.000 x 25 1.000 x 25 01.000 x 25
- 0.000 1111 x 25 - 0.000 11 x 25 + 11.111 01 x 25
00.111 01 x 25
normalized answer: 1.110 1 x 24
rounded answer: 1.111 x 24 (30)