KEMBAR78
Lecture 10 (Temp) | PDF | Arithmetic | Theoretical Computer Science
0% found this document useful (0 votes)
82 views50 pages

Lecture 10 (Temp)

Floating point numbers allow for a wider range of values to be represented compared to fixed point numbers by allowing the radix or decimal point to vary its position. The IEEE-754 standard defines common floating point formats including 32-bit single precision and 64-bit double precision that specify the layout of the sign, exponent, and significand fields. Special values like infinities, zeros, and NaN are also defined. Floating point arithmetic operations like addition, subtraction, multiplication and division follow specific steps involving aligning operands, performing operations on significands, and adjusting exponents.

Uploaded by

Anton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views50 pages

Lecture 10 (Temp)

Floating point numbers allow for a wider range of values to be represented compared to fixed point numbers by allowing the radix or decimal point to vary its position. The IEEE-754 standard defines common floating point formats including 32-bit single precision and 64-bit double precision that specify the layout of the sign, exponent, and significand fields. Special values like infinities, zeros, and NaN are also defined. Floating point arithmetic operations like addition, subtraction, multiplication and division follow specific steps involving aligning operands, performing operations on significands, and adjusting exponents.

Uploaded by

Anton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Lecture 10: Floating-Point

EEE 105 Computer Organization


1st Semester AY 2019-20
Floating Point Numbers
• A lot of values in scientific calculations that cannot be
represented as integers
• Floating-point can also represent a higher range of numbers
compared to fixed-point numbers
• Radix point can vary in position (floating point)
• Examples:
◦ 6.0247x1023 mol-1 (Avogadro’s Number)
◦ 1.6022x10-19 Coul (magnitude of electron charge)
Floating Point Numbers
• A number is said to be normalized when the radix point is
placed right to the first non-zero significant digit
• For a decimal system
±𝑋1 . 𝑋2 𝑋3 𝑋4 𝑋5 𝑋6 𝑋7 × 10±𝑌1 𝑌2
◦ Number of significant digits = 7
◦ Range of exponent = ±99
◦ Mantissa: 𝑋2 𝑋3 … 𝑋6 𝑋7
▪ Set of digits or binary numbers following the radix point
IEEE-754 Standard
• Defines the functionality of floating-point representation and
arithmetic
• Specifies the basic and extended floating-point number
formats
◦ Five basic formats based on the definition of the standard:
▪ Three binary floating-point formats (encoded w/ 32, 64, and 128 bits)
▪ Two decimal floating-point formats (encoded w/ 64 and 128 bits)
◦ Two most-common formats:
▪ 32-bit single precision format (binary32)
▪ 64-bit double precision format (binary64)
IEEE-754 Standard
• 32-bit Single Precision Format

content S E M
sign exponent mantissa
size (bits) 1 8 23

Conversion: 𝑁 = −1 𝑆 × 2𝐸−127 × 1. 𝑀
implied 1
◦ Float format
IEEE-754 Standard
• 64-bit Extended/Double Precision Format

content S E M
sign exponent “mantissa"
size (bits) 1 11 52

Conversion: 𝑁 = −1 𝑆 × 2𝐸−1023 × 1. 𝑀
implied 1
◦ Double format
◦ To allow for extended range and precision
◦ Reduces round-off errors during intermediate calculations
IEEE -754 Standard
• Implied one is used to maximize the bits in encoding
◦ Since for a non-zero number, the first non-zero (binary) digit is
always one
• However, implied one is used only when E is between max(E) and
min(E)
◦ Max(E) = all 1’s
◦ Min(E) = all 0’s

• Exponent bias (i.e. 127 for single and 1023 for double) allows for the
representation of very small and very large numbers
◦ Max(E) and min(E) has special meanings – different interpretation
◦ Effective range of E
▪ -126 to 127 (single)
▪ -1022 to 1023 (double)
Special Numbers
• Zero
◦ 𝐸 = 0 (min(E) or all 0’s), 𝑀 = 0
◦ Can have positive and negative zero

• Subnormal/Denormal
◦ 𝐸 = 0 (min€ or all 0’s), 𝑀 ≠ 0
◦ No implied one (a zero is used)
◦ Resulting representation is not normalized
◦ Actual exponent is -126 instead of -127
◦ −1 𝑆 × 0. 𝑀 × 2−126
Special Numbers
• Infinities
◦ 𝐸 = max(𝐸) or all 1’s, 𝑀 = 0
◦ Also used as replacement for the result when overflow occurs

• Not a number (NaN)


◦ 𝐸 = max(𝐸) or all 1’s, 𝑀 ≠ 0
◦ Result of invalid operations like
▪ 0/0
▪ Sqrt(-1)
▪ Infinity*0
▪ [+infinity] + [-infinity]
Subnormal Numbers
• Example (single precision but assuming that M is only 2 bits long)
◦ 𝐸 = 110 (exponent factor is 21−127 = 2−126 )
𝑴 Significand Value
112 1.112 1.112 × 2−126
102 1.102 1.102 × 2−126
012 1.012 1.012 × 2−126
continuation
002 1.002 1.002 × 2−126 towards zero

◦ 𝐸 = 010 (exponent factor is still 2−126 but no implied 1)


𝑴 Significand Value
112 0.112 0.112 × 2−126
102 0.102 0.102 × 2−126
012 0. 012 0.012 × 2−126
002 0. 002 0.002 × 2−126
Examples
Minimum and maximum
numbers (binary32)
• Good to identify these numbers in identifying possible cases for
overflow and underflow

◦ Maximum magnitude (𝐸 = max 𝐸 − 1, 𝑀 = all 1’s)


▪ 1. 𝑀 × 2127 = 224 − 1 2−23 2127 ⇒ ~𝟐𝟏𝟐𝟖
▪ Must be infinity when the multiplier reaches 2128 (since 𝐸 = 𝑚𝑎𝑥 𝐸 )
▪ Note: normalized

◦ Minimum non-zero magnitude (𝐸 = 0, 𝑀 = 0. . 01)


▪ 0.0 … 01 × 2−126 = 2−23 2−126 = 𝟐−𝟏𝟒𝟗
▪ Note: denormal
Floating-Point Operations
Some Required Operations in the IEEE-754 standard
• Add
• Subtract
• Multiply
• Divide
• Square Root
• Remainder
• Comparisons
• Conversion between formats
Floating-Point Exceptions
• Causes:
◦ Invalid floating-point operations (e.g. square root of neg.)
◦ Division by zero – result must be “infinity”
◦ Overflow – resulting 𝐸 becomes more than max 𝐸
◦ Underflow – resulting 𝐸 becomes less than 0
◦ Inexact result
▪ By default, result is rounded-off to fit into the format
Floating Point Conversion
• If 𝜋 is approximately 3.14159265359, how do you represent
this in single precision format?
Floating-Point Addition and
Subtraction
• Has more steps (because of formatting) than fixed point
addition or subtraction
• Addition of significands (implied one (if any) with mantissa) is
the same as fixed point addition
• Similar procedure is used for subtraction
Floating-Point Addition and
Subtraction
1. Compare exponents
2. Significand of the number with smaller exponent must be
shifted right by the difference in exponents
◦ To align the radix points before addition/subtraction
◦ Possible shifting out of some significant digits
◦ Example: 𝐸1 = 100 and 𝐸2 = 95
▪ Then, 2nd significand must be shifted to the right 5 times

Significand2 = 1.00001111000011110001011 Note: simplification only


(assumes NO guard bits)
Significand2 = 0.0000100001111000011110001011
Floating-Point Addition and
Subtraction
3. Sign bit is used to determine whether true addition or
subtraction must be done
4. Addition/subtraction of aligned significands is performed
5. Normalize the result (and adjust the exponent 𝐸) if needed
◦ Limit the exponent to -126 (single) or -1022 (double)
6. Round-off the resulting mantissa before placing into the final
format
◦ Extract the part of the mantissa that must be encoded in 𝑀
◦ Possibly need to remove an implied 1 in encoding
Floating Point Addition and
Subtraction
Additional Notes:
• Addition
◦ Addition of significands may require normalization (shifting to the right)
▪ Resulting sum may have more than 1 significant digit on the left of the radix point
◦ Easiest to add magnitudes (positive numbers)

• Subtraction
◦ Subtraction of significands may require normalization (shifting to the
left)
▪ Resulting difference may have its first non-zero digit at the left of the radix point
◦ Result of subtraction may be negative – need to get 2’s complement and
update the sign bit of the result
▪ Do the 2’s complement before normalizing
Floating-Point Addition
+1.111 0010 0000 0000 0000 0010 x 24
+1.100 0000 0000 0001 1000 0101 x 22

• Need to align radix points first


+ 1.111 0010 0000 0000 0000 0010 x 24
+ 0.011 0000 0000 0000 0110 0001 01 x 24
+10.010 0010 0000 0000 0110 0011 01 x 24

• Need to normalize to identify the correct 𝐸 and 𝑀 to use


+ 1.001 0001 0000 0000 0011 0001 101 x 25
𝑀 = 001 0001 0000 0000 0011 00012 while 𝐸 = 5 + 127 = 132
Floating-Point Subtraction
1.000 0000 0101 1000 1000 1101 x 2-6
- 1.000 0000 0000 0000 1001 1010 x 2-1

• Align radix points first


0.000 0100 0000 0010 1100 0100 01101 x 2-1
- 1.000 0000 0000 0000 1001 1010 x 2-1

• Perform subtraction (or perform 2’s complement then add)


00.000 0100 0000 0010 1100 0100 01101 x 2-1
+10.111 1111 1111 1111 0110 0110 x 2-1
11.000 0100 0000 0010 0010 1010 01101 x 2-1
Floating-Point Subtraction
• After subtraction of aligned operands:
11.000 0100 0000 0010 0010 1010 01101 x 2-1

• Perform 2’s complement (since the result is negative and we encode mag)
- 0.111 1011 1111 1101 1101 1101 10011 x 2-1

• Normalize the result


- 1.111 0111 1111 1011 1011 1011 0011 x 2-2
Floating-Point Addition and
Subtraction
• Assuming that both operands are positive
◦ Addition may require normalization of the sum by shifting to
the right
▪ Magnitude of answer is larger
◦ Subtraction may require normalization of the difference by
shifting to the left
▪ Magnitude of answer is smaller
Floating-Point Multiplication
and Division
1. Determine the sign of the result
2. Determine the initial exponent of the result
◦ For multiplication: 𝐸 = 𝐸1 + 𝐸2 − 𝑏𝑖𝑎𝑠
◦ For division: 𝐸 = 𝐸1 − 𝐸2 + 𝑏𝑖𝑎𝑠
3. Perform multiplication and division of significands (similar to
integer multiplication and division)
4. Normalize result for the mantissa if needed
5. Round-off the resulting mantissa before placing into the final
format
Floating-Point Multiplication
Operands
• Both normal
◦ Result: normal, subnormal, overflow, or underflow

• Normal and subnormal


◦ Result: normal, subnormal, underflow

• Both subnormal
◦ Result: underflow
Floating-Point Multiplication
• Consider two normalized (positive) single precision numbers 𝑁1 and 𝑁2
𝑁1 × 𝑁2 = 1. 𝑀1 × 2𝐸1 −127 1. 𝑀2 × 2𝐸2 −127
𝑁1 × 𝑁2 = 1. 𝑀1 × 1. 𝑀2 2 𝐸1 +𝐸2 −127 −127

• If both operands are normal, initial 𝐸 of the result is


◦ 𝐸1 + 𝐸2 − 127
• If one operand is a subnormal number, initial 𝐸 is either
◦ −126 + 𝐸1 or −126 + 𝐸2
• Note: Initial exponent cannot fully identify if overflow or
underflow will occur especially with (but not limited to)
subnormal operands.
Floating-Point Multiplication
Normal only: 𝑁1 × 𝑁2 = 1. 𝑀1 × 1. 𝑀2 2 𝐸1 +𝐸2 −127 −127

• Significands are 24-bits wide each


◦ Whole product must be 48-bits
◦ Radix point is after the 2nd MSB
• When both operands are normalized, the first 1 can either be
at the
◦ 1st MSB (47th bit): extract bits 46 to 24
▪ Happens with last carry out in summing partial products
◦ 2nd MSB (46th bit): extract bits 45 to 23
▪ Happens without last carry out in summing partial products
Floating-Point Multiplication
Normal only: 𝑁1 × 𝑁2 = 1. 𝑀1 × 1. 𝑀2 2 𝐸1 +𝐸2 −127 −127

With subnormal: 𝑁1 × 𝑁2 = 0. 𝑀1 × 1. 𝑀2 2 −126+𝐸2 −127

• When one operand is subnormal, the first 1 can appear almost


“anywhere” in the 48-bit product
◦ First 1 can be at 47th bit (MSB) down to 23rd bit
◦ Possibly need to store the LSBs of the 48-bit of product
▪ When first 1 is at 23rd bit, need bits 22 to 0 for the mantissa
◦ Do not truncate partial sums! Store whole 48-bit product.

• Is this the only option?


Floating-Point Multiplication
General Case: 𝑁1 × 𝑁2 = 𝑏1 . 𝑀1 × 𝑏2 . 𝑀2 2𝐸𝑎𝑐𝑡𝑢𝑎𝑙,1 +𝐸𝑎𝑐𝑡𝑢𝑎𝑙,2
where 𝑏1,2 can be a 1 or 0

Possible Strategies:
1. Multiply as is (same as previous slide)
▪ Radix point of 48-bit product is always after the 2nd MSB
▪ First 1 of the 48-bit product can be “anywhere”
2. Move radix points of to the right of both mantissas
▪ Adjust exponent/s
▪ Operands will always look like 𝑋23 𝑋22 … 𝑋0 . and 𝑌23 𝑌22 … 𝑌0 .
▪ Radix point of 48-bit product is always after the LSB
▪ First 1 of the 48-bit product can be “anywhere”
Floating-Point Multiplication
General Case: 𝑁1 × 𝑁2 = 𝑏1 . 𝑀1 × 𝑏2 . 𝑀2 2 𝐸1 +𝐸2 −127 −127

where 𝑏1,2 can be a 1 or 0

Possible Strategies:
3. Normalize subnormal operands before multiplication
▪ Adjust exponent/s
▪ Will always be similar to having “1. 𝑀1 ” and “1. 𝑀2 ”
▪ Radix point of 48-bit product is always after the 2nd MSB
▪ First 1 of the 48-bit product can only be at the MSB (bit 47) or next
MSB (bit 46)
▫ Need only at most 1 shift to normalize (if result can be normal)
▫ Otherwise, shift right until exponent is -126 (for subnormal results)
▫ Allows the LSBs of the partial sum to be shifted out in the addition of
partial products (no need to store the whole 48 bits throughout)
Floating-Point Division
Operands
• Both normal
◦ Result: normal, subnormal, overflow, or underflow

• Dividend is normal, divisor is subnormal


◦ Result: normal, overflow

• Dividend is subnormal, divisor is normal


◦ Result: normal, subnormal, underflow

• Both subnormal
◦ Result: normal
Floating-Point Division
• Consider two normalized (positive) single precision numbers 𝑁1 and 𝑁2
𝑁1 /𝑁2 = 1. 𝑀1 × 2𝐸1 −127 / 1. 𝑀2 × 2𝐸2 −127
𝑁1 /𝑁2 = 1. 𝑀1 /1. 𝑀2 2 𝐸1 −𝐸2 +127 −127

• If both operands are normal, initial 𝐸 of the result is


◦ 𝐸1 − 𝐸2 + 127
• Adjust initial 𝐸 when at least one operand is subnormal

• Note: Initial exponent cannot fully identify if overflow or


underflow will occur especially with (but not limited to)
subnormal operands.
Floating-Point Division
Possible Cases for 𝑁1 /𝑁2
• N/N: 1. 𝑀1 /1. 𝑀2 2 𝐸1 +𝐸2 −127 −127

• N/D: 1. 𝑀1 /0. 𝑀2 2 𝐸1 +126 −127

• D/N: 0. 𝑀1 /1. 𝑀2 2 128−𝐸2 −127

• D/D: 0. 𝑀1 /0. 𝑀2 2 127 −127

Legend: (N) normal, (D) denormal


Floating-Point Division
General Case: 𝑁1 /𝑁2 = 𝑏1 . 𝑀1 /𝑏2 . 𝑀2 2 𝐸1 +𝐸2 −127 −127

where 𝑏1,2 can be a 1 or 0

• Goal: Divide the WITHOUT REMAINDER


◦ May need to extend the dividend to the left with zeros
• Quotient can be 1 bit long only to infinitely long!
• Moving the radix for both operands will have no effect
𝑏1 . 𝑀1 𝑏1 𝑀1
◦ i.e. =
𝑏2 . 𝑀2 𝑏2 𝑀2
• Radix point is after the “last” bit of the mantissas
• However, in general, the first 1 of the quotient can appear
anywhere (to the right or to the left ) of the radix point.
Floating-Point Division
General Case: 𝑁1 /𝑁2 = 𝑏1 𝑀1 /𝑏2 𝑀2 2𝐸𝑖𝑛𝑖𝑡−127
where 𝑏1,2 can be a 1 or 0

Possible Strategy:
• Proceed with the division process discussed (without remainder) with the
following changes
◦ Assume that the first resulting quotient bit is a 1 and that the result will
be a normal number
◦ Given this assumption, it will take 24 cycles of the division to complete
the significand of the quotient (to get a 23-bit mantissa)
▪ Quotient becomes 24-bits (with an MSB of 1 as assumed)
◦ For this to quotient, the radix point must be shifted to the left 23 times
▪ 23-bit LSB of the quotient must be used as result mantissa (𝑀)
▪ 23 must be added to the initial exponent
▫ Radix point moved from the end of the LSB to after the MSB of the 24-bit quotient
Floating-Point Division
General Case: 𝑁1 /𝑁2 = 𝑏1 𝑀1 /𝑏2 𝑀2 2 𝐸1 +𝐸2 −127 −127

where 𝑏1,2 can be a 1 or 0

Possible Strategy:
• Adjust the approach to cover all cases (first quotient bit is not a 1)
• To simplify things, let 𝑋 represent the numerator 𝑏1 𝑀1 and 𝑌 represent
the denominator 𝑏2 𝑀2
• Initially we assumed:
? × 2𝐸𝑖𝑛𝑖𝑡−127 1𝑥𝑥𝑥𝑥 … 𝑥. × 2𝐸𝑖𝑛𝑖𝑡−127
𝑏2 𝑀2 𝑏1 𝑀1 𝑌23 𝑌22 … 𝑌0 𝑋23 𝑋22 … 𝑋0

1. 𝑥𝑥𝑥𝑥 … 𝑥 × 2𝐸𝑖𝑛𝑖𝑡+23−127
Floating-Point Division
• What if the first 1 appears at the middle of the quotient?
01𝑥𝑥𝑥 … 𝑥. 𝑥 × 2𝐸𝑖𝑛𝑖𝑡−127 1. 𝑥𝑥𝑥𝑥 … 𝑥 × 2𝐸𝑖𝑛𝑖𝑡+22−127
𝑌23 𝑌22 … 𝑌0 𝑋23 𝑋22 … 𝑋0 . 0

◦ Need to extend the dividend to the right to get 23 bits of


quotient after the first 1
◦ Subtract 1 from the exponent for every additional cycle
◦ Use the last 23 bits of the quotient as the mantissa of the
result
▪ 23 bits does not include the first/implied one
Floating-Point Division
• The described approach can also be used even when the final
quotient must be subnormal
◦ The algorithm stops only when 24 significant digits are obtained
◦ If the actual exponent becomes less than -126
▪ Shift the quotient to the right (throw the LSB and shift in a zero)
▪ Add 1 to the exponent
▪ Repeat until the exponent is -126
◦ If 23 shifts are done and the exponent is still less than -126, then
an underflow occurred (return a floating point zero as the
answer)

• Further optimization?
◦ Identify the exponent
◦ Identify along the way if the result is subnormal?
Floating-Point Division
• Considering 1 bit at a time (from the MSB side) of the dividend is
like shifting in the bits into a “current dividend” register
◦ Recall: (non-)restoring division and division hardware
◦ When extending the dividend, if needed, simply shift in a zero
• A 24-bit quotient is needed where results are placed one bit at a
time (placing new results at the LSB) and shifting old results to the
left
◦ When a 1 is shifted (left) into the MSB of this quotient, then the
last division cycle must have completed the 24-bit result
◦ Indicator that the division process can be stopped
• However, the exponent can also be used to flag the division process
to stop (when the answer is a subnormal)
◦ How to do this?
Floating-Point Division
• From previous described process, every additional division
cycle (extension of the dividend) requires the exponent to be
increased by 1
◦ This process can be traced back from the initial division
without waiting for the 24-bit quotient to be completed!

• This time, assume that every new result from a division cycle is
the last bit of the mantissa even the from the first division
◦ Simplifies the process by just extracting always the LSBs (23
bits) of the current quotient
◦ Additional step: add 47 to the initial exponent
Floating-Point Division
• Revised process
1. Determine the sign of the result
2. Compute the initial exponent 𝐸𝑖𝑛𝑖𝑡
◦ Note the case (N/N, N/D, D/N, or D/D)
3. Add 47 to the 𝐸𝑖𝑛𝑖𝑡
4. Perform division of significands (see details next slide)
◦ Exponent, mantissa, and normalization is done on this step
5. Place into the final registers
Floating-Point Division
• Division of significands
1. Shift a bit of the dividend (from the MSB side) into a
current dividend (also “remainder”) register
2. Perform division as with restoring division
▪ Subtract, evaluate and shift in a new quotient bit, restore if needed
▪ Check if 23rd bit of quotient is already a 1
▫ If yes, then 24-bit significand is already complete (last cycle)
· Still perform the step below
3. Subtract 1 from the exponent
▪ Check if the actual exponent is -126
▫ If yes, then result is subnormal. Extract the 23 LSB of the quotient and save
as Mantissa.
4. Repeat if not yet done
Rounding off
• Intermediate values of significands may need to be
represented using more bits to minimize errors
• Guard bits – additional bits in the mantissa retained during
intermediate steps or operations
◦ Maintains additional accuracy in final results
• Three common-rounding off methods
◦ Truncation
◦ Von Neumann rounding
◦ Round to neares (even)
Rounding Off (Truncation)
Truncation
• Extra bits are simply discarded
• Biased approximation since error range is not symmetrical
around in between numbers
• “Rounding towards 0”
• Example: truncate from 6 bits to 3 bits
◦ 0.001000 → 0.001
◦ 0.010110 → 0.010
Rounding Off (Von Neumann)
Von Neumann rounding
• If bits to be removed are all 0, truncate
• If any of the bits to be removed is 1, set LSB of retained bits to 1
• Unbiased approximation (symmetric error)
• Larger error range
• Example: from 6 bits to 3 bits
◦ 0.001000 → 0.001
◦ 0.001001 → 0.001
◦ 0.010001 → 0.011
Rounding Off (Nearest Even)
Round to nearest (even)
• Requires 3 guard bits
◦ First two guard bits are part of the mantissa to be removed
◦ Third bit is the OR of all bits beyond the first two (at least
one 1)
• Achieves least range of error but is also most difficult to
implement because of addition and possible re-normalization
Rounding Off (Nearest Even)
Round to nearest (even)
• Do the following for each case of guard bits
◦ 0xx – truncate
◦ 100 – round to nearest (even)
▪ +1 to LSB if LSB is 1 (odd)
▪ Truncate if LSB is 0 (already even)
◦ 101, 110, 111 – round up (+1 to LSB)

• Example: From 6 bits to 3 bits


◦ 0.010101 → 0.010 + 0.001 = 0.011 (+1 LSB)
◦ 0.001101 → 0.001 + 0.001 = 0.010 (+1 LSB)
◦ 0.010011 → 0.010 (truncate)
◦ 0.001100 → 0.001 + 0.001 = 0.010 (nearest even, +1 LSB)
◦ 0.010100 → 0.110 (nearest even, truncate)
Rounding Off (Nearest Even)
• Consider the following:
1.000 x 25 (32.00)
- 1.111 x 21 - ( 3.75)
28.35

What happens with infinite precision?


What happens with nearest even rounding off?
What happens with 2 guard bits?
Rounding Off (Nearest Even)
• With infinite precision for partial operands
1.000 x 25 1.000 x 25 01.000 0000 x 25
- 1.111 x 21 - 0.000 1111 x 25 + 11.111 0001 x 25
00.111 0001 x 25
normalized complete answer: 1.110 001 x 24 (28.25)
normalized rounded answer: 1.110 x 24 (28)

• With nearest even rounding off (3 guard bits)


1.000 x 25 1.000 x 25 01.000 x 25
- 0.000 1111 x 25 - 0.000 111 x 25 + 11.111 001 x 25
00.111 001 x 25
normalized answer: 1.110 01 x 24 (28.2510)
rounded answer: 1.110 x 24 (28)
Rounding Off (Nearest Even)
• With infinite precision (2 guard bits)
◦ Last guard bit (2nd guard bit) acts as a sticky bit
1.000 x 25 1.000 x 25 01.000 x 25
- 0.000 1111 x 25 - 0.000 11 x 25 + 11.111 01 x 25
00.111 01 x 25
normalized answer: 1.110 1 x 24
rounded answer: 1.111 x 24 (30)

You might also like