Bits and Bytes, Text Codes
Bits and Bytes, Text Codes
In this chapter we are going to study how numbers are represented in a com-
puter. We already know that at the most basic level, computers just handle se-
quences of 0s and 1s. We also know that numbers can be represented in different
numeral systems, in particular the binary (base-2) numeral system which is per-
fectly suited for computers. We first consider representation of integers which is
quite straightforward, and then representation of fractional numbers which is a
bit more challenging.
47
48 CHAPTER 4. COMPUTERS, NUMBERS AND TEXT
Traditional SI prefixes
Symbol Value Symbol Value Alternative Value
10 3
kB (kilobyte) 2 KB 10 kibibyte 210
MB (megabyte) 220 MB 106 mibibyte 220
GB (gigabyte) 230 GB 109 gibibyte 230
TB (terabyte) 240 TB 1012 tibibyte 240
PB (petabyte) 250 PB 1015 pibibyte 250
EB (exabyte) 260 EB 1018 exbibyte 260
ZB (zettabyte) 270 ZB 1021 zebibyte 270
YB (yottabyte) 280 YB 1024 yobibyte 280
Table 4.1. The Si-prefixes for large collections of bits and bytes.
extremely efficient. On the other hand the computer could not do much other
than report an error message and give up if the result should become larger than
999999.
The other solution would be to not impose a specific limit on the size of the
numbers, but rather attempt to handle as large numbers as possible. For any
given computer there is bound to be an upper limit, and if this is exceeded the
only response would be an error message. We will discuss both of these ap-
proaches to the challenge of big numbers below.
Fact 4.1. A binary digit is called a bit and a group of 8 bits is called a byte.
Numbers are usually represented in terms of 4 bytes (32 bits) or 8 bytes (64
bits).
4.1. REPRESENTATION OF INTEGERS 49
The standard SI prefixes are used when large amounts of bits and bytes are
referred to, see table 4.1. Note that traditionally the factor between each prefix
has been 1024 = 210 in the computer world, but use of the SI-units is now en-
couraged. However, memory size is always reported using the traditional binary
units and most operating systems also use these units to report hard disk sizes
and file sizes. So a file containing 3 913 880 bytes will typically be reported as
being 3.7 MB.
To illustrate the size of the numbers in table 4.1 it is believed that the world’s
total storage in 2006 was 160 exabytes, and the projection is that this will grow
to nearly one zettabyte by 2010.
Based on this it may come as a little surprise that the most negative number that
can be represented is −231 and not −231 + 1. The reason is that with 32 bits at
our disposal we can represent a total of 232 numbers. Since we need 231 bit com-
binations for the positive numbers and 0, we have 232 − 231 = 231 combinations
of digits left for the negative numbers. Similar limits can be derived for 64-bit
integers.
Fact 4.2. The smallest and largest numbers that can be represented by 32-bit
integers are
What we have discussed so far is the typical hardware support for integer
numbers. When we program a computer we have to use a suitable program-
ming language, and different languages may provide different interfaces to the
hardware. There are a myriad of computer languages, and especially handling
of integers may differ quite a bit. We will briefly review integer handling in two
languages, Java and Python, as representatives of two different approaches.
int a;
a = 2147483647;
a = a + 1;
The staring value for a is the largest possible 32-bit integer, and when we add 1
we obtain a number that is too big for an int. This is referred to by saying that
an overflow occurs. So what happens when an integer overflows in Java? The
statements above will lead to a receiving the value -2147483648, and Java gives
no warning about this strange behaviour!. If you look carefully the result is −231 ,
i.e., the smallest possible int. Basically Java (and similar languages) consider
the 32-bit integers to lie in a ring where the integer succeeding 231 − 1 is −231
(overflow in long integers are handled similarly). Sometimes this may be what
you want, but most of the time this kind of behaviour is probably going to give
you a headache unless you remember this paragraph!
Note that Java also has 8 bit integers (byte) and 16 bit integers (short).
These behave completely analogously to int and long variables.
It is possible to work with integers that require more than 64 bits in Java,
but then you have to resort to an auxiliary class called BigInteger. In this class
integers are only limited by the total resources available on your computer, but
the cost of resorting to BigInteger is a big penalty in terms of computing time.
4.2. COMPUTERS AND REAL NUMBERS 51
inevitably means that there will be limitations on the class of real numbers that
can be handled efficiently by computers.
To illustrate the challenge, consider the two real numbers
π = 3.141592653589793238462643383279502884197 . . . ,
10 π = 3.141592653589793238462643383279502884197 . . . × 106 .
6
Both of these numbers are irrational and require infinitely many digits in any
numeral system with an integer base. With a fixed number of digits at our dis-
posal we can only store the most significant (the left-most) digits, which means
that we have to ignore infinitely many digits. But this is not enough to distin-
guish between the two numbers π and 106 π, we also have to store information
about the size of the numbers.
The fact that many real numbers have infinitely many digits and we can only
store a finite number of these means that there is bound to be an error when real
numbers are represented on a computer. This is in marked contrast to integer
numbers where there is no error, just a limit on the size of numbers. The errors
are usually referred to as rounding errors or round-off errors. These errors are
also present on calculators p and a simple situation where round-off error can be
observed is by computing 2, squaring the result and subtracting 2. On one
calculator the result is approximately 4.4×10−16 , a clear manifestation of round-
off error.
Usually the round-off error is small and remains small throughout a compu-
tation. In some cases however, the error grows throughout a computation and
may become significant. In fact, there are situations where the round-off error in
a result is so large that all the displayed digits are wrong! Computations which
lead to large round-off errors are said to be badly conditioned while computa-
tions with small errors are said to be well conditioned.
Since some computations may lead to large errors it is clearly important to
know in advance if a computation may be problematic. Suppose for example
you are working on the development of a new aircraft and you are responsible
for simulations of the forces acting on the wings during flight. Before the first
flight of the aircraft you had better be certain that the round-off errors (and other
errors) are under control. Such error analysis is part of the field called Numerical
Analysis.
are going to do this by first pretending that computers work in the decimal nu-
meral system. Afterwards we will translate our observations to the binary repre-
sentation that is used in practice.
Any real number can be expressed in the decimal system, but infinitely many
digits may be needed. To represent such numbers with finite resources we must
limit the number of digits. Suppose for example that we use four decimal dig-
its to represent real numbers. Then the best representations of the numbers π,
1/700 and 100003/17 would be
π ≈ 3.142,
1
≈ 0.001429,
700
100003
≈ 5883.
17
Observation 4.1 (Normal form of real number). Let a be a real number differ-
ent from zero. Then a can be written uniquely as
a = b × 10n (4.1)
where b is bounded by
1
≤ |b| < 1 (4.2)
10
and n is an integer. This is called the normal form of a, and the number b is
called the significand while n is called the exponent of a. The normal form of
0 is 0 = 0 × 100 .
Note that the digits of a and b are the same; to arrive at the normal form in
(4.1) we simply multiply a by the power of 10 that brings b into the range given
by (4.2).
54 CHAPTER 4. COMPUTERS, NUMBERS AND TEXT
π ≈ 0.3142 × 101 ,
1
≈ 0.1429 × 10−2 ,
7
100003
≈ 0.5883 × 104 ,
17
10000000
≈ 0.4348 × 107 .
23
From this we see that if we reserve four digits for the significand and one digit for
the exponent, plus a sign for both, then we have a format that can accommodate
all these numbers. If we keep the significand fixed and vary the exponent, the
decimal point moves among the digits. For this reason this kind of format is
called floating point, and numbers represented in this way are called floating
point numbers.
It is always useful to be aware of the smallest and largest numbers that can
be represented in a format. With four digits for the significand and one digit for
the exponent plus signs, these numbers are
Observation 4.2 (Binary normal form of real number). Let a be a real number
different from zero. Then a can be written uniquely as
a = b × 2n
where b is bounded by
1
≤ |b| < 1
2
and n is an integer. This is called the binary normal form of a, and the number
b is called the significand while n is called the exponent of a. The normal form
of 0 is 0 = 0 × 20 .
use 32 or 64 bits to represent real numbers. The 32-bit format is useful for appli-
cations that do not demand very much accuracy, but 64 bits has become a stan-
dard for most scientific applications. Occasionally higher accuracy is required
in which case there are some formats with more bits or even a format with no
limitation other than the resources available in the computer.
To describe a floating point format, it is not sufficient to state how many bits
are used in total, we also have to know how many bits are used for the significand
and how many for the exponent. There are several possible ways to do this, but
there is an international standard for floating point computations that is used
by most computer manufacturers. This standard is referred to as the IEEE1 754
standard, and the main details of the 32-bit version is given below.
Fact 4.3 (IEEE 32-bit floating point format). With 32-bit floating point num-
bers 23 bits are allocated for the significand and 9 bits for the exponent, both
including signs. This means that numbers have about 6–9 significant decimal
digits. The smallest and largest negative numbers in this format are
− 38 − −45
F min 32 ≈ −3.4 × 10 , F max 32 ≈ −1.4 × 10 .
typically occur during overflow. For example, if you use 32-bit floating point and
perform the multiplication 1030 ∗1030 , the result will be Infinity. The negative
infinity behaves in a similar way. The NaN is short for ’Not a Number’ and is the
result if you
p try to perform an illegal operation. A typical example is if you try to
compute −1 without using complex numbers, this will give NaN as the result.
And once you have obtained a NaN result it will pollute anything that it touches;
NaN combined with anything else will result in NaN.
With 64-bit numbers we have 32 extra bits at our disposal and the question is
how these should be used. The creators of the IEEE standard believed improved
accuracy to be more important than support for very large or very small num-
bers. They therefore increased the number of bits in the significand by 30 and
the number of bits in the exponent by 2.
Fact 4.4 (IEEE 64-bit floating point format). With 64-bit floating point num-
bers 53 bits are allocated for the significand and 11 bits for the exponent, both
including signs. This means that numbers have about 15–17 significant deci-
mal digits. The smallest and largest negative number in this format are
− 308 − −324
F min 64 ≈ −1.8 × 10 , F max 64 ≈ −5 × 10 .
Other than the extra bits available, the 64-bit format behaves just like its 32-
bit little brother, with the leading 1 not being stored, the use of denormalised
numbers, -Infinity, Infinity and NaN.
At the lowest level, computers can just handle 0s and 1s, and since any number
can be expressed uniquely in the binary number system it can also be repre-
sented in a computer (except for the fact that we may have to limit both the size
of the numbers and their number of digits). We all know that computers can
also handle text and in this section we are going to see the basic principles of
how this is done.
A text is just a sequence of individual characters like ’a’, ’B’, ’3’, ’.’, ’?’, i.e.,
upper- and lowercase letters, the digits 0–9 and various other symbols used for
punctuation and other purposes. So the basic challenge in handling text is how
to represent the individual characters. With numbers at our disposal, this is a
simple challenge to overcome. Internally in the computer a character is just
represented by a number and the correspondence between numbers and char-
acters is stored in a table. The letter ’a’ for example, usually has code 97. So
when the computer is told to print the character with code 97, it will call a pro-
gram that draws an ’a’2 . Similarly, when the user presses the ’a’ on the keyboard,
it is immediately converted to code 97.
Although the two concepts are slightly different, we will use the terms ’char-
acter sets’ and ’character mappings’ as synonyms.
From fact 4.5 we see that the character mapping is crucial in how text is han-
dled. Initially, the mappings were simple and computers could only handle the
most common characters used in English. Today there are extensive mappings
available that make the characters of most of the world’s languages, including
the ancient ones, accessible. Below we will briefly describe some of the most
common character sets.
58 CHAPTER 4. COMPUTERS, NUMBERS AND TEXT
will quit if you type ˆD (hold down the control-key while you press ’d’). Various
combinations of characters 10, 12 and 13 are used in different operating systems
for indicating a new line within a file. The meaning of character 13 (’Carriage
Return’) was originally to move back to the beginning of the current line and
character 10 (’Line Feed’) meant forward one line.
4.3.3 Unicode
By the early 1990s there was a critical need for character sets that could han-
dle multilingual characters, like those from English and Chinese, in the same
document. A number of computer companies therefore set up an organisation
called Unicode. Unicode has since then organised the characters of most of the
world’s languages in a large table called the Unicode table, and new characters
are still being added. There are a total of about 100 000 characters in the ta-
ble which means that at least three bytes are needed for their representation.
The codes range from 0 to 1114111 (hexadecimal 10ffff16 ) which means that only
about 10 % of the table is filled. The characters are grouped together according
to language family or application area, and the empty spaces make it easy to add
new characters that may come into use. The first 256 characters of Unicode is
identical to the ISO Latin 1 character set, and in particular the first 128 charac-
ters correspond to the ASCII table. You can find all the Unicode characters at
http://www.unicode.org/charts/.
60 CHAPTER 4. COMPUTERS, NUMBERS AND TEXT
One could use the same strategy with Unicode as with ASCII and ISO Latin 1
and represent the characters via their integer codes (usually referred to as code
points) in the Unicode table. This would mean that each character would re-
quire three bytes of storage. The main disadvantage of this is that a program
for reading Unicode text would give completely wrong results if by mistake it
was used for reading ’old fashioned’ eight bit text, even if it just contained ASCII
characters. Unicode has therefore developed variable length encoding schemes
for encoding the characters.
4.3.4 UTF-8
A popular encoding of Unicode is UTF-83 . UTF-8 has the advantage that ASCII
characters are encoded in one byte so there is complete backwards compatibility
with ASCII. All other characters require from two to four bytes.
To see how the code points are actually encoded in UTF-8, recall that the
ASCII characters have code points in the range 0–127 (decimal) which is 0016 –
7 f 16 in hexadecimal or 000000002 –011111112 in binary. These characters are just
encoded in one byte in the obvious way and are characterised by the fact that the
most significant (the left-most) bit is 0. All other characters require more than
one byte, but the encoding is done in such a way that none of these bytes start
with 0. This is done by adding some set fixed bit combinations at the beginning
of each byte. Such codes are called prefix codes. The details are given in a fact
box.
Fact 4.6 (UTF-8 encoding of Unicode). A Unicode character with code point
c is encoded in UTF-8 according to the following four rules:
1. If c = (d 6 d 5 d 4 d 3 d 2 d 1 d 0 )2 is in the decimal range 0–127 (hexadecimal
0016 –7f16 ), it is encoded in one byte as
0d 6 d 5 d 4 d 3 d 2 d 1 d 0 . (4.3)
3. If c = (d 15 d 14 d 13 d 12 d 11 d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 )2 is in the decimal
range 2048–65535 (hexadecimal 80016 –ffff16 ) it is encoded as the three-
byte binary number
1110d 15 d 14 d 13 d 12 10d 11 d 10 d 9 d 8 d 7 d 6 10d 5 d 4 d 3 d 2 d 1 d 0 . (4.5)
4. If c = (d 20 d 19 d 18 d 17 d 16 d 15 d 14 d 13 d 12 d 11 d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 )2 is
in the decimal range 65536–1114111 (hexadecimal 1000016 –10ffff16 ) it
is encoded as the four-byte binary number
11110d 20 d 19 d 18 10d 17 d 16 d 15 d 14 d 13 d 12
10d 11 d 10 d 9 d 8 d 7 d 6 10d 5 d 4 d 3 d 2 d 1 d 0 . (4.6)
This may seem complicated at first sight, but is in fact quite simple and el-
egant. Note any given byte in a UTF-8 encoded text must start with the binary
digits 0, 10, 110, 1110 or 11110. If the first bit in a byte is 0, the remaining bits
represent a seven bit ASCII character. If the first two bits are 10, the byte is the
second, third or fourth byte of a multi-byte code point, and we can find the first
byte by going back in the byte stream until we find a byte that does not start with
10. If the byte starts with 110 we know that it is the first byte of a two-byte code
point; if it starts with 1110 it is the first byte of a three-byte code point; and if it
starts with 11110 it is the first of a four-byte code point.
Observation 4.3. It is always possible to tell if a given byte within a text en-
coded in UTF-8 is the first, second, third or fourth byte in the encoding of a
code point.
The UTF-8 encoding is particularly popular in the Western world since all
the common characters of English can be represented by one byte, and almost
all the national European characters can be represented with two bytes.
Example 4.1. Let us consider a concrete example of how the UTF-8 code of a
code point is determined. The ASCII characters are not so interesting since for
these characters the UTF-8 code agrees with the code point. The Norwegian
character ’Å’ is more challenging. If we check the Unicode charts,4 we find that
this character has the code point c516 = 197. This is in the range 128–2047 which
is covered by rule 2 in fact 4.6. To determine the UTF-8 encoding we must find
the binary representation of the code point. This is easy to deduce from the
hexadecimal representation. The least significant numeral (5 in our case) deter-
mines the four least significant bits and the most significant numeral (c) deter-
mines the four most significant bits. Since 5 = 01012 and c 16 = 11002 , the code
point in binary is
c 5
z }| { z }| {
000 1100 01012 ,
where we have added three 0s to the left to get the eleven bits referred to by
rule 2. We then distribute the eleven bits as in (4.4) and obtain the two bytes
11000011, 10000101.
In hexadecimal this corresponds to the two values c3 and 85 so the UTF-8 en-
coding of ’Å’ is the two-byte number c38516 .
4.3.5 UTF-16
Another common encoding is UTF-16. In this encoding most Unicode charac-
ters with two-byte code points are encoded directly by their code points. Since
the characters of major Asian languages like Chinese, Japanese and Korean are
encoded in this part of Unicode, UTF-16 is popular in this part of the world.
UTF-16 is also the native format for representation of text in the recent versions
of Microsoft Windows and Apple’s Mac OS X as well as in programming environ-
ments like Java, .Net and Qt.
UTF-16 uses a variable width encoding scheme similar to UTF-8, but the ba-
sic unit is two bytes rather than one. This means that all code points are encoded
in two or four bytes. In order to recognize whether two consecutive bytes in an
UTF-16 encoded text correspond to a two-byte code point or a four-byte code
point, the initial bit patterns of each pair of a four byte code has to be illegal
in a two-byte code. This is possible since there are big gaps in the Unicode ta-
ble. In fact certain Unicode code points are reserved for the specific purpose of
signifying the start of pairs of four-byte codes (so-called surrogate pairs).
Fact 4.7 (UTF-16 encoding of Unicode). A Unicode character with code point
c is encoded in UTF-16 according to two rules:
1. If the number
c = (d 15 d 14 d 13 d 12 d 11 d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 )2
d 15 d 14 d 13 d 12 d 11 d 10 d 9 d 8 d7 d6 d5 d4 d3 d2 d1 d0 .
4.3. REPRESENTATION OF LETTERS AND OTHER CHARACTERS 63
2. If the number
c = (d 20 d 19 d 18 d 17 d 16 d 15 d 14 d 13 d 12 d 11 d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 )2
c 0 = (d 19
0 0
d 18 0
d 17 0
d 16 0
d 15 0
d 14 0
d 13 0
d 12 0
d 11 0
d 10 d 90 d 80 d 70 d 60 d 50 d 40 d 30 d 20 d 10 d 00 )2 .
0 0 0 0 0 0 0 0 0 0
110110d 19 d 18 d 17 d 16 d 15 d 14 d 13 d 12 d 11 d 10
110111d 90 d 80 d 70 d 60 d 50 d 40 d 30 d 20 d 10 d 00 .
Superficially it may seem like UTF-16 does not have the prefix property, i.e.,
it may seem that a pair of bytes produced by rule 2 may occur as a pair generated
by rule 1 and vice versa. However, the existence of gaps in the Unicode table
means that this problem does not occur.
Observation 4.4. None of the pairs of bytes produced by rule 2 in fact 4.7 will
ever match a pair of bytes produced by the first rule as there are no two-byte
code points that start with the bit sequences 110110 or 110111. It is therefore
always possible to determine whether a given pair of consecutive bytes in an
UTF-16 encoded text corresponds directly to a code point (rule 1), or is the
first or second pair of a four byte encoding.
The UTF-16 encoding has the advantage that all two-byte code points are
encoded directly by their code points. Since the characters that require more
than two-byte code points are very rare, this means that virtually all characters
are encoded directly in two bytes.
UTF-16 has one technical complication. Different computer architectures
code pairs of bytes in different ways: Some will insist on sending the eight most
significant bits first, some will send the eight least significant bits first; this is
usually referred to as little endian and big endian. To account for this there are
in fact three different UTF-16 encoding schemes, UTF-16, UTF-16BE and UTF-
16LE. UTF-16BE uses strict big endian encoding while UTF-16LE uses strict lit-
tle endian encoding. UTF-16 does not use a specific endian convention. Instead
64 CHAPTER 4. COMPUTERS, NUMBERS AND TEXT
any file encoded with UTF-16 should indicate the endian by having as its first
two bytes what is called a Byte Order Mark (BOM). This should be the hexadec-
imal sequence feff16 for big-endian and fffe16 for little-endian. This character,
which has code point feff, is chosen because it should never legitimately appear
at the beginning of a text.
4.3.6 UTF-32
UTF-32 encode Unicode characters by encoding the code point directly in four
bytes or 32 bits. In other words it is a fixed length encoding. The disadvantage is
that this encoding is rather extravagant since many frequently occurring char-
acters in Western languages can be encoded with only one byte, and almost all
characters can be encoded with two bytes. For this reason UTF-32 is little used
in practice.
# coding=utf-8/
You can then use Unicode in your string constants which in this case will be en-
coded in UTF-8. All the standard string functions also work for Unicode strings,
but note that the default encoding is ASCII.
4.4.1 Text
A text is simply a sequence of characters. We know that a character is repre-
sented by an integer code so a text file is a sequence of integer codes. If we use
4.4. REPRESENTATION OF GENERAL INFORMATION 65
Knut
Mørken
4b 6e 75 74 0a 4d f8 72 6b 65 6e
The first four bytes you will find in table 4.2 as the codes for ’K’, ’n’, ’u’ and ’t’
(remember that the codes of latin characters in ISO Latin 1 are the same as in
ASCII). The fifth character has decimal code 10 which you find in table 4.3. This
is the Line feed character which causes a new line on my computer. The re-
maining codes can all be found in table 4.2 except for the seventh which has
decimal code 248. This is located in the upper 128 ISO Latin 1 characters and
corresponds to the Norwegian letter ’ø’ as can be seen in table 4.4.
If instead the text is represented in UTF-8, we obtain the bytes
4b 6e 75 74 0a 4d c3 b8 72 6b 65 6e
We see that these are the same as for ISO Latin 1 except that ’f8’ has become two
bytes ’c3 b8’ which is the two-byte code for ’ø’ in UTF-8.
In UTF-16 the text is represented by the codes
ff fe 4b 00 6e 00 75 00 74 00 0a 00 4d 00 f8 00 72 00 6b 00 65 00 6e 00
All the characters can be represented by two bytes and the leading byte is ’00’
since we only have ISO Latin 1 characters. It may seem a bit strange that the
zero byte comes after the nonzero byte, but this is because the computer uses
little endian representation. A program reading this file would detect this from
the first two bytes which is the byte-order mark referred to above.
4.4.2 Numbers
A number can be stored in a file by finding its binary representation and storing
the bits in the appropriate number of bytes. The number 13 = 11012 for exam-
ple could be stored as a 32 bit integer by storing the the bytes 00 00 00 0d (in
hexadecimal).5 But here there is a possibly confusing point: Why can we not
just store the number as a text? This is certainly possible and if we use UTF-8
we can store 13 as the two bytes 31 33 (in hexadecimal). This even takes up less
space than the four bytes required by the true integer format. For bigger num-
bers however the situation is the opposite: Even the largest 32-bit integer can
5 For technical reasons integers are in fact usually stored in so-called two’s complement.
66 CHAPTER 4. COMPUTERS, NUMBERS AND TEXT
This principle is absolute, but there are of course many ways to instruct a
computer how information should be interpreted. A lot of the interpretation is
68 CHAPTER 4. COMPUTERS, NUMBERS AND TEXT
programmed into the computer via the operating system, programs that are in-
stalled on the computer contain code for encoding and decoding information
specific to each program, sometimes the user has to tell a given program how
to interpret information (for example tell a program the format of a file), some-
times a program can determine the format by looking for special bit sequences
(like the endian convention used in a UTF-16 encoded file). And if you write
programs yourself you must of course make sure that your program can process
the information from a user in an adequate way.
Exercises
4.1 Determine the UTF-8 encodings of the Unicode characters with the fol-
lowing code points:
a) 5a16 .
b) f516 .
c) 3f816 .
d) 8f3716 .
4.3 In this exercise you may need to use the Unicode table which can be found
at www.unicode.org/charts/.
a) Suppose you save the characters ’æ’, ’ø’ and ’å’ in a file with UTF-8
encoding. How will these characters be displayed if you open the file
in an editor using the ISO Latin 1 encoding?
b) What will you see if you do the opposite?
c) Repeat (a) and (b), but use UTF-16 instead of UTF-8.
d) Repeat (a) and (b), but use UTF-16 instead of ISO Latin 1.
4.5. A FUNDAMENTAL PRINCIPLE OF COMPUTING 69
Table 4.2. The ASCII characters with codes 32–127. The character with decimal code 32 is white space, and
the one with code 127 is ’delete’.
70 CHAPTER 4. COMPUTERS, NUMBERS AND TEXT
Table 4.3. The first 32 characters of the ASCII table. The first two columns show the code number in decimal
and octal, the third column gives a standard abbreviation for the character and the fourth column gives a
printable representation of the character. The last column gives a more verbose description of the character.
4.5. A FUNDAMENTAL PRINCIPLE OF COMPUTING 71
Table 4.4. The last 64 characters of the ISO Latin1 character set.