A.
BETCHOO
O-LEVEL
1.3 DATA STORAGE AND FILE COMPRESSION
1.3.1 Measurement of data storage
A bit is the basic unit of all computing memory storage terms and is either 1 or 0. The word comes
from binary digit. The byte is the smallest unit of memory in a computer. 1 byte is 8 bits. A 4-bit
number is called a nibble – half a byte.
1 byte of memory wouldn’t allow you to store very much information so memory size is measured in
the multiples shown in Table 1.4:
Name of memory size Equivalent denary value
1 kilobyte (1 KB) 1 000 bytes
1 megabyte (1 MB) 1 000 000 bytes
1 gigabyte (1 GB) 1 000 000 000 bytes
1 terabyte (1 TB) 1 000 000 000 000 bytes
1 petabyte (1 PB) 1 000 000 000 000 000 bytes
1 exabyte (1 EB) 1 000 000 000 000 000 000 bytes
The above system of numbering now only refers to some storage devices but is technically
inaccurate. It is based on the SI (base 10) system of units where 1 kilo is equal to 1000.
A 1 TB hard disk drive would allow the storage of 1 × 1012 bytes according to this system.
However, since memory size is actually measured in terms of powers of 2, another system has been
adopted by the IEC (International Electrotechnical Commission) that is based on the binary system
(Table 1.5):
Name of memory size Number of bytes Equivalent denary value
1 kibibyte (1 KiB) 210 1 024 bytes
1 mebibyte (1 MiB) 220 1 048 576 bytes
1 gibibyte (1 GiB) 230 1 073 741 824 bytes
1 tebibyte (1 TiB) 240 1 099 511 627 776 bytes
1 pebibyte (1 PiB) 250 1 125 899 906 842 624 bytes
1 152 921 504 606 846 976
1 exbibyte (1 EiB) 260
bytes
This system is more accurate. Internal memories (such as RAM and ROM) should be measured using
the IEC system. A 64 GiB RAM could, therefore, store 64 × 230 bytes of data (68 719 476 736 bytes).
1.3.2 Calculation of file size
In this section we will look at the calculation of the file size required to hold a bitmap image and a
sound sample.
The file size of an image is calculated as:
image resolution (in pixels) × colour depth (in bits)
The size of a mono sound file is calculated as:
1
A. BETCHOO
O-LEVEL
1.3 DATA STORAGE AND FILE COMPRESSION
sample rate (in Hz) × sample resolution (in bits) × length of sample (in seconds)
For a stereo sound file, you would then multiply the result by two.
Example 1
A photograph is 1024 × 1080 pixels and uses a colour depth of 32 bits. How many photographs of
this size would fit onto a memory stick of 64 GiB?
1. Multiply number of pixels in vertical and horizontal directions to find total number of pixels =
(1024 × 1080) = 1 105 920 pixels
2. Now multiply number of pixels by colour depth then divide by 8 to give the number of bytes =
1 105 920 × 32 = 35 389 440/8 bytes = 4 423 680 bytes
3. 64 GiB = 64 × 1024 × 1024 × 1024 = 68 719 476 736 bytes
4. Finally divide the memory stick size by the files size = 68 719 476 736/4 423 680 = 15 534
photos.
Example 2
A camera detector has an array of 2048 by 2048 pixels and uses a colour depth of 16. Find the size of
an image taken by this camera in MiB.
1. Multiply number of pixels in vertical and horizontal directions to find total number of pixels =
(2 048 × 2 048) = 4 194 304 pixels
2. Now multiply number of pixels by colour depth = 4 194 304 × 16 = 67 108 864 bits
3. Now divide number of bits by 8 to find the number of bytes in the file = (67 108 864)/8 = 8 388
608 bytes
4. Now divide by 1024 × 1024 to convert to MiB = (8 388 608)/(1 048 576) = 8 MiB.
Example 3
An audio CD has a sample rate of 44 100 and a sample resolution of 16 bits. The music being
sampled uses two channels to allow for stereo recording. Calculate the file size for a 60-minute
recording.
1. Size of file = sample rate (in Hz) × sample resolution (in bits) × length of sample (in seconds)
2. Size of sample = (44 100 × 16 × (60 × 60)) = 2 540 160 000 bits
3. Multiply by 2 since there are two channels being used = 5 080 320 000 bits
4. Divide by 8 to find number of bytes = (5 080 320 000)/8 = 635 040 000
5. Divide by 1024 × 1024 to convert to MiB = 635 040 000 / 1 048 576 = 605 MiB.
1.3.3 Data compression
The calculations in Section 1.3.2 show that sound and image files can be very large. It is therefore
necessary to reduce (or compress) the size of a file for the following reasons:
• to save storage space on devices such as the hard disk drive/solid state drive
• to reduce the time taken to stream a music or video file.
• to reduce the time taken to upload, download or transfer a file across a network.
2
A. BETCHOO
O-LEVEL
1.3 DATA STORAGE AND FILE COMPRESSION
• the download/upload process uses up network bandwidth – this is the maximum rate of
transfer of data across a network, measured in bits per second. This occurs whenever a file is
downloaded, for example, from a server. Compressed files contain fewer bits of data than
uncompressed files and therefore use less bandwidth, which results in a faster data transfer
rate.
• reduced file size also reduces costs. For example, when using cloud storage, the cost is based
on the size of the files stored. Also, an internet service provider (ISP) may charge a user based
on the amount of data downloaded.
1.3.4 Lossy and lossless file compression
File compression can either be lossless or lossy.
Lossy file compression
With this technique, the file compression algorithm eliminates unnecessary data from the file. This
means the original file cannot be reconstructed once it has been compressed.
Lossy file compression results in some loss of detail when compared to the original file. The
algorithms used in the lossy technique have to decide which parts of the file need to be retained and
which parts can be discarded.
For example, when applying a lossy file compression algorithm to:
• an image, it may reduce the resolution and/or the bit/colour depth.
• a sound file, it may reduce the sampling rate and/or the resolution.
Lossy files are smaller than lossless files which is of great benefit when considering storage and data
transfer rate requirements.
Common lossy file compression algorithms are:
• MPEG-3 (MP3) and MPEG-4 (MP4)
• JPEG.
MPEG-3 (MP3) and MPEG-4 (MP4)
MP3 files are used for playing music on computers or mobile phones. This compression technology
will reduce the size of a normal music file by about 90%. While MP3 music files can never match the
sound quality found on a DVD or CD, the quality is satisfactory for most general purposes.
But how can the original music file be reduced by 90% while still retaining most of the music quality?
Essentially the algorithm removes sounds that the human ear can’t hear properly. For example:
• removal of sounds outside the human ear range
• if two sounds are played at the same time, only the louder one can be heard by the ear, so the
softer sound is eliminated. This is called perceptual music shaping.
MP4 files are slightly different to MP3 files. This format allows the storage of multimedia files rather
than just sound – music, videos, photos and animation can all be stored in the MP4 format. As with
MP3, this is a lossy file compression format, but it still retains an acceptable quality of sound and
3
A. BETCHOO
O-LEVEL
1.3 DATA STORAGE AND FILE COMPRESSION
video. Movies, for example, could be streamed over the internet using the MP4 format without
losing any real discernible quality.
JPEG
When a camera takes a photograph, it produces a raw bitmap file which can be very large in size.
These files are temporary in nature. JPEG is a lossy file compression algorithm used for bitmap
images. As with MP3, once the image is subjected to the JPEG compression algorithm, a new file is
formed, and the original file can no longer be constructed.
The JPEG file reduction process is based on two key concepts:
• human eyes don’t detect differences in colour shades quite as well as they detect differences
in image brightness (the eye is less sensitive to colour variations than it is to variations in
brightness)
• by separating pixel colour from brightness, images can be split into 8 × 8-pixel blocks, for
example, which then allows certain ‘information’ to be discarded from the image without
causing any real noticeable deterioration in quality.
Lossless file compression
With this technique, all the data from the original uncompressed file can be reconstructed. This is
particularly important for files where any loss of data would be disastrous (e.g. when transferring a
large and complex spreadsheet or when downloading a large computer application).
Lossless file compression is designed so that none of the original detail from the file is lost.
Run-length encoding (RLE) can be used for lossless compression of a number of different file
formats:
• it is a form of lossless/reversible file compression.
• it reduces the size of a string of adjacent, identical data (e.g., repeated colours in an image)
• a repeating string is encoded into two values:
– the first value represents the number of identical data items (e.g., characters) in the run
– the second value represents the code of the data item (such as ASCII code if it is a
keyboard character)
• RLE is only effective where there is a long run of repeated units/bits.
Using RLE on text data
Consider the following text string: ‘aaaaabbbbccddddd’. Assuming each character requires 1 byte
then this string needs 16 bytes. If we assume ASCII code is being used, then the string can be coded
as follows:
a a a a a b b b b c c d d d d d
05 97 04 98 02 99 05 100
This means we have five characters with ASCII code 97, four characters with ASCII code 98, two
characters with ASCII code 99 and five characters with ASCII code 100. Assuming each number in the
second row requires 1 byte of memory, the RLE code will need 8 bytes. This is half the original file
size.
4
A. BETCHOO
O-LEVEL
1.3 DATA STORAGE AND FILE COMPRESSION
One issue occurs with a string such as ‘cdcdcdcdcd’ where RLE compression isn’t very effective. To
cope with this, we use a flag. A flag preceding data indicates that what follows are the number of
repeating units (for example, 255 05 97 where 255 is the flag and the other two numbers indicate
that there are five items with ASCII code 97). When a flag is not used, the next byte(s) are taken with
their face value and a run of 1 (for example, 01 99 means one character with ASCII code 99 follows).
Consider this example:
String aaaaaaaa bbbbbbbbbb c d c d c d eeeeeeee
Code 08 97 10 98 01 99 01 100 01 99 01 100 01 99 01 100 08 101
The original string contains 32 characters and would occupy 32 bytes of storage.
The coded version contains 18 values and would require 18 bytes of storage.
Introducing a flag (255 in this case) produces:
255 08 97 || 255 10 98 || 99 100 99 100 99 100 || 255 08 101
This has 15 values and would, therefore, require 15 bytes of storage. This is a reduction in file size of
about 53% when compared to the original string.
Using RLE with images
Example 1: Black and white image
The figure below shows the letter F in a grid where each square requires 1 byte of storage. A white
square has a value 0 and a black square a value of 1.
0 0 0 0 0 0 0 0 In compressed RLE format this becomes
0 1 1 1 1 1 1 0
9W 6B 2W 1B 7W 1B 7W 5B 3W 1B 7W
0 1 0 0 0 0 0 0
1B 7W 1B 6W
0 1 0 0 0 0 0 0
0 1 1 1 1 1 0 0 Using W = 0 and B = 1 we get:
0 1 0 0 0 0 0 0 90 61 20 11 70 11 70 51 30 11 70 11 70
0 1 0 0 0 0 0 0 11 60
0 1 0 0 0 0 0 0
The 8 x 8 grid would need 64 bytes; the compressed RLE format has 30 values, and therefore needs
only 30 bytes to store the image.
Example 2: Coloured images
Figure 1.13 shows an object in four colours. Each colour is made up of red, green and blue (RGB)
according to the code on the right.
5
A. BETCHOO
O-LEVEL
1.3 DATA STORAGE AND FILE COMPRESSION
This produces the following data: 2 0 0 0 4 0 255 0 3 0 0 0 6 255 255 255 1 0 0 0 2 0 255 0 4 255 0 0 4
0 255 0 1 255 255 255 2 255 0 0 1 255 255 255 4 0 255 0 4 255 0 0 4 0 255 0 4 255 255 255 2 0 255 0
1 0 0 0 2 255 255 255 2 255 0 0 2 255 255 255 3 0 0 0 4 0 255 0 2 0 0 0.
The original image (8 × 8 square) would need 3 bytes per square (to include all three RGB values).
Therefore, the uncompressed file for this image is 8 × 8 × 3 = 192 bytes.
The RLE code has 92 values, which means the compressed file will be 92 bytes in size. This gives a file
reduction of about 52%. It should be noted that the file reductions in reality will not be as large as
this due to other data which needs to be stored with the compressed file (e.g., a file header).