File Compression and Backup Computer Principles for Programmers
Part 1: Compression (40 points)
old quote from Vangie Beal, managing editor of Webopedia…
(all lower-case letters have been used to simplify the example)
data compression is particularly useful in communications because it enables
devices to transmit or store the same amount of data in fewer bits. there
are a variety of data compression techniques, but only a few have been
standardized. the ccitt has defined a standard data compression technique
for transmitting and a compression standard for data communications through
modems. in addition, there are file compression formats, such as arc and zip.
Compression algorithms substitute a repeating string in the original text with
a unique token and keep a record of each substitution in a dictionary. The
dictionary is stored with the compressed text for later decompression.
(this is an analogy, not the algorithm used by LZW or Huffman encoding)
• For the length of any string occurring n times and replaced by a single character token…
• Saving length × n occurrences has the cost of a dictionary entry (token + length)
plus n tokens replacing the string in the original text.
• "data " has a length of 5 with 5 occurrences (25 characters)
• less the overhead of 1+5 characters in the dictionary: "!data "
plus 5 ! tokens in the text replacing each "data " string
• length × n – 1 – length – n = x characters saved
• 25 saved less 6 for dictionary entry less 5 text replacements with token = 14 characters saved.
Saving
Token String Length Occurrences
length × (n – 2) – 1
! data 5 5 14
@ compression 12 5 42
# communications 15 2 12
$ Transmit 8 2 5
% there are 11 2 8
& standard 8 3 12
* technique 9 2 6
( in 3 3 2
) for 3 3 2
Total Original Length 3 Total Saved 2
Fall 2022 Page 1 of 8
File Compression and Backup Computer Principles for Programmers
!data
@compression
#communications
$transmit
% there are
&technique
Dictionary
*standard
(in
)for
!@is particularly useful in #because
it enables devices to $ or store the
same amount of !in fewer bits. t%a
variety of !@&s, but only a few have
been *ized. the ccitt has defined a Compressed text
* !@& for $ting faxes and a @*
for !#through modems. in addition, t%
file @formats, such as arc and zip.
Including dictionary, total compressed size is 346 characters which is 77% of original, 23% saved.
The compression factor increases according to the frequency of repeating strings in the text.
Although Huffman and LZW compression routines are more sophisticated, the above illustrates the
concept with character data instead of binary.
➔ How much can you compress the lyrics to a song using the ideas above? Commented [TM1]: Removed a prescribed rhyme for
compression. Student must supply their own rhyme or song.
You choose the song.
Winter 2022 wording:
→ Copy the lyrics of a song to a new MS-Word document (Ctrl+N). What was your dictionary of compression token to string
characters, one entry per line?
→ To reduce complexity, make all letters lower case:
Ctrl+A to select all text, Alt+H 7 L to make the selection lower case. Fall 2021 and earlier wording:
your dictionary of compression token to string characters,
one entry per line.
In the bottom left of the Word display, click “### words”.
e.g.
The Word Count dialog will pop up showing the number of characters with spaces.
N.B. paragraph / new line / CRLF characters are not counted by Word – ignore for our purposes here.
(Alt+H,8 will toggle the display of whitespace characters)
Fall 2022 Page 2 of 8
File Compression and Backup Computer Principles for Programmers
The following will help with your substitution analysis: separate the words in the text so each is on its
own line, then sort the lines to see repeating patterns of individual words.
• copy the lyrics to another new document (Ctrl-N) used only for analysis
• Find and Replace a space with a space + paragraph marker ^p (Ctrl+H)
Find what: □ Replace with: □^p
• Then sort the lines to see repeating words. (Alt H S O)
Fall 2022 Page 3 of 8
File Compression and Backup Computer Principles for Programmers
Sorting by Paragraphs results in one word per line, in alphabetical order;
this makes it easy to see repeating words:
Anything occurring only once is not worth substituting with a token and including in the dictionary; you
will be adding two characters (the tokens) to the file. Any string with a length of 2 or 3 and occurring
only twice is similarly not worth it.
• a space is a character that can be compressed together with most word strings
• a robotic replacement of recurring words will not result in the best compression.
o Consider compressing phrases before compressing individual words
o Consider whether a leading and/or trailing space should be compressed with a word
• the token/string dictionary must be included to decompress text back to its original state.
o The overhead of the dictionary must be considered with the compressed text to assess the
size of the compression versus the original.
➔ How much can you compress the lyrics?
• Use unique tokens – symbols that do not appear in the lyrics. E.g. the special characters and digits
on the keyboard's top row.
N.B. do not use the ^ carat symbol, it is a Microsoft escape character which will confuse its Find &
Replace process.
• Decompression reads the first character in the dictionary as the token and the next characters to
end-of-line as the original string to replace tokens with.
Commented [TM2]: Fall 2021 and earlier wording:
Copy and paste the following into your answer document: your dictionary of compression token to string characters,
one entry per line.
➔ The lyrics of your chosen song – with attribution please.
Winter 2022 wording:
What was your dictionary of compression token to string
characters, one entry per line?
➔ What was your dictionary of compression token to string characters, one entry per line?
Commented [TM3]: Fall 2021 and earlier wording:
e.g. (there is a space following the word because, to software, a space is a character) your dictionary of compression token to string characters,
one entry per line.
Winter 2022 wording:
!data What was your dictionary of compression token to string
@compression characters, one entry per line?
Fall 2022 Page 4 of 8
File Compression and Backup Computer Principles for Programmers
➔ What were the compressed lyrics with the token substitutions? Commented [TM4]: Fall 2021 and earlier wording:
i.e. the compressed text (not the sorted analysis list of words) the compressed rhyme with the token substitutions,
Winter 2022 wording:
!@is … What was the compressed rhyme with the token
substitutions?
➔ What is total dictionary plus compressed text characters (with space) as a percentage of the
original’s size?
???
➔ Test your compression dictionary by decompressing. Process dictionary items from the bottom up: Commented [TM5]: Fall 2021 and earlier wording:
find the compression character in the compressed data and replace it with the original string. Paste Now test your compression dictionary. Reverse the process
to see if your compression dictionary is accurate. Process
the decompressed version below – even if it is not perfect. What modifications, if any, does the dictionary items ..
compression dictionary need to return the compressed data back into its original state? Winter 2022 wording:
Test your compression dictionary by decompressing.
Process dictionary items ...
Part 2:
• Download this week’s .zip archive to your folder:
CP4P_CompressionBackup_Activity_Archive.zip
◦ Remember that compressed files must be decompressed before they can be opened.
◦ Windows does this automatically into the %temp% folder if you open a file directly from
a .zip archive. This is fine to quickly browse a file's content.
◦ However, if the file is to be kept or its content modified,
first copy/extract it from the .zip archive to your folder.
▪ That will be the case for
CP4P_CompressionBackup_Activity_Answers.docx because you will be
adding your answers to it.
First extract it from the archive to your folder, then open it.
If you open it first – into %temp% – you may never find your work again.
Fall 2022 Page 5 of 8
File Compression and Backup Computer Principles for Programmers
These next steps will open files in the archive, then save them as slightly different types. We'll do that
to examine the size differences of different file format types when we add them to the archive.
• Open the CP4P_CompressionBackup_Activity_Instructions.docx file. (It is
likely open now.)
◦ File menu > Save As > PDF (*.pdf)
◦ File menu > Save As > Word 97-2003 Document (*.doc)
If you see the Microsoft Compatibility Checker, click Continue.
• Add the CP4P_CompressionBackup_Activity_Instructions.pdf and .doc files
from your folder into the zip archive we have been using.
◦ select the files, right click, and
Send to > Compressed (zipped) folder option,
or use 7zip [on Windows], or your favourite compression utility to drag and drop.
• Open the CP4P_Compression_Activity_calculator.xlsx file.
◦ File menu > Save As > PDF (*.pdf)
◦ File menu > Save As > Excel 97-2003 Workbook (*.xls)
If you see the Microsoft Compatibility Checker, click Continue.
• Add the CP4P_Compression_Activity_calculator.pdf and.xls files from your
folder into the zip archive we have been using.
Fall 2022 Page 6 of 8
File Compression and Backup Computer Principles for Programmers
Open the .zip archive with Windows File Explorer.
On macOS, open a terminal window and cd to folder with .zip file
$ zipinfo -m archivename.zip
-m [medium] shows percentage of file size saved by compression, higher is
better.
-l [large ] shows original and compressed file sizes in bytes.
Use the Snipping Tool or Snip & Sketch ( + “snip”) to copy only the information seen below.
The Ratio shows the proportion of space saved. Ratio = (Size – Compressed) / Size * 100
"Ratio" is a misnomer because it is not a ratio of the sizes shown. The “Ratio” of a calculated, non-displayed value
relative to one of the two values shown is a guessing game, not a good user interface.
FYI: opening the .zip archive with 7zip will show bytes, not rounded K bytes, for original Size and Packed (compressed)
size.
See https://www.noupe.com/design/everything-you-need-to-know-about-image-compression.html
➔ Paste the image of the Windows [File] Explorer .zip archive information or equivalent from macOS.
→ knowing the properties of different file formats is essential to answering the questions below.
Which image format should you use? See this.
Reduce the Size of Microsoft Office Documents using Word as an example
➔ Files with the lowest ratios were compressed the least. Ratio indicates % of space saved.
Which file types compressed the least? Why would that be? (10 pts)
➔ Files with the highest ratios were compressed the most.
Which file types compressed the most? Why would that be? (10 pts)
Part 3: Backup
The most common cause of data loss is accidental deletion of a file by the end user on their own PC, or
by IT professionals of a great many files on a server. To recover from these inevitable cases of shooting
yourself in the foot, make a backup just before loading your gun.
Fall 2022 Page 7 of 8
File Compression and Backup Computer Principles for Programmers
A backup is a copy in a geographically separate location on an independent platform. A good backup
location is Microsoft Office 365 OneDrive, in a folder that is not synchronized with any other system.
Another is to collect your data into a zip archive, and SFTP it to the matrix server.
a. Create a backup folder/directory on the target system.
b. Copy important files to that folder. e.g. the zip archive you created in Part 2. Because it
is already compressed into a single file, it will take a minimum amount of time to upload.
c. Congratulations. You just backed up something.
➔ paste a screen shot of your backup results. (use the Screen Snip tool) (10 points)
Imagine your laptop just stopped working and could not be restarted
after you completed a great many hours of work today and yesterday.
You need a backup & restore strategy.
(30 points total for four answers ~100 words each, 400 in total.)
➔ What is (or what should have been) your backup routine? How do you ensure your backup is
current?
➔ How does your backup routine address the three characteristics of a real backup and fulfill the 3-2-1
backup check?
➔ Now that you have a backup but no laptop, how will you access and work with the current version
of your backed up files? What is your restore/recovery strategy?
➔ How long would this all take…and what if you a had a big assignment due tomorrow?
FYI – https://www.google.com/search?q=7zip+full+and+differential+backup+script
Fall 2022 Page 8 of 8