Task 1: Navigating the File System
1. If you have not yet done so, download this very exercise sheet. Navigate to it using commands such
as ls, cd on the command line. Once you are in the directory in which the exercise sheet is located,
use ls, whose output should list the sheet.
Show your commands and their outputs.
If you cannot find the exercise sheet, you may use the command find.
Example solution:
input output
1. ls Desktop Downloads Movies Pictures Documents Library Music Public
2. cd Downloads -
3. ls texttech-ex04.pdf
2. Navigate to your home folder using commands like in the previous task and use pwd there. Show
your commands and their outputs.
Example solution:
The command line automatically opens new windows in the home folder.
If you have moved to a different directory, you can use the following command to return to the home
input output
folder. cd ∼ -
pwd /Users/username
Task 2: Creating Folders and Files
1. Create the following file hierarchy only using commands such as ls, cd, mkdir, touch on the com-
mand line:
– 1/7 –
LanguageFamilies
Dravidian
South-Central Dravidian
Telugu.txt
Indo-European
Germanic
North Germanic
Swedish.txt
Japonic
Japanese.txt
Koreanic
Korean.txt
Sino-Tibetan
Sinitic
Mandarin Chinese.txt
When you are done, go to the folder LanguageFamilies and use the tree command.
Show your commands and their outputs.
Example solution:
mkdir -p LanguageFamilies/Dravidian/South-Central Dravidian
LanguageFamilies/Indo-European/Germanic/North Germanic
LanguageFamilies/{Japonic,Koreanic}
LanguageFamilies/Sino-Tibetan/Sinitic
(this is 1 command, separators should be spaces; new lines used here for visual reasons.)
touch Dravidian/South-Central Dravidian/Telugu.txt
Indo-European/Germanic/North Germanic/Swedish.txt
Japonic/Japanese.txt
Koreanic/Korean.txt
Sino-Tibetan/Sinitic/Mandarin Chinese.txt
(this is again 1 command, separators should be spaces; new lines used here for visual reasons.)
tree used inside LanguageFamilies:
2. You were given example files in Lecture 8. If you have not yet done so, download those from Moodle.
The archive contains a file called inferred-cognates.tsv. From this file, extract the entries for
pronouns (label PRN) of the languages Telugu (tel), Swedish (swe), Japanese (jpn), Korean (kor),
and Mandarin Chinese (cmn), and save them to the corresponding .txt files you created in the previous
task. Use grep and combine it with other commands if needed.
– 2/7 –
Show your commands and their outputs.
Example solution:
grep -P ’.+::PRN\ttel\t’ inferred-cognates.tsv >
~/path/to/LanguageFamilies/Dravidian/South-Central_Dravidian/Telugu.txt
(this is again 1 command, separators should be spaces; new lines used here for visual reasons.)
cat-ing the file:
grep -P ’.+::PRN\tswe\t’ inferred-cognates.tsv >
~/path/to/LanguageFamilies/Indo-European/Germanic/North_Germanic/Swedish.txt
and so on, for each language code + file path.
Task 3: Comparing files with diff
You have received files with data about birds in Tübingen. For unclear reasons, one of the files seems
distorted and some of the original data in it was lost (possibly due to copy-pase errors). Use the command
line and Unix tools to examine the file and answer the following questions. The files (file1.dat and
file2.dat) are in the archive which accompanies this exercise sheet.
1. The first line of file1.dat indicates the locations where the birds were found. Without opening the
file, list all the locations which occur in the data by extracting this header.
Solution:
– 3/7 –
2. The final part of the file lists the birds that have been found around Tübingen. Without opening
the file, read the last four lines of the file, thereby listing the types of bird that have been found in
Tübingen on the terminal.
Solution:
3. file2.dat is the distorted file. By using command line tools, find how many lines are supposed to
be inserted or fixed to become like the original file1.dat.
Solution:
13 lines.
4. How many lines differ between file1.dat and file2.dat if we ignore case?
Solution:
7 lines.
5. After moving into the binary files folder, you will see two binary files there. Use diff to compare two
– 4/7 –
binary files and see if these two files are really different.
Solution:
if you see the result, the command line results tell us that files are different. However, if you convert
the binary code to the normal text, message is the same. This is because binary encoding styles are
different.
6. The Unix command diff can compare two files. If we want to compare three files instead, what
command should we use instead? How did you find out?
Solution:
To compare more than 2 files, we can use for loop command together with diff. Or we can use diff3.
This is used specifically to compare three files (not more than 3 unfortunately).
Task 4: Regular Expressions I
(General Hint: When creating your regular expressions, be sure to test them with various test strings. Make
sure that the expression matches what it should and does not match what it should not.)
1. Create a regular expression that matches only the student mail addresses at our university.
It should, for example, match erik.zeiner@student.uni-tuebingen.de,
sung-jin-miriam.han@student.uni-tuebingen.de, and hyunjoo.cho@student.uni-tuebingen.de.
It should not match, for example, match johannes.dellert@uni-tuebingen.de, incorrect addresses
like zeiner@student.uni-tuebingen.de, or addresses on any other domain.
Solution:
[a-z-]+\.[a-z-]+@student\.uni-tuebingen\.de
2. Consider the following regular expression:
[1-2]{1,2}(:|.)[0-3][7-9]␠?([ap].?m.?|[AP]M)
Describe, in your own words, strings that match this expression. Analyse the expression part by part.
For each part, provide at least two strings that match the expression and demonstrate what the given
part does.
Example Solution:
The expression matches various ways to express time in a 12-hour format. It allows the hours to be
either one or two digits, and those digits can only be 1 or 2. This can also match non-existent hours
such as 21 (e.g. 1:07 am and 21:07 am). The hours and minutes can be separated either by a dot or
by a colon (e.g. 1:07 am and 1.07 am). The minutes can be in the range from 07 to 39, where the
second digit is always between 7 and 9 (e.g. 1:07 am and 1:38 am). After the minutes, there can,
but doesn’t have to be, a space (e.g. 1:07 am and 1:07am). At the end, we can express the period
of the day, using ‘am’, ‘pm’, ‘a.m.’, ‘p.m.’, ‘am.’, ‘pm.’, ‘a.m’, ‘p.m’, ‘AM’, or ‘PM’ (e.g. 1:07 a.m.
and 1:38 PM).
– 5/7 –
Task 5: Regular Expressions II
For this exercise, create a regular expression that matches all case forms for a specific set of Czech nouns in
their singular paradigm. The two representative examples are ‘stůl‘ (table) and ‘dvůr’ (courtyard). Consider
their declension tables:
Case Courtyard Sg. Table Sg.
Nominative dvůr stůl
Genitive dvoru/dvora stolu
Dative dvoru stolu
Accusative dvůr stůl
Vocative dvore! stole!
Locative dvoru/dvoře stole
Instrumental dvorem stolem
Your regular expression should match lines containing any of the 7 forms, regardless of any text prepended.
For example, the string could be ‘Nominative: dvůr’, or a style more commonly used in Czech ‘1. pád:
stůl’. Use the declension of the word ‘kůl’ (stake) to confirm that your expression is precise enough. This
word should be matched in nominative and accusative, but not in other cases, as it doesn’t follow the same
paradigm. Your expression therefore must match all 7 cases of the previous words, but only 2 cases of ‘kůl’.
Do not hardcode the two examples: we want to match all words with that declension, not only these two.
Case Stake Sg.
Nominative kůl
Genitive kůlu
Dative kůlu
Accusative kůl
Vocative kůle!
Locative kůlu
Instrumental kůlem
Example Solution:
This regular expression matches all declensions of the words for courtyard and table, regardless how the
grammatical case is expressed:
^.* (.*(?:r|l|o(?:r|l)(?:u|a)|o(?:r|l|ř)e!?|o(?:r|l)em))$
1. Use grep with your regular expression to list all the lines which match in the file examples.txt (again
found in the archive provided alongside this sheet). Then use wc to count how many such lines there
are in the file. Explain how, just from considering that number, we could gauge how many words
(with their full declension list), which follow the paradigm, are represented in the file.
– 6/7 –
Solution:
The expression matched 16 lines in the file. We know that a full declension list of a word would have
7 lines, so we could estimate that there are at least 2 words that match the paradigm (2 · 7 = 14)
possibly some other words where only a partial match was made. The declension table for the word
stake matches in Nominative and Accusative, so that, in this case, explains the two more matches.
This, of course, is not a conclusive way to estimate the number, and there are situations where this
could lead you to a wrong estimate.
2. Add an argument to the previous command to list all lines from the file which do not match.
Solution:
– 7/7 –