Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 2: Handout 28.10.2024, Due 04.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #008
For the following questions, we will consider the uni db schema imported in the previous sheet
Question 1: Index tuning (1 P.)
All relevant primary indices and foreign keys are already created (e.g.: assistants.boss is a foreign key
referencing professors.PID).
As the system seems to be very unresponsive lately, the president of the university tasked you with
improving the performance of the system. An analysis of the query load showed that the following values
are queried regularly:
• Q1: A list of all participants (matrnr) of a specific exam, filtered by semester (e.g.: WHERE semester
< 4).
• Q2: A histogram, showing how many assistants are working in which field, given a specific boss.
• Q3: The same histogram as before for the whole university.
• Q4: All exams of a given professor with a specific grade.
• Q5: An overview over all grades of a given student.
• Q6: The names of all lectures without prerequisites.
Which indices will you create to improve the performance of the given queries? For each index discuss
why you created it. Hint: Maybe some indices can be used for multiple queries. Which of your created
indices could benefit from being a hash index, instead of a B+ tree?
1
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 2: Handout 28.10.2024, Due 04.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #008
Solution
• Q1: We create an B+ index on the students.semester, as range queries are possible. The join
between exam and student does not require a further index.
• Q2: A B+ index on (assistants.boss, assistants.field). This enables very fast creation of
the histogram.
• Q3: The previous index can’t be used anymore. We build a new B+ index on assistants.field.
• Q4: We create an index on (exam.PID, exam.grade). This index can be a B+- or hash-index, as
both columns have one specific value.
• Q5: We do not have to create an index, as the primary key of exam can already be used. Alterna-
tively, creating an additional B+- or hash-index on exam.matrnr would also be possible.
• Q6: We have to create either a B+- or hash index on prerequisites.lecture, as the primary key
is ordered with the required lecture first.
Question 2: Index Costs (1 P.)
Given the following values for the lectures table: Lets assume, there are N = 350 000 lectures stored in
this relation. Each is numbered sequentially between 1 and N . Each page on the disk can store exactly
10 lecture tuples. A B+ tree was created as primary index. It has a height of h and the leaves contain 20
entries, together with a pointer to a data page.
a) How many pages have to be accessed to answer the query SELECT * FROM lecture WHERE LID
BETWEEN 30 000 AND 40 000? (Hint: Keep the classification of indices from the lecture in mind)
Solution
A clustered sparse primary index has one entry per page. The rows which are physically stored on
the disk follow the same order as the index.
First, we search the key 30 000 within the B+ tree. This requires h page accesses. From there on we
need to consider the next 10 001 tuples.
These tuples are distributed over d10 001/10e = 1 001 pages.
Since the index is clustered, the total page accesses will be h + 1 001.
2
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 2: Handout 28.10.2024, Due 04.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #008
b) After analyzing the query load, the database administrator notices that many queries select lectures
held by the same professor. He decides to cluster the table lectures by the heldby column. An
additional index is created on the primary key LID. How would the lecture classify this index? How
many page accesses would the query from question a) require now?
Solution
The index is now a dense secondary index, as the file is clustered by another attribute. We still have
to find the key 30 000, which once again requires h page accesses. From there we, again, iterate over
the leaves to receive the next 10 001 tuples. But now, each tuple has its own index entry. Thereby,
we have to read d10 001/20e = 501 index pages which point to 10 001 data pages. In total we access
h + 501 + 10 001 pages.
Note: Not necessarily will each lecture tuple be stored in a different page. In reality, we may access
less than 10 001 data pages. But we can not guarantee that multiple result-tuples are stored in the
same page, or that this page is still contained in the DB buffer when we require it again. For the
purpose of this lecture we calculate the worst-case costs.
Question 3: Composite Index (1 P.)
The lectures table contains 1 000 lecture tuples, of which exactly 4 tuples fit into one page. Each lecture
has a (uniformly distributed) SWS value between 1 and 20. There are 50 professors, each holding the
same amount of lectures. The lectures table has two secondary composite-key indices on (sws,heldby)
and (heldby,sws). Both indices are B+ trees with a height of h and each leaf can store 5 references to
pages.
• Which index requires less page accesses for the query SELECT * FROM lectures WHERE heldby =
5 and sws <= 10? Please calculate the number of expected page accesses for both indices.
Solution
The selectivities of both predicates are f (σheldby=5 ) = 1/50, f (σsws<=10 ) = 10/20 = 1/2. The index
accesses for both composite indices are sketched in the following figure.
3
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 2: Handout 28.10.2024, Due 04.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #008
(sws, heldby)
h
Index
(20,50)
(1,50)
(5,20)
(10,5)
(10,6)
... ... ... ... ...
(1,1)
(1,5)
(2,5)
... ... 100 visited leaves
Data Entries
Data Pages ... Max. 10 visited data pages
(heldby, sws)
h
Index
(50,20)
(5,10)
(5,11)
(10,5)
(18,6)
... ... ...
(1,1)
(5,1)
(5,5)
(5,6)
... ... 2 visited leaves
...
...
Data Entries
Data Pages ... Max. 10 visited data pages
Index (sws,heldby) requires h page accesses to navigate to the leaf node containing the first
matching entry. From there on we have to iterate all remaining entry pages until the last element, as
the result tuples are distributed over a large number of leaves. We have to iterate (1 000·10/20·1/5 ≈
100) − 1 = 99 leaves to reach the last matching tuple. As the index allows us to only fetch pages
containing relevant tuples, we require at most 1 000 · 10/20 · 1/50 ≈ 10 data page accesses, if no
result tuple shares a page. In total, this index requires at most h + 99 + 10 page accesses.
Index (heldby,sws) also requires h page accesses to navigate to the leaf node containing the first
entry smaller or equal than (5, 10). But this time the index clusters the result tuples right next to
each other. Hence, we only have to read (1 000 · 1/50 · 1/5 · 1/2 ≈ 2) − 1 = 1 leaf node. As before, the
result tuples can be distributed over up to 10 pages. In total, this index requires at most h + 1 + 10
page accesses.
Aternative solution for (sws,heldby):
If the system estimates that only a few pages have to be accessed and the overhead of random
reads is still less than sequentially reading a range of index leaves, we could find all data entries by
repeatedly traversing the index from the root to the data page as shown below. This would require
h ∗ 10 + 10 random accesses, which depending on the height of the tree and the access times could
be slower than the previous solution.
4
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 2: Handout 28.10.2024, Due 04.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #008
(sws, heldby)
...
10 * h
Index
(20,50)
(1,50)
(5,20)
(10,5)
(10,6)
... ... ... ... ...
(1,1)
(1,5)
(2,5)
... ... 10 visited leaves
Data Entries
Data Pages ... Max. 10 visited data pages
5
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 2: Handout 28.10.2024, Due 04.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #008
Question 4: Index Implementation (1 P.)
This question requires you to implement code in any programming language you want. The code has to
compile and return the correct result. In OLAT, we provide a Java template with most of the boilerplate
code already in place. This template checks your result for correctness and times the execution. Do not
change anything else than the specified parts of the code.
Submit the code as a separate file (or archive, if you have more than one source file). If you use a different
language than Java, provide instructions on how to compile and run your code. Also, your code should
then include the same checks as the template Java main method.
N
Given a dataset of N tuples, consisting of a unique ID and a string value with |S| = 10 distinct strings.
Initially the dataset is ordered by the ID and each tuple is assigned a random string from S like the
following:
(1, ABC), (2, BCD), (3, ABC), (4, XYZ), (5, XYX), (6, BCD), (7, XYZ)
The system should be able to answer the following queries:
• Return all tuples for which the string is equal to a given query string q.
• Return all tuples for which the string is lexicographically equal or greater to a given query string q.
Your task is to:
a) Implement a default access method, which iterates all tuples and returns the correct results. (Al-
ready given, if one uses the Java template code)
b) Implement a dense index as introduced in the lecture.
If it is not possible for your index implementation to perform one of these operations, then use the default
execution method and state why it is not possible.
Execute your code a few times with different query strings (For the java template, just execute it mul-
tiple times). Describe how the index creation time and query times for both operations differ in your
implementations. You may change the values of N and S to check the effects on your implementation.
Solution
One possible solution for the dense index is to use a hash index.
Pseudo code dense index creation:
index = {}
for t in tuples:
index[t.string] += [t.id]
Pseudo code dense index get equal string:
return index[query_string]
With the hash index, no “greater than” operation can be efficiently implemented. One could iterate all
keys, check if they are greater than the given value and add the corresponding tuple(s) to the list. But
this can be slower than iterating all tuples.