0% found this document useful (0 votes)

4 views5 pages

09 Indexes2

Lecture #09 discusses various indexing techniques in database systems, including Bloom filters, skip lists, tries, inverted indexes, and vector indexes. Each technique has its own advantages and disadvantages, with Bloom filters providing efficient membership queries, skip lists allowing fast traversal without rebalancing, and inverted indexes supporting keyword searches. The lecture also covers advanced topics like vector indexing for semantic searches and the use of clustering algorithms for efficient nearest-neighbor queries.

Uploaded by

Darion Yaphet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views5 pages

09 Indexes2

Uploaded by

Darion Yaphet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Lecture #09: Indexes II

15-445/645 Database Systems (Spring 2025)

https://15445.courses.cs.cmu.edu/spring2025/
Carnegie Mellon University
Jignesh Patel

1 Bloom filter
A filter is a data structure that answers set membership queries (does this element exist in the set?). If
we know an element is not in a set, we save time finding it in the set while it does not exist. For example,
within a chained hash table, we can put a filter at each bucket pointer. If the filter says negative, we then
know the key is not in the chain thus saving our time traversing through the whole chain, which is costly.
A Bloom filter is a probabilistic filter implemented with bitmap. By probabilistic, it means a Bloom filter
does not always give the correct answer to a set membership query (false positives). However, a Bloom
filter guarantees that it will never has false negatives.
This implies that false positives could happen. The false positive rate can be calculated via Bloom Filter
Calculator.
A Bloom filter needs to define
• Size of the bitmap
• Numbers of hash functions to use

Insert(x)
For insertion, the pre-defined hash functions are used on the inserted element x. For each function’s output
hash value, we modular it with the bitmap size, then set the corresponding position in the bitmap to one.
See Figure 1.

Lookup(x)
For lookup, a similar operation is done on element x. Each hash function takes x as input and modular
the output value with bitmap size. If any of the corresponding positions in the bitmap is not one, a false is
returned. Otherwise a true is returned (one has to go to the set and see if it’s actually in it).

Other Variations
• Counting Bloom filter: Instead of bits, a sequence of integers is used. Deletion is possible.
• Cuckoo filter: Same idea to Cuckoo Hash, but store fingerprints of elements instead. Deletion is
also possible.
• Succinct range filter: An immutable filter. While no insertion can be made, a lookup can ask if
there is an element within a range.

2 Skip List
A Skip List it uses multiple levels of linked lists to skip some nodes and thus traverse faster. See Figure 2.
Like the B+tree, it stores keys in an ordered manner. However, it does not require rebalancing during
insertion or deletion. It is commonly seen in an in-memory data structure such as memtable.
Spring 2025 – Lecture #09 Indexes II

Figure 1: Insert ’RZA’ into a Bloom filter with two hash functions

Figure 2: An overview of a skip list

Find
Go to the top-level linked list and traverse until the value is about to be greater than the target. Then go
down to the next level and traverse the same way until it reaches the bottom list and then the target key.

Insert
Coins are flipped to decide until which level this new node is going to insert. Insertions on different
levels are done from bottom to top, in order to keep the whole data structure intact. Otherwise, a reader
from another thread may come across a node and find out there is no pointer to the lower level while the
insertion is still not finished.
Note that if the linked list is in one direction, each level’s insertion could be done by an atomic pointer
in-memory swap (swap K4 ’s next with K5 ’s next pointer), thus no latch is needed.

Figure 3: The skip list in Figure 2 after K5 inserted. And it succeeded in two coin
flips.

15-445/645 Database Systems

Page 2 of 5
Spring 2025 – Lecture #09 Indexes II

Figure 4: Finding a key ”HELLO” in a trie

Delete
At first, a node is marked as deleted instead of removed from the data structure to prevent other reader
threads from visiting the dead object. Any reader can ignore the deleted node and keep traversing the way
down. Later when the node object is to be deleted, the top node will be removed before the bottom one to
keep the data structure intact.

Conclusion
Advantages:
• Less memory usage if not including reverse pointer compared to B+tree.
• No rebalancing is needed while inserting and deleting.
Disadvantages:
• Not disk/cache friendly because they do not optimize locality of reference
• Reverse search is non-trivial, it becomes tricky to handle both ascending and descending scans.

3 Trie
Because a B+tree does not provide information about whether a node exists below an inner node or not,
it’s essential to go down to the leaf node to find out a node does not exist. It costs one buffer pool page
miss per tree level.
A trie is an order-preserving data structure that stores keys as digits. A trivial way is to make characters
(a byte) as digits for strings or bits for other data types. These digits form a tree structure to represent
prefixes of every entry inserted into this trie. Note that the operation complexity depends on the length
of the target key.
The span of each trie level is the number of bits that a digit takes. If the digit and its prefix exist in the
corpus, then a pointer to the next level node is stored at that digit, otherwise, a null is stored.

15-445/645 Database Systems

Page 3 of 5
Spring 2025 – Lecture #09 Indexes II

Figure 5: From left to right, horizontal compression and vertical compression are
done on a trie with a one-bit span node

Compression
• If we have a known span of a level, we can horizontally compress the node to an array instead of a
map. See Figure 5
• If a node has only a single child, we can vertically compress the nodes below the node as there are
no branches down there. See Figure 5. This is also called Radix tree. False positives may happen,
readers have to check the tuple a node points to.

4 Inverted Index
The indexes we talked about before are only good for point or range searches. They do not support keyword
search. For example, a query to find all Wikipedia articles that contain the word ”Pavlo”.
An inverted index stores an immutable mapping of terms to records (the list of records is called a posting
list) that contain those terms in the target attribute.

Lucene Implementation
Lucene is a specialized inverted index engine. The way it stores an inverted index is to have a trie-like
data structure called a ”finite state transducer” (See Figure 6). Instead of storing pointers to a tuple like a
trie, it stores weights on every edge. By traversing down to the key one is looking for, a rolling sum of the
weights will eventually give the exact position the entry is stored in the mapping.
In the dictionary of terms, since it’s immutable and is built ahead of time, compression techniques such
as delta compression, and bit-packing. Pre-aggregation is also supported to accelerate the query time of
aggregation queries that group on terms.

PostgreSQL Implementation
PostgreSQL’s Generalized Inverted Index uses a B+tree to build the term dictionary. The value of this B+tree
leaf node depends on the size of the posting list. For small-sized posting lists, it will be a sorted list of record
IDs. For large-sized posting lists, additional B+tree structures will be built to hold the record IDs.
A separate pending list is used to avoid small incremental updates. It logs updates and accumulates to a
bulk insert to the dictionary.

5 Vector Index
Inverted indexes can support keyword search, but not the semantic meaning of the content (i.e., the key-
word has to be contained in the content exactly). Suppose an application wants to search for records that

15-445/645 Database Systems

Page 4 of 5
Spring 2025 – Lecture #09 Indexes II

Figure 6: Finite state transducer to find the offset in the dictionary of a term

are related to a certain topic, for example, ”hip-hop groups with songs about slinging”. In that case, we
need another method other than the inverted index.
Large language models are known for generating embeddings for texts, an array of floating point num-
bers. Embeddings are geometrically close to each other if they have similar semantic meanings, therefore
a vector index specialized for nearest-neighbor search can help in the query we were discussing.
However, there is no correct answer to this kind of query compared to those traditional queries we have
seen before. There is also a need to filter data before or after finding nearest neighbors.

Inverted File
Partition vectors into smaller groups by a clustering algorithm. To search the nearest neighbors, use the
same clustering algorithm to locate the query to a group, then look up all the vectors within that group
(might also want to look at the groups nearby). Example: IVFFlat.
Preprocessing and quantization can be performed on the index to reduce the dimension and thus speed up
the lookup time, while the original vectors are still preserved at their original locations.

Navigable Small Worlds

Build a graph that represents the neighbor relationship between vectors, where each node is a vector and
its edges link to its n nearest neighbors. Navigable small worlds use this graph to navigate from the entry
point to the query vector by greedily choosing the edge that moves closer to the query vector. Example:
FAISS, HNSWlib

15-445/645 Database Systems

Page 5 of 5

09 Indexes2
No ratings yet
09 Indexes2
5 pages
08 Indexes1
No ratings yet
08 Indexes1
7 pages
08 Indexes1
No ratings yet
08 Indexes1
7 pages
CS2202 IndexingHashing
No ratings yet
CS2202 IndexingHashing
83 pages
DINLect 1
No ratings yet
DINLect 1
69 pages
DBMS Indexing Methods
No ratings yet
DBMS Indexing Methods
33 pages
IT3020 L06 Indexing
No ratings yet
IT3020 L06 Indexing
41 pages
CSE 301 Lecture-8-Indexing WT
No ratings yet
CSE 301 Lecture-8-Indexing WT
31 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
09 Oltpindexes2
No ratings yet
09 Oltpindexes2
67 pages
10 Data Structures That Make Databases Fast and Scalable
No ratings yet
10 Data Structures That Make Databases Fast and Scalable
12 pages
Week 9
No ratings yet
Week 9
46 pages
IN3020/4020 - Database Systems Spring 2020, Week 3.1 Indexing
No ratings yet
IN3020/4020 - Database Systems Spring 2020, Week 3.1 Indexing
44 pages
Unit Iv
No ratings yet
Unit Iv
29 pages
UNIT-5: Indexing and Hashing
No ratings yet
UNIT-5: Indexing and Hashing
78 pages
CSE 544: Lecture 11 Storing Data, Indexes: Monday, 5/1/2006
No ratings yet
CSE 544: Lecture 11 Storing Data, Indexes: Monday, 5/1/2006
52 pages
Unit Iv Indexing and Hashing: Basic Concepts
No ratings yet
Unit Iv Indexing and Hashing: Basic Concepts
35 pages
DBMS Indexing 5
No ratings yet
DBMS Indexing 5
63 pages
CH 13
No ratings yet
CH 13
34 pages
Indexing
No ratings yet
Indexing
77 pages
Index Dbms
No ratings yet
Index Dbms
5 pages
Database Management System-203105251: Assistant Professor Computer Science & Engineering
No ratings yet
Database Management System-203105251: Assistant Professor Computer Science & Engineering
35 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
Database Indexing Techniques
No ratings yet
Database Indexing Techniques
50 pages
Chapter 09 Advanced Data Structures
No ratings yet
Chapter 09 Advanced Data Structures
9 pages
DM Module-3
No ratings yet
DM Module-3
60 pages
Chapter 11: Indexing and Hashing
No ratings yet
Chapter 11: Indexing and Hashing
47 pages
Lecture Index Structures
No ratings yet
Lecture Index Structures
43 pages
210 Maps PDF
No ratings yet
210 Maps PDF
39 pages
Hash Tree Index
No ratings yet
Hash Tree Index
44 pages
Index Structures
No ratings yet
Index Structures
34 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Final Review
No ratings yet
Final Review
96 pages
Memoryhierarchy Indexing
No ratings yet
Memoryhierarchy Indexing
9 pages
Search Trees
No ratings yet
Search Trees
55 pages
Indexing Hashing Files
No ratings yet
Indexing Hashing Files
68 pages
Ch14, Veiws, Normalization - Summary
No ratings yet
Ch14, Veiws, Normalization - Summary
68 pages
02 Blocking - Addional
No ratings yet
02 Blocking - Addional
74 pages
Lecture 5 Trees
No ratings yet
Lecture 5 Trees
47 pages
CH 12 Updated
No ratings yet
CH 12 Updated
55 pages
DBMS Unit-4
No ratings yet
DBMS Unit-4
9 pages
DBMS Indexing
No ratings yet
DBMS Indexing
43 pages
Module 5 - Indexing and Searching
No ratings yet
Module 5 - Indexing and Searching
15 pages
IRS Module5-I
No ratings yet
IRS Module5-I
15 pages
2.:A Binomial Heap Is A Collection of Binomial Trees: H 2h2 (Key) 1h1 (Key)
No ratings yet
2.:A Binomial Heap Is A Collection of Binomial Trees: H 2h2 (Key) 1h1 (Key)
15 pages
DSACAT2QP
No ratings yet
DSACAT2QP
14 pages
Database Indexing Basics
No ratings yet
Database Indexing Basics
31 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Unit 5 Indexing 2024
No ratings yet
Unit 5 Indexing 2024
50 pages
Database Indexing & Hashing Basics
No ratings yet
Database Indexing & Hashing Basics
7 pages
Storage and Indexing
No ratings yet
Storage and Indexing
41 pages
CO3-Session-09 & 10
No ratings yet
CO3-Session-09 & 10
41 pages
DBMS Unit5
No ratings yet
DBMS Unit5
40 pages
Public PM51xx PM53xx PMC Register List v1037 v1047
No ratings yet
Public PM51xx PM53xx PMC Register List v1037 v1047
25 pages
MANDT Kernelpool PAPER
No ratings yet
MANDT Kernelpool PAPER
28 pages
Handling Large Datasets
No ratings yet
Handling Large Datasets
26 pages
Disk Free Space Management
No ratings yet
Disk Free Space Management
16 pages
Digital Systems Design Using VHDL 3rd Edition Roth Solutions Manual Download
100% (25)
Digital Systems Design Using VHDL 3rd Edition Roth Solutions Manual Download
33 pages
Allen Bradley Micro800 Ethernet Manual
No ratings yet
Allen Bradley Micro800 Ethernet Manual
48 pages
Bit String Flicking
No ratings yet
Bit String Flicking
3 pages
Chapter 2: Basic Structures: Sets, Functions, Sequences, and Sums
No ratings yet
Chapter 2: Basic Structures: Sets, Functions, Sequences, and Sums
47 pages
Bit Happens - Goldman Sachs India Hackathon 2025 - CS Question - Contests - HackerRank
No ratings yet
Bit Happens - Goldman Sachs India Hackathon 2025 - CS Question - Contests - HackerRank
6 pages
MELSEC WS Safety Controller Guide
No ratings yet
MELSEC WS Safety Controller Guide
30 pages
Engine Exhaust System Parameters
No ratings yet
Engine Exhaust System Parameters
24 pages
Chapter - 9
No ratings yet
Chapter - 9
12 pages
IEC 61131-3 - Programming Languages
100% (3)
IEC 61131-3 - Programming Languages
74 pages
Project Assignment Computer Science - Final - 2013 - Vijay
No ratings yet
Project Assignment Computer Science - Final - 2013 - Vijay
17 pages
UBAF4
No ratings yet
UBAF4
1,901 pages
VHDL: A Language For Specifying Logic
No ratings yet
VHDL: A Language For Specifying Logic
24 pages
VCLImageUtils Pascal
No ratings yet
VCLImageUtils Pascal
104 pages
Mining Data Stream
No ratings yet
Mining Data Stream
31 pages
Release Notes Compactlogix L32E Controller Version 17.012 (Released 3/2013) - 20.019 (Released 6/2016) Catalog Number 1769-L32E (Series B)
No ratings yet
Release Notes Compactlogix L32E Controller Version 17.012 (Released 3/2013) - 20.019 (Released 6/2016) Catalog Number 1769-L32E (Series B)
48 pages
Apb Vip User Manual PDF
No ratings yet
Apb Vip User Manual PDF
18 pages
Simple Entities and Architectures: 1 Entity Declarations
No ratings yet
Simple Entities and Architectures: 1 Entity Declarations
6 pages
Tribal SQL
No ratings yet
Tribal SQL
324 pages
1.6 Efficient Data Cube Computation & Indexing OLAP
No ratings yet
1.6 Efficient Data Cube Computation & Indexing OLAP
25 pages
GDS II Stream Format Manual 6.0 Feb87
No ratings yet
GDS II Stream Format Manual 6.0 Feb87
47 pages
Oracle Performance Survival Guide A Systematic Approach To Database Optimization 1st Edition Guy Harrison
100% (1)
Oracle Performance Survival Guide A Systematic Approach To Database Optimization 1st Edition Guy Harrison
48 pages
07 - RANCM Sdrplan - FDD Mchibout (V5.70.10.00) 20250710135204
No ratings yet
07 - RANCM Sdrplan - FDD Mchibout (V5.70.10.00) 20250710135204
170 pages
BACnet PICS E-DDC FG01 03.08.04
No ratings yet
BACnet PICS E-DDC FG01 03.08.04
41 pages
Curso Sysmac - Ladder
No ratings yet
Curso Sysmac - Ladder
70 pages
Notes OS
No ratings yet
Notes OS
11 pages
Integration Manual With CTFClient v7.4
No ratings yet
Integration Manual With CTFClient v7.4
137 pages

09 Indexes2

Uploaded by

09 Indexes2

Uploaded by

Lecture #09: Indexes II

15-445/645 Database Systems (Spring 2025)

Figure 2: An overview of a skip list

15-445/645 Database Systems

Figure 4: Finding a key ”HELLO” in a trie

15-445/645 Database Systems

15-445/645 Database Systems

Navigable Small Worlds

15-445/645 Database Systems

You might also like