0% found this document useful (0 votes)

16 views7 pages

Database Assignment 3

The assignment focuses on implementing a B+ Tree and Index Scan using Apache Calcite, aiming to familiarize students with database file storage, indexing, and query evaluation. Students are required to implement a B+ Tree for indexing CSV files and an IndexScan operator to evaluate simple SQL queries. The assignment includes specific tasks, background information on Apache Calcite, and guidelines for submission, including a bonus option for implementing deletion functionality in the B+ Tree.

Uploaded by

sahil984sk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views7 pages

Database Assignment 3

Uploaded by

sahil984sk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

COL362/COL632 Assignment 3

B+ Tree and Index Scan

Due: 22nd March 2024 7:00pm

Modified on: 16th March 2024

Goal
The goal of this assignment is to get familiar with
• Physical storage of database files on disk
• Indexing databases files, particularly using B+Trees
• Implementing a new operator using Apache Calcite, and using it to evaluate simple queries
You have been provided with a starter code for this assignment. The code is available here (You should be
logged in with your IITD credentials to access it). In this assignment, you will further develop the provided
code.

Background
Apache Calcite. We will be using Apache Calcite, an open-source extensible data management frame-
work. It contains a SQL parser, an API for building expressions in relational algebra, and a query optimizer.
It is used in many popular big data systems, such as Apache Flink, Apache Hive, Apache Drill, Apache Kylin,
and is integral to data systems technology stack across several companies. See companies and projects that
are powered by Calcite. You are strongly encouraged to check out more information and get yourself familiar
with Apache Calcite here. We will be using Calcite in more detail in Assignments 4 and 5.

Memory management. Managing data storage is a critical part of any data management system. The
system must manage the memory efficiently, and must also provide a way to index the data for faster access.

In this assignment, you will develop an index and an efficient access control method to access data efficiently.

Tasks
Broadly, You have to solve two tasks in this assignment.

Task 1. Implement a B+ Tree

In this task, you will implement B+Trees. The B+Tree is a data structure used to store and retrieve data
in a database. It is a balanced tree, and is particularly useful for efficiently accessing records, and efficiently
performing insertions, deletions, and updates on files stored on disks.

1
You job is to implement a B+Tree that supports (point) insert and search operations. Feel free to flex your
programming and hacking skills by also implementing deletions (see bonus section below). The B+Tree you
develop, will be used to index CSV files. There are some example files provided to you in the starter code.

Task 2. Implement IndexScan in Calcite

After you have completed task 1, we’d love to see your B+Tree in action, i.e, given simple SELECT-FROM-WHERE
query, we’d like to evaluate the query and return the rows in the CSV file which satisfy the filter condition
in the WHERE clause.

The code provided to you already does most of the ground work for you. In particular, it

• Parses the query and convert it into a relational algebra expression

• Converts the relational algebra expression into a logical plan
• Converts the logical plan into a physical plan

• Executes the physical plan and return the result

The only missing piece the above steps, is the PIndexScan operator that you have to implement. This
operator will be used to evaluate queries which have a WHERE clause and the column in the WHERE clause
is indexed.

Note: For this assignment, we will have simple SFW queries. The queries will always of the type -

“SELECT * FROM $table_name WHERE $col_name OP $value”

where the OP can be EQUALS, LESS THAN, GREATER THAN, GREATER THAN OR EQUAL, LESS THAN OR EQUAL.

DB362
DB362 is a data system that we’ve developed, and need your help to further develop it. The system is being
designed to efficiently and effectively query data stored in CSV files. DB362 is designed as an in-memory
system, i.e., it first loads all files from disk to memory, and then provides the data management framework
to efficiently query CSV files–one can easily query a CSV file as “select ... from CSVfile where ...”.
The following figure shows the high level overview the “three layers” of the system.

Storage
The storage layer is already developed! DB362 system organizes CSV files in memory in a similar way a
database organizes the database files on disk. In the code provided to you, the classes DB, File, and Block
contain important parts of the code for storage layer (You should NOT modify these). The system organizes
the in-memory database (of CSV files) as a collection of File objects, each of which is a collection of Block
objects, and each Block contains records (rows of a CSV file.)

The following explains how the DB362 system organizes data in memory.

The structure of each block is as follows

2
s
Figure 1: The DB362 data system.

Figure 2: Block Structure for Records

A block is of always of fixed size of 4KB. When used to store rows of a CSV file, we follow the conven-
tion as shown in the figure. The first 2 bytes represent the number of rows, say n. The next 2n bytes store
the offset of each row in this block (2 bytes for each offset). We start filling the rows from the right hand side.

The structure of each record is as follows

Figure 3: Record Structure

Here, Oi , Li are the offset and length for the ith variable-length field. v is the number of variable length

3
fields and f is the number of fixed length fields. Each Oi is 2 bytes each, same for Li . For more details, refer
to the lecture notes.

Note that in each file that is used to store rows of a CSV file, the first block is always the metadata block.
In this case, we use the block to store the schema of the CSV file. The metadata block looks like -

Figure 4: Schema Block Structure

For each column, we store the datatype of the column, and the column name. This looks like -

Figure 5: Each column data in schema

Note: Since we store everything as bytes, take care of the little endian and big endian encoding at different
places.

Index
For this assignment, you will develop the Indexing layer, in particular a B+Tree index, and an access method
using the B+Tree (the two tasks above). A skeleton code is already been provided to you. The B+Tree is
implemented using the class Tree. The Tree class contains the InternalNode and LeafNode classes.

Note that the Tree class extends the AbstractFile Class. This is because, we also store the B+ Tree as a file on
the (simulated) in-memory disk. The InternalNode and LeafNode classes also extend the AbstractBlock class.

Remember that - since the InternalNode and LeafNode are blocks in the BPlusTreeIndexFile file - the
identifier (ID) of a node is the index of the block in the file.

Some methods are already implemented in these classes. You should NOT modify them. You should
implement the insert and search methods.

The first block in the Tree file will be the metadata block. In this block, the first 2 bytes are reserved for
the order of the block, and the next 2 bytes are reserved for the index of the root node.

The structure of InternalNode and LeafNode are different. Each InternalNode looks as follows -

4
Figure 6: Internal Node

Whereas, each LeafNode looks as follows -

Figure 7: Leaf Node

Here, you need to implement the search and insert function in BPlusTreeIndexFile.java. The search and
insert functions in LeafNode.java and InternalNode.java can be used as helper functions and will NOT be
tested separately.

The search function in BPlusTreeIndexFile should return the leftmost ID of leafnode that contains the key.

StorageManager
The StorageManager class helps in managing the database. It provides methods to load csv files an in-
memory storage, and to serialize and deserialize the rows into series of bytes.

The StorageManager class also provides methods to create indexes on columns of tables. You will need to
implement the create index method.

Calcite
As a part of the Apache Calcite framework, the starter code has 3 directories in the optimizer package -

5
• rel - This directory contains the classes for the relational algebra expressions. For this assignment, we
only have classes PRel and PIndexScan. PRel is the base class for all the relational algebra expressions.
PIndexScan is the class for the IndexScan operator.
• convention - PConvention is the physical convention for the relational algebra expressions.

• rules - This directory contains the rules for converting the logical plan into a physical plan. We have
already implemented the rule for converting a logical plan (which matches a SFW query) into a physical
plan. Once converted, we have a physical IndexScan operator in the plan.

As an index method, you will only need to implement the evaluate function in the PIn-
dexScan class. You will not need to modify any other part of the code in the Calcite package.

What to submit?
DB362 system requires Java 8. Ensure that you have java version 8 before proceeding with developing the
assignment. Further, you would also require Gradle version 4.5. Then proceed as follows:

• Clone the project from https://git.iitd.ac.in/dbsys/assignment_3.git

– create directory path/to/assignment 3/
– cd into the newly created directory by cd path/to/assignment 3/
– run git clone https://git.iitd.ac.in/dbsys/assignment_3.git . to clone the project on
your local machine

• Import the project into your favorite editor. We strongly recommend using Intellj
• Develop the system further. You should only work on the following files
– InternalNode.java
– LeafNode.java
– Tree.java BPlusTreeIndexFile.java
– PIndexScan.java
– StorageManager.java
– TreeNode.java

• Test that your code works

– You can add your own test cases in the files placed in ”in/ac/iitd/src/test/java” directory.
– To add new test cases, follow similar syntax as already included ones. (Should include a ”@Test”
annotation before the test function)
– To run the test cases, run the command ./gradlew test in the /path/to/assignment 3 direc-
tory. You can also use ./gradlew test --info to get detailed output on your console. (You can
also setup these run commands in Intellij IDE).
• Submit your contribution

– cd into path/to/assignment 3/
– create a patch git diff [COMMITID] > [ENTRYNO].patch
– Submit the .patch file on Moodle

6
Please follow the instructions strictly as given here and as comments in the code. Do not rename any
files, modify function signatures, or include any other file unless asked for.

When submitting your patch:

• replace [ENTRYNO] with your entry number.
• replace [COMMITID] with one that will be provided to you (this will be provided 1-2 days before
submission deadline)
Ensure that your patch doesn’t contain any files other than those you can make changes in. Thus, if you
create any new files for test cases, you should remove them from your patch.

Bonus Part
As a part of this assignment, you are free to also implement the delete functionality for your B+ Tree
implementations. Note that this part is optional and will carry some bonus marks. The signature for the
delete function is provided in Tree.java BPlusTreeIndexFile.java and StorageManager.java. As before, you
should NOT modify this signature. If you choose to not implement this function, please do not remove the
function signature from the Tree.java BPlusTreeIndexFile.java file - just leave the implementation empty.

Assignment 3
No ratings yet
Assignment 3
6 pages
Indexing and Hashing: (Emphasis On B+ Trees)
No ratings yet
Indexing and Hashing: (Emphasis On B+ Trees)
23 pages
Index and Hashing
No ratings yet
Index and Hashing
82 pages
DBMS Unit5
No ratings yet
DBMS Unit5
40 pages
DBMS Unit-4
No ratings yet
DBMS Unit-4
9 pages
Ch14, Veiws, Normalization - Summary
No ratings yet
Ch14, Veiws, Normalization - Summary
68 pages
CSE 544: Lecture 11 Storing Data, Indexes: Monday, 5/1/2006
No ratings yet
CSE 544: Lecture 11 Storing Data, Indexes: Monday, 5/1/2006
52 pages
Da 2 Arunkumar
No ratings yet
Da 2 Arunkumar
5 pages
Storage and Indexing
No ratings yet
Storage and Indexing
41 pages
UNIT-5: Indexing and Hashing
No ratings yet
UNIT-5: Indexing and Hashing
78 pages
Chapter 11: Indexing and Hashing
No ratings yet
Chapter 11: Indexing and Hashing
47 pages
Chapter 7 - Indexing
No ratings yet
Chapter 7 - Indexing
94 pages
CH 12 Updated
No ratings yet
CH 12 Updated
55 pages
LM6 - B+ Tree Index Files - B Tree Index Files
No ratings yet
LM6 - B+ Tree Index Files - B Tree Index Files
27 pages
DBMS Indexing
No ratings yet
DBMS Indexing
43 pages
DBMS Unit-Iv
No ratings yet
DBMS Unit-Iv
9 pages
DBMS IMPORTANT UNIT-4 QUESTIONS and Answer
No ratings yet
DBMS IMPORTANT UNIT-4 QUESTIONS and Answer
5 pages
B+-Trees: Efficient Indexing Guide
No ratings yet
B+-Trees: Efficient Indexing Guide
35 pages
Physical DBs B+ Tree
No ratings yet
Physical DBs B+ Tree
35 pages
n04-B Trees
No ratings yet
n04-B Trees
19 pages
Database Indexing Techniques
No ratings yet
Database Indexing Techniques
50 pages
Indexing and Hashing: (Emphasis On B+ Trees)
No ratings yet
Indexing and Hashing: (Emphasis On B+ Trees)
23 pages
Memoryhierarchy Indexing
No ratings yet
Memoryhierarchy Indexing
9 pages
Database Modeling - Notes-V
No ratings yet
Database Modeling - Notes-V
9 pages
Indexing: Contents
No ratings yet
Indexing: Contents
13 pages
08 Indexes1
No ratings yet
08 Indexes1
7 pages
CSE 301 Lecture-8-Indexing WT
No ratings yet
CSE 301 Lecture-8-Indexing WT
31 pages
08 Indexes1
No ratings yet
08 Indexes1
7 pages
B+ Trees for Database Students
No ratings yet
B+ Trees for Database Students
8 pages
CS245 In-Memory Data Assignment
No ratings yet
CS245 In-Memory Data Assignment
7 pages
Indexing
No ratings yet
Indexing
141 pages
Unit-5 B+Trees & Hashing
No ratings yet
Unit-5 B+Trees & Hashing
37 pages
B - TREES: (Loosely Based On The Cow Book: Ch. 10)
No ratings yet
B - TREES: (Loosely Based On The Cow Book: Ch. 10)
22 pages
Database File Organization Guide
No ratings yet
Database File Organization Guide
26 pages
Unit 5 Indexing 2024
No ratings yet
Unit 5 Indexing 2024
50 pages
Indexing and B+ Tress
No ratings yet
Indexing and B+ Tress
6 pages
File Organization
No ratings yet
File Organization
11 pages
Indexing
No ratings yet
Indexing
41 pages
Co3 Session 22
No ratings yet
Co3 Session 22
19 pages
Solution 3
No ratings yet
Solution 3
7 pages
IT3020 L06 Indexing
No ratings yet
IT3020 L06 Indexing
41 pages
Indexing
No ratings yet
Indexing
56 pages
DBMS Indexing Methods
No ratings yet
DBMS Indexing Methods
33 pages
File Organization and Indexing
No ratings yet
File Organization and Indexing
38 pages
CSC3170 2024fall A3
No ratings yet
CSC3170 2024fall A3
4 pages
Indexing Structures For Files: Database Design Database Design
No ratings yet
Indexing Structures For Files: Database Design Database Design
9 pages
CH 3 Index
No ratings yet
CH 3 Index
40 pages
Unit Iv Indexing and Hashing: Basic Concepts
No ratings yet
Unit Iv Indexing and Hashing: Basic Concepts
35 pages
Database Indexing Essentials
No ratings yet
Database Indexing Essentials
110 pages
CH 17 Sum
No ratings yet
CH 17 Sum
9 pages
CO3-Session-09 & 10
No ratings yet
CO3-Session-09 & 10
41 pages
Col334 A4
No ratings yet
Col334 A4
8 pages
A Critical Analysis of Caste System PDF
No ratings yet
A Critical Analysis of Caste System PDF
11 pages
Col333 A3 2
No ratings yet
Col333 A3 2
4 pages
Market Maker Citadel CA - Simulation Research-May-2022
No ratings yet
Market Maker Citadel CA - Simulation Research-May-2022
6 pages
5 Scheduling
No ratings yet
5 Scheduling
168 pages
Sheet 3
No ratings yet
Sheet 3
3 pages
Windows API Guide for VB Developers
No ratings yet
Windows API Guide for VB Developers
22 pages
1) / Write A Program To Print 0 01 010
No ratings yet
1) / Write A Program To Print 0 01 010
54 pages
Snowpro-Core LATEST
No ratings yet
Snowpro-Core LATEST
391 pages
Ah en Modbus TCP 8294 en 03 PDF
No ratings yet
Ah en Modbus TCP 8294 en 03 PDF
32 pages
Mid Term Exam Semester 1 PDF
No ratings yet
Mid Term Exam Semester 1 PDF
150 pages
9 Extracted Data Recovery Course
No ratings yet
9 Extracted Data Recovery Course
7 pages
Tutorial On File Organization: Comparison of File Organizations
No ratings yet
Tutorial On File Organization: Comparison of File Organizations
15 pages
FortiAnalyzer 5.4.1 Administration Guide
No ratings yet
FortiAnalyzer 5.4.1 Administration Guide
189 pages
DCN Lab Manual-JNUH Hyderabad
100% (3)
DCN Lab Manual-JNUH Hyderabad
69 pages
Transport Layer Essentials
No ratings yet
Transport Layer Essentials
12 pages
Chapter 2: Entity-Relationship Model
No ratings yet
Chapter 2: Entity-Relationship Model
71 pages
Blackboard Manage Data Sources
No ratings yet
Blackboard Manage Data Sources
52 pages
04-Hadoop Distributed File System
No ratings yet
04-Hadoop Distributed File System
56 pages
11g R1 Rac - Solaris - Installation Checklist
No ratings yet
11g R1 Rac - Solaris - Installation Checklist
9 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
44 pages
Embedded System Application: Roms, Non-Volatile and Flash Memories
No ratings yet
Embedded System Application: Roms, Non-Volatile and Flash Memories
66 pages
Data Sheet
No ratings yet
Data Sheet
23 pages
Java Programming Challenges
No ratings yet
Java Programming Challenges
86 pages
Android App Log: Module Updates
No ratings yet
Android App Log: Module Updates
55 pages
Data Analytics: Practice Exercises
50% (2)
Data Analytics: Practice Exercises
2 pages
Assignment Ch02
No ratings yet
Assignment Ch02
3 pages
Cloud Service - Merged
No ratings yet
Cloud Service - Merged
472 pages
DNS Record Types
No ratings yet
DNS Record Types
14 pages
Accessing MySQL Through Matlab
No ratings yet
Accessing MySQL Through Matlab
3 pages
TECHNOLOGY IN BANKING OPERATIONS Explain The Concept of Network Topology and Its Various Types
No ratings yet
TECHNOLOGY IN BANKING OPERATIONS Explain The Concept of Network Topology and Its Various Types
6 pages
SAP System Commands Guide
No ratings yet
SAP System Commands Guide
5 pages
Integration Manager User Guide PDF
No ratings yet
Integration Manager User Guide PDF
74 pages
Data Structures
No ratings yet
Data Structures
45 pages
3ee13 PLC Register
No ratings yet
3ee13 PLC Register
7 pages
Presentation On Hard Disks: By-Satyam Jhawar
No ratings yet
Presentation On Hard Disks: By-Satyam Jhawar
23 pages

Database Assignment 3

Uploaded by

Database Assignment 3

Uploaded by

COL362/COL632 Assignment 3

B+ Tree and Index Scan

Modified on: 16th March 2024

Task 1. Implement a B+ Tree

Task 2. Implement IndexScan in Calcite

• Parses the query and convert it into a relational algebra expression

• Executes the physical plan and return the result

“SELECT * FROM $table_name WHERE $col_name OP $value”

The structure of each block is as follows

Figure 2: Block Structure for Records

The structure of each record is as follows

Figure 3: Record Structure

Figure 4: Schema Block Structure

Figure 5: Each column data in schema

Whereas, each LeafNode looks as follows -

Figure 7: Leaf Node

• Clone the project from https://git.iitd.ac.in/dbsys/assignment_3.git

• Test that your code works

When submitting your patch:

You might also like