Database Topics Explained With Examples
Database Topics Explained With Examples
Management Systems
This first part establishes the fundamental principles of database systems. It begins by
exploring the crucial transition from simple file-based storage to sophisticated Database
Management Systems (DBMS), highlighting the problems that drove this evolution. It then
dissects the internal architecture of a DBMS, explaining its core components, the levels of
abstraction that make it powerful, and the essential distinction between a database's
structure (schema) and its content (instance). The section concludes by examining the
languages used to interact with databases and the vital human roles responsible for their
management and use.
To understand the value of a modern database system, one must first appreciate the
limitations of its predecessor: the traditional file system. For decades, applications stored
their data in individual files, managed directly by the operating system. While simple, this
approach created significant challenges as applications grew in complexity and data became
more valuable. The emergence of the Database Management System (DBMS) was not merely
an improvement but a paradigm shift in how we manage and interact with information.
A file system is a method an operating system uses to control how data is stored and
retrieved. It organizes data in a hierarchy of files and directories (folders) on a storage device
like a hard disk.1 Think of it as a traditional office filing cabinet. Each drawer is a directory, and
each manila folder is a file. To find a specific piece of information—say, a customer's phone
number—you need to know which drawer and which folder it's in. You pull out the folder,
search through the papers manually, and hope the information is correct.
The transition from file systems to DBMS was driven by the need to solve a series of critical
problems that became bottlenecks for developing robust, multi-user applications. The DBMS
was engineered specifically to address these shortcomings.
The inherent limitations of file systems are best understood when contrasted with the features
provided by a DBMS.
The necessity for this evolution is rooted in the increasing complexity of software. Early,
single-user applications could tolerate the simplicity of file systems. However, the rise of
enterprise-level, multi-user systems where data is a shared, critical asset made the file
system's limitations untenable. A DBMS is not just a better storage system; it is an enabling
technology that made modern, data-driven applications possible by treating data as a
managed, valuable asset rather than a passive byproduct of a program.
However, it is important to recognize that the landscape is not a simple dichotomy. The
boundaries have become blurred with the advent of embedded databases like SQLite, which
is fundamentally a file-based database library used extensively in applications like web
browsers and mobile apps.3 This illustrates a critical engineering principle: the choice of data
storage is a trade-off. For a simple, single-user application, the overhead of a full-scale DBMS
is unnecessary; an embedded database provides the benefits of SQL and transactions
without the complexity of a client-server architecture. For a large-scale e-commerce
platform, anything less than a full-fledged DBMS would be inadequate. The modern data
environment is a spectrum of tools, and the architect's job is to select the right one for the
task at hand.
A DBMS is a complex piece of software with several interacting components that work
together to store, retrieve, and manage data. Understanding its internal architecture reveals
how it can provide powerful features like data independence, query optimization, and
concurrency control.
At the heart of any DBMS are two primary components that handle the core tasks of data
management and query execution.
1. Storage Manager: This is the module that interfaces with the operating system's file
system and is responsible for all low-level data operations. It translates commands from
the query processor into physical actions on the disk. Its key functions include managing
the physical storage of data, allocating and deallocating disk space, organizing data into
files and pages, managing the buffer (the area of main memory used to cache disk pages
for faster access), and maintaining data structures like indexes that speed up data
retrieval.6 The storage manager is the component that worries about the physical reality
of bits and bytes on a storage device.
2. Query Processor: This component acts as the "brain" of the DBMS. It is responsible for
understanding and executing user queries. When a user submits a query (e.g., in SQL),
the query processor first parses it to check for correct syntax, then optimizes it by
determining the most efficient way to execute it, and finally generates a sequence of
low-level instructions that it passes to the storage manager for execution.6 The
optimization step is crucial; for a complex query involving multiple tables, there can be
thousands of ways to execute it, and the optimizer's job is to find a plan that minimizes
resource usage (like disk I/O and CPU time).
Data Abstraction: Hiding the Complexity
One of the most important functions of a DBMS is to provide users with an abstract view of
the data, hiding the intricate details of how it is physically stored and maintained. This is
achieved through levels of abstraction, much like an onion with layers that shield the user
from the complex core.9
● Visual Representation: The Three Levels of Abstraction
Imagine three concentric circles. The innermost circle is the Physical Level, the middle is
the Logical Level, and the outermost is the View Level. The user interacts with the outer
level, completely shielded from the inner workings.
!(https://i.imgur.com/8Q5F5YJ.png)
1. Physical Level (Internal Level): This is the lowest level of abstraction and describes
how the data is actually stored on the physical storage devices. It deals with complex,
low-level data structures like B+ trees and hashing methods, file organization, and
memory management details.9 This level is the concern of database system developers
and, to some extent, DBAs who tune performance.
2. Logical Level (Conceptual Level): This is the next level up, describing what data is
stored in the database and what relationships exist among that data. It presents the
entire database in terms of a small number of relatively simple structures, such as tables
(relations), columns (attributes), and constraints.9 Database administrators and
application developers work at this level. For example, a developer defines a
Students table with columns like StudentID, Name, and Major, without needing to know
how these records are physically arranged on a disk.
3. View Level (External Level): This is the highest level of abstraction and describes only a
part of the entire database, tailored to the needs of a particular user group. A view can
hide certain data for security purposes or present a simplified structure to make
interaction easier.9 For instance, a university registrar might see all student information,
while a faculty member's view might be restricted to only the students enrolled in their
courses, and it might hide sensitive information like financial aid status.
The purpose of these layers of abstraction is to achieve data independence. The logical level
hides the physical storage details, providing physical data independence. This means the
DBA can change the physical storage structures or devices (e.g., move the database to a
faster SSD, create a new index) to improve performance without requiring any changes to the
application programs that access the data.5 Similarly, the view level hides changes in the
logical structure, providing
logical data independence. This allows the DBA to change the conceptual schema (e.g.,
split a table into two, add a new column) without affecting applications that do not depend on
those changes.5 This separation is a cornerstone of modern software engineering, as it
dramatically reduces the long-term cost of system maintenance and evolution.
Two fundamental terms that describe the state and structure of a database are schema and
instance.
● Schema: The schema is the logical blueprint of the database. It is the overall design,
defining the tables, the columns within each table, the data types for each column, the
relationships between tables, and integrity constraints.4 The schema is designed during
the database design phase and is relatively static; it does not change frequently.12 To use
an analogy, the schema is the architect's empty blueprint for a building, detailing the
structure of rooms, doors, and windows, but containing no furniture or people.4
● Instance: An instance of a database is the actual data contained within it at a specific
point in time.4 It is a "snapshot" of the database's content. While the schema is stable,
the instance is highly dynamic, changing with every
INSERT, UPDATE, or DELETE operation.12 In our analogy, an instance is the building at a
particular moment, filled with furniture and occupied by people.4
The way applications connect to and interact with a database is defined by its architecture.
The two most common models are two-tier and three-tier.
1. Two-Tier Architecture: This is a simple client-server model where the application logic
resides on the client machine and communicates directly with the database server.15 For
example, a desktop application installed on your computer might connect directly to a
central database. While easy to develop for simple scenarios, this model is less scalable
and poses security risks, as the client has direct access to the database.
2. Three-Tier Architecture: This is the dominant architecture for modern web and
enterprise applications. It introduces a middle layer, the Application Tier, between the
client (Presentation Tier) and the database (Data Tier).15
○ Presentation Tier: This is the user interface that the end-user interacts with, such
as a web browser or a mobile app. Its job is to display information and collect input
from the user.18
○ Application Tier (Middle Tier): This layer contains the business logic of the
application. It receives requests from the presentation tier, processes them (e.g.,
validating data, performing calculations), and interacts with the data tier to store or
retrieve information. This is where the bulk of the application's work happens.16
○ Data Tier: This tier consists of the DBMS and the database itself. It is responsible for
storing and managing the data. Crucially, it is only accessible through the application
tier, never directly by the client.16
● Visual Representation: Three-Tier Architecture
A clear diagram shows the logical separation and communication flow: The User interacts
with the Presentation Tier, which communicates with the Application Tier. The Application
Tier then communicates with the Data Tier to fulfill the user's request.
!(https://www.ibm.com/docs/en/was/images/thtrcs.gif)
This three-tier model offers significant advantages, including enhanced scalability (each tier
can be scaled independently), flexibility (a tier can be updated or replaced without affecting
the others), and improved security (the data tier is shielded from direct external access).17
To interact with a database—to define its structure, manipulate its data, and control
access—users employ a specialized set of commands. In relational databases, these
commands are part of the Structured Query Language (SQL). SQL is not a single monolithic
language but is composed of several sub-languages, each with a distinct function.
● Data Definition Language (DDL): DDL commands are used to define and manage the
database structure, or schema. They create, modify, and delete database objects like
tables, indexes, and views.19 Because these commands alter the fundamental structure of
the database, they are typically used by database administrators and designers.
○ CREATE: To build new database objects (e.g., CREATE TABLE Students (...)).
○ ALTER: To modify the structure of existing objects (e.g., ALTER TABLE Students ADD
COLUMN GPA DECIMAL(3,2);).
○ DROP: To permanently delete objects (e.g., DROP TABLE Students;).
○ TRUNCATE: To remove all data from a table quickly, without deleting the table
structure itself.20
● Data Manipulation Language (DML): DML commands are used to manage the data
within the schema objects—that is, to interact with the database instance.21 These are
the everyday commands used by applications and users to perform operations on the
data.
○ SELECT: To retrieve data from the database. This is the most frequently used SQL
command.
○ INSERT: To add new rows of data into a table (e.g., INSERT INTO Students VALUES
(...);).
○ UPDATE: To modify existing data in a table (e.g., UPDATE Students SET Major =
'Computer Science' WHERE StudentID = 123;).
○ DELETE: To remove rows of data from a table (e.g., DELETE FROM Students WHERE
StudentID = 123;).
● Data Control Language (DCL): DCL commands are concerned with rights, permissions,
and other controls of the database system. They are used by DBAs to manage user
access to data.20
○ GRANT: To give a specific user permission to perform certain tasks (e.g., GRANT
SELECT, INSERT ON Students TO 'professor_smith';).
○ REVOKE: To take away permissions from a user (e.g., REVOKE DELETE ON Students
FROM 'intern_user';).
● Transaction Control Language (TCL): TCL commands are used to manage transactions
in the database, ensuring that work is done reliably and data integrity is maintained.19
○ COMMIT: To save all the work done in the current transaction, making the changes
permanent.
○ ROLLBACK: To undo all the work done in the current transaction, restoring the
database to its state before the transaction began.
○ SAVEPOINT: To set a point within a transaction to which you can later roll back,
without rolling back the entire transaction.
DDL (Data Definition Defines and manages the CREATE, ALTER, DROP,
Language) database structure TRUNCATE
(schema).
A profound aspect of SQL, particularly its DML component, is its declarative nature.22 When
a user writes a query like
SELECT Name FROM Students WHERE Major = 'Physics', they are declaring what data they
want, not prescribing the step-by-step procedure for how to get it. The procedural
work—deciding which index to use, the order in which to access tables, and the algorithm for
filtering the data—is handled by the DBMS's query optimizer. This separation of user intent
from execution strategy is a key reason for SQL's enduring power and longevity. It allows the
underlying database technology to evolve and improve its performance without requiring any
changes to the millions of applications that rely on it.
A database system is not just technology; it is a resource used and managed by people in
various roles. Understanding these roles is essential to appreciating how a database functions
within an organization.
Database users can be categorized based on their level of technical expertise and how they
interact with the system.23
1. Naive Users: This is the largest group of users. They are typically not aware that they are
using a database. They interact with the system through pre-written application
programs with simple graphical user interfaces.24 Examples include a bank teller using a
terminal to process a withdrawal, a customer using an ATM, or someone booking a flight
on a website.
2. Application Programmers: These are the software developers who write the application
programs that naive users interact with. They are skilled professionals who use
programming languages (like Java, Python, or C#) in conjunction with database
commands (like SQL) to create the user interfaces and business logic of the system.24
3. Sophisticated Users: These users interact with the database directly, without writing
application programs. They are typically engineers, scientists, or business analysts who
are proficient in writing complex SQL queries to perform ad-hoc data analysis.23 They use
tools like SQL clients or data analysis software to explore the data and generate reports.
4. Specialized Users: These users write specialized database applications that do not fit
into the traditional data-processing framework. Their applications often involve complex
data types, such as those for computer-aided design (CAD), geographic information
systems (GIS), or multimedia databases.23
The Database Administrator (DBA) is the person or team responsible for the overall
management of the database system. The DBA is the guardian of the data, ensuring its
integrity, security, and availability.24 The role is multifaceted and requires a blend of technical
expertise and strategic thinking.
The DBA role is not just a technical one; it is a strategic function that acts as a bridge between
the technology, the business, and the users. A successful DBA must understand the technical
intricacies of the DBMS, the data needs of the application developers, and the overarching
business requirements for data security, availability, and performance. They ensure that the
database, one of the most critical assets of any modern organization, is not only functioning
correctly but is also aligned with and actively supporting the organization's strategic goals.
For students seeking to build a strong theoretical foundation in the concepts covered in this
module, the following texts are highly recommended:
● For Comprehensive Theory: Database System Concepts by Abraham Silberschatz,
Henry F. Korth, and S. Sudarshan. Often referred to as the "Sailboat Book," this is a
cornerstone academic text that provides rigorous and in-depth coverage of fundamental
database concepts. It is ideal for those who want a deep theoretical understanding.30
● For a Balanced Approach: Fundamentals of Database Systems by Ramez Elmasri and
Shamkant B. Navathe. This is another classic textbook widely used in university courses. It
is known for its broad coverage, clear explanations of both theory and design, and its
inclusion of real-world examples, making it highly accessible for students.30
Before a single line of code is written or a table is created, a database must be designed. Data
modeling is the process of creating a conceptual representation of the data that an
organization needs to store and the relationships between different data elements. It is
arguably the most critical step in building a successful database application. A well-designed
model leads to a database that is efficient, easy to maintain, and scalable, while a poor model
can lead to a system plagued by performance issues and data anomalies. This part focuses on
the Entity-Relationship (E-R) model, the industry standard for conceptual data modeling, and
also introduces the alternative data models offered by NoSQL databases.
The Entity-Relationship (E-R) model is a high-level, conceptual data modeling approach that
provides a graphical way to view data, making it an effective tool for database design.34 It
allows designers to create a blueprint of the database based on real-world objects and their
associations, which can then be translated into a relational database schema. The E-R model
is built upon three fundamental concepts: entities, attributes, and relationships.
● Entity: An entity is a real-world object or concept that is distinguishable from other
objects and about which we want to store data. Entities are the "nouns" of our database
model.35 In a university database, examples of entities would be
Student, Course, and Instructor. An entity set is a collection of similar entities (e.g., all
the students in the university).34 In an E-R diagram, an entity is represented by a
rectangle.
○ Strong Entity: An entity that has a primary key attribute that uniquely identifies each
instance. Most entities are strong entities.35
○ Weak Entity: An entity that cannot be uniquely identified by its own attributes alone
and relies on a relationship with another (owner) entity for its identity. For example, a
Dependent entity might only be identifiable in the context of the Employee it belongs
to. A weak entity is represented by a double-lined rectangle.34
● Attribute: An attribute is a property or characteristic that describes an entity. Attributes
are the "adjectives" or "properties" of the entities.35 For the
Student entity, attributes might include StudentID, Name, and Major. In an E-R diagram,
attributes are represented by ovals connected to their entity. There are several types of
attributes:
○ Key Attribute: An attribute whose value is unique for each entity instance. This is
used as the primary key and is typically underlined in the diagram.35
○ Composite Attribute: An attribute that can be subdivided into smaller sub-parts. For
example, an Address attribute could be composed of Street, City, and ZipCode.35
○ Multivalued Attribute: An attribute that can hold multiple values for a single entity
instance. For example, a PhoneNumber attribute for a student could hold both a
home and a mobile number. This is represented by a double-lined oval.35
○ Derived Attribute: An attribute whose value can be calculated or derived from
another attribute. For example, a student's Age can be derived from their
DateOfBirth. This is represented by a dashed oval.39
● Relationship: A relationship represents an association between two or more entities.
Relationships are the "verbs" that connect the entities.35 For example, a
Student entity is associated with a Course entity through an "enrolls in" relationship. In an
E-R diagram, a relationship is represented by a diamond.
○ Cardinality Constraints: These constraints define the numerical nature of the
relationship, specifying how many instances of one entity can be related to instances
of another entity. The most common cardinalities are 34:
■ One-to-One (1:1): One instance of entity A can be associated with at most one
instance of entity B, and vice versa. (e.g., One Department has one Head).
■ One-to-Many (1:M): One instance of entity A can be associated with many
instances of entity B, but each instance of B is associated with only one instance
of A. (e.g., One Instructor teaches many Courses).
■ Many-to-Many (M:N): Many instances of entity A can be associated with many
instances of entity B. (e.g., Many Students enroll in many Courses).
The graphical representation of the E-R model is the Entity-Relationship Diagram (E-R
Diagram). Its primary purpose is to serve as a visual blueprint that helps database designers,
developers, and business stakeholders communicate and refine the database structure before
it is implemented.34 A clear E-R diagram provides a preview of how tables will connect and
what fields they will contain, allowing for the identification of potential design flaws early in
the development process.34
Let's construct a basic E-R diagram for the context provided in the syllabus, Vivekananda
Global University.
1. Identify Entities: The core objects in a university setting are Student, Course, Instructor,
and Department.41 These will be our primary entities, represented by rectangles.
2. Define Attributes: Next, we define the properties for each entity and identify their
primary keys (underlined).
○ Student: StudentID, Name, Address (composite: Street, City, State), DateOfBirth
○ Course: CourseID, Title, Credits
○ Instructor: InstructorID, Name, OfficeNumber
○ Department: DeptID, DeptName
3. Establish Relationships: We now define the associations between these entities using
diamonds.
○ A Student enrolls in a Course.
○ An Instructor teaches a Course.
○ A Course is offered by a Department.
○ An Instructor belongs to a Department.
4. Determine Cardinality: We analyze the numerical constraints of each relationship.
○ Student enrolls in Course: A student can enroll in multiple courses, and a course
can have multiple students enrolled. This is a Many-to-Many (M:N) relationship.
○ Instructor teaches Course: Let's assume for simplicity that one instructor can
teach multiple courses, but each course is taught by only one instructor. This is a
One-to-Many (1:M) relationship from Instructor to Course.
○ Department offers Course: A department can offer many courses, but each course
is offered by only one department. This is a One-to-Many (1:M) relationship.
○ Instructor belongs to Department: An instructor belongs to one department, and a
department has many instructors. This is also a One-to-Many (1:M) relationship.
5. Draw the Final Diagram: Combining these elements results in the following E-R
diagram:
!(https://i.imgur.com/q2y1f2P.png)
This diagram serves as a clear, high-level blueprint. From this visual model, a database
designer can proceed to the next stage: translating these conceptual entities and
relationships into a logical schema of relational tables, primary keys, and foreign keys.
While the E-R model and the resulting relational database have been the dominant paradigm
for decades, the rise of the internet, big data, and applications requiring massive scalability
led to the development of a new class of databases known as NoSQL ("Not Only SQL"). These
databases are designed to handle use cases where the rigid schema and strict consistency of
relational databases can be a limitation.43 They excel at managing large volumes of
unstructured or semi-structured data and are typically designed to scale out horizontally (by
adding more servers) rather than scaling up (by using a more powerful single server).
There are several major categories of NoSQL data models, each suited to different types of
problems:
1. Document Databases: These databases store data in flexible, semi-structured
documents, most commonly in a format like JSON (JavaScript Object Notation).43 Each
document can have its own unique structure. This model is very intuitive for developers
as it maps directly to objects in application code.
○ Example Use Cases: Content management systems, blogging platforms,
e-commerce product catalogs, user profiles.
○ Prominent System: MongoDB.
2. Key-Value Stores: This is the simplest NoSQL data model. Data is stored as a collection
of key-value pairs, where each key is unique and is used to retrieve its corresponding
value.43 The value can be anything from a simple string to a complex object.
○ Example Use Cases: Caching web pages or query results, storing user session
information for web applications, real-time bidding systems.
○ Prominent Systems: Redis, Amazon DynamoDB.
3. Column-Family (or Wide-Column) Stores: These databases store data in tables with
rows and columns, but unlike relational databases, the names and format of the columns
can vary from row to row in the same table.43 They are optimized for queries over large
datasets and are highly scalable for write-heavy workloads.
○ Example Use Cases: Big data analytics, recommendation engines, event logging,
systems that require heavy write throughput.
○ Prominent Systems: Apache Cassandra, Apache HBase.
4. Graph Databases: These databases are purpose-built to store and navigate
relationships. Data is modeled as nodes (entities), edges (relationships), and properties
(attributes).43 They are designed to efficiently handle complex queries that explore the
connections between data points.
○ Example Use Cases: Social networks (e.g., finding friends of friends), fraud
detection (e.g., identifying complex rings of fraudulent activity), logistics and network
management, recommendation engines (e.g., "customers who bought this also
bought...").
○ Prominent Systems: Neo4j, Amazon Neptune.
The introduction of NoSQL in a database curriculum immediately following the E-R model is
significant. It highlights a fundamental shift in the world of data management. The choice of a
data model is no longer automatically "relational." Instead, it is a critical architectural decision
that must be driven by the specific requirements of the application. A modern data architect
needs to be proficient in multiple data models—a concept sometimes called "polyglot
persistence"—to select the right tool for the right job. For a system requiring complex
transactions and strong data integrity, like a banking application, the relational model remains
the superior choice. For a social media application that needs to manage a vast,
interconnected network of users and scale to millions of requests per second, a graph
database is a more natural and efficient fit.
Once a database is designed and populated with data, its primary purpose is to provide
information through queries. A query language is the interface used to communicate with the
database, allowing users to retrieve, insert, update, and delete data. For relational databases,
the universal standard is SQL (Structured Query Language). This part delves into the structure
of SQL queries, from basic data retrieval to complex operations that combine data from
multiple tables, and concludes with an introduction to the concept of query optimization.
SQL is a powerful, declarative language used to manage and query data in a relational
database. Its syntax is designed to be readable and expressive, resembling natural English.
The cornerstone of data retrieval in SQL is the SELECT statement. A basic query has a
well-defined structure composed of several clauses that are processed in a logical order.
● Core Clauses:
○ SELECT: Specifies the columns (attributes) you want to retrieve. Using an asterisk (*)
selects all columns.49
○ FROM: Specifies the table (relation) from which to retrieve the data.49
○ WHERE: Filters the rows based on a specified condition. Only rows that satisfy the
condition are included in the result.49
● Optional Clauses for Grouping and Sorting:
○ GROUP BY: Groups rows that have the same values in specified columns into
summary rows. It is often used with aggregate functions.49
○ HAVING: Filters the results of a GROUP BY clause. While WHERE filters rows before
aggregation, HAVING filters groups after aggregation.49
○ ORDER BY: Sorts the final result set in ascending (ASC) or descending (DESC) order
based on one or more columns.49
To find the names of all students majoring in 'Computer Science' and order them
alphabetically by last name, the query would be:
SQL
SELECT FirstName, LastName
FROM Students
WHERE Major = 'Computer Science'
ORDER BY LastName ASC;
Aggregate Functions
Aggregate functions perform a calculation on a set of values and return a single value. They
are frequently used with the GROUP BY clause to generate summary reports.20
● COUNT(): Returns the number of rows. COUNT(*) counts all rows, while
COUNT(column_name) counts non-NULL values in that column.
● SUM(): Returns the total sum of a numeric column.
● AVG(): Returns the average value of a numeric column.
● MAX(): Returns the largest value in a column.
● MIN(): Returns the smallest value in a column.
SQL
In SQL, NULL is a special marker used to indicate that a data value does not exist in the
database. It is not the same as zero or an empty string. Because NULL represents an unknown
value, it cannot be compared using standard comparison operators like = or !=. Instead, the IS
NULL and IS NOT NULL operators must be used to test for NULL values.51
Example: To find all students who have not yet declared a major:
SQL
SELECT Name
FROM Students
WHERE Major IS NULL;
The true power of a relational database lies in its ability to store related data in separate
tables and then combine that data on the fly to answer complex questions. This is
accomplished using joins. A JOIN clause combines rows from two or more tables based on a
related column between them, typically a foreign key in one table that references a primary
key in another.53
Venn diagrams are an excellent way to visualize how different join types work, showing which
records are included from two tables, Table A (left) and Table B (right).56
● CROSS JOIN: Returns the Cartesian product of the two tables—every row from the first
table is paired with every row from the second table. This is rarely used in practice with a
WHERE clause, but can be useful for generating combinatorial data.59
● SELF JOIN: This is not a distinct join type but a technique where a table is joined with
itself. It is useful for querying hierarchical data or comparing rows within the same table.
For example, you could use a self-join on an Employees table to find all employees who
have the same manager.59
Example: Using our university database, let's say we have a Students table and an
Enrollments table. To get a list of all students and the courses they are enrolled in, we would
use an INNER JOIN:
SQL
SQL
For students who are not enrolled, the CourseID column in the result would be NULL.
Beyond basic SELECT statements and joins, SQL offers capabilities for constructing highly
complex queries.
● Nested Queries and Subqueries: A subquery is a SELECT statement nested inside
another SQL statement (e.g., inside a WHERE or FROM clause).52 They allow for
sophisticated, multi-step filtering. For example, to find all students enrolled in the
'Advanced Databases' course, one could first use a subquery to find the
CourseID for that course title.
● Integrity Constraints: While not query commands, integrity constraints like PRIMARY
KEY, FOREIGN KEY, and NOT NULL are defined using DDL and are crucial for ensuring the
logical consistency of the data that queries operate on.20 A
FOREIGN KEY constraint, for instance, ensures that a value in one table refers to a valid,
existing primary key in another table, making joins meaningful and reliable.
● Query Optimization: As mentioned earlier, the DBMS does not execute a query exactly
as it is written. The query optimizer, a critical component of the query processor,
analyzes the query and determines the most efficient execution plan.6 It considers
factors like available indexes, table sizes, and data distribution statistics to decide on the
best join order, access methods, and algorithms. This process is what allows a high-level,
declarative language like SQL to perform efficiently on complex databases.
It is here that the connection between data modeling and data retrieval becomes clear. The
abstract relationships defined in the E-R model during the design phase are made concrete
and actionable through the JOIN clauses in SQL queries. A line connecting Student and
Course in an E-R diagram becomes an ON Students.StudentID = Enrollments.StudentID clause
in a query. A well-structured E-R model, which accurately captures the real-world
relationships, translates directly into a database schema where logical and efficient joins are
possible. Conversely, a poorly designed model leads to a schema that requires convoluted,
inefficient, or sometimes impossible queries to get the needed information, reinforcing the
paramount importance of thoughtful data modeling.
To master the art of writing effective SQL queries, from the basics to advanced techniques,
the following books are highly recommended:
● For a Quick Start: SQL in 10 Minutes, Sams Teach Yourself by Ben Forta. This book is
renowned for its concise, lesson-based approach that helps beginners quickly grasp the
essential syntax and commands of SQL.30
● For Deeper Understanding: SQL Queries for Mere Mortals: A Guide to Data
Manipulation in SQL by John L. Viescas. This classic text goes beyond simple syntax to
teach the logic and thought process behind constructing powerful and accurate queries.
It is an excellent resource for moving from a basic user to a proficient query writer.61
A robust and efficient database is not an accident; it is the result of a deliberate and
principled design process. The central goal of this process is to create a structure that
minimizes data redundancy and protects data integrity. The primary technique used to
achieve this is normalization, a systematic process for organizing the columns and tables in a
relational database to reduce data duplication and eliminate undesirable characteristics. This
part explores the problems that arise from poor design, explains the theory of functional
dependencies that underpins normalization, and provides a step-by-step guide through the
most common normal forms.
When a database schema is not properly designed, it often contains redundant data—the
same piece of information is stored in multiple places. This redundancy is not just inefficient; it
is dangerous because it leads to data anomalies, which are inconsistencies or errors that
occur when a user attempts to insert, update, or delete data.63 Normalization is the formal
process of decomposing tables to eliminate these anomalies.66
These anomalies are all symptoms of a single underlying problem: a poorly structured schema
where a single table is used to store facts about multiple different entities (students, courses,
instructors). The solution is normalization.
We will now walk through the process of normalizing a single, unnormalized table for student
course registrations to achieve a well-structured design.
This table violates the basic principles of a relational model because the Courses column
contains a repeating group of multiple values.
Anomalies still exist: We have solved the atomicity problem, but now we have significant
redundancy. StudentName is repeated for John Doe, and CourseName and Instructor are
repeated for CS101. This table still suffers from update, insertion, and deletion anomalies.
To achieve 2NF, we decompose the table into smaller tables, separating the partially
dependent attributes.
StudentID StudentName
● Courses Table:
CourseID CourseName Instructor
● Enrollments Table:
StudentID CourseID
101 CS101
101 MATH203
102 CS101
Anomalies are reduced: We can now add a new student without them being enrolled in a
course, and add a new course before any students enroll. However, the Courses table still has
a problem.
● Instructors Table:
(The Students and Enrollments tables remain the same). Now, each table contains facts about
only one entity, and all non-key attributes depend only on the primary key.
The process of normalization involves breaking down (decomposing) a single large table into
multiple smaller tables. For this process to be correct, the decomposition must have two
crucial properties.
Properties of Decomposition
1. Lossless Join Decomposition: The decomposition must be "lossless," meaning that if
we join the decomposed tables back together, we must be able to perfectly reconstruct
the original table without creating any extra (spurious) rows or losing any original rows.76
A decomposition of a relation R into R1 and R2 is lossless if the intersection of their
attributes is a key for at least one of them.76
2. Dependency Preserving Decomposition: The decomposition must preserve all the
original functional dependencies. This means that every FD from the original table must
be logically implied by the FDs in the individual decomposed tables.76 This is important
because it allows the database to enforce the original business rules by checking
constraints on the smaller, more efficient tables.
While it is always possible to achieve a lossless join decomposition that is in BCNF, it is not
always possible for that decomposition to also be dependency-preserving. In such cases,
designers might choose to stick with a 3NF design that preserves dependencies over a BCNF
design that does not.
Beyond BCNF, there are higher normal forms that address more complex data dependencies.
● Fourth Normal Form (4NF): Deals with multi-valued dependencies. A table is in 4NF if
it is in BCNF and has no multi-valued dependencies. This arises when a table has multiple
independent one-to-many relationships, which should be separated into their own
tables.81
● Fifth Normal Form (5NF): Deals with join dependencies. It is designed to reduce
redundancy in databases that contain multi-valued facts by isolating them semantically.
A table is in 5NF if it cannot be decomposed into any number of smaller tables without
loss of information.81
In practice, achieving 3NF or BCNF is sufficient for the vast majority of database designs.
It is critical to understand that normalization is not a goal in and of itself, but a tool to achieve
a better design. In the real world, there is a fundamental trade-off between update efficiency
and query performance. A highly normalized database (e.g., in 3NF/BCNF) minimizes
redundancy, which makes updates, insertions, and deletions very efficient and safe from
anomalies. However, it also means that retrieving data often requires joining many small tables
together, which can be computationally expensive.
For this reason, while Online Transaction Processing (OLTP) systems (like e-commerce
checkout systems or banking applications) are almost always highly normalized to ensure
data integrity, Online Analytical Processing (OLAP) systems (like data warehouses used for
business intelligence) are often intentionally denormalized. They use designs like star
schemas, which introduce controlled redundancy to reduce the number of joins required for
complex analytical queries, thereby dramatically improving query performance. The "best"
level of normalization is not always the highest possible level, but rather the level that best
serves the specific workload of the application.
To fully grasp the theory and practice of normalization and relational database design, this
book is an essential resource:
● For Theory and Practice: Database Systems: The Complete Book by Hector
Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. This text offers a rigorous and
comprehensive treatment of relational design theory, including detailed explanations of
functional dependencies, the process of normalization through all major forms, and the
algorithms for ensuring lossless and dependency-preserving decompositions.46
In a multi-user database system, many users and applications may be trying to read and write
data at the same time. The part of the DBMS that ensures these concurrent operations do not
corrupt the data and that the database remains in a consistent state, even in the face of
system failures, is the transaction manager. This final part explores the concept of a
transaction, the ACID properties that guarantee its reliability, the problems that arise from
concurrent access, and the mechanisms used to control concurrency and recover from
failures.
To ensure that transactions are processed reliably, a DBMS must guarantee four key
properties, known by the acronym ACID.85 The classic example used to illustrate these
properties is a bank transfer of $100 from Account A to Account B.
1. Atomicity: This property ensures that a transaction is an "atomic" or indivisible unit.
Either all of its operations are executed, or none are.
○ Bank Transfer Example: The transaction consists of two operations: debiting $100
from Account A and crediting $100 to Account B. Atomicity guarantees that if the
system crashes after the debit but before the credit, the entire transaction will be
rolled back, and the $100 will be returned to Account A. The database is never left in
a state where the money has vanished.85
2. Consistency: This property ensures that a transaction brings the database from one
valid state to another, preserving all predefined rules and constraints.84
○ Bank Transfer Example: A business rule might be that the total sum of money
across all accounts must remain constant. The transfer operation moves money but
does not change the total. The consistency property ensures that the database is in
a consistent state both before the transaction begins and after it commits.87
3. Isolation: This property ensures that concurrently executing transactions cannot
interfere with each other. From the perspective of any single transaction, it should
appear as if it is the only transaction running on the system.84
○ Bank Transfer Example: If another transaction is calculating the total assets of the
bank while our transfer is in progress, isolation ensures it will not read the accounts
in an intermediate state (e.g., after the debit from A but before the credit to B). It will
either see the state before the transfer or the state after it, but never an inconsistent
state in between.85
4. Durability: This property guarantees that once a transaction has been successfully
completed (committed), its changes are permanent and will survive any subsequent
system failure, such as a power outage or server crash.84
○ Bank Transfer Example: Once the transfer is complete and the user receives a
confirmation, durability ensures that the new account balances are permanently
recorded. Even if the bank's servers crash moments later, the record of the
transaction will not be lost.85
Transaction States
During its execution, a transaction passes through several states, which describe its lifecycle
from start to finish.
● Visual Representation: Transaction State Diagram
A state diagram illustrates the lifecycle of a transaction, showing how it moves from one
state to another.
!(https://i.imgur.com/vB1kE3V.png)
● The States:
1. Active: The initial state where the transaction is executing its operations (e.g., READ,
WRITE).89
2. Partially Committed: After the final statement of the transaction has been
executed. At this point, the changes are still in a temporary buffer in main memory
and have not yet been permanently written to the database.89
3. Committed: After the changes have been successfully and permanently stored in the
database. The transaction has completed successfully.89
4. Failed: The state entered if the transaction cannot proceed normally due to an error
(e.g., hardware failure, violation of a constraint).89
5. Aborted: The state after the transaction has failed and all its changes have been
rolled back, restoring the database to its state prior to the transaction's start.89
6. Terminated: The final state of the transaction, reached after it has been either
committed or aborted.92
Section 5.2: Managing Concurrent Access
To prevent these problems and enforce the ACID properties, particularly Isolation, DBMSs use
concurrency control techniques. To enforce Durability, they use recovery techniques.
● Locking Mechanisms: This is the most common approach. Before a transaction can
access a data item, it must acquire a lock on it. Locks can be shared (read-only) or
exclusive (read-write). If a transaction holds an exclusive lock on an item, no other
transaction can access it until the lock is released. This prevents dirty reads and lost
updates.
● Timestamp Ordering: Each transaction is assigned a unique timestamp when it starts.
The DBMS then ensures that any conflicting operations are executed in timestamp order,
which prevents inconsistencies without the overhead of locking.
Deadlock
A serious problem that can arise with locking is deadlock. This occurs when two or more
transactions are in a circular waiting pattern, each waiting for a resource that is locked by
another transaction in the cycle.98
● Scenario: Transaction T1 locks Record A and requests a lock on Record B.
Simultaneously, Transaction T2 locks Record B and requests a lock on Record A. T1
cannot proceed until T2 releases B, and T2 cannot proceed until T1 releases A. They are
stuck in a permanent standoff.
● Visual Representation: Wait-For Graph
A deadlock can be visualized using a "wait-for" graph, where nodes represent
transactions and a directed edge from T1 to T2 means T1 is waiting for a resource held by
T2. A cycle in this graph indicates a deadlock.98
!(https://i.imgur.com/L7c10b7.png)
● Deadlock Handling: DBMSs handle deadlocks through:
○ Prevention: Designing protocols that ensure a deadlock can never occur (e.g.,
requiring transactions to acquire all locks at once).
○ Avoidance: Checking resource requests in real-time to see if granting a lock could
lead to a deadlock.
○ Detection and Recovery: Periodically checking for cycles in the wait-for graph and,
if one is found, breaking it by aborting one of the transactions (the "victim") and
rolling back its changes.101
Recovery Techniques
Recovery mechanisms are essential for ensuring the Atomicity and Durability properties of
transactions in the face of failures.
● Log-Based Recovery: The system maintains a log on stable storage (like a hard disk)
that records all database modifications.103 Each log record contains information such as
the transaction ID, the data item modified, its old value, and its new value. The
Write-Ahead Logging (WAL) protocol ensures that the log record for an operation is
written to stable storage before the actual data is modified on disk.103 After a crash, the
recovery manager uses the log to:
○ Undo the operations of transactions that had not committed.
○ Redo the operations of transactions that had committed, to ensure their changes are
on disk.
● Shadow Paging: This is an alternative recovery technique that avoids the need for a log.
It works by maintaining two page tables: a current page table and a shadow page table.
When a transaction starts, both point to the same database pages on disk. When a write
operation occurs, the modified page is written to a new location on disk, and the current
page table is updated to point to this new page, while the shadow page table remains
unchanged. If the transaction commits, the shadow page table is updated to match the
current one. If it aborts or the system crashes, the current page table is simply discarded,
and the shadow page table, which still points to the original, unmodified data, is used to
restore the database state.106
The principles of transaction management are deeply intertwined with the challenges of
distributed systems. The strong consistency guarantees of ACID, which are relatively
straightforward to implement on a single machine, become a significant performance
bottleneck in a distributed environment. This tension is famously described by the CAP
Theorem, which states that a distributed data store can only provide two of the following
three guarantees: Consistency, Availability, and Partition Tolerance (the ability to function
despite network failures). Because partition tolerance is a necessity in any real-world network,
designers are often forced to choose between strong consistency (like that provided by ACID)
and high availability. This trade-off is a primary reason why many modern NoSQL distributed
databases have chosen to relax their consistency guarantees (opting for a model known as
"eventual consistency") to achieve higher availability and better performance at a massive
scale.
For those interested in the advanced topics of transaction processing, concurrency, and
distributed systems, these books provide deep insights:
● For Advanced Theory: Readings in Database Systems (often called the "Red Book"),
edited by Peter Bailis, Joseph M. Hellerstein, and Michael Stonebraker. This is a curated
collection of the most influential research papers in the database field, offering a direct
look at the foundational ideas behind transaction management, concurrency control, and
distributed databases.45
● For Distributed Systems Internals: Database Internals: A Deep Dive into How
Distributed Data Systems Work by Alex Petrov. This book provides an excellent, modern
exploration of the internal workings of databases, with a strong focus on the challenges
and solutions related to building distributed data systems, including storage engines,
distributed algorithms, and concurrency control.45
Conclusion
This comprehensive guide has journeyed through the five core modules of a foundational
database management systems course. The journey began with the fundamental
"why"—understanding that a DBMS is not merely a data container but a sophisticated system
engineered to solve the critical problems of data redundancy, inconsistency, and insecure
access that plagued older file systems. It established the core architectural components and
the powerful concept of abstraction that provides data independence, a key economic benefit
in software engineering.
The guide then moved to the "what" and "how" of database design and interaction. The art of
data modeling was explored through the Entity-Relationship model, providing a blueprint for
structuring data logically before implementation. The power of SQL was detailed, showing
how its declarative nature allows users to retrieve complex information efficiently, with a
particular focus on the JOIN operation as the practical realization of conceptual data
relationships. The principles of effective design were covered through the process of
normalization, a crucial technique for eliminating data anomalies and ensuring the integrity of
the database schema.
Finally, the guide addressed the critical issues of reliability and performance through the
study of transaction management and concurrency control. The ACID properties were
presented as the bedrock guarantee of reliable data processing, while mechanisms for
managing concurrent access and recovering from system failures were explained as the
means to uphold these guarantees. The introduction to distributed databases highlighted the
modern challenges of scaling and availability, revealing the fundamental trade-offs that
engineers must navigate between consistency and performance in today's data-intensive
world.
Ultimately, the study of database systems is the study of organized information. It is a field
that blends rigorous theory with practical engineering trade-offs. A successful database
professional must not only understand the syntax of SQL or the rules of normalization but
must also grasp the underlying principles that govern the design of reliable, scalable, and
maintainable systems. The true goal is not just to store data, but to transform it into a
consistent, secure, and accessible asset that can power applications and drive informed
decisions.
Works cited