Lec9Notes Merged
Lec9Notes Merged
1. What is Data?
a. Data is a collection of raw, unorganized facts and details like text, observations, figures, symbols,
and descriptions of things etc.
In other words, data does not carry any specific purpose and has no significance by itself.
Moreover, data is measured in terms of bits and bytes – which are basic units of information in the
context of computer storage and processing.
b. Data can be recorded and doesn’t have any meaning unless processed.
2. Types of Data
a. Quantitative
i. Numerical form
ii. Weight, volume, cost of an item.
b. Qualitative
i. Descriptive, but not numerical.
p
ii. Name, gender, hair color of a person.
3. What is Information?
a. Info. Is processed, organized, and structured data.
el
b. It provides context of the data and enables decision making.
c. Processed data that make sense to us.
d. Information is extracted from the data, by analyzing and interpreting pieces of data.
e. E.g., you have data of all the people living in your locality, its Data, when you analyze and interpret
the data and come to some conclusion that:
eH
i. There are 100 senior citizens.
ii. The sex ratio is 1.1.
iii. Newborn babies are 100.
These are information.
4. Data vs Information
a. Data is a collection of facts, while information puts those facts into context.
b. While data is raw and unorganized, information is organized.
od
c. Data points are individual and sometimes unrelated. Information maps out that data to provide a
big-picture view of how it all fits together.
d. Data, on its own, is meaningless. When it’s analyzed and interpreted, it becomes meaningful
information.
e. Data does not depend on information; however, information depends on data.
f. Data typically comes in the form of graphs, numbers, figures, or statistics. Information is typically
presented through words, language, thoughts, and ideas.
C
g. Data isn’t sufficient for decision-making, but you can make decisions based on information.
5. What is Database?
a. Database is an electronic place/system where data is stored in a way that it can be easily accessed,
managed, and updated.
b. To make real use Data, we need Database management systems. (DBMS)
6. What is DBMS?
a. A database-management system (DBMS) is a collection of interrelated data and a set of
programs to access those data. The collection of data, usually referred to as the database,
contains information relevant to an enterprise. The primary goal of a DBMS is to provide a way to
store and retrieve database information that is both convenient and efficient.
b. A DBMS is the database itself, along with all the software and functionality. It is used to perform
different operations, like addition, access, updating, and deletion of the data.
7.
p
a. File-processing systems has major disadvantages.
i. Data Redundancy and inconsistency
ii. Difficulty in accessing data
el
iii. Data isolation
iv. Integrity problems
v. Atomicity problems
vi. Concurrent-access anomalies
vii. Security problems
eH
b. Above 7 are also the Advantages of DBMS (answer to “Why to use DBMS?”)
od
C
LEC-2: DBMS Architecture
p
e. Logical level / Conceptual level:
i. The conceptual schema describes the design of a database at the conceptual level,
describes what data are stored in DB, and what relationships exist among those data.
el
ii. User at logical level does not need to be aware about physical-level structures.
iii. DBA, who must decide what information to keep in the DB use the logical level of
abstraction.
iv. Goal: ease to use.
f. View level / External level:
eH
i. Highest level of abstraction aims to simplify users’ interaction with the system by
providing different view to different end-user.
ii. Each view schema describes the database part that a particular user group is interested
and hides the remaining database from that user group.
iii. At the external level, a database contains several schemas that sometimes called as
subschema. The subschema is used to describe the different view of the database.
iv. At views also provide a security mechanism to prevent users from accessing certain parts
od
of DB.
C
g.
2. Instances and Schemas
a. The collection of information stored in the DB at a particular moment is called an instance of DB.
b.The overall design of the DB is called the DB schema.
c.Schema is structural description of data. Schema doesn’t change frequently. Data may change
frequently.
d. DB schema corresponds to the variable declarations (along with type) in a program.
e. We have 3 types of Schemas: Physical, Logical, several view schemas called subschemas.
f. Logical schema is most important in terms of its effect on application programs, as programmers
construct apps by using logical schema.
g. Physical data independence, physical schema change should not affect logical
schema/application programs.
3. Data Models:
a. Provides a way to describe the design of a DB at logical level.
b. Underlying the structure of the DB is the Data Model; a collection of conceptual tools for describing
data, data relationships, data semantics & consistency constraints.
c. E.g., ER model, Relational Model, object-oriented model, object-relational data model etc.
4. Database Languages:
p
a. Data definition language (DDL) to specify the database schema.
b. Data manipulation language (DML) to express database queries and updates.
c. Practically, both language features are present in a single DB language, e.g., SQL language.
el
d. DDL
i. We specify consistency constraints, which must be checked, every time DB is updated.
e. DML
i. Data manipulation involves
1. Retrieval of information stored in DB.
eH
2. Insertion of new information into DB.
3. Deletion of information from the DB.
4. Updating existing information stored in DB.
ii. Query language, a part of DML to specify statement requesting the retrieval of
information.
5. How is Database accessed from Application programs?
a. Apps (written in host languages, C/C++, Java) interacts with DB.
od
b. E.g., Banking system’s module generating payrolls access DB by executing DML statements from
the host language.
c. API is provided to send DML/DDL statements to DB and retrieve the results.
i. Open Database Connectivity (ODBC), Microsoft “C”.
ii. Java Database Connectivity (JDBC), Java.
6. Database Administrator (DBA)
a. A person who has central control of both the data and the programs that access those data.
C
b. Functions of DBA
i. Schema Definition
ii. Storage structure and access methods.
iii. Schema and physical organization modifications.
iv. Authorization control.
v. Routine maintenance
1. Periodic backups.
2. Security patches.
3. Any upgrades.
7. DBMS Application Architectures: Client machines, on which remote DB users work, and server machines
on which DB system runs.
a. T1 Architecture
i. The client, server & DB all present on the same machine.
b. T2 Architecture
i. App is partitioned into 2-components.
ii. Client machine, which invokes DB system functionality at server end through query
language statements.
iii. API standards like ODBC & JDBC are used to interact between client and server.
c. T3 Architecture
i. App is partitioned into 3 logical components.
ii. Client machine is just a frontend and doesn’t contain any direct DB calls.
iii. Client machine communicates with App server, and App server communicated with DB
system to access data.
iv. Business logic, what action to take at that condition is in App server itself.
v. T3 architecture are best for WWW Applications.
vi. Advantages:
1. Scalability due to distributed application servers.
2. Data integrity, App server acts as a middle layer between client and DB, which
p
minimize the chances of data corruption.
3. Security, client can’t directly access DB, hence it is more secure.
el
eH
od
C
LEC-3: Entity-Relationship Model
1. Data Model: Collection of conceptual tools for describing data, data relationships, data semantics, and consistency
constraints.
2. ER Model
1. It is a high level data model based on a perception of a real world that consists of a collection of basic objects, called
entities and of relationships among these objects.
2. Graphical representation of ER Model is ER diagram, which acts as a blueprint of DB.
3. Entity: An Entity is a “thing” or “object” in the real world that is distinguishable from all other objects.
1. It has physical existence.
2. Each student in a college is an entity.
3. Entity can be uniquely identified. (By a primary attribute, aka Primary Key)
4. Strong Entity: Can be uniquely identified.
5. Weak Entity: Can’t be uniquely identified., depends on some other strong entity.
1. It doesn’t have sufficient attributes, to select a uniquely identifiable attribute.
2. Loan -> Strong Entity, Payment -> Weak, as instalments are sequential number counter can be generated
p
separate for each loan.
3. Weak entity depends on strong entity for existence.
4. Entity set
el
1. It is a set of entities of the same type that share the same properties, or attributes.
2. E.g., Student is an entity set.
3. E.g., Customer of a bank
5. Attributes
eH
1. An entity is represented by a set of attributes.
2. Each entity has a value for each of its attributes.
3. For each attribute, there is a set of permitted values, called the domain, or value set, of that attribute.
4. E.g., Student Entity has following attributes
A. Student_ID
B. Name
C. Standard
D. Course
od
E. Batch
F. Contact number
G. Address
5. Types of Attributes
1. Simple
1. Attributes which can’t be divided further.
C
p
2. Unary, Only one entity participates. e.g., Employee manages employee.
3. Binary, two entities participates. e.g., Student takes Course.
4. Ternary relationship, three entities participates. E.g, Employee works-on branch, employee works-on job.
el
5. Binary are common.
7. Relationships Constraints
1. Mapping Cardinality / Cardinality Ratio
1. Number of entities to which another entity can be associated via a relationship.
eH
2. One to one, Entity in A associates with at most one entity in B, where A & B are entity sets. And an entity
of B is associated with at most one entity of A.
1. E.g., Citizen has Aadhar Card.
3. One to many, Entity in A associated with N entity in B. While entity in B is associated with at most one
entity in A.
1. e.g., Citizen has Vehicle.
4. Many to one, Entity in A associated with at most one entity in B. While entity in B can be associated with
N entity in A.
od
1. Basic ER Features studied in the LEC-3, can be used to model most DB features but when complexity increases, it is
better to use some Extended ER features to model the DB Schema.
2. Specialisation
1. In ER model, we may require to subgroup an entity set into other entity sets that are distinct in some way with other
entity sets.
2. Specialisation is splitting up the entity set into further sub entity sets on the basis of their functionalities,
specialities and features.
3. It is a Top-Down approach.
4. e.g., Person entity set can be divided into customer, student, employee. Person is superclass and other specialised
entity sets are subclasses.
1. We have “is-a” relationship between superclass and subclass.
2. Depicted by triangle component.
5. Why Specialisation?
1. Certain attributes may only be applicable to a few entities of
p
the parent entity set.
2. DB designer can show the distinctive features of the sub entities.
3. To group such entities we apply Specialisation, to overall refine the DB blueprint.
el
3. Generalisation
1. It is just a reverse of Specialisation.
2. DB Designer, may encounter certain properties of two entities are overlapping. Designer may consider to make a
new generalised entity set. That generalised entity set will be a super class.
eH
3. “is-a” relationship is present between subclass and super class.
4. e.g., Car, Jeep and Bus all have some common attributes, to avoid data repetition for the common attributes. DB
designer may consider to Generalise to a new entity set “Vehicle”.
5. It is a Bottom-up approach.
6. Why Generalisation?
1. Makes DB more refined and simpler.
2. Common attributes are not repeated.
4. Attribute Inheritance
od
1. Relational Model (RM) organises the data in the form of relations (tables).
2. A relational DB consists of collection of tables, each of which is assigned a unique name.
3. A row in a table represents a relationship among a set of values, and table is collection of such relationships.
4. Tuple: A single row of the table representing a single data point / a unique record.
5. Columns: represents the attributes of the relation. Each attribute, there is a permitted value, called domain of the
attribute.
6. Relation Schema: defines the design and structure of the relation, contains the name of the relation and all the
columns/attributes.
7. Common RM based DBMS systems, aka RDBMS: Oracle, IBM, MySQL, MS Access.
8. Degree of table: number of attributes/columns in a given table/relation.
9. Cardinality: Total no. of tuples in a given relation.
10. Relational Key: Set of attributes which can uniquely identify an each tuple.
11. Important properties of a Table in Relational Model
1. The name of relation is distinct among all other relation.
p
2. The values have to be atomic. Can’t be broken down further.
3. The name of each attribute/column must be unique.
4. Each tuple must be unique in a table.
el
5. The sequence of row and column has no significance.
6. Tables must follow integrity constraints - it helps to maintain data consistency across the tables.
12. Relational Model Keys
1. Super Key (SK): Any P&C of attributes present in a table which can uniquely identify each tuple.
eH
2. Candidate Key (CK): minimum subset of super keys, which can uniquely identify each tuple. It contains no
redundant attribute.
1. CK value shouldn’t be NULL.
3. Primary Key (PK):
1. Selected out of CK set, has the least no. of attributes.
4. Alternate Key (AK)
1. All CK except PK.
5. Foreign Key (FK)
od
p
primary key must contain unique as well as not null values.
6. FOREIGN KEY: Whenever there is some relationship between two entities, there must be some common
attribute between them. This common attribute must be the primary key of an entity set and will become the
el
foreign key of another entity set. This key will prevent every action which can result in loss of connection
between tables.
eH
od
C
LEC-8: Transform - ER Model to Relational Model
1. Both ER-Model and Relational Model are abstract logical representation of real world enterprises. Because the two
models implies the similar design principles, we can convert ER design into Relational design.
2. Converting a DB representation from an ER diagram to a table format is the way we arrive at Relational DB-design from
an ER diagram.
3. ER diagram notations to relations:
1. Strong Entity
1. Becomes an individual table with entity name, attributes becomes columns of the relation.
2. Entity’s Primary Key (PK) is used as Relation’s PK.
3. FK are added to establish relationships with other relations.
2. Weak Entity
1. A table is formed with all the attributes of the entity.
2. PK of its corresponding Strong Entity will be added as FK.
3. PK of the relation will be a composite PK, {FK + Partial discriminator Key}.
3. Single Values Attributes
p
1. Represented as columns directly in the tables/relations.
4. Composite Attributes
1. Handled by creating a separate attribute itself in the original relation for each composite attribute.
el
2. e.g., Address: {street-name, house-no}, is a composite attribute in customer relation, we add address-street-
name & address-house-name as new columns in the attribute and ignore “address” as an attribute.
5. Multivalued Attributes
1. New tables (named as original attribute name) are created for each multivalued attribute.
eH
2. PK of the entity is used as column FK in the new table.
3. Multivalued attribute’s similar name is added as a column to define multiple values.
4. PK of the new table would be {FK + multivalued name}.
5. e.g., For Strong entity Employee, dependent-name is a multivalued attribute.
1. New table named dependent-name will be formed with columns emp-id, and dname.
2. PK: {emp-id, name}
3. FK: {emp-id}
6. Derived Attributes: Not considered in the tables.
od
7. Generalisation
1. Method-1: Create a table for the higher level entity set. For each lower-level entity set, create a table that
includes a column for each of the attributes of that entity set plus a column for each attribute of the primary key
of the higher-level entity set.
For e.g., Banking System generalisation of Account - saving & current.
1. Table 1: account (account-number, balance)
C
p
el
eH
od
C
LEC-9: SQL in 1-Video
p
1. SQL is Structured Query language used to perform CRUD operations in R-DB, while MySQL is a RDBMS used to
store, manage and administrate DB (provided by itself) using SQL.
el
SQL DATA TYPES (Ref: https://www.w3schools.com/sql/sql_datatypes.asp)
1. In SQL DB, data is stored in the form of tables.
2. Data can be of different types, like INT, CHAR etc.
DATATYPE Description
eH
CHAR string(0-255), string with size = (0, 255], e.g.,
CHAR(251)
VARCHAR string(0-255)
TINYTEXT String(0-255)
TEXT string(0-65535)
od
BLOB string(0-65535)
MEDIUMTEXT string(0-16777215)
MEDIUMBLOB string(0-16777215)
LONGTEXT string(0-4294967295)
C
LONGBLOB string(0-4294967295)
DATE YYYY-MM-DD
TIMESTAMP YYYYMMDDHHMMSS
TIME HH:MM:SS
BOOLEAN 0/1
p
3. Size: TINY < SMALL < MEDIUM < INT < BIGINT.
4. Variable length Data types e.g., VARCHAR, are better to use as they occupy space equal to the actual data size.
5. Values can also be unsigned e.g., INT UNSIGNED.
el
6. Types of SQL commands:
1. DDL (data definition language): defining relation schema.
1. CREATE: create table, DB, view.
2. ALTER TABLE: modification in table structure. e.g, change column datatype or add/remove columns.
eH
3. DROP: delete table, DB, view.
4. TRUNCATE: remove all the tuples from the table.
5. RENAME: rename DB name, table name, column name etc.
2. DRL/DQL (data retrieval language / data query language): retrieve data from the tables.
1. SELECT
3. DML (data modification language): use to perform modifications in the DB
1. INSERT: insert data into a relation
2. UPDATE: update relation data.
od
MANAGING DB (DDL)
1. Creation of DB
1. CREATE DATABASE IF NOT EXISTS db-name;
2. USE db-name; //need to execute to choose on which DB CREATE TABLE etc commands will be executed.
//make switching between DBs possible.
3. DROP DATABASE IF EXISTS db-name; //dropping database.
4. SHOW DATABASES; //list all the DBs in the server.
5. SHOW TABLES; //list tables in the selected DB.
DATA RETRIEVAL LANGUAGE (DRL)
1. Syntax: SELECT <set of column names> FROM <table_name>;
2. Order of execution from RIGHT to LEFT.
3. Q. Can we use SELECT keyword without using FROM clause?
1. Yes, using DUAL Tables.
2. Dual tables are dummy tables created by MySQL, help users to do certain obvious actions without referring to user
defined tables.
3. e.g., SELECT 55 + 11;
SELECT now();
SELECT ucase(); etc.
4. WHERE
1. Reduce rows based on given conditions.
2. E.g., SELECT * FROM customer WHERE age > 18;
5. BETWEEN
1. SELECT * FROM customer WHERE age between 0 AND 100;
2. In the above e.g., 0 and 100 are inclusive.
p
6. IN
1. Reduces OR conditions;
2. e.g., SELECT * FROM officers WHERE officer_name IN ('Lakshay', ‘Maharana Pratap', ‘Deepika’);
el
7. AND/OR/NOT
1. AND: WHERE cond1 AND cond2
2. OR: WHERE cond1 OR cond2
3. NOT: WHERE col_name NOT IN (1,2,3,4);
eH
8. IS NULL
1. e.g., SELECT * FROM customer WHERE prime_status is NULL;
9. Pattern Searching / Wildcard (‘%’, ‘_’)
1. ‘%’, any number of character from 0 to n. Similar to ‘*’ asterisk in regex.
2. ‘_’, only one character.
3. SELECT * FROM customer WHERE name LIKE ‘%p_’;
10. ORDER BY
1. Sorting the data retrieved using WHERE clause.
od
CONSTRAINTS (DDL)
1. Primary Key
p
el
eH
1. PK is not null, unique and only one per table.
2. Foreign Key
1. FK refers to PK of other table.
2. Each relation can having any number of FK.
od
3. UNIQUE
1. Unique, can be null, table can have multiple unique attributes.
2. CREATE TABLE customer (
…
email VARCHAR(1024) UNIQUE,
…
);
4. CHECK
1. CREATE TABLE customer (
…
CONSTRAINT age_check CHECK (age > 12),
…
);
2. “age_check”, can also avoid this, MySQL generates name of constraint automatically.
5. DEFAULT
1. Set default value of the column.
2. CREATE TABLE account (
…
saving-rate DOUBLE NOT NULL DEFAULT 4.25,
…
);
6. An attribute can be PK and FK both in a table.
7. ALTER OPERATIONS
1. Changes schema
2. ADD
1. Add new column.
2. ALTER TABLE table_name ADD new_col_name datatype ADD new_col_name_2 datatype;
3. e.g., ALTER TABLE customer ADD age INT NOT NULL;
3. MODIFY
1. Change datatype of an attribute.
p
2. ALTER TABLE table-name MODIFY col-name col-datatype;
3. E.g., VARCHAR TO CHAR
ALTER TABLE customer MODIFY name CHAR(1024);
el
4. CHANGE COLUMN
1. Rename column name.
2. ALTER TABLE table-name CHANGE COLUMN old-col-name new-col-name new-col-datatype;
3. e.g., ALTER TABLE customer CHANGE COLUMN name customer-name VARCHAR(1024);
eH
5. DROP COLUMN
1. Drop a column completely.
2. ALTER TABLE table-name DROP COLUMN col-name;
3. e.g., ALTER TABLE customer DROP COLUMN middle-name;
6. RENAME
1. Rename table name itself.
2. ALTER TABLE table-name RENAME TO new-table-name;
3. e.g., ALTER TABLE customer RENAME TO customer-details;
od
p
JOINING TABLES
1. All RDBMS are relational in nature, we refer to other tables to get meaningful outcomes.
2. FK are used to do reference to other table.
el
3. INNER JOIN
1. Returns a resultant table that has matching values from both the tables or all the tables.
2. SELECT column-list FROM table1 INNER JOIN table2 ON condition1
INNER JOIN table3 ON condition2
eH
…;
3. Alias in MySQL (AS)
1. Aliases in MySQL is used to give a temporary name to a table or a column in a table for the purpose of
a particular query. It works as a nickname for expressing the tables or column names. It makes the query short
and neat.
2. SELECT col_name AS alias_name FROM table_name;
3. SELECT col_name1, col_name2,... FROM table_name AS alias_name;
4. OUTER JOIN
od
1. LEFT JOIN
1. This returns a resulting table that all the data from left table and the matched data from the right table.
2. SELECT columns FROM table LEFT JOIN table2 ON Join_Condition;
2. RIGHT JOIN
1. This returns a resulting table that all the data from right table and the matched data from the left table.
2. SELECT columns FROM table RIGHT JOIN table2 ON join_cond;
C
3. FULL JOIN
1. This returns a resulting table that contains all data when there is a match on left or right table data.
2. Emulated in MySQL using LEFT and RIGHT JOIN.
3. LEFT JOIN UNION RIGHT JOIN.
4. SELECT columns FROM table1 as t1 LEFT JOIN table2 as t2 ON t1.id = t2.id
UNION
SELECT columns FROM table1 as t1 RIGHT JOIN table2 as t2 ON t1.id = t2.id;
5. UNION ALL, can also be used this will duplicate values as well while UNION gives unique values.
5. CROSS JOIN
1. This returns all the cartesian products of the data present in both tables. Hence, all possible variations
are reflected in the output.
2. Used rarely in practical purpose.
3. Table-1 has 10 rows and table-2 has 5, then resultant would have 50 rows.
4. SELECT column-lists FROM table1 CROSS JOIN table2;
6. SELF JOIN
1. It is used to get the output from a particular table when the same table is joined to itself.
2. Used very less.
3. Emulated using INNER JOIN.
4. SELECT columns FROM table as t1 INNER JOIN table as t2 ON t1.id = t2.id;
7. Join without using join keywords.
1. SELECT * FROM table1, table2 WHERE condition;
2. e.g., SELECT artist_name, album_name, year_recordedFROM artist, albumWHERE artist.id = album.artist_id;
SET OPERATIONS
1. Used to combine multiple select statements.
2. Always gives distinct rows.
Combines multiple tables based on matching Combination is resulting set from two or more
condition. SELECT statements.
p
Column wise combination. Row wise combination.
Data types of two tables can be different. Datatypes of corresponding columns from each
el
table should be the same.
The number of column(s) selected may or may not The number of column(s) selected must be the
be the same from each table. same from each table.
eH
Combines results horizontally. Combines results vertically.
3. UNION
1. Combines two or more SELECT statements.
2. SELECT * FROM table1
UNION
SELECT * FROM table2;
od
3. Number of column, order of column must be same for table1 and table2.
4. INTERSECT
1. Returns common values of the tables.
2. Emulated.
3. SELECT DISTINCT column-list FROM table-1 INNER JOIN table-2 USING(join_cond);
4. SELECT DISTINCT * FROM table1 INNER JOIN table2 ON USING(id);
5. MINUS
C
1. This operator returns the distinct row from the first table that does not occur in the second table.
2. Emulated.
3. SELECT column_list FROM table1 LEFT JOIN table2 ON condition WHERE table2.column_name IS NULL;
4. e.g., SELECT id FROM table-1 LEFT JOIN table-2 USING(id) WHERE table-2.id IS NULL;
SUB QUERIES
1. Outer query depends on inner query.
2. Alternative to joins.
3. Nested queries.
4. SELECT column_list (s) FROM table_name WHERE column_name OPERATOR
(SELECT column_list (s) FROM table_name [WHERE]);
5. e.g., SELECT * FROM table1 WHERE col1 IN (SELECT col1 FROM table1);
6. Sub queries exist mainly in 3 clauses
1. Inside a WHERE clause.
2. Inside a FROM clause.
3. Inside a SELECT clause.
7. Subquery using FROM clause
1. SELECT MAX(rating) FROM (SELECT * FROM movie WHERE country = ‘India’) as temp;
8. Subquery using SELECT
1. SELECT (SELECT column_list(s) FROM T_name WHERE condition), columnList(s) FROM T2_name WHERE
condition;
9. Derived Subquery
1. SELECT columnLists(s) FROM (SELECT columnLists(s) FROM table_name WHERE [condition]) as new_table_name;
10. Co-related sub-queries
1. With a normal nested subquery, the inner SELECT query
runs first and executes once, returning values to be used by
the main query. A correlated subquery, however, executes
once for each candidate row considered by the outer query.
In other words, the inner query is driven by the outer query.
p
JOIN VS SUB-QUERIES
el
JOINS SUBQUERIES
Faster Slower
MySQL VIEWS
1. A view is a database object that has no values. Its contents are based on the base table. It contains rows and columns
od
NOTE: We can also import/export table schema from files (.csv or json).
LEC-11: Normalisation
p
2. Augmentation
1. If B can be determined from A, then adding an attribute to this functional dependency won’t change
anything.
el
2. If A→ B holds, then AX→ BX holds too. ‘X’ being a set of attributes.
3. Transitivity
1. If A determines B and B determines C, we can say that A determines C.
2. if A→ B and B→ C then A→ C.
eH
3. Why Normalisation?
1. To avoid redundancy in the DB, not to store redundant data.
4. What happen if we have redundant data?
1. Insertion, deletion and updation anomalies arises.
5. Anomalies
1. Anomalies means abnormalities, there are three types of anomalies introduced by data redundancy.
2. Insertion anomaly
1. When certain data (attribute) can not be inserted into the DB without the presence of other data.
od
3. Deletion anomaly
1. The delete anomaly refers to the situation where the deletion of data results in the unintended loss of some
other important data.
4. Updation anomaly (or modification anomaly)
1. The update anomaly is when an update of a single data value requires multiple rows of data to be updated.
2. Due to updation to many places, may be Data inconsistency arises, if one forgets to update the data at all the
C
intended places.
5. Due to these anomalies, DB size increases and DB performance become very slow.
6. To rectify these anomalies and the effect of these of DB, we use Database optimisation technique called
NORMALISATION.
6. What is Normalisation?
1. Normalisation is used to minimise the redundancy from a relations. It is also used to eliminate undesirable
characteristics like Insertion, Update, and Deletion Anomalies.
2. Normalisation divides the composite attributes into individual attributes OR larger table into smaller and links them
using relationships.
3. The normal form is used to reduce redundancy from the database table.
7. Types of Normal forms
1. 1NF
1. Every relation cell must have atomic value.
2. Relation must not have multi-valued attributes.
2. 2NF
1. Relation must be in 1NF.
2. There should not be any partial dependency.
1. All non-prime attributes must be fully dependent on PK.
2. Non prime attribute can not depend on the part of the PK.
3. 3NF
1. Relation must be in 2NF.
2. No transitivity dependency exists.
1. Non-prime attribute should not find a non-prime attribute.
4. BCNF (Boyce-Codd normal form)
1. Relation must be in 3NF.
2. FD: A -> B, A must be a super key.
1. We must not derive prime attribute from any prime or non-prime attribute.
8. Advantages of Normalisation
1. Normalisation helps to minimise data redundancy.
2. Greater overall database organisation.
p
3. Data consistency is maintained in DB.
el
eH
od
C
LEC-12: Transaction
1. Transaction
1. A unit of work done against the DB in a logical sequence.
2. Sequence is very important in transaction.
3. It is a logical unit of work that contains one or more SQL statements. The result of all these statements in a
transaction either gets completed successfully (all the changes made to the database are permanent) or if at any
point any failure happens it gets rollbacked (all the changes being done are undone.)
2. ACID Properties
1. To ensure integrity of the data, we require that the DB system maintain the following properties of the transaction.
2. Atomicity
1. Either all operations of transaction are reflected properly in the DB, or none are.
3. Consistency
1. Integrity constraints must be maintained before and after transaction.
2. DB must be consistent after transaction happens.
4. Isolation
p
1. Even though multiple transactions may execute concurrently, the system guarantees that, for every pair of
transactions Ti and Tj, it appears to Ti that either Tj finished execution before Ti started, or Tj started execution
after Ti finished. Thus, each transaction is unaware of other transactions executing concurrently in the system.
el
2. Multiple transactions can happen in the system in isolation, without interfering each other.
5. Durability
1. After transaction completes successfully, the changes it has made to the database persist, even if there are
system failures.
eH
3. Transaction states
od
C
1. Active state
1. The very first state of the life cycle of the transaction, all the read and write operations are being
performed. If they execute without any error the T comes to Partially committed state. Although if any
error occurs then it leads to a Failed state.
2. Partially committed state
1. After transaction is executed the changes are saved in the buffer in the main memory. If the changes made
are permanent on the DB then the state will transfer to the committed state and if there is any failure, the T
will go to Failed state.
3. Committed state
1. When updates are made permanent on the DB. Then the T is said to be in the committed state. Rollback
can’t be done from the committed states. New consistent state is achieved at this stage.
4. Failed state
1. When T is being executed and some failure occurs. Due to this it is impossible to continue the execution of
the T.
5. Aborted state
1. When T reaches the failed state, all the changes made in the buffer are reversed. After that the T rollback
completely. T reaches abort state after rollback. DB’s state prior to the T is achieved.
6. Terminated state
1. A transaction is said to have terminated if has either committed or aborted.
p
el
eH
od
C
LEC-13: How to implement Atomicity and Durability in Transactions
p
2. T abort can be done by just deleting the new copy of DB.
3. Hence, either all updates are reflected or none.
9. Durability
1. Suppose, system fails are any time before the updated db-pointer is written to disk.
el
2. When the system restarts, it will read db-pointer & will thus, see the original content of DB and none of the effects of T will
be visible.
3. T is assumed to be successful only when db-pointer is updated.
4. If system fails after db-pointer has been updated. Before that all the pages of the new copy were written to disk. Hence,
when system restarts, it will read new DB copy.
eH
10. The implementation is dependent on write to the db-pointer being atomic. Luckily, disk system provide atomic updates to entire
block or at least a disk sector. So, we make sure db-pointer lies entirely in a single sector. By storing db-pointer at the beginning
of a block.
11. Inefficient, as entire DB is copied for every Transaction.
3. Log-based recovery methods
1. The log is a sequence of records. Log of each transaction is maintained in some stable storage so that if any failure occurs, then
it can be recovered from there.
2. If any operation is performed on the database, then it will be recorded in the log.
3. But the process of storing the logs should be done before the actual transaction is applied in the database.
od
4. Stable storage is a classification of computer data storage technology that guarantees atomicity for any given write operation
and allows software to be written that is robust against some hardware and power failures.
5. Deferred DB Modifications
1. Ensuring atomicity by recording all the DB modifications in the log but deferring the execution of all the write operations
until the final action of the T has been executed.
2. Log information is used to execute deferred writes when T is completed.
3. If system crashed before the T completes, or if T is aborted, the information in the logs are ignored.
C
4. If T completes, the records associated to it in the log file are used in executing the deferred writes.
5. If failure occur while this updating is taking place, we preform redo.
6. Immediate DB Modifications
1. DB modifications to be output to the DB while the T is still in active state.
2. DB modifications written by active T are called uncommitted modifications.
3. In the event of crash or T failure, system uses old value field of the log records to restore modified values.
4. Update takes place only after log records in a stable storage.
5. Failure handling
1. System failure before T completes, or if T aborted, then old value field is used to undo the T.
2. If T completes and system crashes, then new value field is used to redo T having commit logs in the logs.
LEC-14: Indexing in DBMS
1. Indexing is used to optimise the performance of a database by minimising the number of disk accesses required when a query is
processed.
2. The index is a type of data structure. It is used to locate and access the data in a database table quickly.
3. Speeds up operation with read operations like SELECT queries, WHERE clause etc.
4. Search Key: Contains copy of primary key or candidate key
of the table or something else.
5. Data Reference: Pointer holding the address of disk block
where the value of the corresponding key is stored.
6. Indexing is optional, but increases access speed. It is not the
primary mean to access the tuple, it is the secondary mean.
7. Index file is always sorted.
8. Indexing Methods
1. Primary Index (Clustering Index)
1. A file may have several indices, on different search keys. If the data file containing the records is sequentially ordered, a
Primary index is an index whose search key also defines the sequential order of the file.
2. NOTE: The term primary index is sometimes used to mean an index on a primary key. However, such usage is
p
nonstandard and should be avoided.
3. All files are ordered sequentially on some search key. It could be Primary Key or non-primary key.
4. Dense And Sparse Indices
1. Dense Index
el
1. The dense index contains an index record for every search key value in the data file.
2. The index record contains the search-key value and a pointer to the first data record with that search-key value.
The rest of the records with the same search-key value would be stored sequentially after the first record.
3. It needs more space to store index record itself. The index records have the search key and a pointer to the actual
record on the disk.
eH
2. Sparse Index
1. An index record appears for only some of the search-key values.
2. Sparse Index helps you to resolve the issues of dense Indexing in DBMS. In this method of indexing technique, a
range of index columns stores the same data block address, and when data needs to be retrieved, the block
address will be fetched.
5. Primary Indexing can be based on Data file is sorted w.r.t Primary Key attribute or non-key attributes.
6. Based on Key attribute
1. Data file is sorted w.r.t primary key attribute.
2. PK will be used as search-key in Index.
od
3. Sparse Index will be formed i.e., no. of entries in the index file = no. of blocks in datafile.
7. Based on Non-Key attribute
1. Data file is sorted w.r.t non-key attribute.
2. No. Of entries in the index = unique non-key attribute value in the data file.
3. This is dense index as, all the unique values have an entry in the
index file.
4. E.g., Let’s assume that a company recruited many employees in
C
p
el
eH
od
C
LEC-15: NoSQL
1. NoSQL databases (aka "not only SQL") are non-tabular databases and store data differently than relational tables. NoSQL databases
come in a variety of types based on their data model. The main types are document, key-value, wide-column, and graph. They
provide flexible schemas and scale easily with large amounts of data and high user loads.
1. They are schema free.
2. Data structures used are not tabular, they are more flexible, has the ability to adjust dynamically.
3. Can handle huge amount of data (big data).
4. Most of the NoSQL are open sources and has the capability of horizontal scaling.
5. It just stores data in some format other than relational.
2. History behind NoSQL
1. NoSQL databases emerged in the late 2000s as the cost of storage dramatically decreased. Gone were the days of needing to
create a complex, difficult-to-manage data model in order to avoid data duplication. Developers (rather than storage) were
becoming the primary cost of software development, so NoSQL databases optimised for developer productivity.
2. Data becoming unstructured more, hence structuring (defining schema in advance) them had becoming costly.
3. NoSQL databases allow developers to store huge amounts of unstructured data, giving them a lot of flexibility.
4. Recognising the need to rapidly adapt to changing requirements in a software system. Developers needed the ability to iterate
quickly and make changes throughout their software stack — all the way down to the database. NoSQL databases gave them
p
this flexibility.
5. Cloud computing also rose in popularity, and developers began using public clouds to host their applications and data. They
wanted the ability to distribute data across multiple servers and regions to make their applications resilient, to scale out instead
of scale up, and to intelligently geo-place their data. Some NoSQL databases like MongoDB provide these capabilities.
el
3. NoSQL Databases Advantages
A. Flexible Schema
1. RDBMS has pre-defined schema, which become an issue when we do not have all the data with us or we need to change
the schema. It's a huge task to change schema on the go.
B. Horizontal Scaling
eH
1. Horizontal scaling, also known as scale-out, refers to bringing on additional nodes to share the load. This is difficult with
relational databases due to the difficulty in spreading out related data across nodes. With non-relational databases, this is
made simpler since collections are self-contained and not coupled relationally. This allows them to be distributed across
nodes more simply, as queries do not have to “join” them together across nodes.
2. Scaling horizontally is achieved through Sharding OR Replica-sets.
C. High Availability
1. NoSQL databases are highly available due to its auto replication feature i.e. whenever any kind of failure happens data
replicates itself to the preceding consistent state.
2. If a server fails, we can access that data from another server as well, as in NoSQL database data is stored at multiple
od
servers.
D. Easy insert and read operations.
1. Queries in NoSQL databases can be faster than SQL databases. Why? Data in SQL databases is typically normalised, so
queries for a single object or entity require you to join data from multiple tables. As your tables grow in size, the joins can
become expensive. However, data in NoSQL databases is typically stored in a way that is optimised for queries. The rule of
thumb when you use MongoDB is data that is accessed together should be stored together. Queries typically do not require
joins, so the queries are very fast.
C
p
2. Column-Oriented / Columnar / C-Store / Wide-Column
1. The data is stored such that each row of a column will be next to other rows from that same column.
2. While a relational database stores data in rows and reads data row by row, a column store is organised as a set of columns.
el
This means that when you want to run analytics on a small number of columns, you can read those columns directly
without consuming memory with the unwanted data. Columns are often of the same type and benefit from more efficient
compression, making reads even faster. Columnar databases can quickly aggregate the value of a given column (adding up
the total sales for the year, for example). Use cases include analytics.
3. e.g., Cassandra, RedShift, Snowflake.
eH
3. Document Based Stores
1. This DB store data in documents similar to JSON (JavaScript Object Notation) objects. Each document contains pairs of
fields and values. The values can typically be a variety of types including things like strings, numbers, booleans, arrays, or
objects.
2. Use cases include e-commerce platforms, trading platforms, and mobile app development across industries.
3. Supports ACID properties hence, suitable for Transactions.
4. e.g., MongoDB, CouchDB.
4. Graph Based Stores
1. A graph database focuses on the relationship between data elements. Each element is stored as a node (such as a person
od
in a social media graph). The connections between elements are called links or relationships. In a graph database,
connections are first-class elements of the database, stored directly. In relational databases, links are implied, using data to
express the relationships.
2. A graph database is optimised to capture and search the connections between data elements, overcoming the overhead
associated with JOINing multiple tables in SQL.
3. Very few real-world business systems can survive solely on graph queries. As a result graph databases are usually run
alongside other more traditional databases.
C
4. Use cases include fraud detection, social networks, and knowledge graphs.
7. NoSQL Databases Dis-advantages
1. Data Redundancy
1. Since data models in NoSQL databases are typically optimised for queries and not for reducing data duplication, NoSQL
databases can be larger than SQL databases. Storage is currently so cheap that most consider this a minor drawback, and
some NoSQL databases also support compression to reduce the storage footprint.
2. Update & Delete operations are costly.
3. All type of NoSQL Data model doesn’t fulfil all of your application needs
1. Depending on the NoSQL database type you select, you may not be able to achieve all of your use cases in a single
database. For example, graph databases are excellent for analysing relationships in your data but may not provide what
you need for everyday retrieval of the data such as range queries. When selecting a NoSQL database, consider what your
use cases will be and if a general purpose database like MongoDB would be a better option.
4. Doesn’t support ACID properties in general.
5. Doesn’t support data entry with consistency constraints.
8. SQL vs NoSQL
Data Storage Model Tables with fixed rows and Document: JSON documents,
columns Key-value: key-value pairs, Wide-
column: tables with rows and
dynamic columns, Graph: nodes
and edges
Development History Developed in the 1970s with a Developed in the late 2000s with
focus on reducing data a focus on scaling and allowing
duplication for rapid application change
driven by agile and DevOps
practices.
p
Cassandra and HBase, Graph:
Neo4j and Amazon Neptune
el
value: large amounts of data with
simple lookup queries, Wide-
column: large amounts of data
with predictable query patterns,
Graph: analyzing and traversing
eH
relationships between connected
data
MongoDB etc.
1. Relational Databases
1. Based on Relational Model.
2. Relational databases are quite popular, even though it was a system designed in the 1970s. Also known as relational database
management systems (RDBMS), relational databases commonly use Structured Query Language (SQL) for operations such as
creating, reading, updating, and deleting data. Relational databases store information in discrete tables, which can be JOINed
together by fields known as foreign keys. For example, you might have a User table which contains information about all your
users, and join it to a Purchases table, which contains information about all the purchases they’ve made. MySQL, Microsoft SQL
Server, and Oracle are types of relational databases.
3. they are ubiquitous, having acquired a steady user base since the 1970s
4. they are highly optimised for working with structured data.
5. they provide a stronger guarantee of data normalisation
6. they use a well-known querying language through SQL
7. Scalability issues (Horizontal Scaling).
8. Data become huge, system become more complex.
2. Object Oriented Databases
1. The object-oriented data model, is based on the object-oriented-programming paradigm, which is now in wide use.
p
Inheritance, object-identity, and encapsulation (information hiding), with methods to provide an interface to objects, are
among the key concepts of object-oriented programming that have found applications in data modelling. The object-oriented
data model also supports a rich type system, including structured and collection types. While inheritance and, to some extent,
complex types are also present in the E-R model, encapsulation and object-identity distinguish the object-oriented data model
el
from the E-R model.
2. Sometimes the database can be very complex, having multiple relations. So, maintaining a relationship between them can be
tedious at times.
1. In Object-oriented databases data is treated as an object.
2. All bits of information come in one instantly available object package instead of multiple tables.
eH
3. Advantages
1. Data storage and retrieval is easy and quick.
2. Can handle complex data relations and more variety of data types that standard relational databases.
3. Relatively friendly to model the advance real world problems
4. Works with functionality of OOPs and Object Oriented languages.
4. Disadvantages
1. High complexity causes performance issues like read, write, update and delete operations are slowed down.
2. Not much of a community support as isn’t widely adopted as relational databases.
3. Does not support views like relational databases.
od
p
el
eH
od
C
LEC-17: Clustering in DBMS
1. Database Clustering (making Replica-sets) is the process of combining more than one servers or instances connecting a single database.
Sometimes one server may not be adequate to manage the amount of data or the number of requests, that is when a Data Cluster is needed.
Database clustering, SQL server clustering, and SQL clustering are closely associated with SQL is the language used to manage the database
information.
2. Replicate the same dataset on different servers.
3. Advantages
1. Data Redundancy: Clustering of databases helps with data redundancy, as we store the same data at multiple servers. Don’t confuse this
data redundancy as repetition of the same data that might lead to some anomalies. The redundancy that clustering offers is required and is
quite certain due to the synchronisation. In case any of the servers had to face a failure due to any possible reason, the data is available at other
servers to access.
2. Load balancing: or scalability doesn’t come by default with the database. It has to be brought by clustering regularly. It also depends on the
setup. Basically, what load balancing does is allocating the workload among the different servers that are part of the cluster. This indicates that
more users can be supported and if for some reasons if a huge spike in the traffic appears, there is a higher assurance that it will be able to
support the new traffic. One machine is not going to get all of the hits. This can provide scaling seamlessly as required. This links directly to
high availability. Without load balancing, a particular machine could get overworked and traffic would slow down, leading to decrement of
the traffic to zero.
p
3. High availability: When you can access a database, it implies that it is available. High availability refers the amount of time a database is
considered available. The amount of availability you need greatly depends on the number of transactions you are running on your database
and how often you are running any kind of analytics on your data. With database clustering, we can reach extremely high levels of availability
due to load balancing and have extra machines. In case a server got shut down the database will, however, be available.
el
4. How does Clustering Work?
1. In cluster architecture, all requests are split with many computers so that an individual user request is executed and produced by a number of
computer systems. The clustering is serviceable definitely by the ability of load balancing and high-availability. If one node collapses, the
request is handled by another node. Consequently, there are few or no possibilities of absolute system failures.
eH
od
C
LEC-18: Partitioning & Sharding in DBMS (DB Optimisation)
1. A big problem can be solved easily when it is chopped into several smaller sub-problems. That is what the partitioning technique does. It divides a
big database containing data metrics and indexes into smaller and handy slices of data called partitions. The partitioned tables are directly used by
SQL queries without any alteration. Once the database is partitioned, the data definition language can easily work on the smaller partitioned slices,
instead of handling the giant database altogether. This is how partitioning cuts down the problems in managing large database tables.
2. Partitioning is the technique used to divide stored database objects into separate servers. Due to this, there is an increase in performance,
controllability of the data. We can manage huge chunks of data optimally. When we horizontally scale our machines/servers, we know that it gives us
a challenging time dealing with relational databases as it’s quite tough to maintain the relations. But if we apply partitioning to the database that is
already scaled out i.e. equipped with multiple servers, we can partition our database among those servers and handle the big data easily.
3. Vertical Partitioning
1. Slicing relation vertically / column-wise.
2. Need to access different servers to get complete tuples.
4. Horizontal Partitioning
1. Slicing relation horizontally / row-wise.
2. Independent chunks of data tuples are stored in different servers.
5. When Partitioning is Applied?
1. Dataset become much huge that managing and dealing with it become a tedious task.
p
2. The number of requests are enough larger that the single DB server access is taking huge time and hence the system’s response time become
high.
6. Advantages of Partitioning
1. Parallelism
el
2. Availability
3. Performance
4. Manageability
5. Reduce Cost, as scaling-up or vertical scaling might be costly.
7. Distributed Database
eH
1. A single logical database that is, spread across multiple locations (servers) and logically interconnected by network.
2. This is the product of applying DB optimisation techniques like Clustering, Partitioning and Sharding.
3. Why this is needed? READ Point 5.
8. Sharding
1. Technique to implement Horizontal Partitioning.
2. The fundamental idea of Sharding is the idea that instead of having all the data sit on one DB instance, we split it up and introduce a
Routing layer so that we can forward the request to the right instances that actually contain the data.
3. Pros
1. Scalability
od
2. Availability
4. Cons
1. Complexity, making partition mapping, Routing layer to be implemented in the system, Non-uniformity that creates the necessity of Re-
Sharding
2. Not well suited for Analytical type of queries, as the data is spread across different DB instances. (Scatter-Gather problem)
C
LEC-20: CAP Theorem
p
5. CAP Theorem NoSQL Databases: NoSQL databases are great for distributed networks. They allow for horizontal scaling, and they can quickly scale
across multiple nodes. When deciding which NoSQL database to use, it’s important to keep the CAP theorem in mind.
1. CA Databases: CA databases enable consistency and availability across all nodes. Unfortunately, CA databases can’t deliver fault tolerance. In
any distributed system, partitions are bound to happen, which means this type of database isn’t a very practical choice. That being said, you still
el
can find a CA database if you need one. Some relational databases, such as MySQL or PostgreSQL, allow for consistency and availability. You can
deploy them to nodes using replication.
2. CP Databases: CP databases enable consistency and partition tolerance, but not availability. When a partition occurs, the system has to turn
off inconsistent nodes until the partition can be fixed. MongoDB is an example of a CP database. It’s a NoSQL database management system
(DBMS) that uses documents for data storage. It’s considered schema-less, which means that it doesn’t require a defined database schema. It’s
eH
commonly used in big data and applications running in different locations. The CP system is structured so that there’s only one primary
node that receives all of the write requests in a given replica set. Secondary nodes replicate the data in the primary nodes, so if the
primary node fails, a secondary node can stand-in. In banking system Availability is not as important as consistency, so we can opt it
(MongoDB).
3. AP Databases: AP databases enable availability and partition tolerance, but not consistency. In the event of a partition, all nodes are available,
but they’re not all updated. For example, if a user tries to access data from a bad node, they won’t receive the most up-to-date version of the
data. When the partition is eventually resolved, most AP databases will sync the nodes to ensure consistency across them. Apache Cassandra is
an example of an AP database. It’s a NoSQL database with no primary node, meaning that all of the nodes remain available. Cassandra allows
for eventual consistency because users can re-sync their data right after a partition is resolved. For apps like Facebook, we value availability
od
more than consistency, we’d opt for AP Databases like Cassandra or Amazon DynamoDB.
C
LEC-21: The Master-Slave Database Concept
p
el
eH
1. Master-Slave is a general way to optimise IO in a system where number of requests goes way high that a single DB server is not able to handle it
efficiently.
2. Its a Pattern 3 in LEC-19 (Database Scaling Pattern). (Command Query Responsibility Segregation)
3. The true or latest data is kept in the Master DB thus write operations are directed there. Reading ops are done only from slaves. This architecture
serves the purpose of safeguarding site er liability, availability, reduce latency etc . If a site receives a lot of traffic and the only available
database is one master, it will be overloaded with reading and writing requests. Making the entire system slow for everyone on the site.
4. DB replication will take care of distributing data from Master machine to Slaves machines. This can be synchronous or asynchronous depending
od