1.
Some important term
1.1. Common term
select from where group by having order by
SQL: Structured Query Language, also used to refer to databases that use SQL as their query language.
Components of SQL
Data Definition Language (DDL): This component is used for defining, altering, and destroying the
objects in applications like tables, indexes.
Data Manipulation Language (DML): This component is used for inserting, appending, modifying, and
removing the data from the table.
Data Query Language (DQL): This is used to display or retrieve the data from the table in the form of
records.
Data Control Language (DCL): This component is used for defining and providing security to the tables
and database object
Transaction Control Language (TCL): This is used to commit the transaction or to roll-back the
transaction. The transaction is saved after execution or the transaction can be undone.
NoSQL: used to refer to a class of databases that are non-relational and do not use SQL as their query
language. They could perhaps be better called Distributed Database Management Systems (or DDBMS),
but for now, the popular term is NoSQL.
Data: Information, especially facts or numbers, collected to be examined and considered and used to
help decision-making, or information in an electronic form that can be stored and used by a computer
(Cambridge English)
• Data: refers to raw facts,
• A non-random sequence of characters, numbers, values, or words.
• A collection of non-random facts recorded through observation or research.
• When data is processed and becomes meaningful, it transforms into information.
Information: Facts about a situation, person, event, etc. (Cambridge English)
• Information is a collection of organized and processed facts that hold additional value beyond that of
individual data points.
• Data that has been processed and holds significance.
• The processing of data is purpose-driven.
• Data can be interpreted and understood by the recipient.
• Information serves to reduce uncertainty in situations or events, thereby aiding in decision-making.
Picture 1: DIKW Model
Databases: A database is a structured repository or collection of data that is stored and retrieved
electronically for use in applications. Data can be stored, updated, or deleted from a database.
Database Management System (DBMS): Database Management System (DBMS) is a combination of two
words, Database & Management System. That means, where Data is defined in the previous section,
and Management System is a software or set of programs, used to manage (Save and Manipulate) data
easily. The overall purpose of the Database Management System is to design a centralized system to
manage data effectively and efficiently. Centralized mean, instead of maintaining multiple copies of
same data at multiple places, consolidate it at only one centralize place (to reduce duplicity and
maintenance complexities) and then allow access to intended recipients. For example, instead of
maintaining the same student’s records in different registers in a different department of a university or
college, consolidated at a single place and then allow access to all departments.
Relational database management system: An RDBMS is a type of database management system
(DBMS) that stores data in a row-based table structure which connects related data elements. An
RDBMS includes functions that maintain the security, accuracy, integrity and consistency of the data
Data Model: The process of creating data models for an information system. Data modeling can easily
translate to database modeling, as this is the essential end state
Entity Relationship Diagram: An Entity Relationship Diagram (ERD) is a visual representation of different
entities within a system and how they relate to each other, example:
Data warehosue: A copy of transaction data specifically structured for query and analysis - Ralph Kimball
; A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in
support of management’s decision-making process - Bill Inmon
- Subject-oriented: A data warehouse is organized around a major subject such as customer, products
and sales. That is, data are organized according to a subject instead of application. For example, an
insurance company using a data warehouse would organize its data by customer, premium and claim
instead of by different policies (auto sweep policy, joint life policy, etc.).
- Non-volatile: A data warehouse is always a physically separated store of data. Due to this separation,
data warehouse does not require transaction processing, recovery, concurrency control and so on. The
data are not overwritten or deleted once they enter the data warehouse, but are only loaded, refreshed
and accessed for queries. The data in the data warehouse are retained for future reporting.
- Time varying: The data are stored in a data warehouse to provide a historical perspective. Thus, the
data in the data warehouse are time-variant or historical in nature. The data in the warehouse are 5 to
10 years old, or older, and are used for comparisons, trend analysis and forecasting. The changes to the
data in the data warehouse are tracked and recorded so that reports can be produced showing the
changes over time.
- Integrated: A data warehouse is usually constructed by integrating multiple, heterogeneous sources
such as relational databases and flat files. The database contains data from most or all of an
organization’s operational applications, and these data are made consistent.
- Data granularity: The summarization of the individual units of data or levelling of detail of data in data
warehouse is termed as data granularity. In common, the level of granularity is inversely proportional to
the level of data. That is, if the level of granularity will be lower, then more details of data are available
and vice versa. Mostly in data warehouse the data are stored at the lowest level of granularity which
thus helps a user to satisfy his/her query. For example, when a user queries the data warehouse of a
grocery store for analysis, then he/she may look at the daily details of the product ordered, and may
also look for units of a product ordered for a particular month or for a quarterly summary easily as data
are summarized at different levels. However, lot of data need to be stored in the data warehouse to
make the user satisfy his/her queries.
Dimensional modeling is a logical design technique for structuring data so that it’s intuitive to business
users and delivers fast query performance. Dimensional modeling is widely accepted as the preferred
approach for DW/BI presentation.
Dimensional modeling divides the world into measurements and context. Measurements are captured
by the organization’s business processes and their supporting operational source systems.
Measurements are usually numeric values; we refer to them as facts. Facts are surrounded by largely
textual context that is true at the moment the fact is recorded. This context is intuitively divided into
independent logical clumps called dimensions. Dimensions describe the ‘‘who, what, when, where, why,
and how’’ context of the measurement.
Dimensions are an essential and distinguishing concept in multidimensional databases. An important
goal of multidimensional modeling is to use dimensions to provide as much context as possible for facts.
Dimensions are used for selecting and aggregating data at the desired level of detail
Facts represent the subject—the interesting pattern or event in the enterprise that must be analyzed to
understand its behavior. In most multidimensional data models, facts are implicitly defined by their
combination of dimension values; a fact exists only if there is a nonempty cell for a particular
combination of values. However, some models treat facts as first-class objects with a separate identity.
Most multidimensional models also require mapping each fact to one value at the lowest level in each
dimension, but some models relax this mapping requirement.
Business Intelligence: Business Intelligence (BI) is the process of making “intelligent” business decisions
based on the analysis of available data. From a technology point of view, BI covers the combined areas
of data warehousing, reporting, OLAP, data mining, some data visualization, what-if analysis, and
special-purpose analytical applications.
BI’s major objective is to enable interactive access (sometimes in real time) to data, to enable
manipulation of data, and to give business managers and analysts the ability to conduct appropriate
analysis. By analyzing historical and current data, situations, and performances, decision makers get
valuable insights that enable them to make more informed and better decisions. The process of BI is
based on the transformation of data to information, then to decisions, and finally to actions.
Imagine you are the manager of a retail store chain, and you want to increase your sales. You have
access to a wealth of data, including sales figures, customer demographics, and inventory levels. Here's
how BI can help:
1. Data Gathering: You collect and store all this data in a centralized data warehouse, ensuring it's up-to-
date and organized.
2. Reporting: BI tools allow you to generate reports and dashboards that provide insights into your
business operations. For instance, you can create a report showing which products are selling well in
which locations and at what times.
3. OLAP Analysis: With Online Analytical Processing (OLAP), you can slice and dice your data in various
ways. For instance, you can analyze sales data by product category, region, or time period. This helps
identify trends and patterns.
4. Data Mining: BI tools can perform data mining to discover hidden patterns or relationships in your
data. For example, you might find that certain products are often bought together, allowing you to
create targeted promotions.
5. Data Visualization: BI tools often include data visualization features, such as charts and graphs.
Visualizing your data can make it easier to spot trends and anomalies. For instance, a sales heatmap
might show that certain days of the week are consistently busier than others.
6. What-If Analysis: BI tools enable you to perform "what-if" scenarios. You can simulate the impact of
various decisions, such as changing pricing strategies or opening new store locations, to see how they
might affect sales and profitability.
7. Special-Purpose Analytical Applications: In some cases, specialized BI applications can help with
specific tasks. For instance, a demand forecasting tool can predict future sales based on historical data
and external factors like holidays or economic trends.
1.2. When to use a relational database?
Advantages of Using a Relational Database
Flexibility for writing in SQL queries: With SQL being the most common database query language.
Modeling the data not modeling queries
Ability to do JOINS
Ability to do aggregations and analytics
Secondary Indexes available: You have the advantage of being able to add another index to help with
quick searching.
Smaller data volumes: If you have a smaller data volume (and not big data) you can use a relational
database for its simplicity.
ACID Transactions: Allows you to meet a set of properties of database transactions intended to
guarantee validity even in the event of errors, power failures, and thus maintain data integrity.
Easier to change to business requirements
1.3. ACID Transactions
Properties of database transactions intended to guarantee validity even in the event of errors or power
failures.
Atomicity: The whole transaction is processed or nothing is processed. A commonly cited example of an
atomic transaction is money transactions between two bank accounts. The transaction of transferring
money from one account to the other is made up of two operations. First, you have to withdraw money
in one account, and second you have to save the withdrawn money to the second account. An atomic
transaction, i.e., when either all operations occur or nothing occurs, keeps the database in a consistent
state. This ensures that if either of those two operations (withdrawing money from the 1st account or
saving the money to the 2nd account) fail, the money is neither lost nor created. Source Wikipedia for a
detailed description of this example.
Consistency: Only transactions that abide by constraints and rules are written into the database,
otherwise the database keeps the previous state. The data should be correct across all rows and tables.
Check out additional information about consistency on Wikipedia.
Isolation: Transactions are processed independently and securely, order does not matter. A low level of
isolation enables many users to access the data simultaneously, however this also increases the
possibilities of concurrency effects (e.g., dirty reads or lost updates). On the other hand, a high level of
isolation reduces these chances of concurrency effects, but also uses more system resources and
transactions blocking each other. Source: Wikipedia
Durability: Completed transactions are saved to database even in cases of system failure. A commonly
cited example includes tracking flight seat bookings. So once the flight booking records a confirmed seat
booking, the seat remains booked even if a system failure occurs. Source: Wikipedia.
1.4. When Not to Use a Relational Database
Have large amounts of data: Relational Databases are not distributed databases and because of this
they can only scale vertically by adding more storage in the machine itself. You are limited by how much
you can scale and how much data you can store on one machine. You cannot add more machines like
you can in NoSQL databases.
Need to be able to store different data type formats: Relational databases are not designed to handle
unstructured data.
Need high throughput -- fast reads: While ACID transactions bring benefits, they also slow down the
process of reading and writing data. If you need very fast reads and writes, using a relational database
may not suit your needs.
Need a flexible schema: Flexible schema can allow for columns to be added that do not have to be used
by every row, saving disk space.
Need high availability: The fact that relational databases are not distributed (and even when they are,
they have a coordinator/worker architecture), they have a single point of failure. When that database
goes down, a fail-over to a backup system occurs and takes time.
Need horizontal scalability: Horizontal scalability is the ability to add more machines or nodes to a
system to increase performance and space for data.
1.5. When to use a NoSQL Database
Need to be able to store different data type formats: NoSQL was also created to handle different data
configurations: structured, semi-structured, and unstructured data. JSON, XML documents can all be
handled easily with NoSQL.
Large amounts of data: Relational Databases are not distributed databases and because of this they can
only scale vertically by adding more storage in the machine itself. NoSQL databases were created to be
able to be horizontally scalable. The more servers/systems you add to the database the more data that
can be hosted with high availability and low latency (fast reads and writes).
Need horizontal scalability: Horizontal scalability is the ability to add more machines or nodes to a
system to increase performance and space for data
Need high throughput: While ACID transactions bring benefits they also slow down the process of
reading and writing data. If you need very fast reads and writes using a relational database may not suit
your needs.
Need a flexible schema: Flexible schema can allow for columns to be added that do not have to be used
by every row, saving disk space.
Need high availability: Relational databases have a single point of failure. When that database goes
down, a failover to a backup system must happen and takes time.
1.6. When NOT to use a NoSQL Database?
When you have a small dataset: NoSQL databases were made for big datasets not small datasets and
while it works it wasn’t created for that.
When you need ACID Transactions: If you need a consistent database with ACID transactions, then most
NoSQL databases will not be able to serve this need. NoSQL database are eventually consistent and do
not provide ACID transactions. However, there are exceptions to it. Some non-relational databases like
MongoDB can support ACID transactions.
When you need the ability to do JOINS across tables: NoSQL does not allow the ability to do JOINS. This
is not allowed as this will result in full table scans.
__If you want to be able to do aggregations and analytics __
__If you have changing business requirements __: Ad-hoc queries are possible but difficult as the data
model was done to fix particular queries
__If your queries are not available and you need the flexibility __: You need your queries in advance. If
those are not available or you will need to be able to have flexibility on how you query your data you
might need to stick with a relational database
1.7. Importance of Relational Databases
Standardization of data model: Once your data is transformed into the rows and columns format, your
data is standardized and you can query it with SQL
Flexibility in adding and altering tables: Relational databases gives you flexibility to add tables, alter
tables, add and remove data.
Data Integrity: Data Integrity is the backbone of using a relational database.
Simplicity: Data is systematically stored and modeled in tabular format.
Intuitive Organization: The spreadsheet format in relational databases is intuitive for data modeling.
1.8 ETL vs ELT
Extract: Information from a source database is perused and the ideal subset of information is extricated
in this process. The motivation behind this progression is to recover all the required information from
the source framework with the least assets. The concentrating procedure should be planned such that it
doesn't influence the source framework contrarily regarding execution or reaction time.
Transform: In this procedure, the sifting, purging of information is done and it likewise readies the
removed information utilizing query tables or administers or by making blends with other information
and changes over it to the ideal state. The change step incorporates the approval of records, dismissal of
information (in the event that they are not worthy) and information mix. Arranging, separating, clearing
the copies, institutionalizing, interpreting and gazing upward or checking the consistency of information
sources is a portion of the generally utilized procedures for change transformation.
Load: The way toward stacking the information into the information distribution center is one of the
elements of the procedure. The heap capacity composes the subsequent information, for example the
extricated and changed information in like manner to an objective information vault. A few devices
physically embed each record as another column into the table of the objective database utilizing SQL
embed explanation, while numerous different devices interface the extraction, change, and stacking
forms for each record from the source
ETL vs. ELT: What's the Difference & Which Is Better? (hubspot.com)
1.9. OLAP vs. OLTP
OLTP:
Because most people are familiar with commercial relational database systems, it is easy to understand
what a data warehouse is by comparing these two kinds of systems.
The major task of online operational database systems is to perform online transaction and query
processing. These systems are called online transaction processing (OLTP) systems. They cover most of
the day-to-day operations of an organization such as purchasing, inventory, manufacturing, banking,
payroll, registration, and accounting.
These are the systems that are used to run the day-to-day core business of the company. They are the
called bread-and-butter systems These systems typically get the data into the database. Each
transaction processes information about a single entity such as a single order, a single invoice, or a single
customer.
OLAP: OLAP abbreviates On-Line Analytical Processing. As opposed to the well-known OLTP (On-Line
Transaction Processing), focus is on data analyses rather than transactions. Furthermore, the analyses
occur“On-Line,” i.e.,fast,“interactive” query response is implied. OLAP systems always employ a
multidimensional view of data.
OLAP systems come in three broad categories: systems based on relational database management
technology, called ROLAP systems, systems utilizing non-relational, multidimensional arraytype
technologies, called MOLAP systems, and hybrid systems that combine these technologies, called HOLAP
systems.
OLAP vs. OLTP: What’s the Difference? - IBM Blog
OLTP vs. OLAP: What's the Difference? (Plus Examples) | Indeed.com
10. Normalization vs. Denormalization:
JOINS on the database allow for outstanding flexibility but are extremely slow. If you are dealing with
heavy reads on your database, you may want to think about denormalizing your tables. You get your
data into normalized form, and then you proceed with denormalization. So, denormalization comes
after normalization.
Citation for slides: https://en.wikipedia.org/wiki/Denormalization
Difference between Normalization and Denormalization (tutorialspoint.com)
2.Common Questions
2.1. Why can't everything be stored in a giant Excel spreadsheet?
There are limitations to the amount of data that can be stored in an Excel sheet. So, a database helps
organize the elements into tables - rows and columns, etc. Also reading and writing operations on a
large scale is not possible with an Excel sheet, so it's better to use a database to handle most business
functions.
2.2. Does data modeling happen before you create a database, or is it an iterative process?
It's definitely an iterative process. Data engineers continually reorganize, restructure, and optimize data
models to fit the needs of the organization.
2.3. How is data modeling different from machine learning modeling?
Machine learning includes a lot of data wrangling to create the inputs for machine learning models, but
data modeling is more about how to structure data to be used by different people within an
organization. You can think of data modeling as the process of designing data and making it available to
machine learning engineers, data scientists, business analytics, etc., so they can make use of it easily.