What is a Data Warehouse?
A data warehouse is a centralized system used for storing and managing large volumes of data
from various sources. It is designed to help businesses analyze historical data and make
informed decisions. Data from different operational systems is collected, cleaned, and stored in a
structured way, enabling efficient querying and reporting.
Goal is to produce statistical results that may help in decision-making.
Ensures fast data retrieval even with the vast datasets.
🔹 Simple Example of Data Warehousing:
🏪 Example Scenario: A Retail Store Chain
Suppose you own a retail chain with stores in different cities. Each store uses a different system to
manage:
Sales
Inventory
Customer information
Employee attendance
These systems are separate and store data in different formats.
💡 Problem:
You want to know:
What are the top-selling products this year?
Which city has the highest sales?
What are the buying patterns of your customers?
But you can't easily analyze the data because it's spread across different systems.
✅ Solution: Data Warehouse
You create a Data Warehouse where:
Data from all stores is collected daily or weekly.
The data is cleaned and transformed into a common format.
It is stored in organized tables for analysis.
📊 Final Result:
Now, you can generate reports like:
"Top 10 products sold in all cities"
"Monthly sales trends"
"Customer behavior by location"
Advantages of Data Warehousing
•Combines Data from Many Sources
→ Collects data from different departments (sales, HR, inventory, etc.) into one place.
•Better Decision Making
→ Helps managers and leaders make smart decisions based on facts and reports.
•Faster Data Access
→ You can get reports and insights quickly without searching through multiple files.
•Historical Data Storage
→ Stores years of data so you can compare current performance with the past.
•Improves Data Quality
→ Cleans and organizes the data, removing errors and duplicates.
•Supports Business Intelligence (BI)
→ Makes it easy to use BI tools (like Power BI, Tableau) for dashboards and reports.
•Saves Time
→ Reduces the time needed for preparing reports manually.
•Improves Productivity
→ Employees spend less time searching for data and more time analyzing it.
•High Performance for Analysis
→ Designed to handle large amounts of data for fast queries and reports.
•Data Security
→ Centralized control over who can access what data.
•Disadvantages of Data Warehousing
•High Cost
→ Building and maintaining a data warehouse can be expensive (hardware, software, experts).
•Complex Setup
→ Setting it up is technical and time-consuming, especially when connecting many systems.
•Requires Skilled Staff
→ Needs trained professionals to manage, update, and analyze data properly.
•Not Real-Time
→ Most data warehouses update periodically (daily/weekly), so data is not always real-time.
•Data Overload
→ Can become too large or complicated if not managed properly.
•Maintenance is Ongoing
→ Requires regular updates and monitoring to ensure data accuracy and performance.
•May Include Unused Data
→ Sometimes stores more data than needed, which increases storage and complexity.
•Risk of Data Breach
→ If not secured well, centralizing all business data can be a security risk.
Database Data Warehouse
1. It is used for Online Transactional Processing (OLTP) but can be used for other 1. It is used for Online Analytical Processing (OLAP). This reads the
objectives such as Data Warehousing. This records the data from the clients for historical information for the customers for business decisions.
history.
2. The tables and joins are complicated since they are normalized for RDBMS. This is 2. The tables and joins are accessible since they are de-normalized.
done to reduce redundant files and to save storage space. This is done to minimize the response time for analytical queries.
3. Data is dynamic 3. Data is largely static
4. Entity: Relational modeling procedures are used for RDBMS database design. 4. Data: Modeling approach are used for the Data Warehouse design.
5. Optimized for write operations. 5. Optimized for read operations.
6. Performance is low for analysis queries. 6. High performance for analytical queries.
7. The database is the place where the data is taken as a base and managed to get 7. Data Warehouse is the place where the application data is handled
available fast and efficient access. for analysis and reporting objectives.
Characteristics of Data Warehouse
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore, data warehouses typically provide
a concise and straightforward view around a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that are not useful concerning the subject and including all data
needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online transaction records. It requires
performing data cleaning and integration during data warehousing to ensure consistency in naming conventions, attributes types,
etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6 months, 12 months, or even
previous data from a data warehouse. These variations with a transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source operational RDBMS. The operational
updates of data do not occur in the data warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data. Therefore, the DW does not require transaction
processing, recovery, and concurrency capabilities, which allows for substantial speedup of data retrieval. Non-Volatile defines that
once entered into the warehouse, and data should not change.
Need for Data Warehouse
Data Warehouse is needed for the following reasons:
1. 1) Business User: Business users require a data warehouse to view summarized data from the past. Since
these people are non-technical, the data may be presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable data from the past. This
input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in the data warehouse. So,
data warehouse contributes to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a commonplace, the user
can effectively undertake to bring the uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types of
queries, which demands a significant degree of flexibility and quick response time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand, and query.
4. Queries that would be complex in many normalized databases could be easier to build and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.
Components or Building Blocks of Data Warehouse
Architecture is the proper arrangement of the elements.
Source Data Component
Source data coming into the data warehouses may be grouped into four broad categories:
Production Data: This type of data comes from the different operating systems of the enterprise. Based on the data requirements in
the data warehouse, we choose segments of the data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports, customer profiles, and sometimes even
department databases. This is the internal data, part of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In every operational system, we periodically
take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large percentage of the information they use.
They use statistics associating to their industry produced by the external department.
Data Staging Component
After we have been extracted data from various operational systems and external sources, we have to prepare the files for storing in
the data warehouse. The extracted data coming from several different sources need to be changed, converted, and made ready in a
format that is relevant to be saved for querying and analysis.
1) Data Extraction: This method has to deal with numerous data sources. We have to employ the appropriate techniques for each
data source.
2) Data Transformation: As we know, data for a data warehouse comes from many different sources. If data extraction for a data
warehouse posture big challenges, data transformation present even significant challenges. We perform several individual tasks as
part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of misspellings or may deal with providing
default values for missing data elements, or elimination of duplicates when we bring in the same data from various source systems.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we complete the structure and construction
of the data warehouse and go live for the first time, we do the initial loading of the information into the data warehouse storage.
The initial load moves high volumes of data using up a substantial amount of time.
Data Storage Components
Data storage for the data warehousing is a split repository. The data repositories for the operational systems generally include only
the current data. Also, these data repositories include the data structured in highly normalized for fast and efficient processing.
Information Delivery Component
The information delivery element is used to enable the process of subscribing for data warehouse files and having it transferred to
one or more destinations according to some customer-specified scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database management system. In the data
dictionary, we keep the data about the logical data structures, the data about the records and addresses, the information about the
indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The scope is confined to particular selected
subjects. Data in a data warehouse should be a fairly current, but not mainly up to the minute, although development in the data
warehouse industry has made standard and incremental data dumps more achievable. Data marts are lower than data warehouses
and usually contain organization. The current trends in data warehousing are to developed a data warehouse with several smaller
related data marts for particular kinds of queries and reports.
Management and Control Component
The management and control elements coordinate the services and functions within the data warehouse. These components control
the data transformation and the data transfer into the data warehouse storage. On the other hand, it moderates the data delivery to
the clients. Its work with the database management systems and authorizes data to be correctly saved in the repositories. It
monitors the movement of information into the staging method and from there into the data warehouses storage itself.
Why we need a separate Data Warehouse?
Data Warehouse queries are complex because they involve the computation of large groups of data at summarized levels.
It may require the use of distinctive data organization, access, and implementation method based on multidimensional views.
Performing OLAP queries in operational database degrade the performance of functional tasks.
Data Warehouse is used for analysis and decision making in which extensive database is required, including historical data, which
operational database does not typically maintain.
The separation of an operational database from data warehouses is based on the different structures and uses of data in these
systems.
Because the two systems provide different functionalities and require different kinds of data, it is necessary to maintain separate
databases.
Feature Database Data Warehouse
Purpose Operational Analytical
Data Current, detailed Historical, summarized
Structure Normalized Denormalized (star/snowflake)
Access Frequent updates, reads, writes Primarily reads, complex queries
Optimization Transaction processing (OLTP) Analytical processing (OLAP)
Size Relatively small Typically very large
Typically used by operational
staff such as data entry clerks,
customer service typically used by analysts, data
representatives, and scientists, and business intelligence
operational managers who professionals who need to perform
User Base need up-to-date data. in-depth analysis and reporting.
Frequent, real-time or near Periodic batch updates (e.g., daily,
Data Updates real-time updates weekly)
Data Warehouse Applications
a data warehouse helps business executives to organize, analyze, and
use their data for decision making. A data warehouse serves as a sole
part of a plan-execute-assess "closed-loop" feedback system for the
enterprise management. Data warehouses are widely used in the
following fields −
❖ Financial services
❖ Banking services
❖ Consumer goods
❖ Retail sectors
❖ Controlled manufacturing
Types of Data Warehouse
Information processing, analytical processing, and data mining are the three types of data
warehouse applications that are discussed below −
Information Processing − A data warehouse allows to process the data stored in it. The data
can be processed by means of querying, basic statistical analysis, reporting using
crosstabs, tables, charts, or graphs.
Analytical Processing − A data warehouse supports analytical processing of the information
stored in it. The data can be analyzed by means of basic OLAP operations, including
slice-and-dice, drill down, drill up, and pivoting.
Data Mining − Data mining supports knowledge discovery by finding hidden patterns and
associations, constructing analytical models, performing classification and prediction. These
mining results can be presented using the visualization tools.
Data Warehouse Architecture
A Data Warehouse is a system that combine data from multiple sources, organizes it under a single architecture,
and helps organizations make better decisions. It simplifies data handling, storage, and reporting, making analysis
more efficient. Data Warehouse Architecture uses a structured framework to manage and store data effectively.
There are two common approaches to constructing a data warehouse:
● Top-Down Approach: This method starts with designing the overall data warehouse architecture first and
then creating individual data marts.
● Bottom-Up Approach: In this method, data marts are built first to meet specific business needs, and later
integrated into a central data warehouse.
Top-Down Approach
The Top-Down Approach, introduced by Bill Inmon, is a method for designing data warehouses that starts by
building a centralized, company-wide data warehouse. This central repository acts as the single source of truth for
managing and analyzing data across the organization. It ensures data consistency and provides a strong foundation
for decision-making.
Working of Top-Down Approach
● Central Data Warehouse: The process begins with creating a comprehensive data warehouse where data
from various sources is collected, integrated, and stored. This involves the ETL (Extract, Transform, Load)
process to clean and transform the data.
● Specialized Data Marts: Once the central warehouse is established, smaller, department-specific data
marts (e.g., for finance or marketing) are built. These data marts pull information from the main data
warehouse, ensuring consistency across departments.
Bottom-Up Approach
The Bottom-Up Approach, popularized by Ralph Kimball, takes a more flexible and incremental path to designing
data warehouses. Instead of starting with a central data warehouse, it begins by building small, department-specific
data marts that cater to the immediate needs of individual teams, such as sales or finance. These data marts are
later integrated to form a larger, unified data warehouse.
Working of Bottom-Up Approach
● Department-Specific Data Marts: The process starts with creating data marts for individual departments
or specific business functions. These data marts are designed to meet immediate data analysis and
reporting needs, allowing departments to gain quick insights.
● Integration into a Data Warehouse: Over time, these data marts are connected and consolidated to create
a unified data warehouse. The integration ensures consistency and provides a comprehensive view of the
organization’s data.
Data warehouse Schema
Schema is a logical description of the entire database. It includes the name and
description of records of all record types including all associated data-items and
aggregates. Much like a database, a data warehouse also requires to maintain a
schema. A database uses relational model, while a data warehouse uses Star,
Snowflake, and Fact Constellation schema.
Star Schema: Star schema is the type of multidimensional model which is used for
data warehouse. In star schema, The fact tables and the dimension tables are
contained. In this schema fewer foreign-key join is used. This schema forms a star
with fact table and dimension tables.
Snowflake Schema: Snowflake Schema is also the type of multidimensional model which
is used for data warehouse. In snowflake schema, The fact tables, dimension tables as
well as sub dimension tables are contained. This schema forms a snowflake with fact
tables, dimension tables as well as sub-dimension tables.
Star Schema Snowflake Schema
While in snowflake schema, The fact tables,
In star schema, The fact tables and the
dimension tables as well as sub dimension
dimension tables are contained.
tables are contained.
Star schema uses more space. While it uses less space.
It takes less time for the execution of While it takes more time than star schema
queries. for the execution of queries.
In star schema, Normalization is not While in this, Both normalization and
used. denormalization are used.
Star Schema Snowflake Schema
The query complexity of star schema While the query complexity of snowflake
is low. schema is higher than star schema.
It’s understanding is very simple. While it’s understanding is difficult.
It has less number of foreign keys. While it has more number of foreign keys.
It has high data redundancy. While it has low data redundancy.
Fact Constellation Schema: The fact constellation schema is also a type of multidimensional
model. The fact constellation schema consists of dimension tables that are shared by several fact
tables. The fact constellation schema consists of more than one star schema at a time. Unlike the
snowflake schema, the planetarium schema is not really easy to operate, as it has multiple
numbers between tables. Unlike the snowflake schema, the constellation schema, in fact, uses
heavily complex queries to access data from the database.
Star Schema
Snowflake Schema
OLAP Operations
OLAP stands for Online Analytical Processing Server. It is a software technology that
allows users to analyze information from multiple database systems at the same time. It is
based on multidimensional data model and allows the user to query on multi-dimensional
data (eg. Delhi -> 2018 -> Sales data). OLAP databases are divided into one or more
cubes and these cubes are known as Hyper-cubes.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
● Moving down in the concept hierarchy
● Adding a new dimension
2. In the cube given in overview section, the drill down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter -> Month).
Roll up: It is just opposite of the drill-down operation. It performs aggregation on
the OLAP cube. It can be done by:
● Climbing up in the concept hierarchy
● Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by
climbing up in the concept hierarchy of Location dimension (City -> Country).
Dice: It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube is selected by
selecting following dimensions with criteria:
● Location = “Delhi” or “Kolkata”
● Time = “Q1” or “Q2”
● Item = “Car” or “Bus”
Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension
Time = “Q1”.
Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation, performing
pivot operation gives a new view of it.
OLTP (Online Transaction
Aspect Processing) OLAP (Online Analytical Processing)
Supports day-to-day
transaction processing and Supports complex queries and analytical
Purpose operational tasks reporting
Transactional systems like
CRM, ERP, e-commerce, Business intelligence, data mining, and decision
Use Cases banking systems support
Data Insert, update, delete, and Querying and aggregating large volumes of
Operations retrieve individual records historical data
Highly normalized to reduce
redundancy and optimize Denormalized (e.g., star/snowflake schemas) for
Data Structure transaction speed fast retrieval and aggregation
Generally handles a smaller
volume of data, focused on Handles large volumes of historical data, often
Data Volume current records aggregated
OLTP (Online Transaction
Aspect Processing) OLAP (Online Analytical Processing)
Data Update Frequent, with real-time or near Less frequent, typically updated in batch
Frequency real-time updates processes or scheduled intervals
Strong consistency to ensure
Consistency transactional integrity (ACID Consistency focused on data accuracy over time
Model properties) rather than real-time transactional integrity
Operational users such as Analysts, data scientists, and business managers
clerks, customer service reps, who need to analyze trends and make strategic
User Base and transaction processors decisions
Examples of Processing a purchase order, Analyzing sales performance across different
Operations updating customer information regions, generating quarterly reports