KEMBAR78
Module 1 Data Warehousing Fundamentals | PDF | Data Warehouse | Information Science
0% found this document useful (0 votes)
305 views17 pages

Module 1 Data Warehousing Fundamentals

Uploaded by

dhruu2503
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
305 views17 pages

Module 1 Data Warehousing Fundamentals

Uploaded by

dhruu2503
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Module 1 Data Warehousing Fundamentals

1.1.1 Introduction to Data Warehouse,

A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived
from transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and
focuses on providing support for decision-makers for data modeling and
analysis.

1.1.2 Data warehouse architecture

Data Warehouse Architecture is complex as it's an information system that


contains historical and commutative data from multiple sources. It defines the
overall architecture of data communication processing and presentation that
exist for end-clients computing within the enterprise. Data warehouses and their
architectures vary depending on an organization's situation elements.
The following are three common architectures:
• Data Warehouse Architecture (Basic).
• Data Warehouse Architecture (with Staging Area).
• Data Warehouse Architecture (with Staging Area and Data Marts).
• Data Warehouse Architecture (Basic)

Operational System
In data warehousing, an operational system is a system that processes an
organization's day-to-day transactions.
Flat File System
A flat file system is a collection of files in which each file must have a unique
name.
Metadata
The metadata contains information about other data but not the data itself, such
as a message's text or an image's content. We use metadata to direct a query to
the most relevant data source.
Raw data
Raw data is a set of data that has not yet been processed and delivered from a
specific data entity to the data supplier and has not been processed by machine
or human. This information is gathered from various online sources to provide
detailed insight into users' online behavior.
Summary Data
Data summary is a simple term for a brief conclusion to a large theory or
paragraph. This is frequently the case in which analysts write the code and then
declare the ultimate end by summarizing data. In data mining and processing, a
data summary is critical.
Data Warehouse Architecture (with Staging Area)

Before putting our operational data into the warehouse, we need to clean and
process it. Although data warehouses use a staging area, we can do this
programmatically.
A staging area is a location where data is processed before entering the
warehouse. It makes data cleansing and consolidation easier for operational
methods that use data from multiple sources, especially for enterprise data
warehouses where all relevant data is consolidated.

Data Warehouse Architecture (with Staging Area and Data Marts)


We can customize the architecture of our warehouse for multiple groups within
our organization by adding data marts.
The data mart is a part of the storage component. It stores the information of an
organization's particular function, which is handled by a single authority.
Depending on the functions, there can be as many data marts in an organization.
We can also say that a data mart is a subset of the data in a data warehouse.
The above diagram illustrates an example in which purchasing, sales, and stocks
are all separated. In this case, a financial analyst wants to mine historical data to
predict customer behavior or analyze historical data for purchases and sales.

Advantages of Data Warehouse Architecture


• Improved Decision-Making:
Centralizes and harmonizes data from various sources.
Provides a solid foundation for analytics and reporting.
Enables historical analysis and trend forecasting.

• Data Quality and Consistency:


Ensures data consistency across the organization.
Implements data cleansing and quality checks.
Provides a single version of the truth.

• Enhanced Business Intelligence:


Supports a wide range of BI activities.
Facilitates advanced analytics, data mining, and machine learning.
Empowers users with self-service analytics and reporting capabilities.

• Performance and Scalability:


Optimizes data retrieval and analysis performance.
Scalable architecture grows with the evolving data and analytics needs.
Supports ad-hoc queries and scheduled reporting efficiently.
• Time and Cost Efficiency:
Reduces the time and effort required for data preparation and analysis.
Streamlines the ETL (Extract, Transform, Load) processes.
Lowers the total cost of ownership in the long run by maintaining a
centralized data repository.
Disadvantages of Data Warehouse Architecture
• High Initial Costs:
Significant investment is required for hardware, software, and expertise.
Costly to set up, configure, and maintain.

• Complex Implementation:
Requires thorough planning, design, and expertise.
Integration with existing systems can be challenging.

• Data Latency:
Data may not be available in real time due to batch processing.
ETL processes can introduce delays in data availability.

• Potential Overhead:
Requires additional systems and processes.
It can add to the operational overhead and require more management.

• Limited Historical Data:


Depending on storage capacity, there might be limits on how much
historical data can be stored.
Pruning of older data may limit historical analysis.
1.1.3 Data warehouse versus Data Marts
1.1.4 E-R Modeling versus Dimensional Modeling,

1.1.5 information package diagram

1.1.6 Data Warehouse Schemas; Star Schema, Snowflake Schema,

Star schema and snowflake schema are both data warehouse modeling
techniques used in relational databases to organize and structure data for
efficient querying and reporting. They are commonly associated with online
analytical processing (OLAP) systems. Let's explore each schema:
Star Schema:
Structure:
• In a star schema, there is a central fact table surrounded by dimension
tables.
• The fact table contains quantitative data (e.g., sales, revenue) and foreign
keys that link to the dimension tables.
• Dimension tables contain descriptive attributes that provide context to the
data in the fact table.
Advantages:
• Simple and easy to understand.
• Query performance is often faster because of denormalization (reducing
joins).
• Well-suited for OLAP operations and data warehousing.
Disadvantages:
• Redundancy in data storage since dimension tables might have repeated
data.
• May not be as normalized as other schemas, which can impact data
integrity.

Snowflake Schema:
Structure:
• Similar to the star schema but with normalized dimension tables.
• Dimension tables in a snowflake schema are organized into multiple
related tables, forming a shape that resembles a snowflake.
• Normalization involves breaking down dimension tables into sub-
dimensions to reduce redundancy.
Advantages:
• Saves storage space by avoiding redundancy in dimension tables.
• Improves data integrity as updates need to be made in fewer places.
Disadvantages:
• Increased complexity in the schema design and queries due to
normalization.
• Query performance might be slower than a star schema due to additional
joins.
• Not as intuitive as star schema for end-users.
1.1.7 fact less fact table

A factless fact table is a type of fact table in a data warehouse that captures the
relationship between dimensions without containing any measurable numeric
facts. Unlike traditional fact tables, which store quantitative data (such as sales,
revenue, or quantities sold), a factless fact table contains only the keys of the
related dimensions, creating a record of events or conditions.
Here are some key characteristics of factless fact tables:
• No Measurable Facts:
• Factless fact tables do not contain any quantitative or measurable data.
They focus solely on capturing events or occurrences.
• Captures Events or Relationships:
• The purpose of a factless fact table is to record the occurrences of certain
events or relationships between dimensions. These events may not have a
measurable quantity associated with them.
• Used for Tracking Events:
• Factless fact tables are commonly used to track events such as
promotions, appointments, enrollments, reservations, or relationships
between entities.
• Enables Analysis of Events:
• While there are no numeric measures in the fact table, it allows analysts
to study patterns, relationships, and trends associated with specific events
or conditions.
• Supports Constraints and Analysis:
• Factless fact tables support constraints and analysis related to the
occurrence of events. For example, analysts might use these tables to
determine which products were on promotion during a specific time
frame.

1.1.8

What is Fact Constellation Schema?


• A Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.
• Fact Constellation Schema describes a logical structure of data warehouse
or data mart. Fact Constellation Schema can design with a collection of
de-normalized FACT, Shared, and Conformed Dimension tables.
• Fact Constellation Schema is a sophisticated database design that is
difficult to summarize information. Fact Constellation Schema can
implement between aggregate Fact tables or decompose a complex Fact
table into independent simplex Fact tables.
Example: A fact constellation schema is shown in the figure below.
PlayNext
Unmute
Current Time 0:00
/
Duration 18:10
Loaded: 2.94%
Â
Fullscreen

• This schema defines two fact tables, sales, and shipping. Sales are treated
along four dimensions, namely, time, item, branch, and location. The
schema contains a fact table for sales that includes keys to each of the
four dimensions, along with two measures: Rupee_sold and units_sold.
The shipping table has five dimensions, or keys: item_key, time_key,
shipper_key, from_location, and to_location, and two measures:
Rupee_cost and units_shipped.
• The primary disadvantage of the fact constellation schema is that it is a
more challenging design because many variants for specific kinds of
aggregation must be considered and selected.

1.1.9

ETL (Extract, Transform, and Load) Process


What is ETL?
The mechanism of extracting information from source systems and bringing it
into the data warehouse is commonly called ETL, which stands for Extraction,
Transformation and Loading.
• The ETL process requires active inputs from various stakeholders,
including developers, analysts, testers, top executives and is technically
challenging.
• To maintain its value as a tool for decision-makers, Data warehouse
technique needs to change with business changes. ETL is a recurring
method (daily, weekly, monthly) of a Data warehouse system and needs
to be agile, automated, and well documented.

How ETL Works?


ETL consists of three separate phases:
Backward Skip 10sPlay VideoForward Skip 10s

Extraction
• Extraction is the operation of extracting information from a source system
for further use in a data warehouse environment. This is the first stage of
the ETL process.
• Extraction process is often one of the most time-consuming tasks in the
ETL.
• The source systems might be complicated and poorly documented, and
thus determining which data needs to be extracted can be difficult.
• The data has to be extracted several times in a periodic manner to supply
all changed data to the warehouse and keep it up-to-date.
Cleansing
• The cleansing stage is crucial in a data warehouse technique because it is
supposed to improve data quality. The primary data cleansing features
found in ETL tools are rectification and homogenization. They use
specific dictionaries to rectify typing mistakes and to recognize
synonyms, as well as rule-based cleansing to enforce domain-specific
rules and defines appropriate associations between values.
• The following examples show the essential of data cleaning:
• If an enterprise wishes to contact its users or its suppliers, a complete,
accurate and up-to-date list of contact addresses, email addresses and
telephone numbers must be available.
• If a client or supplier calls, the staff responding should be quickly able to
find the person in the enterprise database, but this need that the caller's
name or his/her company name is listed in the database.
• If a user appears in the databases with two or more slightly different
names or different account numbers, it becomes difficult to update the
customer's information.
Transformation
• Transformation is the core of the reconciliation phase. It converts records
from its operational source format into a particular data warehouse
format. If we implement a three-layer architecture, this phase outputs our
reconciled data layer.
• The following points must be rectified in this phase:
• Loose texts may hide valuable information. For example, XYZ PVT Ltd
does not explicitly show that this is a Limited Partnership company.
• Different formats can be used for individual data. For example, data can
be saved as a string or as three integers.
Following are the main transformation processes aimed at populating the
reconciled data layer:
• Conversion and normalization that operate on both storage formats and
units of measure to make data uniform.
• Matching that associates equivalent fields in different sources.
• Selection that reduces the number of source fields and records.
• Cleansing and Transformation processes are often closely linked in ETL
tools.
Loading
The Load is the process of writing the data into the target database. During the
load step, it is necessary to ensure that the load is performed correctly and with
as little resources as possible.
Loading can be carried in two ways:
• Refresh: Data Warehouse data is completely rewritten. This means that
older file is replaced. Refresh is usually used in combination with static
extraction to populate a data warehouse initially.
• Update: Only those changes applied to source information are added to
the Data Warehouse. An update is typically carried out without deleting
or modifying preexisting data. This method is used in combination with
incremental extraction to update data warehouses regularly.
1.1.10

1.1.11
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will
discuss OLAP operations in multidimensional data.
Here is the list of OLAP operations −
• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)

Roll-up
• Roll-up performs aggregation on a data cube in any of the following ways

• By climbing up a concept hierarchy for a dimension
• By dimension reduction
• The following diagram illustrates how roll-up works.

• Roll-up is performed by climbing up a concept hierarchy for the


dimension location.
• Initially the concept hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location hierarchy
from the level of city to the level of country.
• The data is grouped into cities rather than countries.
• When roll-up is performed, one or more dimensions from the data cube
are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the
following ways −
• By stepping down a concept hierarchy for a dimension
• By introducing a new dimension.
• The following diagram illustrates how drill-down works −
• Drill-down is performed by stepping down a concept hierarchy for the
dimension time.
• Initially the concept hierarchy was "day < month < quarter < year."
• On drilling down, the time dimension is descended from the level of
quarter to the level of month.
• When drill-down is performed, one or more dimensions from the data
cube are added.
• It navigates the data from less detailed data to highly detailed data.
• Slice
• The slice operation selects one particular dimension from a given cube
and provides a new sub-cube. Consider the following diagram that shows
how slice works.

• Here Slice is performed for the dimension "time" using the criterion time
= "Q1".
• It will form a new sub-cube by selecting one or more dimensions.
Dice
• Dice selects two or more dimensions from a given cube and provides a
new sub-cube. Consider the following diagram that shows the dice
operation.

The dice operation on the cube based on the following selection criteria
involves three dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in
order to provide an alternative presentation of data. Consider the following
diagram that shows the pivot operation.

You might also like