KEMBAR78
DMW Unit2 | PDF | Data Warehouse | Information Technology
0% found this document useful (0 votes)
14 views69 pages

DMW Unit2

DMW_Unit2

Uploaded by

Samiksha Walbe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views69 pages

DMW Unit2

DMW_Unit2

Uploaded by

Samiksha Walbe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 69

Data Warehouse Architecture

A data warehouse architecture is a method of defining the overall


architecture of data communication processing and presentation that
exist for end-clients computing within the enterprise. Each data
warehouse is different, but all are characterized by standard vital
components.
Production applications such as payroll accounts payable product
purchasing and inventory control are designed for online transaction
processing (OLTP). Such applications gather detailed data from day to
day operations.
Data Warehouse applications are designed to support the user ad-hoc
data requirements, an activity recently dubbed online analytical
processing (OLAP). These include applications such as forecasting,
profiling, summary reporting, and trend analysis.
Production databases are updated continuously by either by hand or
via OLTP applications. In contrast, a warehouse database is updated
from operational systems periodically, usually during off-hours. As
OLTP data accumulates in production databases, it is regularly
extracted, filtered, and then loaded into a dedicated warehouse server
that is accessible to users. As the warehouse is populated, it must be
restructured tables de-normalized, data cleansed of errors and
redundancies and new fields and keys added to reflect the needs to
the user for sorting, combining, and summarizing data.
Data warehouses and their architectures very depending upon the
elements of an organization's situation.
Three common architectures are:
o Data Warehouse Architecture: Basic
o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic

Operational System
An operational system is a method used in data warehousing to refer
to a system that is used to process the day-to-day transactions of an
organization.
Flat Files
A Flat file system is a system of files in which transactional data is
stored, and every file in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can
make finding and work with particular instances of data more
accessible. For example, author, data build, and data changed, and file
size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data
source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and
highly summarized (aggregated) data generated by the warehouse
manager.
The goals of the summarized information are to speed up query
performance. The summarized record is updated continuously as new
information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information
to the business managers for strategic decision-making. These
customers interact with the warehouse using end-client access tools.
The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Data Warehouse Architecture: With Staging Area
We must clean and process your operational information before put it
into the warehouse.
We can do this programmatically, although data warehouses uses
a staging area (A place where data is processed before entering the
warehouse).
A staging area simplifies data cleansing and consolidation for
operational method coming from multiple source systems, especially
for enterprise data warehouses where all relevant data of an
enterprise is consolidated.

Data Warehouse Staging Area is a temporary location where a record


from source systems is copied.
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple
groups within our organization.
We can do this by adding data marts. A data mart is a segment of a
data warehouses that can provided information for reporting and
analysis on a section, unit, department or operation in the company,
e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks
are separated. In this example, a financial analyst wants to analyze
historical data for purchases and sales or mine historical information
to make predictions about customer behavior.
Properties of Data Warehouse Architectures
The following architecture properties are necessary for a data
warehouse system:
1. Separation: Analytical and transactional processing should be keep
apart as much as possible.
2. Scalability: Hardware and software architectures should be simple
to upgrade the data volume, which has to be managed and processed,
and the number of user's requirements, which have to be met,
progressively increase.
3. Extensibility: The architecture should be able to perform new
operations and technologies without redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic
data stored in the data warehouses.
5. Administerability: Data Warehouse management should not be
complicated.
Types of Data Warehouse Architectures
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose
is to minimize the amount of data stored to reach this goal; it removes
data redundancies.
The figure shows the only layer physically available is the source layer.
In this method, data warehouses are virtual. This means that the data
warehouse is implemented as a multidimensional view of operational
data created by specific middleware, or an intermediate processing
layer.
The vulnerability of this architecture lies in its failure to meet the
requirement for separation between analytical and transactional
processing. Analysis queries are agreed to operational data after the
middleware interprets them. In this way, queries affect transactional
workloads.

Two-Tier Architecture
The requirement for separation plays an essential role in defining the
two-tier architecture for a data warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a
separation between physically available sources and data warehouses,
in fact, consists of four subsequent data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous
source of data. That data is stored initially to corporate relational
databases or legacy databases, or it may come from an
information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted,
cleansed to remove inconsistencies and fill gaps, and integrated
to merge heterogeneous sources into one standard schema. The
so-named Extraction, Transformation, and Loading Tools
(ETL) can combine heterogeneous schemata, extract, transform,
cleanse, validate, filter, and load source data into a data
warehouse.
3. Data Warehouse layer: Information is saved to one logically
centralized individual repository: a data warehouse. The data
warehouses can be directly accessed, but it can also be used as a
source for creating data marts, which partially replicate data
warehouse contents and are designed for specific enterprise
departments. Meta-data repositories store information on
sources, access procedures, data staging, users, data mart
schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible
accessed to issue reports, dynamically analyze information, and
simulate hypothetical business scenarios. It should feature
aggregate information navigators, complex query optimizers, and
customer-friendly GUIs.

Three-Tier Architecture
The three-tier architecture consists of the source layer (containing
multiple source system), the reconciled layer and the data warehouse
layer (containing both data warehouses and data marts). The
reconciled layer sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a
standard reference data model for a whole enterprise. At the same
time, it separates the problems of source data extraction and
integration from those of data warehouse population. In some cases,
the reconciled layer is also directly used to accomplish better some
operational tasks, such as producing daily reports that cannot be
satisfactorily prepared using the corporate applications or generating
data flows to feed external processes periodically to benefit from
cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide
systems. A disadvantage of this structure is the extra file storage space
used through the extra redundant reconciled layer. It also makes the
analytical tools a little further away from being real-time.
Fact Constellation in Data Warehouse modelling
Fact Constellation is a schema for representing multidimensional
model. It is a collection of multiple fact tables having some common
dimension tables. It can be viewed as a collection of several star
schemas and hence, also known as Galaxy schema. It is one of the
widely used schema for Data warehouse designing and it is much more
complex than star and snowflake schema. For complex systems, we
require fact constellations.
Figure – General structure of Fact Constellation

Here, the pink coloured Dimension tables are the common ones among
both the star schemas. Green coloured fact tables are the fact tables of
their respective star schemas.
Example:

In above demonstration:
 Placement is a fact table having attributes: (Stud_roll,
Company_id, TPO_id) with facts: (Number of students eligible,
Number of students placed).
 Workshop is a fact table having attributes: (Stud_roll, Institute_id,
TPO_id) with facts: (Number of students selected, Number of
students attended the workshop).
 Company is a dimension table having attributes: (Company_id,
Name, Offer_package).
 Student is a dimension table having attributes: (Student_roll,
Name, CGPA).
 TPO is a dimension table having attributes: (TPO_id, Name, Age).
 Training Institute is a dimension table having attributes:
(Institute_id, Name, Full_course_fee).
So, there are two fact tables namely, Placement and Workshop which
are part of two different star schemas having dimension tables
– Company, Student and TPO in Star schema with fact
table Placement and dimension tables – Training
Institute, Student and TPO in Star schema with fact table Workshop.
Both the star schema have two dimension tables common and hence,
forming a fact constellation or galaxy schema.
Advantage: Provides a flexible schema.
Disadvantage: It is much more complex and hence, hard to implement
and maintain.
What is OLAP?
OLAP stands for Online Analytical Processing, which is a technology that
enables multi-dimensional analysis of business data. It provides
interactive access to large amounts of data and supports complex
calculations and data aggregation. OLAP is used to support business
intelligence and decision-making processes.
Grouping of data in a multidimensional matrix is called data cubes. In
Dataware housing, we generally deal with various multidimensional
data models as the data will be represented by multiple dimensions and
multiple attributes. This multidimensional data is represented in the
data cube as the cube represents a high-dimensional space. The Data
cube pictorially shows how different attributes of data are arranged in
the data model. Below is the diagram of a general data cube.

The example above is a 3D cube having attributes like


branch(A,B,C,D),item
type(home,entertainment,computer,phone,security),
year(1997,1998,1999) .

Data cube classification:


The data cube can be classified into two categories:
 Multidimensional data cube: It basically helps in storing large
amounts of data by making use of a multi-dimensional array. It
increases its efficiency by keeping an index of each dimension.
Thus, dimensional is able to retrieve data fast.
 Relational data cube: It basically helps in storing large amounts of
data by making use of relational tables. Each relational table
displays the dimensions of the data cube. It is slower compared to
a Multidimensional Data Cube.
Data cube operations:

Data cube operations are used to manipulate data to meet the needs of
users. These operations help to select particular data for the analysis
purpose. There are mainly 5 operations listed below-
 Roll-up: operation and aggregate certain similar data attributes
having the same dimension together. For example, if the data
cube displays the daily income of a customer, we can use a roll-up
operation to find the monthly income of his salary.

 Drill-down: this operation is the reverse of the roll-up operation. It


allows us to take particular information and then subdivide it
further for coarser granularity analysis. It zooms into more detail.
For example- if India is an attribute of a country column and we
wish to see villages in India, then the drill-down operation splits
India into states, districts, towns, cities, villages and then displays
the required information.
 Slicing: this operation filters the unnecessary portions. Suppose in
a particular dimension, the user doesn’t need everything for
analysis, rather a particular attribute. For example,
country=”jamaica”, this will display only about jamaica and only
display other countries present on the country list.

 Dicing: this operation does a multidimensional cutting, that not


only cuts only one dimension but also can go to another
dimension and cut a certain range of it. As a result, it looks more
like a subcube out of the whole cube(as depicted in the figure).
For example- the user wants to see the annual salary of Jharkhand
state employees.

 Pivot: this operation is very important from a viewing point of


view. It basically transforms the data cube in terms of view. It
doesn’t change the data present in the data cube. For example, if
the user is comparing year versus branch, using the pivot
operation, the user can change the viewpoint and now compare
branch versus item type.
Advantages of data cubes:
 Multi-dimensional analysis: Data cubes enable multi-dimensional
analysis of business data, allowing users to view data from
different perspectives and levels of detail.
 Interactivity: Data cubes provide interactive access to large
amounts of data, allowing users to easily navigate and manipulate
the data to support their analysis.
 Speed and efficiency: Data cubes are optimized for OLAP analysis,
enabling fast and efficient querying and aggregation of data.
 Data aggregation: Data cubes support complex calculations and
data aggregation, enabling users to quickly and easily summarize
large amounts of data.
 Improved decision-making: Data cubes provide a clear and
comprehensive view of business data, enabling improved
decision-making and business intelligence.
 Accessibility: Data cubes can be accessed from a variety of devices
and platforms, making it easy for users to access and analyze
business data from anywhere.
 Helps in giving a summarised view of data.
 Data cubes store large data in a simple way.
 Data cube operation provides quick and better analysis,
 Improve performance of data.
Disadvantages of data cube:
 Complexity: OLAP systems can be complex to set up and
maintain, requiring specialized technical expertise.
 Data size limitations: OLAP systems can struggle with very large
data sets and may require extensive data aggregation or
summarization.
 Performance issues: OLAP systems can be slow when dealing with
large amounts of data, especially when running complex queries
or calculations.
 Data integrity: Inconsistent data definitions and data quality
issues can affect the accuracy of OLAP analysis.
 Cost: OLAP technology can be expensive, especially for enterprise-
level solutions, due to the need for specialized hardware and
software.
 Inflexibility: OLAP systems may not easily accommodate changing
business needs and may require significant effort to modify or
extend.
 Star Schema in Data Warehouse modeling
A star schema is a type of data modeling technique used in data
warehousing to represent data in a structured and intuitive way. In a
star schema, data is organized into a central fact table that contains the
measures of interest, surrounded by dimension tables that describe the
attributes of the measures.
The fact table in a star schema contains the measures or metrics that
are of interest to the user or organization. For example, in a sales data
warehouse, the fact table might contain sales revenue, units sold, and
profit margins. Each record in the fact table represents a specific event
or transaction, such as a sale or order.
The dimension tables in a star schema contain the descriptive attributes
of the measures in the fact table. These attributes are used to slice and
dice the data in the fact table, allowing users to analyze the data from
different perspectives. For example, in a sales data warehouse, the
dimension tables might include product, customer, time, and location.
In a star schema, each dimension table is joined to the fact table
through a foreign key relationship. This allows users to query the data in
the fact table using attributes from the dimension tables. For example,
a user might want to see sales revenue by product category, or by
region and time period.
The star schema is a popular data modeling technique in data
warehousing because it is easy to understand and query. The simple
structure of the star schema allows for fast query response times and
efficient use of database resources. Additionally, the star schema can be
easily extended by adding new dimension tables or measures to the
fact table, making it a scalable and flexible solution for data
warehousing.
Star schema is the fundamental schema among the data mart schema
and it is simplest. This schema is widely used to develop or build a data
warehouse and dimensional data marts. It includes one or more fact
tables indexing any number of dimensional tables. The star schema is a
necessary cause of the snowflake schema. It is also efficient for
handling basic queries.
It is said to be star as its physical model resembles to the star shape
having a fact table at its center and the dimension tables at its
peripheral representing the star’s points. Below is an example to
demonstrate the Star Schema:
In the above demonstration, SALES is a fact table having attributes i.e.
(Product ID, Order ID, Customer ID, Employer ID, Total, Quantity,
Discount) which references to the dimension tables. Employee
dimension table contains the attributes: Emp ID, Emp Name, Title,
Department and Region. Product dimension table contains the
attributes: Product ID, Product Name, Product Category, Unit
Price. Customer dimension table contains the attributes: Customer ID,
Customer Name, Address, City, Zip. Time dimension table contains the
attributes: Order ID, Order Date, Year, Quarter, Month.
Model of Star Schema :
In Star Schema, Business process data, that holds the quantitative data
about a business is distributed in fact tables, and dimensions which are
descriptive characteristics related to fact data. Sales price, sale quantity,
distant, speed, weight, and weight measurements are few examples of
fact data in star schema.
Often, A Star Schema having multiple dimensions is termed as
Centipede Schema. It is easy to handle a star schema which have
dimensions of few attributes.
Advantages of Star Schema :
1. Simpler Queries –
Join logic of star schema is quite cinch in comparison to other join
logic which are needed to fetch data from a transactional schema
that is highly normalized.
2. Simplified Business Reporting Logic –
In comparison to a transactional schema that is highly normalized,
the star schema makes simpler common business reporting logic,
such as of reporting and period-over-period.
3. Feeding Cubes –
Star schema is widely used by all OLAP systems to design OLAP
cubes efficiently. In fact, major OLAP systems deliver a ROLAP
mode of operation which can use a star schema as a source
without designing a cube structure.
Disadvantages of Star Schema –
1. Data integrity is not enforced well since in a highly de-normalized
schema state.
2. Not flexible in terms if analytical needs as a normalized data
model.
3. Star schemas don’t reinforce many-to-many relationships within
business entities – at least not frequently.
Features:
Central fact table: The star schema revolves around a central fact table
that contains the numerical data being analyzed. This table contains
foreign keys to link to dimension tables.
Dimension tables: Dimension tables are tables that contain descriptive
attributes about the data being analyzed. These attributes provide
context to the numerical data in the fact table. Each dimension table is
linked to the fact table through a foreign key.
Denormalized structure: A star schema is denormalized, which means
that redundancy is allowed in the schema design to improve query
performance. This is because it is easier and faster to join a small
number of tables than a large number of tables.
Simple queries: Star schema is designed to make queries simple and
fast. Queries can be written in a straightforward manner by joining the
fact table with the appropriate dimension tables.
Aggregated data: The numerical data in the fact table is usually
aggregated at different levels of granularity, such as daily, weekly, or
monthly. This allows for analysis at different levels of detail.
Fast performance: Star schema is designed for fast query performance.
This is because the schema is denormalized and data is pre-aggregated,
making queries faster and more efficient.
Easy to understand: The star schema is easy to understand and
interpret, even for non-technical users. This is because the schema is
designed to provide context to the numerical data through the use of
dimension tables.
Snowflake Schema in Data Warehouse Model
The snowflake schema is a variant of the star schema. Here, the
centralized fact table is connected to multiple dimensions. In the
snowflake schema, dimensions are present in a normalized form in
multiple related tables. The snowflake structure materialized when the
dimensions of a star schema are detailed and highly structured, having
several levels of relationship, and the child tables have multiple parent
tables. The snowflake effect affects only the dimension tables and does
not affect the fact tables.
A snowflake schema is a type of data modeling technique used in data
warehousing to represent data in a structured way that is optimized for
querying large amounts of data efficiently. In a snowflake schema, the
dimension tables are normalized into multiple related tables, creating a
hierarchical or “snowflake” structure.
In a snowflake schema, the fact table is still located at the center of the
schema, surrounded by the dimension tables. However, each dimension
table is further broken down into multiple related tables, creating
a hierarchical structure that resembles a snowflake.
For Example, in a sales data warehouse, the product dimension table
might be normalized into multiple related tables, such as product
category, product subcategory, and product details. Each of these tables
would be related to the product dimension table through a foreign
key relationship.
Example:
Snowflake Schema
The Employee dimension table now contains the attributes:
EmployeeID, EmployeeName, DepartmentID, Region, and Territory. The
DepartmentID attribute links with the Employee table with
the Department dimension table. The Department dimension is used to
provide detail about each department, such as the Name and Location
of the department. The Customer dimension table now contains the
attributes: CustomerID, CustomerName, Address, and CityID. The CityID
attributes link the Customer dimension table with the City dimension
table. The City dimension table has details about each city such as city
name, Zipcode, State, and Country.
What is Snowflaking?
The snowflake design is the result of further expansion and
normalization of the dimension table. In other words, a dimension table
is said to be snowflaked if the low-cardinality attribute of the
dimensions has been divided into separate normalized tables. These
tables are then joined to the original dimension table with referential
constraints (foreign key constrain).
Generally, snowflaking is not recommended in the dimension table, as it
hampers the understandability and performance of the dimension
model as more tables would be required to be joined to satisfy the
queries.
Difference Between Snowflake and Star Schema
The main difference between star schema and snowflake schema is that
the dimension table of the snowflake schema is maintained in the
normalized form to reduce redundancy. The advantage here is that such
tables (normalized) are easy to maintain and save storage space.
However, it also means that more joins will be needed to execute the
query. This will adversely impact system performance.
However, the snowflake schema can also be more complex to query
than a star schema because it requires more table joins. This can result
in slower query response times and higher resource usage in the
database. Additionally, the snowflake schema can be more difficult to
understand and maintain because of the increased complexity of the
schema design.
The decision to use a snowflake schema versus a star schema in a data
warehousing project will depend on the specific requirements of the
project and the trade-offs between query performance, schema
complexity, and data integrity.
Characteristics of Snowflake Schema
 The snowflake schema uses small disk space.
 It is easy to implement the dimension that is added to the
schema.
 There are multiple tables, so performance is reduced.
 The dimension table consists of two or more sets of attributes that
define information at different grains.
 The sets of attributes of the same dimension table are populated
by different source systems.
Features of the Snowflake Schema
 Normalization: The snowflake schema is a normalized design,
which means that data is organized into multiple related tables.
This reduces data redundancy and improves data consistency.
 Hierarchical Structure: The snowflake schema has a hierarchical
structure that is organized around a central fact table. The fact
table contains the measures or metrics of interest, and the
dimension tables contain the attributes that provide context to
the measures.
 Multiple Levels: The snowflake schema can have multiple levels of
dimension tables, each related to the central fact table. This
allows for more granular analysis of data and enables users to drill
down into specific subsets of data.
 Joins: The snowflake schema typically requires more complex SQL
queries that involve multiple tables joins. This can impact
performance, especially when dealing with large data sets.
 Scalability: The snowflake schema is scalable and can handle large
volumes of data. However, the complexity of the schema can
make it difficult to manage and maintain.
Advantages of Snowflake Schema
 It provides structured data which reduces the problem of data
integrity.
 It uses small disk space because data are highly structured.
Disadvantages of Snowflake Schema
 Snowflaking reduces space consumed by dimension tables but
compared with the entire data warehouse the saving is usually
insignificant.
 Avoid snowflaking or normalization of a dimension table, unless
required and appropriate.
 Do not snowflake hierarchies of dimension table into separate
tables. Hierarchies should belong to the dimension table only and
should never be snowflakes.
 Multiple hierarchies that can belong to the same dimension have
been designed at the lowest possible detail.
OLAP Operations in DBMS
OLAP stands for Online Analytical Processing Server. It is a software
technology that allows users to analyze information from multiple
database systems at the same time. It is based on multidimensional
data model and allows the user to query on multi-dimensional data
(eg. Delhi -> 2018 -> Sales data). OLAP databases are divided into one
or more cubes and these cubes are known as Hyper-cubes.

OLAP operations:
There are five basic analytical operations that can be performed on an
OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is
converted into highly detailed data. It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
In the cube given in overview section, the drill down operation is
performed by moving down in the concept hierarchy
of Time dimension (Quarter -> Month).

2. Roll up: It is just opposite of the drill-down operation. It


performs aggregation on the OLAP cube. It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is
performed by climbing up in the concept hierarchy
of Location dimension (City -> Country).

1. Dice: It selects a sub-cube from the OLAP cube by selecting two


or more dimensions. In the cube given in the overview section, a
sub-cube is selected by selecting following dimensions with
criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”

2. Slice: It selects a single dimension from the OLAP cube which


results in a new sub-cube creation. In the cube given in the
overview section, Slice is performed on the dimension Time =
“Q1”.

3. Pivot: It is also known as rotation operation as it rotates the


current view to get a new view of the representation. In the sub-
cube obtained after the slice operation, performing pivot
operation gives a new view of it.

Concept Hierarchy in Data Mining


In data mining, the concept of a concept hierarchy refers to the
organization of data into a tree-like structure, where each level of the
hierarchy represents a concept that is more general than the level
below it. This hierarchical organization of data allows for more
efficient and effective data analysis, as well as the ability to drill down
to more specific levels of detail when needed. The concept of
hierarchy is used to organize and classify data in a way that makes it
more understandable and easier to analyze. The main idea behind the
concept of hierarchy is that the same data can have different levels of
granularity or levels of detail and that by organizing the data in a
hierarchical fashion, it is easier to understand and perform analysis.
Example:
Explanation:
As shown in the above diagram, it consists of a concept hierarchy for
the dimension location, where the user can easily retrieve the data. In
order to evaluate it easily the data is represented in a tree-like
structure. The top of the tree consists of the main dimension location
and further splits into various sub-nodes. The root node is located,
and it further splits into two nodes countries ie. USA and India. These
countries are further then splitted into more sub-nodes, that
represent the province states ie. New York, Illinois, Gujarat, UP. Thus
the concept hierarchy as shown in the above example organizes the
data into a tree-like structure and describes and represents in more
general than the level below it.
The hierarchical structure represents the abstraction level of the
dimension location, which consists of various footprints of the
dimension such as street, city, province state, and country.
Types of Concept Hierarchies
1. Schema Hierarchy: Schema Hierarchy is a type of concept
hierarchy that is used to organize the schema of a database in a
logical and meaningful way, grouping similar objects together. A
schema hierarchy can be used to organize different types of data,
such as tables, attributes, and relationships, in a logical and
meaningful way. This can be useful in data warehousing, where
data from multiple sources needs to be integrated into a single
database.
2. Set-Grouping Hierarchy: Set-Grouping Hierarchy is a type of
concept hierarchy that is based on set theory, where each set in
the hierarchy is defined in terms of its membership in other sets.
Set-grouping hierarchy can be used for data cleaning, data pre-
processing and data integration. This type of hierarchy can be
used to identify and remove outliers, noise, or inconsistencies
from the data and to integrate data from multiple sources.
3. Operation-Derived Hierarchy: An Operation-Derived Hierarchy is
a type of concept hierarchy that is used to organize data by
applying a series of operations or transformations to the data.
The operations are applied in a top-down fashion, with each
level of the hierarchy representing a more general or abstract
view of the data than the level below it. This type of hierarchy is
typically used in data mining tasks such as clustering and
dimensionality reduction. The operations applied can be
mathematical or statistical operations such as aggregation,
normalization
4. Rule-based Hierarchy: Rule-based Hierarchy is a type of concept
hierarchy that is used to organize data by applying a set of rules
or conditions to the data. This type of hierarchy is useful in data
mining tasks such as classification, decision-making, and data
exploration. It allows to the assignment of a class label or
decision to each data point based on its characteristics and
identifies patterns and relationships between different attributes
of the data.
Need of Concept Hierarchy in Data Mining
There are several reasons why a concept hierarchy is useful in data
mining:
1. Improved Data Analysis: A concept hierarchy can help to
organize and simplify data, making it more manageable and
easier to analyze. By grouping similar concepts together, a
concept hierarchy can help to identify patterns and trends in the
data that would otherwise be difficult to spot. This can be
particularly useful in uncovering hidden or unexpected insights
that can inform business decisions or inform the development of
new products or services.
2. Improved Data Visualization and Exploration: A concept
hierarchy can help to improve data visualization and data
exploration by organizing data into a tree-like structure, allowing
users to easily navigate and understand large and complex data
sets. This can be particularly useful in creating interactive
dashboards and reports that allow users to easily drill down to
more specific levels of detail when needed.
3. Improved Algorithm Performance: The use of a concept
hierarchy can also help to improve the performance of data
mining algorithms. By organizing data into a hierarchical
structure, algorithms can more easily process and analyze the
data, resulting in faster and more accurate results.
4. Data Cleaning and Pre-processing: A concept hierarchy can also
be used in data cleaning and pre-processing, to identify and
remove outliers and noise from the data.
5. Domain Knowledge: A concept hierarchy can also be used to
represent the domain knowledge in a more structured way,
which can help in a better understanding of the data and the
problem domain.
Applications of Concept Hierarchy
There are several applications of concept hierarchy in data mining,
some examples are:
1. Data Warehousing: Concept hierarchy can be used in data
warehousing to organize data from multiple sources into a
single, consistent and meaningful structure. This can help to
improve the efficiency and effectiveness of data analysis and
reporting.
2. Business Intelligence: Concept hierarchy can be used in business
intelligence to organize and analyze data in a way that can
inform business decisions. For example, it can be used to analyze
customer data to identify patterns and trends that can inform
the development of new products or services.
3. Online Retail: Concept hierarchy can be used in online retail to
organize products into categories, subcategories and sub-
subcategories, it can help customers to find the products they
are looking for more quickly and easily.
4. Healthcare: Concept hierarchy can be used in healthcare to
organize patient data, for example, to group patients by
diagnosis or treatment plan, it can help to identify patterns and
trends that can inform the development of new treatments or
improve the effectiveness of existing treatments.
5. Natural Language Processing: Concept hierarchy can be used in
natural language processing to organize and analyze text data,
for example, to identify topics and themes in a text, it can help
to extract useful information from unstructured data.
6. Fraud Detection: Concept hierarchy can be used in fraud
detection to organize and analyze financial data, for example, to
identify patterns and trends that can indicate fraudulent activity.
Conclusion
A concept hierarchy is a process in data mining that can help to
organize and simplify large and complex data sets. It improves data
visualization, algorithm performance, and data cleaning and pre-
processing. The concept hierarchy can be applied in various fields,
such as data warehousing, business intelligence, online retail,
healthcare, natural language processing, and fraud detection among
others. Understanding and utilizing concept hierarchy can be crucial
for effectively performing data mining tasks and making valuable
insights from the data.
Data Warehouse Architecture
A data-warehouse is a heterogeneous collection of different data
sources organised under a unified schema. There are 2 approaches for
constructing data-warehouse: Top-down approach and Bottom-up
approach are explained as below.
1. Top-down approach:
 In the "Top-Down" design approach, a data
warehouse is described as a subject-oriented, time-
variant, non-volatile and integrated data repository
for the entire enterprise data from different sources
are validated, reformatted and saved in a normalized
(up to 3NF) database as the data warehouse. The
data warehouse stores "atomic" information, the
data at the lowest level of granularity, from where
dimensional data marts can be built by selecting the
data required for specific business subjects or
particular departments. An approach is a data-driven
approach as the information is gathered and
integrated first and then business requirements by
subjects for building data marts are formulated. The
advantage of this method is which it supports a
single integrated data source. Thus data marts built
from it will have consistency when they overlap.
The essential components are discussed below:
1. External Sources –
External source is a source from where data is collected
irrespective of the type of data. Data can be structured, semi
structured and unstructured as well.

2. Stage Area –
Since the data, extracted from the external sources does not
follow a particular format, so there is a need to validate this data
to load into datawarehouse. For this purpose, it is recommended
to use ETL tool.
 E(Extracted): Data is extracted from External data source.

 T(Transform): Data is transformed into the standard


format.
 L(Load): Data is loaded into datawarehouse after
transforming it into the standard format.

3. Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as
central repository. It actually stores the meta data and the actual
data gets stored in the data marts. Note that datawarehouse
stores the data in its purest form in this top-down approach.

4. Data Marts –
Data mart is also a part of storage component. It stores the
information of a particular function of an organisation which is
handled by single authority. There can be as many number of
data marts in an organisation depending upon the functions. We
can also say that data mart contains subset of the data stored in
datawarehouse.

5. Data Mining –
The practice of analysing the big data present in datawarehouse
is data mining. It is used to find the hidden patterns that are
present in the database or in datawarehouse with the help of
algorithm of data mining.
This approach is defined by Inmon as – datawarehouse as a central
repository for the complete organisation and data marts are created
from it after the complete datawarehouse has been created.

Advantages of Top-Down Approach –


1. Since the data marts are created from the datawarehouse,
provides consistent dimensional view of data marts.

2. Also, this model is considered as the strongest model for


business changes. That’s why, big organisations prefer to follow
this approach.
1. Creating data mart from datawarehouse is easy.
2. Improved data consistency: The top-down approach promotes
data consistency by ensuring that all data marts are sourced
from a common data warehouse. This ensures that all data is
standardized, reducing the risk of errors and inconsistencies in
reporting.
3. Easier maintenance: Since all data marts are sourced from a
central data warehouse, it is easier to maintain and update the
data in a top-down approach. Changes can be made to the data
warehouse, and those changes will automatically propagate to
all the data marts that rely on it.
4. Better scalability: The top-down approach is highly scalable,
allowing organizations to add new data marts as needed without
disrupting the existing infrastructure. This is particularly
important for organizations that are experiencing rapid growth
or have evolving business needs.
5. Improved governance: The top-down approach facilitates better
governance by enabling centralized control of data access,
security, and quality. This ensures that all data is managed
consistently and that it meets the organization’s standards for
quality and compliance.
6. Reduced duplication: The top-down approach reduces data
duplication by ensuring that data is stored only once in the data
warehouse. This saves storage space and reduces the risk of data
inconsistencies.
7. Better reporting: The top-down approach enables better
reporting by providing a consistent view of data across all data
marts. This makes it easier to create accurate and timely reports,
which can improve decision-making and drive better business
outcomes.
8. Better data integration: The top-down approach enables better
data integration by ensuring that all data marts are sourced from
a common data warehouse. This makes it easier to integrate
data from different sources and provides a more complete view
of the organization’s data.

Disadvantages of Top-Down Approach –


1. The cost, time taken in designing and its maintenance is very
high.
2. Complexity: The top-down approach can be complex to
implement and maintain, particularly for large organizations
with complex data needs. The design and implementation of the
data warehouse and data marts can be time-consuming and
costly.
3. Lack of flexibility: The top-down approach may not be suitable
for organizations that require a high degree of flexibility in their
data reporting and analysis. Since the design of the data
warehouse and data marts is pre-determined, it may not be
possible to adapt to new or changing business requirements.
4. Limited user involvement: The top-down approach can be
dominated by IT departments, which may lead to limited user
involvement in the design and implementation process. This can
result in data marts that do not meet the specific needs of
business users.
5. Data latency: The top-down approach may result in data latency,
particularly when data is sourced from multiple systems. This
can impact the accuracy and timeliness of reporting and analysis.
6. Data ownership: The top-down approach can create challenges
around data ownership and control. Since data is centralized in
the data warehouse, it may not be clear who is responsible for
maintaining and updating the data.
7. Cost: The top-down approach can be expensive to implement
and maintain, particularly for smaller organizations that may not
have the resources to invest in a large-scale data warehouse and
associated data marts.
8. Integration challenges: The top-down approach may face
challenges in integrating data from different sources, particularly
when data is stored in different formats or structures. This can
lead to data inconsistencies and inaccuracies.

3. Bottom-up approach:
4. In the "Bottom-Up" approach, a data warehouse is
described as "a copy of transaction data specifical
architecture for query and analysis," term the star
schema. In this approach, a data mart is created first
to necessary reporting and analytical capabilities for
particular business processes (or subjects). Thus it is
needed to be a business-driven approach in contrast
to Inmon's data-driven approach.
5. Data marts include the lowest grain data and, if
needed, aggregated data too. Instead of a
normalized database for the data warehouse, a
denormalized dimensional database is adapted to
meet the data delivery requirements of data
warehouses. Using this method, to use the set of
data marts as the enterprise data warehouse, data
marts should be built with conformed dimensions in
mind, defining that ordinary objects are represented
the same in different data marts. The conformed
dimensions connected the data marts to form a data
warehouse, which is generally called a virtual data
warehouse.
6. The advantage of the "bottom-up" design approach
is that it has quick ROI, as developing a data mart, a
data warehouse for a single subject, takes far less
time and effort than developing an enterprise-wide
data warehouse. Also, the risk of failure is even less.
This method is inherently incremental. This method
allows the project team to learn and grow.
7.
1. First, the data is extracted from external sources (same as
happens in top-down approach).

2. Then, the data go through the staging area (as explained above)
and loaded into data marts instead of datawarehouse. The data
marts are created first and provide reporting capability. It
addresses a single business area.

3. These data marts are then integrated into datawarehouse.

This approach is given by Kinball as – data marts are created first and
provides a thin view for analyses and datawarehouse is created after
complete data marts have been created.
Advantages of Bottom-Up Approach –
1. As the data marts are created first, so the reports are quickly
generated.

2. We can accommodate more number of data marts here and in


this way datawarehouse can be extended.

3. Also, the cost and time taken in designing this model is low
comparatively.
4. Incremental development: The bottom-up approach supports
incremental development, allowing for the creation of data
marts one at a time. This allows for quick wins and incremental
improvements in data reporting and analysis.
5. User involvement: The bottom-up approach encourages user
involvement in the design and implementation process. Business
users can provide feedback on the data marts and reports,
helping to ensure that the data marts meet their specific needs.
6. Flexibility: The bottom-up approach is more flexible than the
top-down approach, as it allows for the creation of data marts
based on specific business needs. This approach can be
particularly useful for organizations that require a high degree of
flexibility in their reporting and analysis.
7. Faster time to value: The bottom-up approach can deliver faster
time to value, as the data marts can be created more quickly
than a centralized data warehouse. This can be particularly
useful for smaller organizations with limited resources.
8. Reduced risk: The bottom-up approach reduces the risk of
failure, as data marts can be tested and refined before being
incorporated into a larger data warehouse. This approach can
also help to identify and address potential data quality issues
early in the process.
9. Scalability: The bottom-up approach can be scaled up over time,
as new data marts can be added as needed. This approach can
be particularly useful for organizations that are growing rapidly
or undergoing significant change.
10. Data ownership: The bottom-up approach can help to
clarify data ownership and control, as each data mart is typically
owned and managed by a specific business unit. This can help to
ensure that data is accurate and up-to-date, and that it is being
used in a consistent and appropriate way across the
organization.

Disadvantage of Bottom-Up Approach –


1. This model is not strong as top-down approach as dimensional
view of data marts is not consistent as it is in above approach.
2. Data silos: The bottom-up approach can lead to the creation of
data silos, where different business units create their own data
marts without considering the needs of other parts of the
organization. This can lead to inconsistencies and redundancies
in the data, as well as difficulties in integrating data across the
organization.
3. Integration challenges: Because the bottom-up approach relies
on the integration of multiple data marts, it can be more difficult
to integrate data from different sources and ensure consistency
across the organization. This can lead to issues with data quality
and accuracy.
4. Duplication of effort: In a bottom-up approach, different
business units may duplicate effort by creating their own data
marts with similar or overlapping data. This can lead to
inefficiencies and higher costs in data management.
5. Lack of enterprise-wide view: The bottom-up approach can
result in a lack of enterprise-wide view, as data marts are
typically designed to meet the needs of specific business units
rather than the organization as a whole. This can make it difficult
to gain a comprehensive understanding of the organization’s
data and business processes.
6. Complexity: The bottom-up approach can be more complex than
the top-down approach, as it involves the integration of multiple
data marts with varying levels of complexity and granularity. This
can make it more difficult to manage and maintain the data
warehouse over time.
7. Risk of inconsistency: Because the bottom-up approach allows
for the creation of data marts with different structures and
granularities, there is a risk of inconsistency in the data. This can
make it difficult to compare data across different parts of the
organization or to ensure that reports are accurate and reliable.

tree-Tier Data Warehouse Architecture


Data Warehouses usually have a three-level (tier) architecture that
includes:
1. Bottom Tier (Data Warehouse Server)
2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).
A bottom-tier that consists of the Data Warehouse server, which is
almost always an RDBMS. It may include several specialized data
marts and a metadata repository.
Data from operational databases and external sources (such as user
profile data provided by external consultants) are extracted using
application program interfaces called a gateway. A gateway is
provided by the underlying DBMS and allows customer programs to
generate SQL code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection)
and OLE-DB (Open-Linking and Embedding for Databases),
by Microsoft, and JDBC (Java Database Connection).

A middle-tier which consists of an OLAP server for fast querying of


the data warehouse.
The OLAP server is implemented using either
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational
DBMS that maps functions on multidimensional data to standard
relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular
purpose server that directly implements multidimensional
information and operations.
A top-tier that contains front-end tools for displaying results
provided by OLAP, as well as additional tools for data mining of the
OLAP-generated data.
The overall Data Warehouse Architecture is shown in fig:

The metadata repository stores information that defines DW


objects. It includes the following parameters and information for
the middle and the top-tier applications:
1. A description of the DW structure, including the warehouse
schema, dimension, hierarchies, data mart locations, and
contents, etc.
2. Operational metadata, which usually describes the currency
level of the stored data, i.e., active, archived or purged, and
warehouse monitoring information, i.e., usage statistics, error
reports, audit, etc.
3. System performance data, which includes indices, used to
improve data access and retrieval performance.
4. Information about the mapping from operational databases,
which provides source RDBMSs and their contents, cleaning and
transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports
business data, which include business terms and definitions,
ownership information, etc.
Principles of Data Warehousing
Load Performance
Data warehouses require increase loading of new data periodically
basis within narrow time windows; performance on the load
process should be measured in hundreds of millions of rows and
gigabytes per hour and must not artificially constrain the volume of
data business.
Load Processing
Many phases must be taken to load new or update data into the
data warehouse, including data conversion, filtering, reformatting,
indexing, and metadata update.
Data Quality Management
Fact-based management demands the highest data quality. The
warehouse ensures local consistency, global consistency, and
referential integrity despite "dirty" sources and massive database
size.
Query Performance
Fact-based management must not be slowed by the performance of
the data warehouse RDBMS; large, complex queries must be
complete in seconds, not days.
Types of OLAP Servers
 Relational OLAP
 Multi-Dimensional OLAP
 Hybrid OLAP
 Transparent OLAP
Relational OLAP (ROLAP): Star Schema Based
The ROLAP is based on the premise that data need not be stored
multi-dimensionally to be viewed multi-dimensionally, and that it is
possible to exploit the well-proven relational database technology
to handle the multidimensionality of data. In ROLAP data is stored
in a relational database. In essence, each action of slicing and dicing
is equivalent to adding a “WHERE” clause in the SQL statement.
ROLAP can handle large amounts of data. ROLAP can leverage
functionalities inherent in the relational database.
ROLAP
Multidimensional OLAP (MOLAP): Cube-Based
MOLAP stores data on disks in a specialized multidimensional array
structure. OLAP is performed on it relying on the random access
capability of the arrays. Arrays elements are determined by
dimension instances, and the fact data or measured value
associated with each cell is usually stored in the corresponding
array element. In MOLAP, the multidimensional array is usually
stored in a linear allocation according to nested traversal of the
axes in some predetermined order.
But unlike ROLAP, where only records with non-zero facts are
stored, all array elements are defined in MOLAP and as a result, the
arrays generally tend to be sparse, with empty elements occupying
a greater part of it. Since both storage and retrieval costs are
important while assessing online performance efficiency, MOLAP
systems typically include provisions such as advanced indexing and
hashing to locate data while performing queries for handling sparse
arrays. MOLAP cubes are fast data retrieval, optimal for slicing and
dicing, and can perform complex calculations. All calculations are
pre-generated when the cube is created.

MOALP
Hybrid OLAP (HOLAP)
HOLAP is a combination of ROLAP and MOLAP. HOLAP servers allow
for storing large data volumes of detailed data. On the one hand,
HOLAP leverages the greater scalability of ROLAP. On the other
hand, HOLAP leverages cube technology for faster performance and
summary-type information. Cubes are smaller than MOLAP since
detailed data is kept in the relational database. The database is
used to store data in the most functional way possible.
Transparent OLAP (TOLAP)
TOLAP systems are designed to work transparently with existing
RDBMS systems, allowing users to access OLAP features without
needing to transfer data to a separate OLAP system. This allows for
more seamless integration between OLAP and
traditional RDBMS systems.
Other Types of OLAP
There are some other types of OLAP Systems that are used in
analyzing databases. Some of them are mentioned below.
 Web OLAP (WOLAP): It is a Web browser-based technology. In
traditional OLAP application is accessible by the client/server but
this OLAP application is accessible by the web browser. It is a
three-tier architecture that consists of a client, middleware, and
database server. The most appealing features of this style of
OLAP were (past tense intended, since few products categorize
themselves this way) the considerably lower investment
involved on the client side (“all that’s needed is a browser”) and
enhanced accessibility to connect to the data. A Web-based
application requires no deployment on the client machine. All
that is needed is a Web browser and a network connection to
the intranet or Internet.
 Desktop OLAP (DOLAP): DOLAP stands for desktop analytical
processing. Users can download the data from the source and
work with the dataset, or on their desktop. Functionality is
limited compared to other OLAP applications. It has a cheaper
cost.
 Mobile OLAP (MOLAP): MOLAP is wireless functionality for
mobile devices. User work and access the data through mobile
devices.
 Spatial OLAP (SOLAP): Merge capabilities of both Geographic
Information Systems (GIS) and OLAP into the single user
interface, SOLAP egress. SOLAP is created because the data come
in the form of alphanumeric, image, and vector. This provides
the easy and quick exploration of data that resides in a spatial
database.
 Real-time OLAP (ROLAP): ROLAP technology combines the
features of both OLTP and OLAP. It allows users to view data in
real-time and perform analysis on data as it is being updated in
the system. ROLAP also provides a single, unified view of data
from different sources and supports advanced analytics like
predictive modeling and data mining.
 Cloud OLAP (COLAP): COLAP is a cloud-based OLAP solution that
allows users to access data from anywhere and anytime. It
eliminates the need for on-premise hardware and software
installations, making it a cost-effective and scalable solution for
businesses of all sizes. COLAP also offers high availability and
disaster recovery capabilities, ensuring business continuity in the
event of a disaster.
 Big Data OLAP (BOLAP): BOLAP is an OLAP solution that can
handle large amounts of data, such as data from Hadoop or
other big data sources. It provides high-performance analytics on
large datasets and supports complex queries that are impossible
with traditional OLAP tools. BOLAP also supports real-time
analysis of big data, allowing users to make informed decisions
based on up-to-date information.
 In-memory OLAP (IOLAP): IOLAP is an OLAP solution that stores
data in memory for faster access and processing. It provides real-
time analysis on large datasets and supports complex queries,
making it an ideal solution for businesses that require fast and
accurate analytics. IOLAP also supports advanced analytics like
predictive modeling and data mining, allowing users to gain
insights into their data and make informed decisions.
Advantages of OLAP System
 Fast query response: OLAP systems are designed to provide fast
query response times, even for complex queries involving large
amounts of data.
 Multidimensional analysis: OLAP systems allow users to analyze
data from multiple dimensions, such as time, location, product,
and customer, providing a deeper understanding of the data.
 Flexible and customizable: OLAP systems are highly
customizable, allowing users to define their dimensions,
hierarchies, and calculations.
 Improved decision-making: OLAP systems provide users with the
ability to analyze data from different angles, leading to better
insights and more informed decision-making.
Disadvantages of the OLAP System
 Complexity: OLAP systems can be complex to implement and
maintain, requiring specialized skills and knowledge.
 Data storage requirements: OLAP systems require a large
amount of storage space to store multidimensional data, which
can be expensive and difficult to manage.
 Limited transactional processing: OLAP systems are optimized for
analytical processing, but they are not suitable for transactional
processing, which can lead to performance issues.
 Performance degradation with large datasets: As the size of the
dataset increases, the performance of OLAP systems may
degrade, requiring additional hardware resources to maintain
performance.
Difference between OLAP and OLTP
OLAP (Online OLTP (Online Transaction
Category Analytical Processing) Processing)

It is well-known as an
It is well-known as an online
Definition online database query
database modifying system.
management system.

Consists of historical
Consists of only operational
Data source data from various
current data.
Databases.

It makes use of a
It makes use of a data standard database
Method used
warehouse. management system
(DBMS).

It is subject-oriented.
Used for Data Mining, It is application-oriented.
Application
Analytics, Decisions Used for business tasks.
making, etc.

In an OLAP database,
In an OLTP database, tables
Normalized tables are not
are normalized (3NF).
normalized.

Usage of The data is used in The data is used to perform


OLAP (Online OLTP (Online Transaction
Category Analytical Processing) Processing)

planning, problem-
day-to-day fundamental
data solving, and decision-
operations.
making.

It provides a multi-
dimensional view of It reveals a snapshot of
Task
different business present business tasks.
tasks.

It serves the purpose It serves the purpose to


to extract information Insert, Update, and Delete
Purpose
for analysis and information from the
decision-making. database.

The size of the data is


A large amount of data
Volume of relatively small as the
is stored typically in TB,
data historical data is archived in
PB
MB, and GB.

Relatively slow as the


amount of data
Very Fast as the queries
Queries involved is large.
operate on 5% of the data.
Queries may take
hours.
OLAP (Online OLTP (Online Transaction
Category Analytical Processing) Processing)

The OLAP database is


The data integrity constraint
not often updated. As a
Update must be maintained in an
result, data integrity is
OLTP database.
unaffected.

It only needs backup The backup and recovery


Backup and
from time to time as process is maintained
Recovery
compared to OLTP. rigorously

It is comparatively fast in
The processing of
Processing processing because of
complex queries can
time simple and straightforward
take a lengthy time.
queries.

This data is generally


Types of This data is managed by
managed by CEO, MD,
users clerksForex and managers.
and GM.

Only read and rarely Both read and write


Operations
write operations. operations.

Updates With lengthy, The user initiates data


scheduled batch updates, which are brief and
operations, data is quick.
OLAP (Online OLTP (Online Transaction
Category Analytical Processing) Processing)

refreshed on a regular
basis.

Nature of The process is focused The process is focused on


audience on the customer. the market.

Database Design with a focus on Design that is focused on


Design the subject. the application.

Improves the efficiency Enhances the user’s


Productivity
of business analysts. productivity.

Difference between ROLAP, MOLAP and HOLAP


1. Relational Online Analytical Processing (ROLAP) :
ROLAP servers are placed between relational backend server and
client front-end tools. It uses relational or extended DBMS to store
and manage warehouse data. ROLAP has basically 3 main
components: Database Server, ROLAP server, and Front-end tool.
Advantages of ROLAP –
 ROLAP is used for handle the large amount of data.
 ROLAP tools don’t use pre-calculated data cubes.
 Data can be stored efficiently.
 ROLAP can leverage functionalities inherent in the relational
database.
Disadvantages of ROLAP –
 Performance of ROLAP can be slow.
 In ROALP, difficult to maintain aggregate tables.
 Limited by SQL functionalities.
2. Multidimensional Online Analytical Processing (MOLAP) :
MOLAP does not uses relational database to storage.It stores in
optimized multidimensional array storage. The storage utilization
may be low With multidimensional data stores. Many MOLAP
server handle dense and sparse data sets by using two levels of
data storage representation. MOLAP has 3 components : Database
Server, MOLAP server, and Front-end tool.
Advantages of MOLAP –
 MOLAP is basically used for complex calculations.
 MOLAP is optimal for operation such as slice and dice.
 MOLAP allows fastest indexing to the pre-computed summarized
data.
Disadvantages of MOLAP –
 MOLAP can’t handle large amount of data.
 In MOLAP, Requires additional investment.
 Without re-aggregation, difficult to change dimension.
3. Hybrid Online Analytical Processing (HOLAP) :
Hybrid is a combination of both ROLAP and MOLAP.It offers
functionalities of both ROLAP and as well as MOLAP like faster
computation of MOLAP and higher scalability of ROLAP. The
aggregations are stored separately in MOLAP store. Its server
allows storing the large data volumes of detailed information.
Advantages of HOLAP –
 HOLAP provides the functionalities of both MOLAP and ROLAP.
 HOLAP provides fast access at all levels of aggregation.
Disadvantages of HOLAP –
HOLAP architecture is very complex to understand because it
supports both MOLAP and ROLAP.

Difference between ROLAP, MOLAP and HOLAP :

Basis ROLAP MOLAP HOLAP

Relational
Database
Storage Multidimensio Multidimensio
is used as
location nal Database is nal Database is
storage
for used as storage used as storage
location
summary location for location for
for
aggregatio summary summary
summary
n aggregation. aggregation.
aggregatio
n.

Processing Processing Processing time Processing time


time time of of MOLAP is of HOLAP is
ROLAP is
Basis ROLAP MOLAP HOLAP

very slow. fast. fast.

Large
storage
Medium Small storage
space
storage space space
Storage requireme
requirement in requirement in
space nt in
MOLAP as HOLAP as
requireme ROLAP as
compare to compare to
nt compare
ROLAP and MOLAP and
to MOLAP
HOLAP. ROLAP.
and
HOLAP.

Relational
database Multidimensio Relational
Storage
is used as nal database is database is
location
storage used as storage used as storage
for detail
location location for location for
data
for detail detail data. detail data.
data.

Latency Low High latency in Medium


latency in MOLAP as latency in
ROLAP as compare to HOLAP as
compare ROLAP and compare to
to MOLAP HOLAP. MOLAP and
Basis ROLAP MOLAP HOLAP

and
ROLAP.
HOLAP.

Slow
query
Fast query Medium query
response
response time response time
Query time in
in MOLAP as in HOLAP as
response ROLAP as
compare to compare to
time compare
ROLAP and MOLAP and
to MOLAP
HOLAP. ROLAP.
and
HOLAP.

You might also like