KEMBAR78
3 Business Analysis in Data Mining L6 7 8-9-10 | PDF | Data Warehouse | Relational Database
0% found this document useful (0 votes)
31 views39 pages

3 Business Analysis in Data Mining L6 7 8-9-10

Uploaded by

MANOJ KUMAWAT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views39 pages

3 Business Analysis in Data Mining L6 7 8-9-10

Uploaded by

MANOJ KUMAWAT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Business Analysis in Data Mining

Lecture-6,7,8,9,10
Dr. Sumit Dhariwal
School of Computing Information Technology
Manipal University Jaipur
India
Topic Covered
• Data Warehouse Modelling:
• Data Cube – Star Schema, Snow Flake Scheme, Dimensions, Measures &
OLAP operations
• Data Warehouse Implementation:
• Indexing OLAP Data, Efficient Processing of OLAP queries
• OLAP Server Architecture
Data Warehouse Modelling:
• The analysis and documentation of the current processes and
transactions that exist during the software design and development is
known as data modeling.
• The data modeling techniques and tools simplify the complicated
system designs into easier data flows which can be used for re-
engineering.
• It is used to create the logical and physical design of a data
warehouse.
Need for Data Modeling in a Data warehouse
• Collecting the Business Requirements
• Typically, a data warehouse is designed with the data architects and the business users
determining the entities required in the data warehouse and the facts that need to be
recorded. This initial design has many iterations before deciding the final model
• Enhancing Database Performance
• Query performance is a vital feature of a data warehouse. Enormous data volumes are
involved in a data warehouse, so using a data model product for the management of the
metadata and the data used by the BI users is very important
• Offers Source and Target System Documentation
• In the process of designing an ETL system, it is very important to verify the physical and
logical models of the source and target systems respectively
Types of Data Model
• Conceptual Data Model
• Logical Data Model
• Physical Data Model
• Relational Data Model
• Multi Dimensional Data Model
Conceptual Data Model
• A conceptual data model determines the
highest-level relationships among the different
entities.
• It is the primary step in the creation of a data
model in a top-down approach that is an
exact representation of the business
organization
• Conceives the overall structure of the
database and gives information on the
subject-areas
• Consists of entity types and relationships.
The relationship between the subject areas
is represented by symbolic notations
(IDEF1X or IE). Cardinality is a data model
that exhibits the one to one relationship or
many-to-many relationship
• No primary key is stated
• No attribute is specified
Logical Data Model
• Logical data model represents the
specific particulars of the entities,
attributes, and relationships involved in
a business. It is the basis on which a
physical model is designed.
• The development of a logical model
begins after the sign-off of the
conceptual data model by the functional
team
• A logical model should systematize the
physical design process by defining the
data structures and the relationship
between them
• The primary keys and foreign keys are
established here
• Normalization occurs here
• Represents all the entities and the
relationships between them
Physical Data Model
• Physical data model exhibits the model of the
database that is to be built. It represents the
table structures, column names, column data
types, primary keys, and foreign keys.
• The physical data model is developed after
receiving the acceptance of the logical data
model by the functional team
• Physical data model might be different from the
logical data model due to few physical constraints
• Physical data model differs for different
databases. The data types change for different
databases
• Denormalization takes place according to the user
requirements
• The logical model is changed to a physical data
model by implementing the database rules,
referential integrity, supertypes, and subtypes
Relational Data Model
Relational data modeling is used in OLTP systems that are transaction-
oriented. The major characteristics of a relational data model are:
• Relationship among the tables
• All the data is stored in tables and each relation has rows and columns
• The table should have a header and a body. The header is the list of columns in the
table and body consist of the values populated in the table. Tuple is the unique value
generated from the junction of one column and one row
• Usage of keys
• Primary key is the most important key in a table. It is used as a unique identifier. The
primary key is always a not null column
• Foreign key is used to relate to the primary key. They relate the data from one table to
another table and establish a relationship
• Data Redundancy
• The relational data model applies rules to maintain data integrity
• It eliminates data redundancy. The data is not stored repetitively. This helps in
maintaining data consistency and limited data storage
Multi Dimensional Data Model
A multidimensional data model is a logical view of an enterprise that
represents the important entities of a business and the relationship between
them. It is not restricted to a physical database and tables. It’s not
represented by E-R diagrams. The main components are:
• Attributes
• Attributes are the abstract terms devised for easier summarization of data on a report
• They can also be defined as the column headings that are not part of any calculations
on a report
• Dimensions
• A dimension is a data set comprising individual, non-overlapping data elements
• They enable end-users to define, group, and filter the data for display and browsing
purposes
• Facts
• A fact is a table consisting of columns that are used for numeric purposes to answer
the business questions
• They consist of additive, non-additive, and semi-additive measures
Data Cub
• When data is grouped or combined in
multidimensional matrices called Data Cubes.
The data cube method has a few alternative
names or a few variants, such as
"Multidimensional databases," "materialized
views," and "OLAP (On-Line Analytical
Processing).“
• For example, a relation with the schema sales
(part, supplier, customer, and sale-price) can
be materialized into a set of eight views as
shown in fig, where psc indicates a view
consisting of the aggregate function value
(such as total-sales) computed by grouping
three attributes part, supplier, and
customer, p indicates a view composed of the
corresponding aggregate function values
calculated by grouping part alone, etc.
Conti…
• A data cube is created from a subset of
attributes in the database. Specific attributes
are chosen to be measure attributes, i.e., the
attributes whose values are of interest. Other
attributes are selected as dimensions or
functional attributes. The measure attributes
are aggregated according to the dimensions.
• Data cube method is an interesting technique
with many applications. Data cubes could be
sparse in many cases because not every cell in
each dimension may have corresponding data
in the database.
• Techniques should be developed to handle
sparse cubes efficiently.
1.Example:
In the 2-D representation, we will
look at the All Electronics sales
data for items sold per quarter in
the city of Vancouver. The
measured display in dollars sold
(in thousands).
2. Example: 3-Dimensional Cuboids
• Let suppose we would like to view
the sales data with a third
dimension.
• For example, suppose we would like
to view the data according to time,
item as well as location for the cities
Chicago, New York, Toronto, and
Vancouver. The measured display in
dollars sold (in thousands). These 3-D
data are shown in the table. The 3-D
data of the table are represented as a
series of 2-D tables.
Conti…
• Conceptually, we may
represent the same data
in the form of 3-D data
cubes, as shown in fig:
• In data warehousing, the
data cubes are n-
dimensional. The cuboid
which holds the lowest
level of summarization is
called a base cuboid.
3. Example:
• The 4-D cuboid in the figure is the base cuboid for the given time, item,
location, and supplier dimensions.
Conti…
• Figure is shown a 4-D data
cube representation of
sales data, according to
the dimensions time,
item, location, and
supplier. The measure
displayed is dollars sold (in
thousands).
• The lattice of cuboids
forms a data cube. The
figure shows the lattice of
cuboids creating 4-D data
cubes for the dimension
time, item, location, and
supplier. Each cuboid
represents a different
degree of summarization.
What is Multidimensional schema?
• Multidimensional Schema is specially designed to model data
warehouse systems. The schemas are designed to address the unique
needs of very large databases designed for analytical purposes
(OLAP).
• Types of Data Warehouse Schema:
• Following are 3 chief types of multidimensional schemas each having
its unique advantages.
• Star Schema
• Snowflake Schema
What is a Star Schema?
• Star Schema in the data warehouse, in
which the center of the star can have
one fact table and a number of
associated dimension tables. It is
known as star schema as its structure
resembles a star. The Star Schema data
model is the simplest type of Data
Warehouse schema. It is also known as
Star Join Schema and is optimized for
querying large data sets.
• In the following Star Schema
example, the fact table is at the
center which contains keys to every
dimension table like Dealer_ID, Model
ID, Date_ID, Product_ID, Branch_ID &
other attributes like Units sold and
revenue.
Example of Star Schema Diagram
Conti…
Characteristics of Star Schema:

• Every dimension in a star schema is represented with only a one-


dimension table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign key
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The Star schema is easy to understand and provides optimal disk usage.
• The dimension tables are not normalized. For instance, in the above
figure, Country_ID does not have a Country lookup table as an OLTP
design would have.
• The schema is widely supported by BI Tools
What is a Snowflake Schema?
• Snowflake Schema in
data warehouse is a
logical arrangement of
tables in a
multidimensional
database such that the
ER diagram resembles a
snowflake shape. A
Snowflake Schema is an
extension of a Star
Schema, and it adds
additional dimensions.
The dimension tables
are normalized which
splits data into
additional tables.

Example of Snowflake Schema


Characteristics of Snowflake Schema:
• The main benefit of the snowflake schema it uses smaller disk
space.
• Easier to implement a dimension is added to the Schema
• Due to multiple tables query performance is reduced
• The primary challenge that you will face while using the snowflake
Schema is that you need to perform more maintenance efforts
because of the more lookup tables.
Key Differences
Star Schema Snowflake Schema
Hierarchies for the dimensions are stored in the dimensional
Hierarchies are divided into separate tables.
table.
One fact table surrounded by dimension table which are in turn
It contains a fact table surrounded by dimension tables.
surrounded by dimension table

In a star schema, only single join creates the relationship


A snowflake schema requires many joins to fetch the data.
between the fact table and any dimension tables.
Simple DB Design. Very Complex DB Design.
Denormalized Data structure and query also run faster. Normalized Data Structure.
High level of Data redundancy Very low-level data redundancy
Single Dimension table contains aggregated data. Data Split into different Dimension Tables.

Cube processing is faster. Cube processing might be slow because of the complex join.

Offers higher-performing queries using Star Join Query


The Snowflake schema is represented by centralized fact table
Optimization. Tables may be connected with multiple
which unlikely connected with multiple dimensions.
dimensions.
OLAP(Online Analytics Processing) Operations
• Online analytical processing (OLAP) is a technology that organizes large business
databases and supports complex analysis. It can be used to perform complex analytical
queries without negatively affecting transactional systems.
• The databases that a business uses to store all its transactions and records are called
online transaction processing (OLTP) databases.
• OLAP is a widely spread technology belonging to Business Intelligence processes
developed to coordinate and analyze vast amounts of data. OLAP databases are stored in
the form of multidimensional cubes where each cube comprises the data supposed
relevant by a cube administrator. Through certain OLAP operations, a user is able to
obtain a specified view of the cube and extract requisite information from it.
• So this way it’s possible to get a necessary Pivot Table and Pivot Chart report.
• General OLAP operations involve Drill-up, Drill-down, Pivot, and Slice-and-Dice. Here
we’d like to expand the list and look through all possible OLAP operations with examples
for data mining including slicing and dicing in OLAP.
• But before defining what is OLAP operation is, let’s figure out what language is used in
this process.
• OLAP operations could be based on two OLAP languages: SQL and MDX.
Conti…
Drill Up
• This operation you can meet as a
part of pair drill up and drill down in
OLAP. Drill-up is an operation to
gather data from the cube either by
ascending a concept hierarchy for a
dimension or by dimension
reduction in order to receive
measures at a less detailed
granularity.
• So that to see a broader perspective
in compliance with the concept
hierarchy a user has to group
columns and unite the values. As
there are fewer specifics, one or
more dimensions from the data cube
will be deleted, when this OLAP
operation is run. In some sources
drill up and roll up operations in
OLAP come as synonyms, so this
variant is also possible.
Drill down
• OLAP Drill-down is an operation
opposite to Drill-up. It is carried
out either by descending a
concept hierarchy for a dimension
or by adding a new dimension. It
lets a user deploy highly detailed
data from a less detailed cube.
Consequently, when the
operation is run, one or more
dimensions from the data cube
must be appended to provide
more information elements.

• Have a look at an OLAP Drill-down


example in use:
Slice
• The next pair we are going to discuss is slice
and dice operations in OLAP. The Slice OLAP
operations take one specific dimension from
a cube given and represent a new sub-cube,
which provides information from another
point of view.
• It can create a new sub-cube by choosing
one or more dimensions. The use of Slice
implies the specified granularity level of the
dimension.

• OLAP Slice example will look the following


way:
Dice
• OLAP Dice emphasizes two or
more dimensions from a cube
given and suggests a new sub-
cube, as well as Slice
operation does. In order to
locate a single value for a
cube, it includes adding values
for each dimension.

• The diagram below shows


how Dice operation works:
Pivot
• This OLAP operation rotates the
axes of a cube to provide an
alternative view of the data
cube. Pivot clusters the data
with other dimensions which
helps analyze the performance
of a company or enterprise.

• Here’s an example of Pivot in


operation:
Conti…
Scoping
• The operation of Scoping restrains the presentation of the database objects to
a specified subset. It will let users receive and update certain data values
which they want. If there is a huge amount of data and a user needs to
constrain the access of information to a specified subset Scoping is mostly
conducive.
• Screening
• Screening is conducted to limit the set of data extracted.
• Drill across
• Drill across and Drill through in OLAP are another pair of opposite operations.
The operation Drill across reconciles cells from several data cubes which share
the same scheme.
Conti…
• Drill through
• OLAP Drill through enables to navigate from data at the lower level in a cube to data in the operational
systems whence the cube was ejected. The operation is usually exploited to identify the cause of outlier
values in a data cube.
• Sort
• Sort brings the cube back where the members of a dimension were sorted.
• Add Measure
• Thanks to this OLAP operation one is able to add new measures to a cube.
• Drop Measure
• In contrast to Add Measure, it’s also possible to get rid of a measure from a data cube if it's not necessary.
• Union
• Due to an opportunity of Union, you can unite a number of cubes that have the same scheme but separate
instances.
• Difference
• Difference eliminates the cells in a cube which are owned by another one. These two cubes must possess the
same scheme.
OLAP Servers
• Online Analytical Processing(OLAP) refers to a set of software tools used for
data analysis in order to make business decisions. OLAP provides a
platform for gaining insights from databases retrieved from multiple
database systems at the same time. It is based on a multidimensional data
model, which enables users to extract and view data from various
perspectives. A multidimensional database is used to store OLAP data.
Many Business Intelligence (BI) applications rely on OLAP technology.
• Type of OLAP servers:
The three major types of OLAP servers are as follows:
• ROLAP
• MOLAP
• HOLAP
Relational OLAP (ROLAP):
• Relational On-Line Analytical Processing
(ROLAP) is primarily used for data stored in a
relational database, where both the base data
and dimension tables are stored as relational
tables. ROLAP servers are used to bridge the
gap between the relational back-end server
and the client’s front-end tools. ROLAP servers
store and manage warehouse data using
RDBMS, and OLAP middleware fills in the gaps.
• Benefits:
• It is compatible with data warehouses and OLTP
systems.
• The data size limitation of ROLAP technology is
determined by the underlying RDBMS. As a
result, ROLAP does not limit the amount of
data that can be stored.
• Limitations:
• SQL functionality is constrained.
• It’s difficult to keep aggregate tables up to date.
Multidimensional OLAP (MOLAP):
• Through array-based multidimensional storage engines,
Multidimensional On-Line Analytical Processing (MOLAP)
supports multidimensional views of data. Storage utilization in
multidimensional data stores may be low if the data set is
sparse.
• MOLAP stores data on discs in the form of a specialized
multidimensional array structure. It is used for OLAP, which is
based on the arrays’ random access capability. Dimension
instances determine array elements, and the data or
measured value associated with each cell is typically stored in
the corresponding array element. The multidimensional array
is typically stored in MOLAP in a linear allocation based on
nested traversal of the axes in some predetermined order.
• Benefits:
• Suitable for slicing and dicing operations.
• Outperforms ROLAP when data is dense.
• Capable of performing complex calculations.
• Limitations:
• It is difficult to change the dimensions without re-aggregating.
• Since all calculations are performed when the cube is built, a
large amount of data cannot be stored in the cube itself.

Hybrid OLAP (HOLAP):
• ROLAP and MOLAP are combined in Hybrid On-Line
Analytical Processing (HOLAP). HOLAP offers greater
scalability than ROLAP and faster computation than
MOLAP.HOLAP is a hybrid of ROLAP and MOLAP.
HOLAP servers are capable of storing large amounts
of detailed data. On the one hand, HOLAP benefits
from ROLAP’s greater scalability. HOLAP, on the
other hand, makes use of cube technology for faster
performance and summary-type information.
Because detailed data is stored in a relational
database, cubes are smaller than MOLAP.
• Benefits:
• HOLAP combines the benefits of MOLAP and
ROLAP.
• Provide quick access at all aggregation levels.
• Limitations
• Because it supports both MOLAP and ROLAP
servers, HOLAP architecture is extremely complex.
• There is a greater likelihood of overlap, particularly
in their functionalities.
Other types of OLAP include:
• Web OLAP (WOLAP): WOLAP refers to an OLAP application that can be accessed
through a web browser. WOLAP, in contrast to traditional client/server OLAP
applications, is thought to have a three-tiered architecture consisting of three
components: a client, middleware, and a database server.
• Desktop OLAP (DOLAP): DOLAP is an abbreviation for desktop analytical
processing. In that case, the user can download the data from the source and
work with it on their desktop or laptop. In comparison to other OLAP
applications, functionality is limited. It is less expensive.
• Mobile OLAP (MOLAP): Wireless functionality or mobile devices are examples of
MOLAP. The user is working and accessing data via mobile devices.
• Spatial OLAP (SOLAP): SOLAP egress combines the capabilities of Geographic
Information Systems (GIS) and OLAP into a single user interface. SOLAP is created
because the data can be alphanumeric, image, or vector. This allows for the quick
and easy exploration of data stored in a spatial database.
Thank You!

You might also like