KEMBAR78
Dimensional Data Modeling - Lecture3 | PDF | Data Warehouse | Data Management
100% found this document useful (1 vote)
1K views87 pages

Dimensional Data Modeling - Lecture3

This document discusses dimensional data modeling and contrasts it with entity relationship (ER) modeling. Dimensional modeling presents data in a standard star schema with fact and dimension tables that allows for intuitive, high-performance querying. It addresses issues with ER modeling like complex queries and lack of understandability for end users. The strengths of dimensional modeling include its predictable structure, extensibility, and support for common modeling situations and aggregate management.

Uploaded by

ramanarekha7206
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views87 pages

Dimensional Data Modeling - Lecture3

This document discusses dimensional data modeling and contrasts it with entity relationship (ER) modeling. Dimensional modeling presents data in a standard star schema with fact and dimension tables that allows for intuitive, high-performance querying. It addresses issues with ER modeling like complex queries and lack of understandability for end users. The strengths of dimensional modeling include its predictable structure, extensibility, and support for common modeling situations and aggregate management.

Uploaded by

ramanarekha7206
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 87

Dimensional Data Modeling

Ms. Pallavi N.Halarnkar


Assistant Professor
Computer Engineering Department
MPSTME
NMIMS University
Pallavi.halarnkar@gmail.com
2
Lecture Objectives
Define dimensional data model
Contrast with ER modeling

3
Entity Relationship Modeling: Review
Entity Relationship modeling is a technique used
to abstract users data requirements into a model
that can be analyzed and ultimately implemented.
The focus of ER modeling:
achieve processing and data storage efficiency by
reducing data redundancy (storing data elements once)
provide flexibility and ease of maintenance
protect the integrity of data by storing it once
ER modeling and normalization great for
transaction processing as it makes transactions as
simple as possible (as data stored only in one
place)


4
ER Model Example
However, normalized
databases become very
complex making queries
difficult and inefficient
a spiderweb of joins is
required for many
queries. A database
normalized for
transaction processing is
typically unusable for
non-technical users who
wish to perform queries


5
ER model issues
End users cannot understand or remember an ER model.
End users cannot navigate an ER model. There is no
graphical user interface (GUI) that takes a general ER
model and makes it usable by end users.
Software cannot usefully query a general ER model. Cost-
based optimizers that attempt to do this are notorious for
making the wrong choices, with disastrous consequences
for performance.
Use of the ER modeling technique defeats the basic allure
of data warehousing, namely intuitive and high-
performance retrieval of data.
The solution: the Dimensional Data Model

6
What is Dimensional Modeling (DM)?
DM is a logical design technique that seeks to present
the data in a standard, intuitive framework that allows
for high-performance access.
Can be implemented using a relational or a
multidimensional DBMS
Every dimensional model is composed of one table with
a multipart key, called the fact table, and a set of
smaller tables called dimension tables.
Each dimension table has a single-part primary key that
corresponds exactly to one of the components of the
multipart key in the fact table.
This characteristic "star-like" structure is often called a
star join. The term star join dates back to the earliest
days of relational databases.
7
Dimensional Model Example
8
Dimensional Model: Fact Tables
A fact table contains information about things that an
organization wants to measure.
A fact tables key is made up from the keys of two or
more parents.
A fact always resolves a many-to-many relationship
between the parent, or dimension tables.
The most useful fact tables also contain one or more
numerical measures, or facts, that occur for the
combination of keys that define each record.
Example: the facts are Dollars Sold, Units Sold, and
Dollars Cost.

9
Dimensional model: Fact Tables
The most useful facts in a fact table are numeric
and additive.
Additivity is crucial because data warehouse
applications almost never retrieve a single fact
table record; rather, they fetch back hundreds,
thousands, or even millions of these records at
a time, and often the most useful thing to do
with so many records is to add them up.

10
Dimensional Model: Dimension
Tables
Dimension tables contain information about how
an organization wants to analyze facts:
Show me sales revenue (fact) for last week (time)
for blue cups (product) in the western region
(geography)
Dimension tables most often contain descriptive
textual information Blue cups, Western Region
Dimension attributes are used as the source of
most of the interesting constraints in data
warehouse queries., and they are virtually
always the source of the row headers in the
SQL answer set.
11
Dimensional Model vs ER model
The key to understanding the relationship
between DM and ER is that a single ER diagram
breaks down into multiple DM diagrams, or
stars.
Think of a large ER diagram as representing
every possible business process within an
application. The ER diagram may have Sales
Calls, Order Entries, Shipment Invoices,
Customer Payments, and Product Returns, all
on the same diagram.
12
Dimensional Model vs ER model
Shipments
Returns
Sales Contact
Orders
Payments
13
Dimensional Model vs ER model
To create the individual stars that exist within
an application:
Look for many-to-many relationships in the ER model
containing numeric and additive facts and to
designate them as fact tables.
Alternatively, look for events or transactions
these may also be facts
Denormalize all of the remaining tables into flat
tables with single-part keys that connect directly to
the fact tables. These tables become the dimension
tables.
In cases where a dimension table connects to more
than one fact table, we represent this same
dimension table in both schemas, and we refer to the
dimension tables as "conformed" between the two
dimensional models.

14
1..1
Staff

staffNo {PK}
fName
lname
sAddress
jobTitle
salary
NIN
sex
dob

Client

clientNo {PK}
cAddress
cPostcode
cTelNo
dLicenseNo
sex
dob
Lesson

lessonNo {PK}
lessonDate
lessonTime
stage
progress
comments
mileageStart
mileageFinish

Vehicle

vehRegNo {PK}
model
make
color
capacity
Inspection

inspDate
inspTime
faultsFound
comments
Office

officeNo {PK}
oAddress
oPostcode
oTelNo
oFaxNo

Interview

iDate
iTime
iRoom
iComments
dLicense
Manages
RequiredFor
Has
Books
Undertakes
0..1
1..*
1..1
1..1
0..*
1..1
1..1 0..*
1..1
1..*
1..*
InspectedBy
Registers
0..*
1..1
Attends
Requests
Takes
1..1
1..1
0..*
1..1
1..1
DrivingTest



0..*
Sits
1..1
1..*
1..1
1..*
Runs
15
Dimensional Model - Example
16
DM Strengths
The dimensional model has a number of important data
warehouse advantages that the ER model lacks.
First, the dimensional model is a predictable, standard
framework. Report writers, query tools, and user interfaces can
all make strong assumptions about the dimensional model to
make the user interfaces more understandable and to make
processing more efficient.
Rather than using a cost-based optimizer, a database engine
can make very strong assumptions about first constraining the
dimension tables and then "attacking" the fact table all at once
with the Cartesian product of those dimension table keys
satisfying the user's constraints.
17
DM Strengths
A second strength of the dimensional model is
that the predictable framework of the star join
schema withstands unexpected changes in user
behavior. Every dimension is equivalent. All
dimensions can be thought of as symmetrically
equal entry points into the fact table. The
logical design can be done independent of
expected query patterns. The user interfaces
are symmetrical, the query strategies are
symmetrical, and the SQL generated against the
dimensional model is symmetrical.

18
DM Strengths
A third strength of the dimensional model is that it is
gracefully extensible to accommodate unexpected new
data elements and new design decisions.
Gracefully extensible:
all existing tables (both fact and dimension) can be changed in
place by simply adding new data rows in the table, or the table
can be changed in place with a SQL alter table command.
Data should not have to be reloaded.
No query tool or reporting tool needs to be reprogrammed to
accommodate the change.
Old applications continue to run without yielding different
results. Adding new unanticipated facts (that is, new additive
numeric fields in the fact table), as long as they are consistent
with the fundamental grain of the existing fact table
19
DM Strengths
A fourth strength of the dimensional model is
that there is a body of standard approaches for
handling common modeling situations in the
business world. These modeling situations
include:
Slowly changing dimensions, where a "constant"
dimension such as Product or Customer actually
evolves slowly and asynchronously.
Event-handling databases, where the fact table
usually turns out to be "factless.
Many others
20
DM Strengths
A final strength of the dimensional model is the
management of aggregates.
Aggregates are summary records that are logically
redundant with base data already in the data
warehouse, but they are used to enhance query
performance.
A comprehensive aggregate strategy is required in every
medium- and large-sized data warehouse
implementation.
All of the aggregate management software packages
and aggregate navigation utilities depend on a very
specific single structure of fact and dimension tables
that is absolutely dependent on the dimensional model.
21
ER vs DM Final Points
ER models are not appropriate for Data
Warehouses. ER modeling does not really model
a business; rather, it models the micro
relationships among data elements.
ER models are wildly variable in structure. As
such, it is extremely difficult to optimize query
performance.
Why ER is not suitable for Data Warehouses ?
End user cannot understand or remember an ER
Model. End User cannot navigate an ER Model.
There is no graphical user interface or GUI that
takes a general ER diagram and makes it usable
by end users.
ER modeling is not optimized for complex, ad-
hoc queries. They are optimized for repetitive
narrow queries.
Use of ER modeling technique defeats this basic
allure of data warehousing, namely intuitive and
high performance retrieval of data because it
leads to highly normalized relational tables.

22
Designing a Dimensional Model : Steps
Involved
Step 1 : Select the Business Process
The first step in the design is to decide what
business process (es) to model by combining an
understanding of the business requirements with an
understanding of the available data
Step 2 : Declare the Grain
Once the business process has been identified, the
data warehouse team faces a serious decision about
the granularity. What level of detail must be made
available in the dimensional model?

23
Designing a Dimensional Model : Steps
Involved
The grain of a fact table represents the level of detail
of information in a fact table. Declaring the grain
means specifying exactly what an individual fact
table record represents.
It is recommended that the most atomic information
captured by a business process. Atomic data is the
most detailed information collected. The more
detailed and atomic the fact measurements are, the
more we know and we can analyze the data better.

24
In the star schema discussed above, the most
detailed data would be transaction line item detail in
the sale receipt.
(date, time, product code, product_name,
price/unit, number_of_units, amount)
= (18-SEP-2002, 11.02, p1,
dettol soap, 15, 2, 30)
25
Designing a Dimensional Model : Steps
Involved
But in the above dimensional model we provide sales
data rolled up by product(all records corresponding
to the same product are combined) in a store on a
day. A typical fact table record would look like this:
18-SEP-2002, Product1, Store1,
150, 600
This record tells us that on 18
th
Sept. 150 units of
Product1 was sold for Rs. 600 from Store1.


26
Designing a Dimensional Model : Steps
Involved
Step 3 : Choose the Dimensions
Once the grain of the fact table has been chosen, the
date, product, and store dimensions are readily
identified.
It is often possible to add more dimensions to the
basic grain of the fact table, where these additional
dimensions naturally take on only one value under
each combination of the primary dimensions.
If the additional dimension violates the grain by
causing additional fact rows to be generated, then
the grain must be revised to accommodate this
dimension.


27
Designing a Dimensional Model : Steps
Involved
Step 4 : Identify the Facts
The first step in identifying fact tables is where we
examine the business, and identify the transaction
that may be of interest.
In our example the electronic point of sale (EPOS)
transactions give us two facts, quantity sold and sale
amount.

28
Designing a Dimensional Model : Steps
Involved
Advanced Concepts
29
Snowflake Schema & star flake schema
Snowflaking is removing low cardinality (an attribute not
having low distinct values to table cardinality ratio)
textual attributes from dimension tables and placing
them in secondary dimension tables.
For instance, a product category can be treated this way
and physically removed from the low-level product
dimension table by normalizing the dimension table.
This is particularly done on large dimension tables.
Snowflaking a dimension means normalizing it and
making it more manageable by reducing its size. But
this may have an adverse effect on performance, as
joins need to be performed.

30
31
star flake schema
star flake schema is a hybrid structure that
contains a mixture of star(de normalised) and
snowflake(normalised) schemas.
Allows dimensions to be present in both forms
to cater for different query requirements
Fact Constellation
A Dimensional Model , which contains more
than one fact table sharing one or more
conformed dimension tables , is referred to as
fact constellation
32
Differentiate between Star Schema and
Snowflake Schema
Star Schema Snowflake schema
Star schema contains the dimension
tables mapped around one or more
fact tables.
A Snowflake schema contains in-depth
joins because the tables are splitted in
to many pieces.
It is a denormalized model It is the normalised form of Star
schema
No need to use complicated joins. We have to use complicated joins,
since we have more tables.
Queries results fastly There will be some delay in
processing the Query.
Star Schemas are usually not in BCNF
form. All the primary keys of the
dimension tables are in the fact table
In Snowflake schema, dimension
tables are in 3NF , so we get more
dimension tables which are linked by
primary foreign key relation.
33
Examples on Star Schema and Snowflake Schema
All electronics company have sales department.
Sales consider four dimensions namely time,
item, branch and location. The schema contain
a central fact tables sales with two measures
dollars_sold and unit_sold
Design star schema, snowflake schema and fact
constellation for same.

34
Star Schema

35
Snowflake Schema

36
Fact Constellation

37
Example 2
The Mumbai university wants you to help design
a star schema to record grades for course
completed by students. There are four
dimensional tables namely course section,
professor, student, period with attributes as
follows :
Course_section Attributes: Course_Id, Section_number, Course_name,
Units, Room_id, Roomcapacity. During a given semester the college offers
an average of 500 course sections
Professor Attributes: Prof_id, Prof_Name, Title, Department_id,
department_name
Student Attributes: Student_id, Student_name, Major. Each Course section
has an average of 60 students

38
Period Attributes: Semester_id, Year. The
database will contain Data for 30 months
periods. The only fact that is to be recorded in
the fact table is course Grade
Answer the following Questions
(a) Design the star schema for this problem
(b) Estimate the number of rows in the fact table, using the
assumptions stated above and also estimate the total size of the
fact table (in bytes) assuming that each field has an average
of 5 bytes
(c) Can you convert this star schema to a snowflake schema?
Justify your answer and design a snowflake schema if it is possible

39
Star Schema

40
Total Courses Conducted by university =500
Each Course has average students= 60
University stores data for 30 months
Total Student in University for all courses in 30 months
=500*60 = 30000
Time Dimension = 30 months = 5 Semesters (Assume 1
semester= 6 months)
Now, Number of rows of fact table= 30000*5= 150000 (one
student has 5 grades for 5 semesters)
(c) Snowflake Schema : Yes , the above star schema can be
converted to a snowflake schema, considering the following
assumptions

41
Snowflake Schema

42
Give Information Package for recording
information requirements for Hotel Occupancy
considering dimensions like Time, Hotel etc.
Design star schema from the information
package.
43
Hotel Room Type Time
Hotel Id Room id Time id
Branch Name room type Year
Branch Code room size Quarter
Region number of beds Month
Address type of bed Date
city/stat/zip max occupants day of week
construction year Suite day of month,
renovation year holiday flag
44

45
Information Access and Delivery

Classes of Users
Computer Profeciency Users
Casual or Novice User
This type of users uses the data warehouse
occasionally, not daily.
Regular User
Daily users of the data warehouse are called as
regular users.
Power User
User who uses warehouse to create report and
queries are highly proficient with the technology
are called as the power user.

Classes of Users
Users based on their Job Functions

High-Level Executives and Managers : Users
who take high-level strategic decisions.

Technical Analysts : Users who makes use of OLAP
operations and complex analysis.

Classes of Users
Tourists
A user who visits the data warehouse for information like
a tourist and gets the overall information contents of the
data warehouse.
Operators
The data of ongoing transaction at a detailed level are
used by the operators
Explorers
These types of users try to investigate the data and dig
up useful patterns using complex queries.
Miners
Miners discovers new and unknown pattern in the data.
Information Delivery
The Need for Multidimensional
Information
Multidimensionality allows for assigning
relationships between seemingly unrelated
information fields.
It allows combining information from different
sources, and relating them to obtain useful
knowledge.

Simple 2 Dimensional Data Multi-
Dimensional Data with the
Time axis
OLAP
If a person needs to predict or at least observe
the trend of some variable when some affecting
value is varied, then it is not possible for a
conventional data warehouse to produce the
information required. ,this is where OLAP comes
into play.

OLAP or online-analytical processing is a
technology complementary to data warehousing,
in the sense that it works on the data stored in a
warehouse. Warehouses need OLAP to produce
multidimensional views of the information that it
stores, and OLAP needs data warehouses for the
data it needs to process.

OLAP Defined
OLAP may be defined in terms of just five
keywords Fast Analysis of Shared
Multidimensional Information.
Fast, such that the most complex queries
requiring not more than 5 seconds to be
processed
Analysis alludes to the process of analyzing
information of all relevant kinds in order to
process complex queries and establishing clear
criteria for the results of such queries.
OLAP Defined
The information to be used for analysis is
generally obtained from a shared source, such as
a data warehouse
The information may be related in more than one
or two dimensions. For example, a particular set
of business data may be related, variously, to
sales figures, market trends, consumer buying
patterns, supplier conditions and the liquidity of
the business. Presented in such a
multidimensional detail, such information can
be useful and vital to managerial decision-making.
Differences between OLTP and OLAP
(Application Differences)
OLTP OLAP
Transaction oriented Subject oriented
High Create/Read/Update/ Delete (CRUD)
activity
High Read activity
Many users Few users
Continuous updates many sources Batch updates single source
Real-time information Historical information
Tactical decision-making Strategic planning
Controlled, customized delivery Uncontrolled, generalized delivery
RDBMS RDBMS and/or MDBMS
Operational database Informational database
Modeling Objectives Differences
OLTP OLAP
High transaction volumes using few records
at a time
Low transaction volumes using many records
at a time
Balancing needs of online vs scheduled batch
processing
Design for on-demand online processing
Highly volatile data Non-volatile data
Data redundancy - BAD Data redundancy GOOD
Few levels of granularity Multiple levels of granularity
Complex database designs used by IT
personnel
Simpler database designs with business-
friendly constructs
Model Differences

OLTP

OLAP
Single purpose model supports Operational
System
Multiple models support Informational
Systems
Full set of Enterprise data Subset of Enterprise data
Eliminate redundancy Plan for redundancy
Natural or surrogate keys Surrogate keys
Validate Model against business Function
Analysis
Validate Model against reporting requirements
Technical metadata depends on business
requirements
Technical metadata depends on data mapping
results
This moment in time is important Many moments in time are essential elements
OLAP Techniques
Consolidation or Roll
Up

Roll - Up
Consolidation involves the aggregation of data. This can
involve simple roll-ups or complex grouping involving
inter-related data.

For example, the figure below shows the result of roll
up operation performed on the central cube by climbing
up the concept hierarchy for location. This hierarchy
was defined as the total order street
<city<province_or_state<country. The roll up operation
shown aggregates the data by ascending the location
hierarchy from the data by city to the level of country.
In other words rather than grouping the data by city,
the resulting cube groups the data by country.

OLAP : Drill Down
Slicing
Dicing
Pivot/ Rotate
Other OLAP operations
Drill across: executes query involving(i.e
across) more than one fact table
Drill through : operation makes use of
relational SQL facilities to drill through the
bottom level of the data cube down to its back
end relational table
Other operations may include ranking the
top N, or bottom N items in a list as well
as computing moving averages, growth
rates, interests, internal rates of return,
depreciation, currency conversions and
statistical functions.

OLAP Applications
1. Financial Applications
Activity-based costing (resource allocation)
Budgeting
2. Marketing/Sales Applications
Market Research Analysis
Sales Forecasting
Promotions Analysis
Customer Analyses
Market/Customer Segmentation
3. Business Modelling
Simulating business behaviour
Extensive, real-time decision support system for
managers

Benefits of Using OLAP
OLAP helps managers in decision-making
through the multidimensional data views that it
is capable of providing, thus increasing their
productivity.
OLAP applications are self-sufficient owing to
the inherent flexibility provided to the organized
databases.
It enables simulation of business models and
problems, through extensive usage of analysis-
capabilities.
In conjunction with data warehousing, OLAP
can be used to provide reduction in the
application backlog, faster information retrieval
and reduction in query drag..

Approaches to OLAP Servers
MOLAP
This is the more traditional way of OLAP analysis. In
MOLAP, data is stored in a multidimensional cube.
The storage is not in the relational database, but in
proprietary formats.

MOLAP
Advantages
Excellent performance: MOLAP cubes are built
for fast data retrieval, and are optimal for
slicing and dicing operations.
Can perform complex calculations:
Disadvantages
Limited in the amount of data it can handle
Requires additional investment:
ROLAP
This methodology relies on manipulating the
data stored in the relational database to give
the appearance of traditional OLAP's slicing and
dicing functionality.
Advantages
Can handle large amounts of data:
Can leverage functionalities inherent in the
relational database
ROLAP
Disadvantages
Performance can be slow: Because each ROLAP
report is essentially a SQL query (or multiple
SQL queries) in the relational database, the
query time can be long if the underlying data
size is large.
Limited by SQL functionalities


HOLAP
HOLAP technologies attempt to combine the
advantages of MOLAP and ROLAP.
For summary-type information, HOLAP
leverages cube technology for faster
performance.
When detail information is needed, HOLAP can
drill through from the cube into the underlying
relational data.
For example, a HOLAP server may allow large
volumes of detail data to be stored in a relational
database, while aggregations are kept in a separate
MOLAP store.
The Microsoft SQL Server 7.0 OLAP Services supports
a hybrid OLAP server.

Specialized SQL Servers
To meet the growing demand of OLAP
processing in relational databases, some
relational and data warehousing firms (e.g., Red
Brick from Informix) implement specialized SQL
servers that provide advanced query language
and query processing support for SQL queries
over star and snowflake schemas in a read-only
environment.

Introduction to Weka Tool
Outline

Introduction

Weka Tools/Functions

How to use Weka?
Weka Data File Format (Input)
Weka for Data Mining
Sample Output from Weka (Output)

Conclusion


Introduction to Weka
(Data Mining Tool)
Weka was developed at the University of
Waikato in NewZealand .
http://www.cs.waikato.ac.nz/ml/weka/

Weka is a open source data mining tool
developed in Java. It is used for research,
education, and applications. It can be run
on Windows, Linux and Mac.


19-09-2014 76
What can Weka do?
Weka is a collection of machine learning
algorithms for data mining tasks.

The algorithms can either be applied
directly to a dataset (using GUI) or called
from your own Java code (using Weka Java
library).

Weka contains tools for data pre-
processing, classification, regression,
clustering, association rules, and
visualization.
19-09-2014 77
Wekas Role in the Big Picture
Input
Raw data
Data Ming
by Weka

Pre-processing
Classification
Regression
Clustering
Association Rules
Visualization

Output
Result

19-09-2014 78
19-09-2014 79
Weka GUI
Different analysis tools/functions
Different attributes to
choose
The value set of the chosen attribute
and the # of input items with each
value
19-09-2014 80
Weka GUI
Classification Algorithms
19-09-2014 81
Sample Output from Weka
19-09-2014 82
Interpretation of Result
YES NO
YES TP(36) FN(5)
NO FP(19) TN(10)
19-09-2014 83
Accuracy Measurement Parameters


Accuracy Measurement
Parameters
Formula
1. True Positive Rate TPR= (TP)/(TP+FN)
2.False Positive Rate FPR= (FP)/ (FP+TN)
3.Precision P = ( TP)/(TP+FP)
4.Recall R= (TP)/(TP+FN)
5.F-Measure F= (2*P*R)/(P+R)
6.ROC ROC=True Positive Rate/False
Positive rate
Note: For plotting the curve.
19-09-2014 84
Analysis of Classification Result
0
2
4
6
8
10
12
Actual Data KNN C4.5 Nave Bayes
play
Not Play
play Not Play
Actual Data 9 5
KNN 11 3
C4.5 10 4
Nave Bayes 11 3
19-09-2014 85
Analysis of Clustering Result
Norma
l
Intrusio
ns
Actual
Clusters
498 2
Simple K
means
303 197
Farthest
First
491 9
Hierarchic
al
Clustering
493 7
0
100
200
300
400
500
600
Actual
Clusters
Simple K
means
Farthest
First
Hierarchical
Clustering
Normal
Intrusions
19-09-2014 86
Conclusion about Weka
The overall goal of Weka is to build a state-of-the-art
facility for developing machine learning (ML) techniques
and allow people to apply them to real-world data mining
problems.

Detailed documentation about different functions provided
by Weka can be found on Weka website.

WEKA is available at:
http://www.cs.waikato.ac.nz/ml/weka

19-09-2014 87

You might also like