Data Warehouse Introduction
Spring 2016
Zafar Heydari, Farid
Outlines About Lecturer
OLAP vs OLTP
Why Data Warehouse
Designing Data Warehouse
Designing Dimensions
Designing Fact Tables
OLAP vs. OLTP
ROLAP
OLAP
MOLAP
OLTP SSIS
OLTP- Online Transactional Processing
Designed to solve everyday work transactions (i.e. sales, customer care, manufacturing)
Points where the data is captured and recorded in the company
Very efficient in the management of specific information : ERP, CRM, HRM, SCM, Email, Others
Disadvantages Advantages
• Difficult to manage historical • Reduces paper trails
information • Simple & effective
• High cost of maintenance and • Highly accurate Advantages
gathering distributed data
• Need technical skills and
resources
OLAP – Online Analytical Processing
Designed specifically to obtain information, analyze and solve business problems
Specific analytical information is added to the data
They use a different database technology, optimized to extract information.
Analytical systems unify all of the company’s data in one system: the Data Warehouse
What’s in Cube?
OLAP Cubes
They let you analyze all of the information available in the Data
Warehouse
Each cube stores a set of specific information, and contains different
“measures” and “dimensions”
Measures
numbers (amounts, quantities, percentages)
Dimensions
contain attributes to filter and order information
Samples of Dimension & MeasuresDimension
Sales Stock / Inventory Supply Orders
• Dimensions: • Dimensions: • Dimensions: • Dimensions:
• Date / Time • Date / Time • Date / Time • Date / Time
• Customer • Store / • Supplier • Product
• Branch / Branch • Product • Customer
Store • Product • Contract • Salesperson
• Product • Contract • Terms of sale
• Measures: • Measures: terms • Measures:
• Quantity • Quantity • Type of • Quantity
• Price • Price transactions • Amount
• Discount • Cost • Measures: • Discount
• Profit • Quantity
• Amount
• Cost
Why is it multidimensional?
A dimension is one of the “edges” of your business
It is called multidimensional because you can see the
information from different “edges” at the same time
Customers
Suppliers
Employees
Products
Places (Factories, Branches,…)
Time
Transaction Types ( Invoices, Orders, Quotes,…)
OLAP vs. OLTP
OLTP DB of Sales
Highly Normalized
Lots of Joins
Archived Data from 1/1/1
Low Performance
Multiple Data Bases
Different Naming &
definitions
Why Data Warehouses?
Simplified Reporting
Archived Data
Different Data Sources
Consistent Central Hub for Data Analysis
OLTP is highly Normalized design
OLAP/DW are highly DE-Normalized design
Designing Data Warehouse (1)
FACT Tables
(Numbers)
Production De-Normalized
Environment Dimension
Dimension
(Attributes)
DIM Sub-Category
Prototype Normalized DIM Region
DIM Category
Environment Dimension DIM Country
Designing Data Warehouse (2)
Star Schema : Focuses on an area of OLTP data
Central Fact table contains Metric data for Analysis
Dimensions contain attributes to describe fact data
Snowflake Schema extends De-normalized Star Schema
Designing Dimensions (1)
Dimension tables contain:
Attributes : to filter/slice Fact
data
Labels: to get context of the
report
Keys: Primary Keys(Surrogate
Keys) & Natural Keys (OLTP
Key)
Lineage Key: Track where &
when the data comes (Master
lineage data)
Designing Dimensions (2)
SCD: Slowly Changing Dimensions
How do we deal with updates
Determine historical aspects of dimension
data
Type 1: Don’t maintain history
Type 2: Save history by adding a new
record
Type 3: Only Save Previous Value (Less
Popular)
Sample : Department name :
Marketing Sales Marketing
Designing Dimensions (3)
Natural, Durable, and Supernatural Keys
employee number (natural key)
durable key/ durable supernatural key: persist key, independent of the original business process
Flags and Indicators as Textual Attributes
Cryptic abbreviations, true/false flags, and operational indicators should be supplemented in
dimension tables with full text meaningful words
Null Attributes in Dimensions
a given dimension row has not been fully populated,
or when there are attributes that are not applicable to all the dimension’s rows.
In both cases, we recommend substituting a descriptive string, such as Unknown or Not Applicable
Designing Dimensions (4)
Calendar Date Dimensions
Dim Date has different attributes such as week
number, month name, fiscal period, and national
holiday indicator, …
To facilitate partitioning, the primary key of a date
dimension can be more meaningful, such as an
integer representing YYYYMMDD
the date dimension table needs a special row to
represent unknown or to-be-determined dates
a separate date/time stamp can be added to the
fact table. The date/time stamp is not a foreign key
to a dimension table
Designing Dimensions (5)
Role-Playing Dimensions
A single physical dimension can be referenced multiple times in a fact table
For instance: A fact table can have several dates, each of which is represented by a foreign
key to the date dimension. It is essential that each foreign key refers to a separate view of
the date dimension so that the references are independent.
These separate dimension views (with unique attribute column names) are called roles.
Designing Dimensions (6)
Hierarchies:
De-normalized Flattened Dimensions (Star)
Multiple Hierarchies in Dimensions (Snow flake)
Which Data levels (Star/ Snowflake?) : Depends on usability in all fact tables
Star Sample
Snowflake Sample
Hybrid Sample
Designing Fact Tables (1)
Fact tables contain:
Key: Primary Keys , Dimension
Keys
Measure: Calculated Numbers
Lineage: Track where & when
the data comes (Master
lineage data)
Supportive labels
Designing Fact Tables (2)
Measures Additivity:
Additive: All-dimensions
Non-Additive: No dimensions
(Unit Cost)
Semi-Additive: Some
dimensions (All except time)
(Units Balance)
Designing Fact Tables (3)
Fact-less Table:
represent Many-to-Many
relationships in Data
Warehouse.
Designing Fact Tables (4)
Nulls in Fact Tables:
The aggregate functions (SUM, COUNT, MIN, MAX, and AVG) all do
the “right thing” with null facts.
Nulls must be avoided in the fact table’s foreign keys
The associated dimension table must have a default row (and
surrogate key) representing the unknown or not applicable
condition.
Designing Fact Tables (5)
Conformed Facts
If the same measurement appears in separate fact
tables:
If the separate fact definitions are consistent, the
conformed facts should be identically named
but if they are incompatible, they should be differently
named to alert the business users and BI applications
Designing Fact Tables (6)
Periodic Snapshot Fact Tables
A row in a periodic snapshot fact table summarizes many
measurement events occurring over a standard period, such as
a day, a week, or a month.
The grain is the period, not the individual transaction.
if no activity takes place during the period, a row is typically
inserted in the fact table containing a zero or null for each fact.
Designing Fact Tables (7)
Accumulating Snapshot Fact Tables
A row in an accumulating snapshot
fact table summarizes the
measurement events occurring at
predictable steps between the
beginning and the end of a
process.
Samples: Pipeline or workflow
processes, such as order
fulfillment or claim processing
There is a date foreign key in the
fact table for each critical
milestone in the process.
As pipeline progress occurs, the
accumulating fact table row is
revisited and updated.
Designing Fact Tables (8)
Consolidated Fact Tables
to combine facts from multiple processes together into
a single consolidated fact table.
(Sales actuals can be consolidated with sales forecasts)
Designing Fact Tables (9)
Header/Line Fact Tables
parent/child schemas), all the header-level dimension foreign keys and degenerate
dimensions should be included on the line-level fact table.
Managing Fact Data
Indexing Fact Tables:
Clustered on 1 small numeric PK
Non-clustered on FK
Column Store on Large Tables
Lineage Key
Partitioning on Large table
P-Schema
Simplified maintenance
Improve performance P-Function
Lineage
Directly in Fact tables
Centralized Lineage Dimension
Meta data about columns
Kimball Dimensional Modeling Techniques
Gather Business Requirements and Data Realities
Collaborative Dimensional Modeling Workshops
Four-Step Dimensional Design Process:
1. Select the business process.
2. Declare the grain. (atomic grain)
3. Identify the dimensions.
who, what, where, when, why, and how OF CONTECT OF BUSINESS
PROCESS EVENT
4. Identify the facts.
Exercise 1
Answer 1
Any Question?
Spring 2016
Zafar Heydari, Farid