Data Warehouse and OLAP
Data Warehouse and OLAP
1
What is Data Warehouse?
• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately
from the organization’s operational database.
– Support information processing by providing a solid platform
of consolidated, historical data for analysis.
– W. H. Inmon — “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process.”
• Data warehousing:
– The process of constructing and using data warehouses
2
Data Warehouse — Key Features (1)
• Subject-Oriented:
– Organized around major subjects:
• E.g., customer, product, sales.
• Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
• Provide a simple and concise view around particular subject
issues by excluding useless data in the decision support process.
• Integrated:
– Constructed from multiple, heterogeneous data sources:
• RDBs, flat files, on-line transaction records, …
– Applying data cleaning and data integration techniques.
• Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted.
3
Data Warehouse — Key Features (2)
• Time Variant:
– The time horizon for the data warehouse is significantly longer
than that of operational systems.
• Current value data ⇔ historical perspective info. (past 5-10 yr).
– Every key structure in the data warehouse:
• Contains an element of time [explicitly or implicitly],
• But the key of tuples may or may not contain “time element”.
• Non-Volatile:
– A physically separate store of data transformed (copied) from
the operational environment into one semantic data store.
• Operational data update does not occur in data warehouse.
• Require no transaction processing, recovery, and concurrency
control mechanisms. [⇐ on-line database operations]
– Requires only two operations in data accessing:
• initial loading of data and access of data. 4
Compared w. Heterogeneous DBMS
• Traditional heterogeneous DB integration:
– Build wrappers/mediators on top of heterogeneous DB.
• E.g., IBM Data Joiner and Informix DataBlade.
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
7
Why Separate Data Warehouse?
• High performance for both systems: [asynchronous]
– DBMS ⇒ tuned for OLTP:
• access methods, indexing, concurrency control, recovery, etc.
– Warehouse ⇒ tuned for OLAP:
• complex OLAP queries, multidimensional view, consolidation.
8
From Tables and Spreadsheets to Data Cubes
• Data warehouses utilize an n-dimensional data model
to view data in the form of a data cube.
– A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions.
• Dimensions: time, item, branch, location, supplier, …
– Dimension tables, each for a dimension.
• E.g., item (item_name, brand, type).
• E.g., time (day, week, month, quarter, year).
– Fact table contains numerical measures (say, dollars_sold,
units_sold, amount_budgeted, …) and keys to each of the
related dimension tables.
• In data warehousing literature,
– An n-D base cube is called a base cuboid.
– The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid.
– The lattice of cuboids forms a data cube. [group by]
9
Cube: A Lattice of Cuboids
all
C04 0-D (apex) cuboid
time,item,location time,location,supplier
C34 3-D cuboids
time,item,supplier item,location,supplier
10
Conceptual Modeling of Data Warehouses
• Modeling data warehouses: dimensions & measures
– Star schema: A fact table in the middle connected to a set of
dimension tables
11
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
units_sold
branch_name street
branch_type dollars_sold city
province_or_street
avg_sales
country
Measures
12
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table
item_name
month supplier_key
time_key brand
quarter supplier_type
type
year item_key supplier_key
branch_key
branch location_key
branch_key units_sold location
branch_name
dollars_sold location_key
branch_type city
street
avg_sales city_key city_key
city
Measures province_or_street
country
13
Example of Fact Constellation
time
Shipping Fact Table
time_key item
day item_key time_key
day_of_the_week Sales Fact Table item_name
month item_key
brand
quarter time_key type shipper_key
year supplier_type
item_key from_location
branch_key to_location
branch location
location_key dollars_cost
branch_key location_key
units_sold units_shipped
branch_name street
branch_type dollars_sold city
province_or_street
avg_sales
country
shipper
Measures shipper_key
shipper_name
location_key
shipper_type
14
A Data Mining Query Language, DMQL:
Language Primitives
• Cube Definition (Fact Table):
– define cube <cube_name> [<dimension_list>]:
<measure_list>
18
Measures: Three Categories
• Distributive: if the result derived by applying the
function to n aggregate values is the same as that
derived by applying the function on all the data
without partitioning.
– E.g., count(), sum(), min(), max(). ⇐ aggregate
all all
Specification of hierarchies
• Schema hierarchy
– day < {month < quarter;
week} < year
• Set_grouping hierarchy
– {1..10} < inexpensive
Year
Quarter
Month Week
Day
21
Multidimensional Data
• Sales volume as a function of product, month, and
region.
Dimensions: Product, Location, Time
n
gio
Office Day
22
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc TV
od
PC
Pr
VCR
sum
U.S.A
Country
Canada
Mexico
sum
23
Cuboids Corresponding to the Cube
all
C03 0-D(apex) cuboid
3
product date country
C1 1-D cuboids
product,date product,country date, country
3
C 2 2-D cuboids
24
Browsing a Data Cube
• Visualization
• OLAP capabilities
• Interactive manipulation
25
Typical OLAP Operations ref. Fig. 2.10 p.59
• Other operations
– drill across: involving (across) more than one fact table.
– drill through: through the bottom level of the cube to its
back-end relational tables (using SQL).
26
A Star-Net Query Model
Each circle (abstract level) is called a footprint.
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location
Promotion Organization
27
Design of a Data Warehouse:
A Business Analysis Framework
• Four views regarding the design of a data warehouse:
– Top-down view:
• allows selection of the relevant information necessary for the data
warehouse.
– Data source view:
• exposes the information being captured, stored, and managed by
operational systems.
– Data warehouse view:
• consists of fact tables and dimension tables.
– Business query view:
• sees the perspectives of data in the warehouse from the view of
end-user.
28
Data Warehouse Design Process
• Top-down, bottom-up approaches or hybrid:
– Top-down: Starts with overall design and planning (mature).
– Bottom-up: Starts with experiments and prototypes (rapid).
Monitor
& OLAP Server
Metadata
other Integrator
sources Analysis
Extract Query
Transform Data
Operational Serve Reports
Load Warehouse
DBs Refresh Data mining
Data Marts
• Data Mart:
– a subset of corporate-wide data that is of value to a specific
groups of users.
• Its scope is confined to specific, selected groups, such as
marketing data mart.
– Independent vs. dep. (directly from warehouse) data mart.
• Virtual warehouse:
– A set of views over operational databases.
– Only some summary views are materialized.
31
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
T = ∏ i =1 ( Li + 1) ⇒ 2n , if Li = 1.
n
35
Cube Computation: ROLAP-Based Method (1)
• Efficient cube computation methods:
– ROLAP-based cubing algorithms (Agarwal et al’96)
– Array-based cubing algorithm (Zhao et al’97)
– Bottom-up computation method (Bayer & Ramarkrishnan’99)
36
Cube Computation: ROLAP-Based Method (2)
• This is not in the textbook but in a research paper
• Hash/sort based methods (Agarwal et. al. VLDB’96)
– Smallest-parent: computing a cuboid from the smallest
cuboid previously computed.
– Cache-results: caching results of a cuboid from which other
cuboids are computed to reduce disk I/Os.
– Amortize-scans: computing as many as possible cuboids at
the same time to amortize disk reads.
– Share-sorts: sharing sorting costs cross multiple cuboids
when sort-based method is used.
– Share-partitions: sharing the partitioning cost cross multiple
cuboids when hash-based algorithms are used.
37
Multi-way Array Aggregation for Cube
Computation
• Partition arrays into chunks (a small sub-cube fit in memory).
• Compressed sparse array addressing: (chunk_id, offset)
• Compute aggregates in “multiway” by visiting cube cells in the
order which minimizes the # of times to re-visit each cell,
thereby reducing memory access and storage cost.
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0 What is the best
b3 B13 14 15 16 60
44 traversing order
28 56
b2 9
B 40 to do multi-way
24 52
b1 5 36
1 2 3 4
20 aggregation?
b0
a0 a1 a2 a3 ordering
A 38
Multi-way Array Aggregation for Cube
Computation
AC
C c3 61 62 63 64
BC c2 45 46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B b2 28 56
9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
b0c0 of BC
AB
39
Multi-way Array Aggregation for Cube
Computation
AC
a0b0c0 ⇒ a0b0, b0c0, a0c0.
C c3 61
c2 45
62 63 64
BC 46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B b2 28 56
9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3 A=40
A B=400
C=4000
⇒ AB=16000
⇒ AC=160000
⇒ BC=1600000
AB
40
Multi-Way Array Aggregation for Cube
Computation (cont’d)
• Method: the planes should be sorted and computed
according to their size in ascending order.
– See the details of Example 2.12 (pp. 75-78)
– Idea: keep the smallest plane in the main memory, fetch and
compute only one chunk at a time for the largest plane.
42
Indexing OLAP Data: Join Indices
• Join index: JI(R-id, S-id) where R (R-id, …)
S (S-id, …)
fact table
• Traditional indices map the values to a list
of record ids
– It materializes relational join in JI file and
speeds up relational join — a rather costly
operation
• In data warehouses, join index relates the
values of the dimensions of a start schema
to rows in the fact table.
– E.g. fact table: Sales and two dimensions
city and product
• A join index on city maintains for each
distinct city a list of R-IDs of the tuples
recording the Sales in the city
– Join indices can span multiple dimensions.
43
Efficient Processing OLAP Queries
• Determine which operations should be performed on
the available cuboids:
– transform drill, roll, etc. into corresponding SQL and/or OLAP
operations, e.g., dice = selection + projection
44
Metadata Repository
• Meta data is the data defining warehouse objects.
It has the following kinds:
– Description of the structure of the warehouse.
• schema, view, dimensions, hierarchies, derived data definitions,
data mart locations and contents.
– Operational meta-data.
• data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (usage statistics, error reports, audit trails).
– The algorithms used for summarization.
– The mapping from operational environment to the data
warehouse.
– Data related to system performance.
• warehouse schema, view and derived data definitions.
– Business data.
• business terms & definitions, ownership of data, charging policies.
45
Data Warehouse Back-End Tools and Utilities
• Data extraction:
– get data from multiple, heterogeneous, and external sources.
• Data cleaning:
– detect errors in the data and rectify them when possible.
• Data transformation:
– convert data from legacy or host format to warehouse format.
• Load:
– sort, summarize, consolidate, compute views, check integrity,
and build indices and partitions.
• Refresh
– propagate the updates from the data sources to the
warehouse.
46
Discovery-Driven Exploration of Data Cubes
Drill
down
high InExp
high SelfExp
high InExp
Drill down
48
Complex Aggregation at Multiple Granularities:
Multi-Feature Cubes
• Multi-feature cubes (Ross, et al. 1998): Compute
complex queries involving multiple dependent
aggregates at multiple granularities.
– Ex. Grouping by all subsets of {item, region, month}, find
the maximum price in 1997 for each group, and the total sales
among all maximum price tuples.
• select item, region, month, max(price), sum(R.sales)
from purchases Dependent!
where year = 1997
grouping variable of group gi
cube by item, region, month: R
such that R.price = max(price)
Layer3
OLAM Engine OLAP Engine
OLAP/OLAM
Data Cube API
Layer2
Meta MDDB
MDDB Data
Database API
Filtering&Integration Filtering Layer1
Data cleaning Data
Data
Databases Repository
Data integration Warehouse
52
Summary
• Data warehouse:
– A subject-oriented, integrated, time-variant, & nonvolatile collection
of data in support of management’s decision-making process.
55