Data Mining & Warehousing
Chapter 2: Online Analytical Processing
2.1 Introduction:
What is dimension?
Dimensions: essentially it is an attribute; it is an ordinate within
multidimensional structure consisting of list of ordered values (called members).
Ex: degree dimension has member BSc.
Dimension always has hierarchies that show parent/ child relationship between member and
dimension. Ex: time (year, month, day, hour, minute, sec), country (states, districts, villages…)
What is fact?
A fact is defined or determined by a combination of dimension values for which a non-
null value exists.
Ex: the dimension values(Bsc, India, Merit, 2011) for dimension (degree, country,
scholarship, year) describe a fact.
BSc MCA BE
INDIA 100 200 500
CHINA 50 60 200
Fact
What is Measure?
Measure: The non-null values of facts are the numerical values stored in each data cube
cell are called measure.
Dept of CSE, KLESCET – Shrikant Athanikar Page 1
Data Mining & Warehousing
2.2 OLAP (Online Analytical Processing):
OLAP is different from ODS and Data warehouse,
It is primarily a software technology concerned with fast analysis of enterprise information.
Often OLAP systems are data warehouse front end software tools to make aggregation data
available efficiently, for advanced analysis, to an enterprise’s measures.
Another Definition of OLAP:
An even simpler definition is that OLAP is fast analysis of shared multidimensional
information for advanced analysis.
This definition (sometimes called FASMI) implies that most OLAP queries should be answered
within seconds.
(FASMI)
Fast
Analytic
Shared
Multidimensional
Information
OLAP systems provide facilities for a manager to pose ad hoc complex queries to obtain the
information that he/she requires.
What is Business Intelligence?
It is sometimes used to mean both data warehousing and OLAP.
Or
It has been defined as the user-centered process of exploring data, data relationships and trends,
thereby helping to improve overall decision making.
Dept of CSE, KLESCET – Shrikant Athanikar Page 2
Data Mining & Warehousing
2.3 Compare OLTP and OLAP system:
Property OLTP OLAP
Nature of users Operations workers Decision makers
Functions Mission-critical Management-critical
Nature of queries Mostly simple Mostly complex
Nature of usage Mostly repetitive Mostly ad hoc
Nature of design Application oriented Subject oriented
Number of users Thousands Dozens
Nature of data Current detailed, relational Historical, summarized, multidimensional
Updates All the time Usually not allowed
2.4 CHARACTERISTICS OF OLAP SYSTEMS
FASMI Characteristics
Fast:
The performance has to be like a search engine.
If the response takes more than say 20 seconds, the user is likely to move away to something
else assuming there is a problem with the query.
needs efficient data structure and hardware should be strong.
Analytic:
An OLAP system must provide rich analytic functionality
It is expected that most OLAP queries can be answered without any programming. The
system should be able to cope with any relevant queries for the application and the user.
Often the analysis will be done using the Vendors own tools although OLAP software
capabilities differ widely between products in the market.
Shared:
An OLAP system is a shared resource by a select group of managers and dozens of users.
Being a shared system, an OLAP system should provide adequate security for confidentiality
as well as integrity.
Dept of CSE, KLESCET – Shrikant Athanikar Page 3
Data Mining & Warehousing
Multidimensional:
Basic requirement, it must provide a multidimensional conceptual view of the data.
As a dimension often has hierarchies that show parent child relationships between the
members of a dimension, the multidimensional structure should allow such hierarchies.
Information:
OLAP systems usually obtain information from a data warehouse. The system should be able to
handle a large amount of input data.
Codd's OLAP Characteristics:
1. Multidimensional conceptual view:
This is the central characteristic of an OLAP system.
By requiring a multidimensional view, it is possible to carry out operations like
slice and dice.
2. Accessibility (OLAP as a mediator):
The OLAP software should be sitting between data sources (e.g. a data warehouse)
and an OLAP front-end.
3. Batch extraction vs interpretive:
An OLAP system should provide multidimensional data staging plus partial
precalculation of aggregates in large multidimensional databases.
4. Multi-user support:
Since the OLAP system is shared, the OLAP software should provide many
normal database operations including retrieval, update, concurrency control,
integrity and security.
5. Storing OLAP results:
OLAP results data should be kept separate from source data.
Read-write OLAP applications should not be implemented directly on live
transaction data if OLTP source systems are supplying information to the OLAP
system directly.
6. Extraction of missing values:
The OLAP system should distinguish missing values from zero values.
A large data cube may have a large number of zeros as well as some missing
values.
Dept of CSE, KLESCET – Shrikant Athanikar Page 4
Data Mining & Warehousing
If a distinction is not made between zero values and missing values, the aggregates
are likely to be computed incorrectly.
7. Treatment of missing values:
An OLAP system should ignore all missing values regardless of their source.
Correct aggregate values will be computed once the missing values are ignored.
8. Uniform reporting performance:
Increasing the number of dimensions or database size should not significantly
degrade the reporting performance of the OLAP system.
This is a good objective although it may be difficult to achieve in practice.
9. Generic dimensionality:
An OLAP system should treat each dimension as equivalent in both its structure
and operational capabilities.
Additional operational capabilities may be granted to selected dimensions but such
additional functions should be grantable to any dimension.
10. Unlimited dimensions and aggregation levels:
An OLAP system should allow unlimited dimensions and aggregation levels.
In practice, the number of dimensions is rarely more than 10 and the number of
hierarchies rarely more than six.
2.5 MOTIVATIONS FOR USING OLAP:
Two examples that show how information an OLAP tools can help in discovering are:
1. Understanding and improving sales:
• OLAP can assist in finding the most popular products and the most popular
channels.
• In some cases it may be possible to find the most profitable customers.
Example: Considering the telecommunications industry and only if a company wanted to
analyze the sales of product for every hour of the day (24 hours), differentiate between
weekdays and weekends (2 values) and divide regions to which calls are made into 50
regions. One is therefore looking at a three-dimensional cube with 24 x 2 x 50 = 2400 cells.
OLAP analysis of such information can assist in improving the enterprise business.
Dept of CSE, KLESCET – Shrikant Athanikar Page 5
Data Mining & Warehousing
2. Understanding and reducing costs of doing business:
• OLAP can assist in analyzing the costs associated with sales.
• In some cases, it may also be possible to identify expenditures that produce a high return
on investment (ROI).
Example: recruiting a top salesperson may involve significant costs, but the revenue
generated by the salesperson may justify the investment
2.6 MULTIDIMENSIONAL VIEW AND DATA CUBE
We illustrate the multidimensional view of data by using an example of a simple OLTP
database consisting of three tables.
Ex: student data that has only three dimensions.
• The first relation (student) provides information about students.
• The second relation (enrolment) provides information about the students' degree
enrolment and the semester of their first enrolment.
• The third relation (degree) provides information about the degrees in the university and the
departments that offer them.
Student (Student_id, Student_name, Country, DOB, Address)
Enrolment (Student_id, Degree_id, SSemester)
Degree (Degree_id, Degree_name, Degree_length, Fee, Department)
Table-1: The relation student
Student_id StudentName Country DOB Address
8656789 Peta Williams Australia 1/1/1980 Davis Hall
8700020 John Smith Canada 2/2/1981 9 Davis Hall
8900020 Arun Krishna USA 3/3/1983 90 Second Hall
8801234 Peter Chew UK 4/4/1983 88 Long Hall
8654321 Reena Rani Australia 5/5/1984 88 Long Hall
8712374 Kathy Garcia Malaysia 6/6/1980 88 Long Hall
8612345 Chris Watanabc Singapore 7/7/1981 11 Main Street
Dept of CSE, KLESCET – Shrikant Athanikar Page 6
Data Mining & Warehousing
Table-2 presents an example of the relation enrolment. In this table, the attribute SSemester is
the semester in which the student started the current degree.
We code it by using the year followed by 01 for the first semester and 02 for the second
Table-2: The relation enrolment
Student_id Degree_id SSemester
8900020 1256 2002-01
8700074 3271 2002-01
8700074 3321 2002-02
8900020 4444 2000-01
8801234 1256 2001-01
8801234 3321 1999-02
8801234 3333 1999-02
8977665 3333 2000-02
Table-3 is an example of the relation degree. In this table, the degree length is given in terms of
the number of semesters it normally takes to finish it. The fee is assumed to be in thousands of
dollars per year.
Table-3 The relation degree
Degree_id Degree_name Degreelength Fee Department
1256 BIT () 18 Computer Science
2345 BSc 6 20 Computer Science
4325 BSc 6 20 Chemistry
3271 BSc 6 20 Physics
3321 BCom 6 16 Business
4444 MBBS 12 30 Medicine
3333 LLB 8 22 Law
• It is clear that the information given in Tables 1, 2 and 3, although suitable for a student
enrolment OLTP system, is not suitable for efficient management decision making.
• The managers do not need information about the individual students, the degree they are enrolled
in, and the semester they joined the university.
What the managers need is the trends in student numbers in different degree programs and
from different countries we may visualize the data as two-dimensional, i.e.
Country x Degree
Dept of CSE, KLESCET – Shrikant Athanikar Page 7
Data Mining & Warehousing
Table-4 A two-dimensional table of aggregates for semester
Country BSc LLB MBBS BCom BIT ALL
Australia 5 20 15 50 11 101
India 10 0 15 25 17 67
Malaysia 5 1 in 12 23 51
Singapore 2 2 10 10 31 55
Sweden 5 0 5 25 7 42
UK 5 15 20 20 13 73
USA 0 2 20 15 19 56
ALL 32 40 95 157 121 445
Using this two-dimensional view we are able to find the number of students joining any degree from
any country (only for semester 2000-01).
Other queries that we are quickly able to answer are:
• How many students started BIT in 2000-01?
• How many students joined from Singapore in 2000-01?
• How many students studying BSc degree
A similar table would be available for other semesters. Let us assume that the data for 2001-01 is
given in Table-5
Table-5 Two-dimensional Table of aggregates for semester 2001-01
Country BSc LLB MBBS B Com BIT ALL
Australia 7 10 16 53 10 96
India 9 0 17 22 13 61
Malaysia 5 1 1') 19 20 64
Singapore 2 2 10 12 23 49
Sweden S 0 5 16 7 36
UK 4 13 20 26 11 74
USA 4 2 10 10 12 38
ALL 39 28 97 158 96 418
Dept of CSE, KLESCET – Shrikant Athanikar Page 8
Data Mining & Warehousing
A Two dimensional table which is aggregate of previous two tables are
Table-6 Two-dimensional table of aggregates for both semesters
Degree
BSc LLB MBBS BCom BIT ALL
Country
Australia 12 30 31 103 21 197
India 19 0 32 47 30 128
Malaysia 10 2 29 31 43 115
Singapore 4 4 20 22 54 104
Sweden 13 0 10 4] 11 78
UK 9 28 40 46 24 147
USA 4 4 30 23 31 94
ALL 71 68 192 315 217 863
Tables 4, 5 and 6 together now form a three dimensional cube as shown in Figure
Note that a cube does not need to have an equal number of members in each dimension.
Putting the three tables together gives a cube of 8 x 6 x 3 (= 144) cells including the totals along
every dimension.
A cube such as we have described could be represented by:
Country x Degree x Semester
Possible queries done on this 3 dimensional table:
Dept of CSE, KLESCET – Shrikant Athanikar Page 9
Data Mining & Warehousing
1. NULL (e.g. how many students are there? Only 1 possible query)
2. degrees (e.g. how many students are doing BSc? 5 possible queries if we assume 5
different degrees)
3. semester (e.g. how many students entered in semester 2000-01? 2 possible queries if we
only have data about 2 semesters)
4. country (e.g. how many students are from the USA? 7 possible queries if there are 7
countries)
5. degrees, semester (e.g. how many students entered in 2000-01 to enroll in BCom? With
5 degrees and 2 different semesters we have 10 different queries of this type)
6. semester, country (e.g. how many students from the UK entered in 2000-01? With 2
degrees and 7 different countries we have 14 different queries of this type)
7. degrees, country (e.g. how many students from Singapore are enrolled in BCom? With 5
degrees and 7 different countries we have 35 different queries of this type)
8. All (e.g. how many students from Malaysia entered in 2000-01 to enroll in BCom? With
5 degrees, 2 different semesters and 7 different countries we have 70 different queries of
this type)
• Therefore eight different types of aggregations are possible for three dimensions. In
general, 2n types of aggregations are possible for n dimensions.
• Assuming that we have 20 different semesters, 50 different degrees and 10 different
countries, there are 20 x 50 x 10 or 10,000 possible different combinations of the three
attributes without including the totals.
• Aggregates for all these combinations as well as those of two attributes and single
attributes will need to be pre-computed if we want to have every possible aggregate.
The numbers given in the above list add to 144 different aggregations
Dept of CSE, KLESCET – Shrikant Athanikar Page 10
Data Mining & Warehousing
2.7 DATA CUBE IMPLEMENTATIONS
A large enterprise needs Million of aggregates when there are more than three dimensions.
• Should all these aggregates be pre-computed? Many of the aggregates will be zero.
• How should they be stored?
• How should sparsity be managed?
Some possible solutions are:
Pre-compute and store all: compute all queries and stored.
For large data query response time will be more as aggregation for large data may take more time
and the results may not be accurate. Indexing large amounts of data is also expensive.
Pre-compute (and store) none: This means that the aggregates are computed on-the-fly using the
raw data whenever a query is posed.
This approach does not require additional space for storing the cube but the query response time is
likely to be very poor for large data cubes.
Pre-compute and store some: This means that we pre-compute and store the most frequently
queried aggregates and compute others as the need arises
It can be shown that large numbers of cells do have an "ALL" value and may therefore be derived
from other aggregates. Let us reproduce the list of queries we had in the last section and define
them as (a, b, c) where a stands for a value of the degree dimension, b for country and c for
starting semester:
1. (ALL, ALL, ALL) null (e.g. how many students are there? Only 1 query)
2. (a, ALL, ALL) degrees (e.g. how many students are doing BSc, How many students are
doing LLB? 5 degrees so 5 queries)
3. (ALL, ALL, c) semester (e.g. how many students entered in semester 2000-01? 2
queries)
4. (ALL, b, ALL) country (e.g. how many students are from the USA? 7 queries)
5. (a, ALL, c) degrees, semester (e.g. how many students entered in 2000-01 to enroll in
BCom? 10 queries)
6. (ALL, b, c) semester, country (e.g. how many students from the UK entered in 2000-01?
14 queries)
Dept of CSE, KLESCET – Shrikant Athanikar Page 11
Data Mining & Warehousing
7. (a, b, ALL) degrees, country (e.g. how many students from Singapore are enrolled in
BCom? 35 queries)
8. (a, b, c) all (e.g. how many students from Malaysia entered in 2000-01 to enroll in
BCom? 70 queries)
It is therefore possible to derive the other 74 of the 144 queries from the last 70 queries of
type (a, b, c). Of course in a very large data cube, it may not be practical to even pre-compute
all the queries.
• Another related issue is where the data used by OLAP will reside. We assume that
the data is stored in a data warehouse or in one or more data marts.
• It is not appropriate to use data directly from OLTP systems because OLAP applications
are often large and use complex aggregate analysis.
• Also, OLAP often uses data from more than one OLTP system and the process of
merging these multiple systems can be complex. Data may also need to be transformed if
a company is operating in many locations
Dept of CSE, KLESCET – Shrikant Athanikar Page 12
Data Mining & Warehousing
Data cube products use different techniques for pre-computing aggregates and
storing them: (what are the different techniques used for pre-computing aggregates and storing them?)
1) First model, supported by vendors of traditional relational model databases, is called the ROLAP
model or the Relational OLAP model.
ROLAP model provides a relational view of the multidimensional data in the form of a fact
table.
2) Second model is called the MOLAP model for multidimensional OLAP.
The MOLAP model provides a direct multidimensional view of the data
ROLAP: Relational Online Analytical Processing:
1. ROLAP uses a relational DBMS to implement an OLAP environment
2. It may be considered a bottom-up approach
3. Data is not stored in normalized way as it may not give high performance, so the data will be
stored in terms of multidimensional tables which contain dimensions and facts.
4. An OLAP tool is then used to manipulate the data in D/W, the Tool is used to group the fact
tables and aggregate them.
Advantage:
The advantage of using ROLAP is that it is more easily used with existing relational DBMS and
the data can be stored efficiently using tables since no zero facts need to be stored.
Disadvantage:
• The major disadvantage of the ROLAP model is its poor query performance.
• Proponents of the MOLAP model have called the ROLAP model SLOWLAP.
Some products in this category are Oracle OLAP mode, OLAP Discoverer, Micro-Strategy and
Microsoft Analysis Services.
Dept of CSE, KLESCET – Shrikant Athanikar Page 13
Data Mining & Warehousing
MOLAP: Multidimensional Online Analytical Processing:
• MOLAP is based on using a multidimensional DBMS rather than a data warehouse to store
and access data
• It may be considered as a top-down approach to OLAP
• The multidimensional database systems do not have a standard approach to storing and
maintaining their data.
• They often use special-purpose file systems or indexes that store pre-computation of all
aggregations in the cube
• For example, in ROLAP a cell was represented as (BSc, India, 2001-01) with a value 30
stored in the last column.
• In MOLAP, the same information is stored as just 30 and the storage location implicitly gives
the values of the dimensions.
• The dimension values do not need to be stored since all the values of the cube could be stored
in an array in a predefined way.
Example: how data is represented in MOLAP:
12 30 31 103 21 197 19 0 32 47 30 128 10 …
Difference between ROLAP and MOLAP:
Property MOIAP ROLAP
Data structure Multidimensional database using sparse arrays Relational tables (each cell is a row)
Disk space Separate database for data cube; large for large May not require any space other than
data cubes that available in the data warehouse
Retrieval Fast (pre-computed) Slow (computes on-the-fly)
Scalability Limited (cubes can be very large) Excellent
Best suited for Inexperienced users, limited set of queries Experienced users, queries change
frequently
DBMS facilities Usually weak Usually very strong
Dept of CSE, KLESCET – Shrikant Athanikar Page 14
Data Mining & Warehousing
DATA CUBE OPERATIONS
2.8
A number of operations may be applied to data cubes. The common ones are:
• Roll-up
• Drill-down
• Slice and dice
• Pivot
Roll-up
Roll-up is like zooming out on the data cube. It is required when the user needs further
abstraction or less detail
Example of roll-up based on Tables 4, 5 and 6.
We first define hierarchies on two dimensions. Amongst countries, let us define:
1. Asia (India, Malaysia, Singapore)
2. Europe (Sweden, UK)
3. Rest (Australia, USA)
Roll-up Example
The result of a roll-up for both semesters together from Table 6 then is given in Table-7
Table-7 result of Rollup on table-6
Degree Science Medicine Business and Law
country
Asia 160 81 106
Europe 60 50 115
Rest 68 61 162
Drill-down
Drill-down is like zooming in on the data and is therefore the reverse of roll-up. It is an
appropriate operation when the user needs further detail.
Drill-down adds more details to the data.
Hierarchy defined on a dimension may be involved in drill-down. For example, a higher
level view of student data, for example in Table-7, gives student numbers for the two
semesters for groups of countries and groups of degree
Dept of CSE, KLESCET – Shrikant Athanikar Page 15
Data Mining & Warehousing
Slice and Dice
Slice and dice are operations for browsing the data in the cube.
The terms refer to the ability to look at information from different viewpoints
A slice is a subset of the cube corresponding to a single value for one or more members of
the dimensions
Let the degree dimension be fixed as degree = BIT. The slice will not include any
information about other degrees
Pivoted or Rotate:
The pivot operation is used when the user wishes to re-orient the view of the data cube.
It may involve swapping the rows and columns, or moving one of the row dimensions into
the column dimension.
2.9
GUIDELINES FOR OLAP IMPLEMENTATION:
1. Vision: The OLAP team must, in consultation with the users, develop a clear vision for the
OLAP system.
2. Senior management support: The OLAP project should be fully supported by the senior
managers
3. Selecting an OLAP tool: The OLAP team should familiarize themselves with the ROLAP
and MOLAP tools available in the market. Since tools are quite different, careful planning
may be required in selecting a tool that is appropriate for the enterprise. In some situations, a
combination of ROLAP and MOLAP may be most effective.
Dept of CSE, KLESCET – Shrikant Athanikar Page 16
Data Mining & Warehousing
4. Corporate strategy: The OLAP strategy should fit with the enterprise strategy and
business objectives. A good fit will result in the OLAP tools being used more widely.
5. Focus on the users: The OLAP project should be focused on the users.
Users should, in consultation with the technical professionals, decide
what tasks will be done first and what will be done later. Attempts
should be made to provide each user with a tool suitable for that person's
skill level and information needs.
A good GUI user interface should be provided to non-technical users.
The project can only be successful with the full support of the users.
6. Joint management: The OLAP project must be managed by both the IT and business
professionals. Many other people should be involved in supplying ideas. An appropriate
committee structure may be necessary to channel these ideas.
7. Review and adapt: As noted in the last chapter, organizations evolve and so must the OLAP
system. Regular reviews of the project may be required to ensure that the project is meeting
the current needs of the enterprise.
Dept of CSE, KLESCET – Shrikant Athanikar Page 17
Data Mining & Warehousing
Possible Questions from This chapter:
1. Write a short note on Dimension, facts, OLAP? Ans: Page-1 & 2
2. Write the FASMI characteristics of an OLAP? Ans: Page-3 & 4
3. Compare OLTP and OLAP? Ans: Page-3
4. Mention the motivation for using OLAP? Ans: Page-5
5. Explain in detail multidimensional view and data cubes? Give an example showing all
possible queries that can be performed on these data cubes? Ans: Page-6 to Page-10
6. Mention the possible solutions for aggregating multidimensional data cubes? Ans: Page-11
(precompute and store, precompute some and store some …)
7. Explain the different techniques used in data cubes for pre computing aggregates and storing
them? Ans: Page-13 to Page-14 (ROLAP and MOLAP)
8. Write a note on ROLAP and MOLAP? Ans: Page-13 to Page-14
9. Compare ROLAP and MOLAP? Ans: Page-14
10. Explain in details operations performed on Data Cubes? Ans: Page-15 to 16
11. What are the guidelines for OLAP implementations? Ans: Page-17
Dept of CSE, KLESCET – Shrikant Athanikar Page 18