MODULE I
1.1 Introduction.
NOTES
The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and science
exploration.
Data mining has numerous applications across various industries, including marketing,
healthcare, finance, and manufacturing. In marketing, data mining is used to identify
customer preferences and behaviors, and to develop targeted marketing campaigns. In
healthcare, data mining is used to analyze patient data and develop personalized treatment
plans. In finance, data mining is used to detect fraudulent transactions and assess credit
risk.
Some of the commonly used techniques in data mining include clustering, classification,
regression, and association rule mining. Data mining can be performed using a variety of
tools and software packages, including Python, R, SAS, and Tableau. Overall, data mining
plays a critical role in today's data-driven world, allowing organizations to gain insights
from large datasets and make informed decisions based on data-driven evidence.
Data storage became easier as the availability of large amounts of computing power at low
cost
i.e., the cost of processing power and storage is falling, made data cheap. There was also
the
introduction of new machine learning methods for knowledge representation based on
logic
programming etc. in addition to traditional statistical analysis of data. The new methods
tend to be computationally intensive hence a demand for more processing power.
The data storage bits/bytes are calculated as follows:
1 byte = 8 bits
1 kilobyte (K/KB) = 2 ^ 10 bytes = 1,024 bytes
1 megabyte (M/MB) = 2 ^ 20 bytes = 1,048,576 bytes
1 gigabyte (G/GB) = 2 ^ 30 bytes = 1,073,741,824 bytes
1 SRMIST DDE Self Learning Material
1 terabyte (T/TB) = 2 ^ 40 bytes = 1,099,511,627,776 bytes
NOTES 1 petabyte (P/PB) = 2 ^ 50 bytes = 1,125,899,906,842,624 bytes
1 exabyte (E/EB) = 2 ^ 60 bytes = 1,152,921,504,606,846,976 bytes
1 zettabyte (Z/ZB) =1 000 000 000 000 000 000 000 bytes
1 yottabyte (Y/YB) =1 000 000 000 000 000 000 000 000 bytes
It was recognized that information is at the heart of business operations and that
decisionmakers
could make use of the data stored to gain valuable insight into the business. Database
Management Systems gave access to the data stored but this was only a small part of what
could be gained from the data. Traditional on-line transaction processing systems, OLTPs,
are good at putting data into databases quickly, safely and efficiently but are not good at
delivering meaningful analysis in return. Analyzing data can provide further knowledge
about a business by going beyond the data explicitly stored to derive knowledge about the
business. Data Mining, also called as data archeology, data dredging, data harvesting, is
the process of extracting hidden knowledge from large volumes of raw data and using it to
make crucial business decisions. This is where Data Mining or Knowledge Discovery in
Databases (KDD) has obvious benefits for any enterprise.
1.2 Data Mining Definitions.
▪ Data
Data are the raw facts, figures, numbers, or text that can be processed by a
computer. Today, organizations are gathering massive and growing amounts of
data in different formats and different databases. The operational or transactional
data contains the day-to-day operation data (such inventory data, on-line shopping
data), non-operational data, and metadata i.e. data about data.
▪ Information
The arrangements, relations, or associations among all types of data can deliver
information. Which products are selling when are based upon the analysis of sales
transactions by considering a retail idea.
▪ Knowledge
2 SRMIST DDE Self Learning Material
Information can be converted into knowledge. Supermarket sales information can
be analyzed because of marketing efforts to deliver knowledge of consumer
NOTES
purchasing habits. Data together in large data repositories develop “data tombs”.
Data tombs are converted into “golden nuggets” of knowledge with the use of data
mining tools . Golden nuggets mean “small but valuable facts”. Extraction of
interesting information or patterns from data in large databases is known as data
mining. According to William J. Frawley, Gregory Piatetsky-Shapiro and
Christopher J. Matheus ‘Data Mining, or Knowledge Discovery in Databases
(KDD) as it is also known, is the nontrivial extraction of implicit, previously
unknown, and potentially useful information from data. This encompasses a
number of different technical approaches, such as clustering, data summarization,
learning classification rules, finding dependency networks, analyzing changes, and
detecting anomalies’.
According to Marcel Holshemier and Arno Siebes “Data mining is the search for
relationships
and global patterns that exist in large databases but are ‘hidden’ among the vast amount of
data,
such as a relationship between patient data and their medical diagnosis. These relationships
represent valuable knowledge about the database and the objects in the database and, if the
database is a faithful mirror, of the real world registered by the database”.
Data mining refers to “using a variety of techniques to identify nuggets of information or
decision-making knowledge in bodies of data, and extracting these in such a way that they
can be put to use in the areas such as decision support, prediction, forecasting and
estimation. The data is often voluminous, but as it stands of low value as no direct use can
be made of it; it is the hidden information in the data that is useful”
Data mining is also called as mining of knowledge from data, extraction of knowledge,
data/arrangement analysis, data -archaeology, and data-dredging. Data mining refers to
extracting or mining" knowledge from large amounts of data. There are many other terms
related to data mining, such as knowledge mining, knowledge extraction, data/pattern
3 SRMIST DDE Self Learning Material
analysis, data archaeology, and data dredging. Many people treat data mining as a synonym
for another popularly used term, Knowledge Discovery in Databases", or KDD.
NOTES Data mining is the process of discovering patterns, trends, and insights from large
datasets. It is an interdisciplinary field that draws upon techniques from statistics,
machine learning, database management, and other areas to extract useful information
from data. Essential step in the process of knowledge discovery in databases.
Knowledge discovery as a process is depicted in following figure and consists of an
iterative sequence of the following steps:
▪ Data cleaning: to remove noise or irrelevant data
▪ Data integration: where multiple data sources may be combined
▪ Data selection: where data relevant to the analysis task are retrieved from the
database
▪ Data transformation: where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations
▪ Data mining: an essential process where intelligent methods are applied in order to
extract data patterns
▪ Pattern evaluation to identify the truly interesting patterns representing knowledge
based on some interestingness measures
▪ Knowledge presentation: where visualization and knowledge representation
techniques are used to present the mined knowledge to the user.
1.3 Data Mining Tools
Data Mining tools have the objective of discovering patterns/trends/groupings
among large sets of data and transforming data into more refined information. They
Provide insights into corporate data that are not easily discerned with a managed query or
OLAP tools. Tools usea variety of statistical and AI algorithm to analyze the correlation
of variables in data. It investigates interesting patterns and their relationship. The data
mining tools are required to work on integrated, consistent, and cleaned data. These steps
are very costly in the preprocessing of data. Some of the tools used for data mining
technique is shown in Fig :1.1.
4 SRMIST DDE Self Learning Material
NOTES
Figure 1.1 : Data mining tools.
1.3.1 RapidMiner
The idea behind the Rapid Mining tool is to create one place for everything. RapidMiner
is an integrated enterprise artificial intelligence framework that offers AI solutions to
positively impact businesses. It is used as a data science software platform for data
extraction, data mining, deep learning, machine learning, and predictive analytics. It is
widely used in many businesses and commercial applications as well as in various other
fields such as research, training, education, rapid prototyping, and application
development. All major machine learning processes such as data preparation, model
validation, results from visualization, and optimization can be carried out by using
RapidMiner.
RapidMiner Products
There are many products of RapidMiner that are used to perform multiple operations.
Some of the products are:
▪ RapidMiner Studio :With RapidMiner Studio, one can access, load, and analyze
both traditional structured data and unstructured data like text, images, and media.
It can also extract information from these types of data and transform unstructured
data into structured.
▪ RapidMiner Auto Model :Auto Model is an advanced version of RapidMiner
Studio that increments the process of building and validating data models. You can
customize the processes and can put them in production based on your needs.
Majorly three kinds of problems can be resolved with Auto Model namely
prediction, clustering, and outliers.
▪ RapidMiner Turbo Prep :Data preparation is time-consuming and RapidMiner
5 SRMIST DDE Self Learning Material
Turbo Prep is designed to make the preparation of data much easier. It provides a
user interface where your data is always visible front and center, where you can
NOTES make changes step-by-step and instantly see the results, with a wide range of
supporting functions to prepare the data for model-building or presentation.
1.3.1.1 TOOL CHARACTERISTICS
Usability: Easy to use
Tool orientation:The tool is designed for general-purpose analysis
Data mining type:This tool is made for Structured data mining, Text mining, Image
mining, Audio mining, Video mining, Data gathering, Social network analysis.
Manipulation type:This tool is designed for Data extraction, Data transformation, Data
analysis, Data visualization, Data conversion, Data cleaning
Features of RapidMiner
▪ Application & Interface:RapidMiner Studio is a visual data science workflow
designer accelerating the prototyping & validation of models.
▪ Data Access: With RapidMiner Studio, you can access, load, and analyze any type
of data – both traditional structured data and unstructured data like text, images,
and media. It can also extract information from these types of data and transform
unstructured data into structured.
▪ Data Exploration: Immediately understand and create a plan to prepare the data
automatically extract statistics and key information.
▪ Data Prep:The richness of the data preparation capabilities in RapidMiner Studio
can handle any real-life data transformation challenges, so you can format and
create the optimal data set for predictive analytics. RapidMiner Studio can blend
structured with unstructured data and then leverage all the data for predictive
analysis. Any data preparation process can be saved for reuse.
▪ Modeling: RapidMiner Studio comes equipped with an un-paralleled set of
modeling capabilities and machine learning algorithms for supervised and
unsupervised learning. They are flexible, robust and allow you to focus on building
the best possible models for any use case
6 SRMIST DDE Self Learning Material
1.3.2 Weka
Weka is data mining software that uses a collection of machine learning algorithms. These
NOTES
algorithms can be applied directly to the data or called from the Java code.The algorithms
can either be applied directly to a dataset or called from your own Java code. Weka
contains tools for data pre-processing, classification, regression, clustering, association
rules, and visualization. It is also well-suited for developing new machine learning
schemes.
WEKA an open-source software that provides tools for data preprocessing,
implementation of several Machine Learning algorithms, and visualization tools so that
you can develop machine learning techniques and apply them to real-world data mining
problems. What WEKA offers is summarized in the following diagram depicted in figure
1.2.
Figure 1.2: Working of Weka
1.4 Applications of Data Mining.
Here is the list of areas where data mining is widely used
▪ Financial Data Analysis
▪ Retail Industry
▪ Telecommunication Industry
7 SRMIST DDE Self Learning Material
▪ Biological Data Analysis
▪ Other Scientific Applications
NOTES ▪ Intrusion Detection
▪ Social media app mining.
1.4.1 Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are
as follows
▪ Design and construction of data warehouses for multidimensional data analysis and
data mining.
▪ Loan payment prediction and customer credit policy analysis.
▪ Classification and clustering of customers for targeted marketing.
▪ Detection of money laundering and other financial crimes.
1.4.2 Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of
data
from on sales, customer purchasing history, goods transportation, consumption and
services. It is natural that the quantity of data collected will continue to expand rapidly
because of the increasing ease, availability and popularity of the web. Data mining in retail
industry helps in identifying customer buying patterns and trends that lead to improved
quality of customer service and good customer retention and satisfaction. Here is the list
of examples of data mining in the retail industry.
▪ Design and Construction of data warehouses based on the benefits of data mining.
▪ Multidimensional analysis of sales, customers, products, time and region.
▪ Analysis of effectiveness of sales campaigns.
▪ Customer Retention.
▪ Product recommendation and cross-referencing of items.
8 SRMIST DDE Self Learning Material
1.4.3 Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing
NOTES
various services such as fax, pager, cellular phone, internet messenger, images, e-mail,
web data transmission, etc. Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the reason why
data mining is become very important to help and understand the business. Data mining in
telecommunication industry helps in identifying the telecommunication patterns, catch
fraudulent activities, make better use of resource, and improve quality of service. Here is
the list of examples for which data mining improves telecommunication services.
▪ Multidimensional Analysis of Telecommunication data.
▪ Fraudulent pattern analysis.
▪ Identification of unusual patterns.
▪ Multidimensional association and sequential patterns analysis.
▪ Mobile Telecommunication services.
▪ Use of visualization tools in telecommunication data analysis.
1.4.3 Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological data
mining is a very important part of Bioinformatics. Following are the aspects in which data
mining contributes for biological data analysis .
▪ Semantic integration of heterogeneous, distributed genomic and proteomic
databases.
▪ Alignment, indexing, similarity search and comparative analysis multiple
nucleotide sequences.
▪ Discovery of structural patterns and analysis of genetic networks and protein
pathways.
▪ Association and path analysis.
▪ Visualization tools in genetic data analysis.
Other Scientific Applications
The applications discussed above tend to handle relatively small and homogeneous data
sets for which the statistical techniques are appropriate. Huge amount of data have been
collected from scientific domains such as geosciences, astronomy, etc. A large amount of
9 SRMIST DDE Self Learning Material
data sets is being generated because of the fast numerical simulations in various fields such
as climate and ecosystem modelling, chemical engineering, fluid dynamics, etc. Following
NOTES are the applications of data mining in the field of Scientific Applications.
▪ Data Warehouses and data preprocessing.
▪ Graph-based mining.
▪ Visualization and domain specific knowledge.
1.4.5 Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical
component of network administration.
Here is the list of areas in which data mining technology may be applied for intrusion
detection.Development of data mining algorithm for intrusion detection.
▪ Association and correlation analysis, aggregation to help select and build
discriminating attributes.
▪ Analysis of Stream data.
▪ Distributed data mining.
▪ Visualization and query tools.
▪ Social Media app mining
One of the most lucrative applications of data mining has been undertaken by social media
companies. Platforms like Facebook, TikTok, Instagram, and Twitter gather reams of data
about their users, based on their online activities.
That data can be used to make inferences about their preferences. Advertisers can target
their messages to the people who appear to be most likely to respond positively.
Data mining on social media has become a big point of contention, with several
investigative reports and exposes showing just how intrusive mining users' data can be. At
the heart of the issue, users may agree to the terms and conditions of the sites not realizing
how their personal information is being collected or to whom their information is being
sold.
10 SRMIST DDE Self Learning Material
1.5 Data Warehousing and its characteristics.
NOTES
Data warehousing is a collection of decision support technologies, aimed at
enabling the knowledge worker (executive, manager, analyst) to make better and faster
decisions. Data mining potential can be enhanced if the appropriate data has been collected
and stored in a data warehouse. A data warehouse is a relational database management
system (RDBMS) designed specifically to meet the needs of transaction processing
systems. It can be loosely defined as any centralized data repository which can be queried
for business benefit but this will be more clearly defined later.
Data warehousing is new powerful technique making it possible to extract archived
operational data and overcome inconsistencies between different legacy data formats. As
well as integrating data throughout an enterprise, regardless of location, format, or
communication requirements it is possible to incorporate additional or expert information.
In addition to a relational database, a data warehouse environment includes an extraction,
transportation, transformation, and loading (ETL) solution, an online analytical processing
(OLAP) engine, client analysis tools, and other applications that manage the process of
gathering data and delivering it to business users.
ETL Tools are meant to extract, transform and load the data into Data Warehouse
for decision making. Before the evolution of ETL Tools, the above mentioned ETL process
was done manually by using SQL code created by programmers. This task was tedious in
many cases since it involved many resources, complex coding and more work hours. On
top of it, maintaining the code placed a great challenge among the programmers.
These difficulties are eliminated by ETL Tools since they are very powerful and
they offer many advantages in all stages of ETL process starting from extraction, data
cleansing, data profiling, transformation, debugging and loading into data warehouse when
compared to the old method.
11 SRMIST DDE Self Learning Material
There are a number of ETL tools available in the market to do ETL process the data
according to business/technical requirements. Following are some of those shown in
NOTES table 1.1:
Table 1.1 : ETL Tools
A common way of introducing data warehousing is to refer to the characteristics of a data
warehouse as set forth by William Inmon, author of Building the Data Warehouse and the
guru who is widely considered to be the originator of the data warehousing concept, is as
follows:
▪ Subject Oriented
▪ Integrated
▪ Nonvolatile
▪ Time Variant
Data warehouses are designed to help you analyze data. For example, to learn more about
your company’s sales data, you can build a warehouse that concentrates on sales. Using
this warehouse, you can answer questions like “Who was our best customer for this item
last year?” This ability to define a data warehouse by subject matter, sales in this case,
makes the data warehouse subject oriented. Integration is closely related to subject
orientation. Data warehouses must put data from disparate sources into a consistent format.
They must resolve such problems as naming conflicts and inconsistencies among units of
measure. When they achieve this, they are said to be integrated.
12 SRMIST DDE Self Learning Material
NOTES
Answer the Questions
1.Define data mining.
2. Differentiate data and information.
3.What is knowledge.
4. What is KDD
5. List the steps of KDD.
6. List out the various definitions of data mining.
7. Explain any one data mining tools in detail.
8. Write notes on data warehouse and its charecterestics.
9. Elucidate the applications of data mining.
13 SRMIST DDE Self Learning Material
MODULE II
NOTES
2.1 What is learning?
An individual learns how to carry out a certain task by making a transition from a situation
in which the task cannot be carried out to a situation in which the same task can be carried
out under the same circumstances.
2.2 Inductive learning
Induction is the inference of information from data and inductive learning is the model
building process where the environment i.e., database is analyzed with a view to finding
patterns. Similar objects are grouped in classes and rules formulated whereby it is possible
to predict the class of unseen objects. This process of classification identifies classes such
that each class has a unique pattern of values which forms the class description. The nature
of the environment is dynamic hence the model must be adaptive i.e., should be able to
learn.
Generally it is only possible to use a small number of properties to characterize objects so
we make abstractions in that objects which satisfy the same subset of properties are
mapped to the same internal representation. Inductive learning where the system infers
knowledge itself from observing its environment has two main strategies:
▪ Supervised learning—This is learning from examples where a teacher helps the
system construct a model by defining classes and supplying examples of each class.
The system has to find a description of each class i.e., the common properties in
the examples. Once the description has been formulated the description and the
class form a classification rule which can be used to predict the class of previously
unseen objects. This is similar to discriminate analysis as in statistics.
▪ Unsupervised learning—This is learning from observation and discovery. The data
mine system is supplied with objects but no classes are defined so it has to observe
the examples and recognize patterns (i.e., class description) by itself. This system
results in a set of class descriptions, one for each class discovered in the
environment. Again this similar to cluster analysis as in statistics.
Induction is therefore the extraction of patterns. The quality of the model produced by
inductive learning methods is such that the model could be used to predict the outcome of
14 SRMIST DDE Self Learning Material
future situations in other words not only for states encountered but rather for unseen states
that could occur. The problem is that most environments have different states, i.e., changes
NOTES
within, and it is not always possible to verify a model by checking it for all possible
situations.
2.3 Anatomy of Data Mining
The use of computer technology in decision support is now widespread and pervasive
across a wide range of business and industry. This has resulted in the capture and
availability of data in immense volume and proportion. There are many examples that can
be cited. Point of sale data in retail, policy and claim data in insurance, medical history
data in health care, financial data in banking and securities, are some instances of the types
of data that is being collected. The data are typically a collection of records, where each
individual record may correspond to a transaction or a customer, and the fields in the record
correspond to attributes. Very often, these fields are of mixed type, with some being
numerical (continuous valued, e.g., age) and some symbolic (discrete valued, e.g.,
disease).The multi-disciplinary approach in data mining is shown in figure 2.1.
Figure 2.1 : Multi-disciplinary approach in data mining
2.3.1 Statistics
Statistics has a solid theoretical foundation but the results from statistics can be
overwhelming and difficult to interpret as they require user guidance as to where
and how to analyze the data. Data mining however allows the expert’s knowledge
of the data and the advanced analysis techniques of the computer to work together.
Analysts to detect unusual patterns and explain patterns using statistical models
15 SRMIST DDE Self Learning Material
such as linear models have used statistical analysis systems such as SAS and SPSS.
Statistics have a role to play and data mining will not replace such analyses but
NOTES rather they can act upon more directed analyses based on the results of data mining.
For example statistical induction is something like the average rate of failure of
machines.
2.3.2 Machine Learning
Machine learning is the automation of a learning process and learning is tantamount
to the construction of rules based on observations of environmental states and
transitions. This is a broad field which includes not only learning from examples,
but also reinforcement learning, learning with teacher, etc. A learning algorithm
takes the data set and its accompanying information as input and returns a statement
e.g., a concept representing the results of learning as output. Machine learning
examines previous examples and their outcomes and learns how to reproduce these
and make generalizations about new cases. Generally a machine learning system
does not use single observations of its environment but an entire finite set called
the training set at once. This set contains examples i.e., observations coded in some
machine readable form. The training set is finite hence not all concepts can be
learned exactly.
//Data spaces being the number of examples.
2.3.3 Database Systems
A database is a collection of information related to a particular subject or purpose, such as
tracking customer orders or maintaining a music collection.
2.3.3 Algorithms
▪ Genetic Algorithms
Optimization techniques that use processes such as genetic combination,
mutation, and natural selection in a design based on the concepts of natural
evolution.
▪ Statistical Algorithms
Statistics is the science of colleting, organizing, and applying numerical facts.
Statistical analysis systems such as SAS and SPSS have been used by analysts
16 SRMIST DDE Self Learning Material
to detect unusual patterns and explain patterns using statistical models such as
linear models.
NOTES
2.3.5 Visualization
Data visualization makes it possible for the analyst to gain a deeper, more intuitive
understanding of the data and as such can work well alongside data mining. Data mining
allows the analyst to focus on certain patterns and trends and explore in-depth using
visualization. On its own data visualization can be overwhelmed by the volume of data in
a database but in conjunction with data mining can help with exploration. Visualization
indicates the wide range of tool for presenting the reports to the end user .The presentation
ranges from simple table to complex graph, using various 2D and 3D rendering techniques
to distinguish information presented.
2.3.6 Differences between Data Mining and Machine Learning
Knowledge Discovery in Databases (KDD) or Data Mining, and the part of Machine
Learning (ML) dealing with learning from examples overlap in the algorithms used and
the problems addressed. The main differences are:
▪ KDD is concerned with finding understandable knowledge, while ML is concerned
with improving performance of an agent. So training a neural network to balance a
pole is part of ML, but not of KDD. However, there are efforts to extract knowledge
from neural networks which are very relevant for KDD.
▪ KDD is concerned with very large, real-world databases, while ML typically (but
not always) looks at smaller data sets. So efficiency questions are much more
important for KDD. ² ML is a broader field which includes not only learning from
examples, but also reinforcement learning, learning with teacher, etc.
▪ KDD is that part of ML which is concerned with finding understandable knowledge
in large sets of real-world examples. When integrating machine-learning
techniques into database systems to implement KDD some of the databases require:
▪ More efficient learning algorithms because realistic databases are normally very
large and noisy. It is usual that the database is often designed for purposes different
from data mining and so properties or attributes that would simplify the learning
task are not present nor can they be requested from the real world. Databases are
17 SRMIST DDE Self Learning Material
usually contaminated by errors so the data mining algorithm has to cope with noise
NOTES whereas ML has laboratory type examples i.e., as near perfect as possible.
▪ More expressive representations for both data, e.g., tuples in relational databases,
which represent instances of a problem domain, and knowledge, e.g., rules in a
rule-based system, which can be used to solve users’ problems in the domain, and
the semantic information contained in the relational schemata.
Practical KDD systems are expected to include three interconnected phases
Translation of standard database information into a form suitable for use by learning
facilities;
Using machine learning techniques to produce knowledge bases from databases; and ²
Interpreting the knowledge produced to solve users’ problems and/or reduce data spaces.
2.4 Types of Knowledge
Knowledge is a collection of interesting and useful pattern in a database.
The key issue in Knowledge Discovery in Database is to realize that there is more
information hidden in your data than to distinguish at first sight. In data mining we
distinguish four different types of knowledge, The knowledge and its techniques are listed
below in table 2.1.
Table 2.1: Different types of knowledge and techniques,
2.4.1 Shallow Knowledge
This is information that can be easily retrieved from database using a query tool such as
Structured Query Language (SQL).
18 SRMIST DDE Self Learning Material
2.4.2 Multi-Dimensional Knowledge
NOTES
OLAP tools you have the ability to rapidly explore all sorts of clusterings this is
information that can be analyzed using online analytical processing tools. With and
different orderings of the data but it is important to realize that most of the things you can
do with an OLAP tool can also be done using SQL. The advantage of OLAP tools is that
they are optimized for the kind of search and analysis operation. However, OLAP is not as
powerful as data mining; it cannot search for optimal solutions.
2.4.3 Hidden Knowledge
This is data that can be found relative easily by using pattern recognition or machine
learning algorithms. Again, one could use SQL to find these patterns but this would
probably prove extremely time-consuming. A pattern recognition algorithm could find
regularities in a database in minutes or at most a couple of hours, whereas you would have
to spend months using SQL to achieve the same result. Here information that can be
obtained through data mining techniques.
2.4.4 Deep Knowledge
This is information that is stored in the database but can only be located if we have a clue
that tells us where to look. Hidden knowledge is the result of a search space over a gentle,
hilly landscape; a search algorithm can easily find a reasonably optimal solution. Deep
knowledge is typically the result of a search space over only a tiny local optimum, with no
indication of any elevations in the neighbourhood. A search algorithm could roam around
this landscape for ever, without achieving any significant result. An example of this is
encrypted information stored in a database. It is almost impossible to decipher a message
that is encrypted if you do not have key, which indicates that, for the present at any rate,
there is a limit to what one can learn.
2.5 Knowledge Discovery Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of
useful, previously unknown, and potentially valuable information from large datasets. The
KDD process is an iterative process and it requires multiple iterations of the above steps
to extract accurate knowledge from the data.The following steps are included in KDD
19 SRMIST DDE Self Learning Material
process:
Here is the list of steps involved in the knowledge discovery process −
NOTES ▪ Data Cleaning − In this step, the noise and inconsistent data is removed.
▪ Data Integration − In this step, multiple data sources are combined.
▪ Data Selection − In this step, data relevant to the analysis task are retrieved from
the database.
▪ Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
▪ Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
▪ Pattern Evaluation − In this step, data patterns are evaluated.
▪ Knowledge Presentation − In this step, knowledge is represented.
2.6 Evolutions of Data Mining.
The transformation and growth of the data mining and the technology evolved for the
process is listed below in the table 2.1.
Table 2.1 : Evolution of Data Mining
20 SRMIST DDE Self Learning Material
2.7 Stages of Data Mining
There are various phases involved in KDD Process. The stages have been depicted below
NOTES
in the figure 2.2.
Figure 2.2 : Stages of Data Mining
2.7.1 Data Selection
There are two parts to selecting data for data mining. The first part, locating data, tends to
be more mechanical in nature than the second part, identifying data, which requires
significant input by a domain expert for the data.
A domain expert is someone who is intimately familiar with the business purposes and
aspects, or domain, of the data to be examined.
In our example we start with a database containing records of patient for hypertension
diseases with duration. It is a selection operational data from the Public Health Center
patients of a small village contains information about the name, designation, age, address,
disease particulars, period of diseases shown in table 2.2. In order to facilitate the KDD
process, a copy of this operational data is drawn and stored in a separate database.
Table 2.2 : Original Data
21 SRMIST DDE Self Learning Material
2.7.2 Data Cleaning
Data cleaning is the process of ensuring that, for data mining purposes, the data is
NOTES uniform in terms of key and attribute usage. The process of inspecting data for physical
inconsistencies, such as orphan records or required fields set to null, and logical
inconsistencies, such as accounts with closing dates earlier than starting dates.
Data cleaning is separate from data enrichment and data transformation because data
cleaning attempts to correct misused or incorrect attributes in existing data. Data
enrichment, by contrast, adds new attributes to existing data, while data transformation
changes the form or structure of attributes in existing data to meet specific data mining
requirements.
There are several types of cleaning process, some of which can be executed in advance
while others are invoked only after pollution is detected at the coding or the discovery
stage. An important element in a cleaning operation is the de-duplication of records. In a
normal database some clients will be represented by several records, although in many
cases this will be the result of negligence, such as people making typing errors, or of clients
moving from one place to another without notifying change of address. Although data
mining and data cleaning are two different disciplines, they have a lot in common and
pattern recognition algorithm can be applied in cleaning data. In the original table we have
X and X1. They have same Number and Address but different name, which is a strong
indication that they have the same which duplication. Removal of these datas are de-
duplication which is represented in table 2.3.
Table 2.3 : De-Duplication
Second type of pollution that frequently occurs in lack of domain consistency and
disambiguation. This type of pollution is particularly damaging, because it is hard to trace,
but in greatly influence to type of patterns when we apply data mining to this table. In our
example we replace the unknown data to NULL. The duration value is indicated in
negative value for Patient number 103. The value must be positive so the incorrect value
to be replaced with NULL.
22 SRMIST DDE Self Learning Material
2.7.3 Data Enrichment
Data enrichment is the process of adding new attributes, such as calculated fields or data
NOTES
from external sources, to existing data. Most references on data mining tend to combine
this step with data transformation. Data transformation involves the manipulation of data,
but data enrichment involves adding information to existing data. This can include
combining internal data with external data, obtained from either different departments or
companies or vendors that sell standardized industry-relevant data.
Data enrichment is an important step if you are attempting to mine marginally acceptable
data. You can add information to such data from standardized external industry sources to
make the data mining process more successful and reliable, or provide additional derived
attributes for a better understanding of indirect relationships. For example, data
warehouses frequently provide pre-aggregation across business lines that share common
attributes for cross-selling analysis purposes.
As with data cleaning and data transformation, this step is best handled in a temporary
storage area. Data enrichment, in particular the combination of external data sources with
data to be mined, can require a number of updates to both data and meta data, and such
updates are generally not acceptable in an established data warehouse.
In our example, Suppose, we have purchased extra information about our patients
consisting of age, kidney, heart and stroke related disease. This is more realistic than it
may initially and hypertension and diabetes patients can be traced fairly easily. For this,
however, it is not particularly important how the information was gathered and be easily
be joined to the existing records which is shown in table 2.4.
Table 2.4 : Enrichment
2.7.4 Data Transformation
Data transformation, in terms of data mining, is the process of changing the form or
structure of existing attributes. Data transformation is separate from data cleansing and
23 SRMIST DDE Self Learning Material
data enrichment for data mining purposes because it does not correct existing attribute data
or add new attributes, but instead grooms existing attributes for data mining purposes.
NOTES
2.7.5 Coding
That data in our example can undergo a number of transformations. First the extra
information that was purchased to enrich the database is added to the records describing
the individuals.
In our example we convert disease yes-no into 1 and 0 shown in table 2.5. In the next stage,
we select only those records that have enough information to be of value. Although it is
difficult to give detailed rules for this kind of operation, this is a situation that occurs
frequently in practice. In most tables that are collected from operational data, a lot of
desirable data is missing and most is impossible to retrieve. A general rule states that any
detection of data must be a conscious decision, after a thorough analysis of the possible
consequences. In some cases, especially fraud detection, lack of information can be a
valuable indication of interesting patterns.
Table 2.5 : Coded Database
2.7.8 Data Mining
The discovery stage of the KDD process is fascinating. Here we shall discuss some of the
most important machine-learning and pattern recognition algorithms, and in this way get
an idea of the opportunities that are available as well as some of the problems that occur
during the discovery stage. We shall see that some learning algorithms do well on one part
of the set where others fail, and this clearly indicates the need for hybrid learning. We shall
also show that there is a relationship is detected during the data mining stage.
Data mining is not so much a single technique as the idea that there is more knowledge
hidden in the data than shows itself on the surface. From this point of view, data mining is
really an ‘anything goes’ affair. Any technique that helps extract more out of our data in
24 SRMIST DDE Self Learning Material
useful, so data mining techniques from quite a heterogeneous group. Although various
different techniques are used for different purposes, those that are of interest in present
NOTES
context are:
▪ Query tools
▪ Statistical techniques
▪ Visualization
▪ Online analytical processing
▪ Case based learning (k-nearest neighbour)
▪ Decision trees
▪ Association rules
▪ Neural networks
▪ Genetic algorithms
2.7.9 Visualization/Interpretation/Evaluation
Visualization techniques are a very useful method of discovering patterns in datasets, and
may be used at the beginning of a data mining process to get a rough feeling of the quality
of the data set and where patterns are to be found. Interesting possibilities are offered by
object-oriented three-dimensional tool kits, such as Inventor, which enable to user to
explore three dimensional structures interactively.
Advanced graphical techniques in virtual reality enable people to wander through artificial
data spaces, white historic development of data sets can be displayed as a kind of animated
movie. These simple methods can provide us with a wealth of information.
An elementary technique that can be of great value is the so-called scatter diagram. Scatter
diagrams can be used to identify interesting subsets of the data sets so that we can focus
on the rest of the data mining process. There is a whole field of research dedicated to the
search for interesting projections for data sets that is called projection pursuit.
2.8 OPERATIONS OF DATA MINING
Four operations are associated with discovery-driven data mining.
2.8.1 Creation of Prediction and Classification Models
This is the most commonly used operation primarily because of the proliferation of
automatic model development techniques. The goal of this operation is to use the contents
of the database,which reflect historical data, i.e., data about the past, to automatically
25 SRMIST DDE Self Learning Material
generate a model that can predict a future behaviour. Model creation has been traditionally
pursued using statistical techniques.
NOTES The value added by data mining techniques in this operation is in their ability to generate
models that are comprehensible, and explainable, since many data mining modeling
techniques express models as sets of if... then... rules.
2.8.2 Association
Whereas the goal of the modelling operation is to create a generalized description that
characterizes the contents of a database, the goal of Association is to establish relations
between the records in a data base.
2.8.3 Database Segmentation
As databases grow and are populated with diverse types of data it is often necessary to
partition them into collections of related records either as a means of obtaining a summary
of each database, or before performing a data mining operation such as model creation.
2.8.4 Deviation Detection
This operation is the exact opposite of database segmentation. In particular, its goal is to
identify outlying points in a particular data set, and explain whether they are due to noise
or other impurities being present in the data, or due to causal reasons. It is usually applied
in conjunction with database segmentation. It is usually the source of true discovery since
outliers express deviation from some previously known expectation and norm.
2.9 ARCHITECTURE OF DATA MINING
Based on databases, data warehouse of a typical data mining system may have the
following Components depicted in figure 2.2.
Database, data warehouse or other information repository This is one or a set of databases,
data warehouses, spreadsheets or other kinds of information repositories. Data cleaning
and data integration techniques may be performed on the data. Database or data warehouse
server. The database or data warehouse server is responsible for fetching the relevant data,
based on the user’s data mining request.
26 SRMIST DDE Self Learning Material
NOTES
Figure 2.2 : Architecture of Data mining
2.9.1 Knowledge base
This is the domain knowledge that is used to guide the search or evaluate the
interestingness
of resulting patterns. Such knowledge can include concept hierarchies, used to organize
attributes or attribute values into different levels of abstraction. Knowledge such as user
beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness,
may also be included.
2.9.2 Data mining engine
This is essential to the data mining system and ideally consists of a set of functional
modules for tasks such as characterization, association, classification, cluster analysis and
evolution and deviation analysis
2.9.3 Pattern evaluation module
This component typically employs interestingness measures and interacts with the data
mining
modules so as to focus the search towards interesting patterns. For efficient data mining, it
is highly recommended to push the evaluation of pattern interestingness as deep as possible
into the mining process so as to confine the search to only the interesting patterns.
2.9.4 Graphical user interface
27 SRMIST DDE Self Learning Material
This module communicates between user and the data mining system, allowing the user
interact with the system by specifying a data mining query or task, providing information
NOTES to help focus the search and performing exploratory data mining, based on the
intermediate data mining results. In addition, this component allows the user to browse
database and data warehouse schemes or data structures, evaluate mined patterns and
visualize the patterns in different forms.
Answer the Questions
1.What is learning?
2. Write notes on inductive learning.
3. Explain about multi-disciplinary approach in data mining.
4. Differentiate supervised and un-supervised learning.
5. What is machine learning?
6. Explain the role of visualization in KDD.
7. List out the difference between KDD and ML.
8. Write in detail about the types of knowledge.
9. Explain the KDD process.
10. Elaborate the stages of data mining.
11. Describe the architecture of the data mining.
28 SRMIST DDE Self Learning Material
MODULE III
NOTES
3.1 Data Mining Techniques
There are several different methods used to perform data mining tasks. These techniques
not only require specific types of data structures, but also imply certain types of algorithmic
approaches. In this chapter, we briefly examined some of the data mining techniques.
3.2 Classification
Classification is learning a function that maps a data item into one of several predefined
classes. Examples of classification methods used as part of knowledge discovery
applications include classifying trends in financial markets and automated identification of
objects of interest in large image databases.
Prediction involves using some variables or fields in the database to predict unknown or
future values of other variables of interest. Description focuses on finding human
interpretable patterns describing the data.
3.3 Neural Networks
Neural Networks are analytic techniques modeled after the (hypothesized) processes of
learning in the cognitive system and the neurological functions of the brain and capable of
predicting new observations (on specific variables) from other observations (on the same
or other variables) after executing a process of so-called learning from existing data.
The neural network is then subjected to the process of “training.” In that phase, neurons
apply an iterative process to the number of inputs (variables) to adjust the weights of the
network in order to optimally predict (in traditional terms one could say, find a “fit” to)
the sample data on which the “training” is performed. After the phase of learning from an
existing data set, the neural network is ready and it can then be used to generate predictions
as shown in figure 3.1.
Neural Networks techniques can also be used as a component of analyses designed to build
explanatory models because Neural Networks can help explore data sets in search for
relevant variables or groups of variables; the results of such explorations can then facilitate
the process of model building.
29 SRMIST DDE Self Learning Material
NOTES
Figure 3.1: Neural Networks
Advantages
Neural Networks is that, theoretically, they are capable of approximating any continuous
function, and thus the researcher does not need to have any hypotheses about the
underlying model, or even to some extent, which variables matter.
Disadvantages
The final solution depends on the initial conditions of the network. It is virtually impossible
to “interpret” the solution in traditional, analytic terms, such as those used to build theories
that explain phenomena.
3.4 Decision Tree technique
Decision trees are powerful and popular tools for classification and prediction. Decision
trees represent rules. Decision tree is a classifier in the form of a tree structure where each
node is either:
▪ a leaf node, indicating a class of instances, or
▪ a decision node that specifies some test to be carried out on a single attribute value,
with one branch and sub-tree for each possible outcome of the test.
A decision tree can be used to classify an instance by starting at the root of the tree and
moving through it until a leaf node, which provides the classification of the instance.
3.5 Constructing decision trees
Decision tree programs construct a decision tree T from a set of training cases. The original
idea of construction of decision trees goes back to the work of Hoveland and Hunt on
Concept Learning Systems (CLS) in the late 1950s.
30 SRMIST DDE Self Learning Material
The algorithm consists of five steps.
1. T ¬ the whole training set. Create a T node.
NOTES
2. If all examples in T are positive, create a ‘P’ node with T as its parent and stop.
3. If all examples in T are negative, create a ‘N’ node with T as its parent and stop.
4. Select an attribute X with values v1, v2, …, vN and partition T into subsets T1, T2, …,
TN according their values on X. Create N nodes Ti (i = 1,..., N) with T as their parent and
X = vi as the label of the branch from T to Ti.
5. For each Ti do: T ¬ Ti and go to step 2.
An example of decision tree is represented in figure 3.2.
Figure 3.2 : An example of simple decision tree
Decision tree induction is a typical inductive approach to learn knowledge on
classification.
The key requirements to do mining with decision trees are:
▪ Predefined classes: The categories to which cases are to be assigned must have
been established beforehand (supervised data).
▪ Discrete classes: A case does or does not belong to a particular class, and there
must be for more cases than classes.
▪ Sufficient data: Usually hundreds or even thousands of training cases. “Logical”
classification model: Classifier that can be only expressed as decision trees or set
of production rules.
31 SRMIST DDE Self Learning Material
3.6 ID3 algorithm
J. Ross Quinlan originally developed ID3 at the University of Sydney. He first
NOTES presented ID3 in 1975 in a book, Machine Learning, vol. 1, no. 1. ID3 is based on the
Concept Learning System (CLS) algorithm.
▪ In the decision tree each node corresponds to a non-goal attribute and each arc to
a possible value of that attribute.
▪ A leaf of the tree specifies the expected value of the goal attribute for the records
described by the path from the root to that leaf. [This defines what a decision tree
is.]
▪ In the decision tree at each node should be associated to the non-goal attribute
which is most informative among the attributes not yet considered in the path
from the root. Entropy is used to measure how informative is a node.
Which attribute is the best classifier?
The estimation criterion in the decision tree algorithm is the selection of an attribute to test
at each decision node in the tree. The goal is to select the attribute that is most useful for
classifying examples. A good quantitative measure of the worth of an attribute is a
statistical property called information gain that measures how well a given attribute
separates the training examples according to their target classification. This measure is
used to select among the candidate attributes at each step while growing the tree.
Entropy—a measure of homogeneity of the set of examples
In order to define information, gain precisely, we need to define a measure commonly used
in information theory, called entropy, that characterizes the (im)purity of an arbitrary
collection of examples. Given a set S, containing only positive and negative examples of
some target concept (a 2-class problem), the entropy of set S relative to this simple, binary
classification is defined as:
Entropy (S) = – pp log2 pp – pn log2 pn
where pp is the proportion of positive examples in S and pn is the proportion of negative
examples in S. In all calculations involving entropy we define 0log0 to be 0.
To illustrate, suppose S is a collection of 25 examples, including 15 positive and 10
negatives
examples [15+, 10–]. Then the entropy of S relative to this classification is
32 SRMIST DDE Self Learning Material
Entropy (S) = – (15/25) log2 (15/25) – (10/25) log2 (10/25) = 0.970
3.6.1 Example of ID3
NOTES
Suppose we want ID3 to decide whether the weather is amenable to playing cricket. Over
the course of 2 weeks, data is collected to help ID3 build a decision tree (see Table 3.1 ).
The target classification is “should we play cricket” which can be yes or no. The weather
attributes are outlook, temperature, humidity, and wind speed. They can have the
following values:
outlook = {sunny, overcast, rain}
temperature = {hot, mild, cool}
humidity = {high, normal}
wind = {weak, strong}
Examples of set S are:
TABLE 3.1: Weather Report
If S is a collection of 14 examples with 9 YES and 5 NO examples then,
Entropy(S) = –(9/14) Log2 (9/14) – (5/14) Log2 (5/14) = 0.940
Notice entropy is 0 if all members of S belong to the same class (the data is perfectly
classified). The range of entropy is 0 (“perfectly classified”) to 1 (“totally random”).
Gain(S, A) is information gain of example set S on attribute A is defined as
Gain(S, A) = Entropy(S) – S ((|Sv|/|S|) * Entropy(Sv))
Where:
S = value v of all possible values of attribute A
Sv = subset of S for which attribute A has value v
|Sv| = number of elements in Sv
33 SRMIST DDE Self Learning Material
|S| = number of elements in S.
Suppose S is a set of 14 examples in which one of the attributes is wind speed. The values
NOTES of wind can be Weak or Strong. The classification of these 14 examples are 9 YES and
5 NO. For attribute Wind, suppose there are 8 occurrences of Wind = Weak and 6
occurrences of Wind = Strong. For Wind = Weak, 6 of the examples are YES and 2 are
NO.
For Wind = Strong, 3 are YES and 3 are NO. Therefore,
Gain(S,Wind) = Entropy(S)–(8/14)*Entropy(Sweak)–(6/14)*Entropy(Sstrong)
= 0.940 – (8/14)*0.811 – (6/14)*1.00 = 0.048.
Entropy(Sweak) = – (6/8)*log2(6/8) – (2/8)*log2(2/8) = 0.811
Entropy(Sstrong) = – (3/6)*log2(3/6) – (3/6)*log2(3/6) = 1.00
For each attribute, the gain is calculated and the highest gain is used in the decision node.
We need to find which attribute will be the root node in our decision tree. The gain is
calculated for all four attributes:
Gain(S, Outlook) = 0.246
Gain(S, Temperature) = 0.029
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Outlook attribute has the highest gain, therefore it is used as the decision attribute in the
root
node. Since Outlook has three possible values, the root node has three branches (sunny,
overcast, rain). The next question is “what attribute should be tested at the Sunny branch
node?” Since we have used Outlook at the root, we only decide on the remaining three
attributes: Humidity, Temperature, or Wind.
examples from Table 3.1 with outlook = sunny
Ssunny = {D1, D2, D8, D9, D11} = 5
Gain(Ssunny, Humidity) = 0.970.
Gain(Ssunny, Temperature) = 0.570
Gain(Ssunny, Wind) = 0.019
Humidity has the highest gain; therefore, it is used as the decision node. This process
goes
on until all data is classified perfectly or we run out of attributes.
The final decision = tree
34 SRMIST DDE Self Learning Material
The decision tree can also be expressed in rule format:
IF outlook = sunny AND humidity = high THEN cricket = no
NOTES
IF outlook = rain AND humidity = high THEN cricket = no
IF outlook = rain AND wind = strong THEN cricket = yes
IF outlook = overcast THEN cricket = yes
IF outlook = rain AND wind = weak THEN cricket = yes.
ID3 has been incorporated in a number of commercial rule-induction packages. Some
specific applications include medical diagnosis, credit risk assessment of loan applications,
equipment malfunctions by their cause, classification of soybean diseases, and web search
classification. The decision tree based on the concept for playing cricket is shown in the
figure 3.3
Figure 3.3 Decision tree for concept the play cricket
3.6.2 Strengths and Weaknesses of Decision Tree Methods
The strengths of decision tree methods
1. Decision trees are able to generate understandable rules.
2. Decision trees perform classification without requiring much computation.
3. Decision trees are able to handle both continuous and categorical variables.
4. Decision trees provide a clear indication of which fields are most important for
prediction or classification.
The weaknesses of decision tree methods
Decision trees are less appropriate for estimation tasks where the goal is to predict the
value
of a continuous variable such as income, blood pressure, or interest rate. Decision trees are
also problematic for time-series data unless a lot of effort is put into presenting the data in
such a way that trends and sequential patterns are made visible.
1. Error-Prone with Too Many Classes.
35 SRMIST DDE Self Learning Material
2. Computationally Expensive to Train.
3. Trouble with Non-Rectangular Regions.
NOTES 3.7 Genetic Algorithm
Genetic Algorithms/Evolutionary algorithms are based on Darwin’s theory of survival of
the fittest. Here the best program or logic survives from a pool of solutions. Two programs
called chromosomes combine to produce a third program called child, the reproduction
process goes through Crossover and Mutation operations.
3.8.1 Crossover
Using simple single point crossover reproduction mechanism, a point along the
chromosome length is randomly chosen at which the crossover is done as illustrated.
(a) Single Point Crossover: In this type, one crossover point is selected and the string from
the beginning of chromosome to the crossover point is copied from one parent and the rest
is copied from the second parent resulting in a child. For instance, consider the following
chromosomes and crossover point at position 4 shown in figure 3.4.
Figure 3.4 : Single Point Crossover
(b) Two Point Crossover: In this type, two crossover points are selected and the string from
beginning of one chromosome to the first crossover point is copied from one parent, the
part from the first to the second crossover point is copied from the second parent and the
rest is copied from the first parent. This type of crossover is mainly employed in
permutation encoding and value encoding where a single point crossover would result in
inconsistencies in the child chromosomes. For instance, consider the following
chromosomes and crossover points at positions 2 and 5 as shown in figure 3.5.
Figure 3.5: Two Point Crossover
36 SRMIST DDE Self Learning Material
Here we observe that the crossover regions between the crossover points are
interchanged in the children.
NOTES
(c) Tree Crossover: The tree crossover method is most suitable when tree encoding is
employed. One crossover point is chosen at random and parents are divided in that point
and parts below crossover point are exchanged to produce new offspring depicted in
figure 3.6
Figure 3.6 : Crossover
3.8.2 Mutation
As new individuals are generated, each character is mutated with a given probability. In a
binary-coded Genetic Algorithm, mutation may be done by flipping a bit, while in a non-
binary-coded GA, mutation involves randomly generating a new character in a specified
position. Mutation produces incremental random changes in the offspring generated
through crossover. When used by itself without any crossover, mutation is equivalent to a
random search consisting of incremental random modification of the existing solution and
acceptance if there is improvement. However, when used in the GA, its behavior changes
radically. In the GA, mutation serves the crucial role for replacing the gene values lost
from the population during the selection process so that they can be tried in a new context,
or of providing the gene values that were not present in the initial population. The type of
mutation used as in crossover is dependent on the type of encoding employed.
The various types are as follows:
(a) Bit Inversion: This mutation type is employed for a binary encoded problem. Here, a
bit is randomly selected and inverted i.e., a bit is changed from 0 to 1 and vice-versa. For
instance, consider mutation at figure 3.7.
37 SRMIST DDE Self Learning Material
Figure 3.7 : Bit inversion
NOTES (b) Order Changing: This type of mutation is specifically used in permutation-encoded
problems. Here, two random points in the chromosome are chosen and interchanged.
For instance shown in figure 3.8
Figure 3.8 Order Inversion
(c) Value Manipulation: Value manipulation refers to selecting random point or points in
the chromosome and adding or subtracting a small number from it. Hence, this method is
specifically useful for real value encoded problems. For instance,
(1.29 5.68 2.86 4.11 5.55) => (1.29 5.68 2.73 4.22 5.55)
(d) Operator Manipulation: This method involves changing the operators randomly in an
operator tree and hence is used with tree-encoded problems shown in figure 3.9.
Figure 3.9 : Operation Manipulation
Here we observe that the divide operator in the parent is randomly changed to the
multiplication operator.
Answer the Questions
1.List any four datamining techniques.
2.Explain decision tree algorithm with suitable example.
3. Write notes on (i) Classification (ii) Neural network.
4. Describe ID3 algorithm with suitable example.
5. Write the two factors that’s influence the decision tree.
6. Write in detail on genetic algorithm.
38 SRMIST DDE Self Learning Material
MODULE IV
NOTES
4,1 Clustering:
Clustering can be considered the most important unsupervised learning problem; so, as
every other problem of this kind, deals with finding a structure in a collection of unlabeled
data. Definition of clustering could be “the process of organizing objects into groups whose
members are similar in some way”. A cluster is therefore a collection of objects which are
“similar” between them and are “dissimilar” to the objects belonging to other clusters as
shown in Figure 4.1.
Figure 4.1 : Clusters
In the above example, we easily identify the 4 clusters into which the data can be
divided; the similarity criterion is distance: two or more objects belong to the same
cluster if they are “close” according to a given distance (in this case geometrical
distance). This is called distance-based clustering.
Another kind of clustering is conceptual clustering: two or more objects belong to the
same cluster if this one defines a concept common to all that objects. In other words,
objects are grouped according to their fit to descriptive concepts, not according to simple
similarity measures.
4.1.1 Distance function
Given two p-dimensional data objects i = (xi1, xi2, ..., xip) and j = (xj1, xj2, ..., xjp), the
following common distance functions can be defined:
39 SRMIST DDE Self Learning Material
Euclidean Distance Function
NOTES
Manhattan Distance Function.
When using the Euclidean distance function to compare distances, it is not necessary to
calculate the square root because distances are always positive numbers and as such, for
two distances, d1 and d2, d1 > d2 d1 > d2. If some of an object’s attributes are
measured along different scales, so when using the Euclidean distance function,
attributes with larger scales of measurement may overwhelm attributes measured on a
smaller scale. To prevent this problem, the attribute values are often normalized to lie
between 0 and 1.
Applications
▪ Clustering algorithms can be applied in many fields, for instance:
▪ Marketing: finding groups of customers with similar behavior given a large
database of customer data containing their properties and past buying records;
▪ Biology: classification of plants and animals given their features;
▪ Libraries: book ordering;
▪ Insurance: identifying groups of motor insurance policy holders with a high
average claim cost; identifying frauds;
▪ City-planning: identifying groups of houses according to their house type, value
and geographical location;
▪ Earthquake studies: clustering observed earthquake epicenters to identify
dangerous zones;
▪ WWW: document classification; clustering weblog data to discover groups of
similar access patterns.
4.2 K-means algorithm
K-Means clustering is an unsupervised iterative clustering technique. It partitions the
given data set into k predefined distinct clusters. A cluster is defined as a collection of data
points exhibiting certain similarities.
40 SRMIST DDE Self Learning Material
NOTES
Figure 4.2 – K-Means Clustering
It partitions the data set such that-
▪ Each data point belongs to a cluster with the nearest mean.
▪ Data points belonging to one cluster have high degree of similarity.
▪ Data points belonging to different clusters have high degree of dissimilarity.
K-Means Clustering Algorithm-
Step-01 :
Choose the number of clusters K.
Step-02:
Randomly select any K data points as cluster centers.
Select cluster centers in such a way that they are as farther as possible from each other.
Step-03:
Calculate the distance between each data point and each cluster center.
Step-04:
Assign each data point to some cluster.
A data point is assigned to that cluster whose center is nearest to that data point.
Step-05:
Re-compute the center of newly formed clusters.
41 SRMIST DDE Self Learning Material
The center of a cluster is computed by taking mean of all the data points contained in that
cluster.
NOTES
Step-06:
Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping
criteria is met-
▪ Center of newly formed clusters do not change
▪ Data points remain present in the same cluster
▪ Maximum number of iterations are reached
Advantages
It is relatively efficient with time complexity O(nkt) where-
▪ n = number of instances
▪ k = number of clusters
▪ t = number of iterations.
It often terminates at local optimum.
Techniques such as Simulated Annealing or Genetic Algorithms may be used to find the
global optimum.
Disadvantages
It requires to specify the number of clusters (k) in advance.
It can not handle noisy data and outliers.
It is not suitable to identify clusters with non-convex shapes.
EXAMPLES ON K-MEANS CLUSTERING ALGORITHM.
PROBLEM
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as
Ρ(a, b) = |x2 – x1| + |y2 – y1|
42 SRMIST DDE Self Learning Material
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Solution : NOTES
We follow the above discussed K-Means Clustering Algorithm-
Iteration-01:
we calculate the distance of each point from each of the center of the three clusters. The
distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and
each of the center of the three clusters.
Calculating Distance Between A1(2, 10) and C1(2, 10)-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(5, 8)-
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
Calculating Distance Between A1(2, 10) and C3(1, 2)-
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
43 SRMIST DDE Self Learning Material
In the similar manner, we calculate the distance of other points from each of the center of
the three clusters.
NOTES
Next, We draw a table showing all the results. Using the table, we decide which point
belongs to which cluster. The given point belongs to that cluster whose center is nearest
to it.
New clusters are
Cluster-01:
First cluster contains points- A1(2, 10)
Cluster-02:
Second cluster contains points - A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A8(4, 9)
Cluster-03:
Third cluster contains points - A2(2, 5), A7(1, 2)
Now, We re-compute the new cluster clusters. The new cluster center is computed by
taking mean of all the points contained in that cluster.
For Cluster-01: We have only one point A1(2, 10) in Cluster-01. So, cluster center
remains the same.
44 SRMIST DDE Self Learning Material
For Cluster-02: Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5) NOTES
= (6, 6)
For Cluster-03: Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
This is completion of Iteration-01.
Iteration-02:
We calculate the distance of each point from each of the center of the three clusters. The
distance is calculated by using the given distance function. The following illustration
shows the calculation of distance between point A1(2, 10) and each of the center of the
three clusters.
Calculating Distance Between A1(2, 10) and C1(2, 10) , Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(6, 6)-Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |6 – 2| + |6 – 10|
=4+4
=8
Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)- Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1.5 – 2| + |3.5 – 10|
45 SRMIST DDE Self Learning Material
= 0.5 + 6.5
=7
NOTES
In the similar manner, we calculate the distance of other points from each of the center
of the three clusters. Next,
We draw a table showing all the results.
Using the table, we decide which point belongs to which cluster. The given point belongs
to that cluster whose center is nearest to it.
New clusters are Cluster-01
First cluster contains points- A1(2, 10), A8(4, 9)
Cluster-02:
Second cluster contains points-A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4)
Cluster-03:
Third cluster contains points- A2(2, 5), A7(1, 2) Now,
We re-compute the new cluster clusters. The new cluster center is computed by taking
mean of all the points contained in that cluster.
For Cluster-01:
46 SRMIST DDE Self Learning Material
Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2) NOTES
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
This is completion of Iteration-02.
After second iteration, the center of the three clusters are-
C1(3, 9.5)
C2(6.5, 5.25)
C3(1.5, 3.5)
4.3 Hierarchical Clustering
Hierarchical clustering creates hierarchy of clusters on the data set. This hierarchical tree
shows levels of clustering with each level having a larger number of smaller clusters.
Hierarchical algorithms can be either agglomerative or divisive, that is top-down or bottom
up.
All agglomerative hierarchical clustering algorithms begin with each object as a separate
group. These groups are successively combined based on similarity until there is only one
group remaining or a specified termination condition is satisfied. For n objects, n-1
mergings are done.
47 SRMIST DDE Self Learning Material
Hierarchical algorithms are rigid in that once a merge has been done, it cannot be undone.
Although there are smaller computational costs with this, it can also cause problems if
NOTES an erroneous merge is done. As such, merge points need to be chosen carefully. Here we
describe a simple agglomerative clustering algorithm.
Figure 4.3 : Hierarchial Clustering
In the context of hierarchical clustering, the hierarchy graph is called a dendogram. Fig.
4.3
shows a sample dendogram that could be produced from a hierarchical clustering
algorithm. Unlike with the k-means algorithm, the number of clusters (k) is not specified
in hierarchical clustering. After the hierarchy is built, the user can specify the number of
clusters required, from 1 to n. The top level of the hierarchy represents one cluster, or k =
1. To examine more clusters, we simply need to traverse down the hierarchy.
Fig. 4.4 shows a simple hierarchical algorithm. The distance function in this algorithm can
determine similarity of clusters through many methods, including single link and group-
average. Single link calculates the distance between two clusters as the shortest distance
between any two objects contained in those clusters. Group-average first finds the average
values for all objects in the group (i.e., cluster) and calculates the distance between clusters
as the distance between the average values.
Each object in X is initially used to create a cluster containing a single object. These
clusters
48 SRMIST DDE Self Learning Material
are successively merged into new clusters, which are added to the set of clusters, C. When
a pair of clusters is merged, the original clusters are removed from C. Thus, the number of
NOTES
clusters in C decreases until there is only one cluster remaining, containing all the objects
from X. The hierarchy of clusters is implicitly represented in the nested sets of C.
Figure 4.4 : Hierarchial Algorithms
Example: Suppose the input to the simple agglomerative algorithm described above is the
set
X, shown in Fig. 4.5 represented in matrix and graph form. We will use the Manhattan
distance function and the single link method for calculating distance between clusters. The
set X contains n = 10 elements, x1 to x10, where x1 = (0,0).
Figure 4.5 : simple agglomerative algorithm in graph and matrix form
Step 1: Initially, each element xi of X is placed in a cluster ci, where ci is a
member of the set of clusters C.
C = {{x1},{x2},{x3}, {x4},{x5},{x6},{x7}, {x8},{x9},{x10}}
Step 2: Set l = 11.
49 SRMIST DDE Self Learning Material
Step 3: (First iteration of while loop) C.size = 10
The minimum single link distance between two clusters is 1. This occurs in two places,
NOTES between c2 and c10 and between c3 and c10.
Depending on how our minimum function works we can choose either pair of clusters.
Arbitrarily we choose the first.
(cmin1,cmin2) = (c2,c10)
Since l = 10, c11 = c2 U c10 = {{x2},{x10}}
Remove c2 and c10 from C.
Add c11 to C.
C = {{x1},{x3}, {x4},{x5},{x6},{x7}, {x8},{x9},{{x2}, {x10}}}
² Set l = l + 1 = 12.
Step 3: (Second iteration) C.size = 9
The minimum single link distance between two clusters is 1. This occurs between,
between c3 and c11 because the distance between x3 and x10 is 1, where x10 is in c11.
(cmin1,cmin2) = (c3,c11)
c12 = c3 U c11 = {{{x2},{x10}},{x3}}
Remove c3 and c11 from C.
Add c12 to C.
C = {{x1}, {x4},{x5},{x6},{x7}, {x8},{x9},{{{x2}, {x10}}, {x3}}}
Set l = 13.
Step 3: (Third iteration) C.size = 8
(cmin1,cmin2) = (c1,c12)
C = {{x4},{x5},{x6},{x7}, {x8},{x9},{{{{x2}, {x10}}, {x3}},{x1}}}
Step 3: (Fourth iteration) C.size = 7
(cmin1,cmin2) = (c4,c8)
C = {{x5},{x6},{x7}, {x9},{{{{x2}, {x10}}, {x3}},{x1}},{{x4},{x8}}}
Step 3: (Fifth iteration) C.size = 6
(cmin1,cmin2) = (c5,c7)
C = {{x6}, {x9},{{{{x2}, {x10}}, {x3}},{x1}},{{x4},{x8}}, {{x5},{x7}}}
Step 3: (Sixth iteration) C.size = 5
(cmin1,cmin2) = (c9,c13)
C = {{x6}, {{x4},{x8}}, {{x5},{x7}},{{{{{x2}, {x10}}, {x3}},{x1}},{x9}}}
Step 3: (Seventh iteration) C.size = 4
50 SRMIST DDE Self Learning Material
(cmin1,cmin2) = (c6,c15)
C = {{{x4},{x8}}, {{{{{x2}, {x10}}, {x3}},{x1}},{x9}},{{x6}, {{x5},{x7}}}}
NOTES
Step 3: (Eighth iteration) C.size = 3
(cmin1,cmin2) = (c14,c16)
C = { {{x6}, {{x5},{x7}}}, {{{x4},{x8}}, {{{{{x2}, {x10}},
{x3}},{x1}},{x9}}}}
Step 3: (Ninth iteration) C.size = 2
(cmin1,cmin2) = (c17,c18)
C = {{{{x4},{x8}}, {{{{{x2}, {x10}}, {x3}},{x1}},{x9}}}, {{x6},
{{x5},{x7}}}}
Step 3: (Tenth iteration) C.size = 1. Algorithm done.
The cluster created from this algorithm can be seen in Fig. 4.6. The corresponding
dendogram formed from the hierarchy in C is shown in Fig. 4.7. The points which appeared
most closely together on the graph of input data in Fig. 4.16 are grouped together more
closely in the hierarchy.
Figure 4.6 : Clusters Figure 4.7: Dendogram
4.4 Association rules:
Association rule mining finds interesting associations and/or correlation relationships
among large set of data items. Association rules show attributes value conditions that occur
frequently together in a given dataset. A typical and widely-used example of association
rule mining is Market Basket Analysis.
51 SRMIST DDE Self Learning Material
Discovery of association rules are showing attribute-value conditions that occur frequently
together in a given set of data. Market Basket Analysis is a modeling technique based on
NOTES the theory that if you buy a certain group of items then you are more (or less) likely to
buy another group of items. The set of items a customer buys is referred to as an item
set, and market basket analysis seeks to find relationships between purchases.
Typically, the relationship will be in the form of a rule:
IF {bread} THEN {butter}.
This above condition extracts the hidden information i.e., if a customer used to buy bread,
he will also buy butter as side dish.
Given a set of transactions, the goal of association rule mining is to find the rules that allow
us to predict the occurrence of a specific item based on the occurrences of the other items
in the transaction. An association rule consists of two parts:
an antecedent (if) and a consequent (then)
An antecedent is something found in data, and a consequent is something located in
conjunction with the antecedent.
For a quick understanding, consider the following association rule:
“If a customer buys bread, he’s 70% likely of buying milk.”
Bread is the antecedent in the given association rule, and milk is the consequent.
The minimum percentage of instances in the database that contain all items listed in a given
association rule.
There are two types of Association rule levels.
▪ Support Level
▪ Confidence Level
4.4.1 Rules for Support Level
The minimum percentage of instances in the database that contain all items listed in a given
association rule.
▪ Support of an item set
Let T be the set of all transactions under consideration, e.g., let T be the set of all
“baskets” or “carts” of products bought by the customers from a supermarket – say
on a given day. The support of an item set S is the percentage of those transactions
in T which contain S. In the supermarket example this is the number of “baskets”
that contain a given set S of products, for example S = {bread, butter, milk}. If U
is the set of all transactions that contain all items in S, then
52 SRMIST DDE Self Learning Material
Support(S) = (|U|/|T|) *100%
where |U| and |T| are the number of elements in U and T, respectively. For example,
NOTES
if a customer buys the set X = {milk, bread, apples, banana, sausages, cheese,
onions, potatoes} then S is obviously a subset of X, and hence S is in U. If there are
318 customers and 242 of them buy such a set U or a similar one that contains S,
then support (S) = (242/318) = 76.1%.
4.4.2 Rules for Confidence Level
“If A then B”, rule confidence is the conditional probability that B is true when A
is known to be true.
To evaluate association rules, the confidence of a association rule R = “A and B –> C” is
the support of the set of all items that appear in the rule divided by the support of the
antecedent of the rule, i.e.,
Confidence (R) = (support ({A, B, C})/support ({A, B})) *100%
More intuitively, the confidence of a rule is the number of cases in which the rule is
correct
relative to the number of cases in which it is applicable.
For example, let R = “butter and bread –> milk”.
If a customer buys butter and bread, then the rule is applicable and it says that he/she can
be expected to buy milk. If he/she does not buy sugar or does not buy bread or buys neither,
than the rule is not applicable and thus (obviously) does not say anything about this
customer.
Apriori algorithm
The first algorithm to generate all frequent sets and confident association rules was the
AIS
algorithm by Agrawal et al., which was given together with the introduction of this mining
problem. Shortly after that, the algorithm was improved and renamed Apriori by Agrawal
et al. by exploiting the monotonic property of the frequency of item sets and the confidence
of association rules
4.4.3 Frequent item set mining problem
A transactional database consists of sequence of transaction: T = (t1,….,tn). A transaction
is a set of items (t, Î, I). Transactions are often called baskets, referring to the primary
application domain (i.e., market-basket analysis). A set of items is often called the item set
53 SRMIST DDE Self Learning Material
by the data mining community. The (absolute) support or the occurrence of X (denoted by
Supp(X) is the number of transactions that are supersets of X (i.e., that contain X). The
NOTES relative support is the absolute support divided by the number of transactions (i.e., n).
An item set is frequent if its support is greater or equal to a threshold value.
4.4.4 Association rule mining problem
This program is also capable of mining association rules. An association rule is like an
implication: X –> Y means that if item set X occurs in a transaction, then item set Y also
occurs with high probability. This probability is given by the confidence of the rule. It is
like an approximation of p(Y|X), it is the number of transactions that contain both X and
Y divided by the number of transaction that contain X, thus conf(X–>Y) =
Supp(XUY)/Supp(X). An association rule is valid if its confidence and support are greater
than or equal to corresponding threshold values.
In the frequent itemset mining problem a transaction database and a relative support
threshold (traditionally denoted by min_supp) is given and we have to find all frequent
item sets.
Applications of Association Rule Mining
Some of the applications of Association Rule Mining are as follows:
1) Market-Basket Analysis
In most supermarkets, data is collected using barcode scanners. This database is called the
“market basket” database. It contains a large number of past transaction records. Every
record contains the name of all the items each customer purchases in one transaction. From
this data, the stores come to know the inclination and choices of items of the customers.
And according to this information, they decide the store layout and optimize the cataloging
of different items.
A single record contains a list of all the items purchased by a customer in a single
transaction. Knowing which groups are inclined toward which set of items allows these
stores to adjust the store layout and catalog to place them optimally next to one another.
2) Medical Diagnosis
Association rules in medical diagnosis can help physicians diagnose and treat patients.
Diagnosis is a difficult process with many potential errors that can lead to unreliable
results. You can use relational association rule mining to determine the likelihood of illness
based on various factors and symptoms. This application can be further expanded using
54 SRMIST DDE Self Learning Material
some learning techniques on the basis of symptoms and their relationships in accordance
with diseases.
NOTES
3) Census Data
The concept of Association Rule Mining is also used in dealing with the massive amount
of census data. If properly aligned, this information can be used in planning efficient public
services and businesses.
Algorithms of Association Rule Mining
Some of the algorithms which can be used to generate association rules are as follows:
▪ Apriori Algorithm
▪ Eclat Algorithm
▪ FP-Growth Algorithm
4.5 Apriori Algorithm:
Apriori algorithm refers to the algorithm which is used to calculate the association rules
between objects. It means how two or more objects are related to one another. In other
words, we can say that the apriori algorithm is an association rule leaning that analyzes
that people who bought product A also bought product B
Apriori steps are as follows:
▪ Counts item occurrences to determine the frequent item sets.
▪ Candidates are generated.
▪ Count the support of item sets pruning process ensures candidate sizes are already
known to tbe frequent item sets.
▪ Use the frequent item sets to generate the desired rules.
The primary objective of the apriori algorithm is to create the association rule between
different objects. The association rule describes how two or more objects are related to one
another. Apriori algorithm is also called frequent pattern mining. Generally, you operate
the Apriori algorithm on a database that consists of a huge number of transactions. Let's
understand the apriori algorithm with the help of an example; suppose you go to Big Bazar
and buy different products. It helps the customers buy their products with ease and
increases the sales performance of the Big Bazar. In this tutorial, we will discuss the apriori
algorithm with examples.
55 SRMIST DDE Self Learning Material
We take an example to understand the concept better. You must have noticed that the Pizza
shop seller makes a pizza, soft drink, and breadstick combo together. He also offers a
NOTES discount to their customers who buy these combos. Do you ever think why does he do
so? He thinks that customers who buy pizza also buy soft drinks and breadsticks.
However, by making combos, he makes it easy for the customers. At the same time, he
also increases his sales performance.
Similarly, you go to Big Bazar, and you will find biscuits, chips, and Chocolate bundled
together. It shows that the shopkeeper makes it comfortable for the customers to buy these
products in the same place.
The above two examples are the best examples of Association Rules in Data Mining. It
helps us to learn the concept of apriori algorithms.
Apriori algorithm refers to an algorithm that is used in mining frequent products sets and
relevant association rules. Generally, the apriori algorithm operates on a database
containing a huge number of transactions. For example, the items customers but at a Big
Bazar.
Apriori algorithm helps the customers to buy their products with ease and increases the
sales performance of the particular store.
4.5.1 Components of Apriori algorithm
The given three components comprise the apriori algorithm.
1. Support
2. Confidence
3. Lift
Support
Support refers to the default popularity of any product. You find the support as a quotient
of the division of the number of transactions comprising that product by the total number
of transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that comprise both
biscuits and chocolates by the total number of transactions to get the confidence.
56 SRMIST DDE Self Learning Material
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions
NOTES
involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of
chocolates when you sell biscuits. The mathematical equations of lift are given below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates together is
five times more than that of purchasing the biscuits alone. If the lift value is below one, it
requires that the people are unlikely to buy both the items together. Larger the value, the
better is the combination.
How does the Apriori Algorithm work in Data Mining?
We will understand this algorithm with the help of an example
Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk,
Apple}. The database represented in table 4,1 comprises six transactions where 1
represents the presence of the product and 0 represents the absence of the product.
Table 4,1 Product Set P as database
The Apriori Algorithm makes the given assumptions
▪ All subsets of a frequent itemset must be frequent.
▪ The subsets of an infrequent item set must be infrequent.
▪ Fix a threshold support level. In our case, we have fixed it at 50 percent.
57 SRMIST DDE Self Learning Material
NOTES
Step 1
Make a frequency table of all the products that appear in all the transactions. Now, short
the frequency table to add only those products with a threshold support level of over 50
percent. We find the given frequency table represented in tabe 4.2.
Table 4.2 : Frequency Table
The above table indicated the products frequently bought by the customers.
Step 2
Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given
frequency table represented in the table 4.3.
Table 4.3 : Frequency table
Step 3
Implementing the same threshold support of 50 percent and consider the products that are
more than 50 percent. In our case, it is more than 3
58 SRMIST DDE Self Learning Material
Thus, we get RP, RO, PO, and PM
Step 4
NOTES
Now, look for a set of three products that the customers buy together. We get the given
combination.
RP and RO give RPO
PO and PM give POM
Step 5
Calculate the frequency of the two itemsets, and you will get the given frequency table
Table 4.4:No of Frequencies
If you implement the threshold assumption, you can figure out that the customers' set of
three products is RPO. We have considered an easy example to discuss the apriori
algorithm in data mining. In reality, you find thousands of such combinations.
4.6 Real Time Applications and Future Scope.
As on this day, there is not a single field where data mining is not applied. Starting from
marketing down to the medical field, data mining tools and techniques have found a niche.
They have become the “Hobson’s choice” in a number of industrial applications.
4.6.1 Data Mining in the Banking Sector
Worldwide, banking sector is ahead of many other industries in using mining techniques
for their vast customer database. Although banks have employed statistical analysis tools
with some success for several years, previously unseen patterns of customer behavior are
now coming into clear focus with the aid of new data mining tools. These statistical tools
and even the OLAP find out the answers, but more advanced data mining tools provide
insight to the answer. Some of the applications of data mining in this industry are;
(i) Predict customer reaction to the change of interest rates
(ii) Identify customers who will be most receptive to new product offers
(iii) Identify “loyal” customers
59 SRMIST DDE Self Learning Material
(iv) Pin point which clients are at the highest risk for defaulting on a loan
(v) Find out persons or groups who will opt for each type of loan in the following
NOTES year
(vi) Detect fraudulent activities in credit card transactions.
(vii) Predict clients who are likely to change their credit card affiliation in the next
quarter
(viii) Determine customer preference of the different modes of transaction namely
through teller or through credit cards, etc.
4.6.2 Data Mining in Bio-Informatics and Biotechnology
Bio-informatics is a rapidly developing research area that has its roots both in biology and
information technology.
Various applications of data mining techniques in this field are:
▪ prediction of structures of various proteins
▪ determining the intricate structures of several drugs
▪ mapping of DNA structure with base to base accuracy
4.6.3 Data Mining in the Insurance Sector
Insurance companies can benefit from modern data mining methodologies, which help
companies to reduce costs, increase profits, retain current customers, acquire new
customers, and develop
new products. This can be done through:
▪ Evaluating the risk of the assets being insured taking into account the
characteristics of the asset as well as the owner of the asset.
▪ Formulating Statistical Modeling of Insurance Risks
▪ Using the Joint Poisson/Log-Normal Model of mining to optimize
insurance policies
▪ Finding the actuarial Credibility of the risk groups among insurers.
4.6.4 Data Mining in the Retail Industries
Data mining techniques have gone glove in hand in CRM (Customer Relationship
Marketing) by developing models for
▪ predicting the propensity of customers to buy
60 SRMIST DDE Self Learning Material
▪ assessing risk in each transaction
▪ knowing the geographical location and distribution of customers
NOTES
▪ analyzing customer loyalty in credit-based transactions
▪ Assessing the competitive threat in a locality.
4.6.5 Data Mining in E-Commerce and the World Wide Web
Sophisticated or not, various forms of data-mining development are being undertaken by
companies looking to make sense of the raw data that has been mounting relentlessly in
the recent years. E-commerce is the newest and hottest to use data mining techniques. A
recent article in the Engineering News-Record noted that e-commerce has empowered
companies to collect vast amounts of data on customers—everything from the number of
Web surfers in a home to the value of the cars in their garage. This was possible, the article
says, with the use of data mining techniques. Few of the ways in which data mining tools
find its use in e-commerce are:
▪ By formulating market tactics in business operations.
▪ By automating business interactions with customers, so that customers can
transact with all the players in supply chain.
▪ By developing common market place services using a shared external ICT
▪ Infrastructure to manage suppliers and logistics and implement electronic
match
making mechanisms.
▪ This is widely used in today’s web world.
4.6.6. Data Mining in Stock Market and Investment
The rapid evolution of computer technology in the last few decades has provided
investment professionals (and amateurs) with the capability to access and analyze
tremendous amounts of financial data. Data archeology churns the ocean of stock market
to obtain the cream of information
in ways like,
(i) Helping the stock market researchers to predict future stock price movement.
(ii) Data mining of the past prices and related variables help to discover stock
market anomalies like, the hawala scandal.
61 SRMIST DDE Self Learning Material
4.6.7 Data Mining in Supply Chain Analysis
Supply chain analysis is nothing but the analysis of the various data regarding the
NOTES transactions between the supplier and the purchaser and the use of this analysis to the
advantage of either of them or even both of them.
Data mining techniques have found wide application in supply chain analysis. It is of use
to the supplier in the following ways:
(i) It analyses the process data to manage buyer rating.
(ii) It mines payment data to advantageously update pricing policies
(iii) Demand analysis and forecasting helps the supplier to determine the optimum
levels of stocks and spare parts.
Coming to the purchaser side in supply chain analysis, data mining techniques help them
by:
(i) Knowing vendor rating to choose the beneficial supplier.
(ii) Analyzing fulfillment data to manage volume purchase contracts.
4.7 FUTURE SCOPE
The era of DBMS began using relational data model and SQL. At present data mining is
little there than a set of tools to uncover hidden information from a database. While these
are many tools to and this present, there is no all-encompassing model or approach. Over
the next few years, not only will there be more efficient algorithms with better interface
techniques, but also steps will be taken.
To develop an all-encompassing model for data mining. While it may not look like the
relational model, it probably will include similar item: algorithms, data model and metrics
for goodness. Manual definition of request and result interpretation used on the current
data mining tools may decrease with increase in automation. As the data mining
applications are of diverse types, development of a “complete” data mining model is
desirable. A major development will be in creation of a sophisticated “query language”
that includes everything form SQL functions to the data mining applications
62 SRMIST DDE Self Learning Material
NOTES
Answer the Questions
1.Define clustering
2. What is distance function.
3. List the applications of clustering
4. Explain K-Means algorithm with example.
5. Write notes on hierarchical clustering.
6. Elaborate association rule with example.
7. What is marker-basket analysis?
8. What is support?
9. What is confidence?
10. Explain Apriori algorithm with example.
11. Write notes on the three components of the Apriori algorithm.
12. Elaborate on the real time applications of data mining.
63 SRMIST DDE Self Learning Material
MODULE V
NOTES 5.1 Introduction
A Data Warehousing (DW) is process for collecting and managing data from varied
sources to provide meaningful business insights. A Data warehouse is typically used to
connect and analyze business data from heterogeneous sources. The data warehouse is the
core of the BI system which is built for data analysis and reporting.
It is a blend of technologies and components which aids the strategic use of data. It is
electronic storage of a large amount of information by a business which is designed for
query and analysis instead of transaction processing. It is a process of transforming data
into information and making it available to users in a timely manner to make a difference.
Data warehouse system is also known by the following name:
▪ Decision Support System (DSS)
▪ Executive Information System
▪ Management Information System
▪ Business Intelligence Solution
▪ Analytic Application
▪ Data Warehouse
5.2 Goals
▪ The Data Warehouse must assist in decision making process
▪ The Data Warehouse must meet the requirements of the business community
▪ The Data Warehouse must provide easy access to information
▪ The Data Warehouse must present information consistently and accurately
▪ The Data Warehouse must be adaptive and resilient to change
▪ The Data Warehouse must provide a secured access to information
▪ To help reporting as well as analysis
▪ Maintain the organization's historical information
▪ Be the foundation for decision making.
5.3 Data Warehouse Features
The key features of a data warehouse are discussed below :
▪ Subject Oriented − A data warehouse is subject oriented because it provides
information around a subject rather than the organization's ongoing operations.
64 SRMIST DDE Self Learning Material
These subjects can be product, customers, suppliers, sales, revenue, etc. A data
warehouse does not focus on the ongoing operations, rather it focuses on modelling
NOTES
and analysis of data for decision making.
▪ Integrated − A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This integration
enhances the effective analysis of data.
▪ Time Variant − The data collected in a data warehouse is identified with a particular
time period. The data in a data warehouse provides information from the historical
point of view.
▪ Non-volatile − Non-volatile means the previous data is not erased when new data
is added to it. A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in the data
warehouse.
5.4 Three-Tier Data Warehouse Architecture
Generally a data warehouses adopts a three-tier architecture. Following are the three tiers
of the data warehouse architecture.
▪ Bottom Tier − The bottom tier of the architecture is the data warehouse database
server. It is the relational database system. We use the back end tools and utilities
to feed data into the bottom tier. These back end tools and utilities perform the
Extract, Clean, Load, and refresh functions.
▪ Middle Tier − In the middle tier, we have the OLAP Server that can be
implemented in either of the following ways.
▪ By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP maps the operations on multidimensional data
to standard relational operations.
▪ By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
▪ Top-Tier − This tier is the front-end client layer. This layer holds the query tools
and reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse using figure
5.1
65 SRMIST DDE Self Learning Material
NOTES
Figure 5.1 : Three-tier Architecture of the data warehouse.
5.4.1 Data Warehouse Models
From the perspective of data warehouse architecture, we have the following data
warehouse models −
▪ Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It
is easy to build a virtual warehouse. Building a virtual warehouse requires excess
capacity on operational database servers.
▪ Data Mart
Data mart contains a subset of organization-wide data. This subset of data is
valuable to specific groups of an organization.
In other words, we can claim that data marts contain data specific to a particular
group. For example, the marketing data mart may contain data related to items,
customers, and sales. Data marts are confined to subjects.
Data marts are
• Window-based or Unix/Linux-based servers are used to implement data marts. They
are implemented on low-cost servers.
• The implementation data mart cycles is measured in short periods of time, i.e., in
weeks rather than months or years.
66 SRMIST DDE Self Learning Material
• The life cycle of a data mart may be complex in long run, if its planning and design
are not organization-wide.
NOTES
• Data marts are small in size.
• Data marts are customized by department.
• The source of a data mart is departmentally structured data warehouse.
• Data mart are flexible.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an entire
Organization It provides us enterprise-wide data integration. The data is integrated from
operational systems and external information providers.This information can vary from
a few gigabytes to hundreds of gigabytes, terabytes or beyond.
Process Manager
Process managers are responsible for maintaining the flow of data both into and out of
the data warehouse. There are three different types of process managers −
• Load manager
• Warehouse manager
• Query manager
5.5 Load Manager
The load manager does perform the following functions which is represented in figure 5.2.
▪ Extract data from the source system.
▪ Fast load the extracted data into temporary data store.
▪ Perform simple transformations into structure similar to the one in the data
warehouse.
67 SRMIST DDE Self Learning Material
NOTES
Figure 5.2 : Load Manager
Extract Data from Source
The data is extracted from the operational databases or the external information providers.
Gateways are the application programs that are used to extract data. It is supported by
underlying DBMS and allows the client program to generate SQL to be executed at a
server. Open Database Connection (ODBC) and Java Database Connection (JDBC) are
examples of gateway.
Fast Load
In order to minimize the total load window, the data needs to be loaded into the
warehouse in the fastest possible time.
▪ Transformations affect the speed of data processing.
▪ It is more effective to load the data into a relational database prior to applying
▪ transformations and checks.
▪ Gateway technology is not suitable, since they are inefficient when large data
volumes are involved.
Simple Transformations
While loading, it may be required to perform simple transformations. After completing
simple transformations, we can do complex checks. Suppose we are loading the EPOS
sales transaction, we need to perform the following checks
▪ Strip out all the columns that are not required within the warehouse.
▪ Convert all the values to required data types.
68 SRMIST DDE Self Learning Material
5.6 Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It consists
NOTES
of a third-party system software, C programs, and shell scripts. The size and complexity
of a warehouse manager varies between specific solutions. The architecture is shown in
figure 5.3.
A warehouse manager includes the following
▪ The controlling process
▪ Stored procedures or C with SQL
▪ Backup/Recovery tool
▪ SQL scripts
Figure 5.3 : Warehouse Manager Architecture
5.6.1 Functions of Warehouse Manager
A warehouse manager performs the following functions
▪ Analyzes the data to perform consistency and referential
integrity checks.
▪ Creates indexes, business views, partition views against the
base data.
▪ Generates new aggregations and updates the existing
aggregations.
▪ Generates normalizations.
▪ Transforms and merges the source data of the temporary
69 SRMIST DDE Self Learning Material
store into the published data warehouse.
▪ Backs up the data in the data warehouse.
NOTES ▪ Archives the data that has reached the end of its captured
life.
5.7 Query Manager
The query manager is responsible for directing the queries to
suitable tables. By directing the queries to appropriate tables, it
speeds up the query request and response process. In addition,
the query manager is responsible for scheduling the execution
of the queries posted by the user.
A query manager includes the following components −
▪ Query redirection via C tool or RDBMS
▪ Stored procedures
▪ Query management tool
▪ Query scheduling via C tool or RDBMS
▪ Query scheduling via third-party software
Figure 5.4 : Query Manager Architecture
70 SRMIST DDE Self Learning Material
Functions of Query Manager
NOTES
▪ It presents the data to the user in a form they understand.
▪ It schedules the execution of the queries posted by the end-user.
▪ It stores query profiles to allow the warehouse manager to determine
which
▪ indexes and aggregations are appropriate.
What is ETL?
ETL is a process that extracts the data from different source systems,
then transforms the data (like applying calculations, concatenations, etc.)
and finally loads the data into the Data Warehouse system. Full form of
ETL is Extract, Transform and Load.
It’s tempting to think a creating a Data warehouse is simply extracting
data from multiple sources and loading into database of a Data warehouse.
This is far from the truth and requires a complex ETL process. The ETL
process requires active inputs from various stakeholders including
developers, analysts, testers, top executives and is technically challenging.
In order to maintain its value as a tool for decision-makers, Data
warehouse system needs to change with business changes. ETL is a
recurring activity (daily, weekly, monthly) of a Data warehouse system and
needs to be agile, automated, and well documented.
Why do you need ETL?
There are many reasons for adopting ETL in the organization:
It helps companies to analyze their business data for taking critical business
decisions.
Transactional databases cannot answer complex business questions that
can be answered by ETL example.
o A Data Warehouse provides a common data repository
• ETL provides a method of moving the data from various sources into a
data warehouse.
71 SRMIST DDE Self Learning Material
• As data sources change, the Data Warehouse will automatically update.
• Well-designed and documented ETL system is almost essential to the
NOTES success of a Data Warehouse project.
• Allow verification of data transformation, aggregation and calculations
rules.
• ETL process allows sample data comparison between the source and the
target system.
• ETL process can perform complex transformations and requires the
extra area to store the data.
• ETL helps to Migrate data into a Data Warehouse. Convert to the various
formats and types to adhere to one consistent system.
• ETL is a predefined process for accessing and manipulating source data
into the target database.
• ETL in data warehouse offers deep historical context for the business.
• It helps to improve productivity because it codifies and reuses without a
need for technical skills.
Step 1) Extraction
In this step of ETL architecture, data is extracted from the source system into
the staging area. Transformations if any are done in staging area so that
performance of source system in not degraded. Also, if corrupted data is copied
directly from the source into Data warehouse database, rollback will be a
72 SRMIST DDE Self Learning Material
challenge. Staging area gives an opportunity to validate extracted data before
it moves into the Data warehouse.
NOTES
Three Data Extraction methods:
▪ Full Extraction
▪ Partial Extraction- without update notification.
▪ Partial Extraction- with update notification
Irrespective of the method used, extraction should not affect performance and
response time of the source systems. These source systems are live production
databases. Any slow down or locking could effect company’s bottom line.
▪ Some validations are done during Extraction:
▪ Reconcile records with the source data
▪ Make sure that no spam/unwanted data loaded
▪ Data type check
▪ Remove all types of duplicate/fragmented data
▪ Check whether all the keys are in place or not
Step 2) Transformation
Data extracted from source server is raw and not usable in its original form.
Therefore it needs to be cleansed, mapped and transformed. In fact, this is the
key step where ETL process adds value and changes data such that insightful
BI reports can be generated.
It is one of the important ETL concepts where you apply a set functions on
extracted data. Data that does not require any transformation is called as direct
move or pass through data.
In transformation step, you can perform customized operations on data. For
instance, if the user wants sum-of-sales revenue which is not in the database.
Or if the first name and the last name in a table is in different columns. It is
possible to concatenate them before loading.
▪ Data Integration Issues
▪ Data Integration Issues
▪ Following are Data Integrity Problems:
73 SRMIST DDE Self Learning Material
▪ Different spelling of the same person like Jon, John, etc.
▪ There are multiple ways to denote company name like Google, Google
NOTES Inc.
▪ Use of different names like Cleaveland, Cleveland.
▪ There may be a case that different account numbers are generated by
various applications for the same customer.
▪ In some data required files remains blank
▪ Invalid product collected at POS as manual entry can lead to mistakes.
▪ Validations are done during this stage
▪ Filtering – Select only certain columns to load. Using rules and lookup
tables for Data standardization
▪ Character Set Conversion and encoding handling
▪ Conversion of Units of Measurements like Date Time Conversion,
currency conversions, numerical conversions, etc.
▪ Data threshold validation check. For example, age cannot be more than
two digits.
▪ Data flow validation from the staging area to the intermediate tables.
▪ Required fields should not be left blank.
▪ Cleaning ( for example, mapping NULL to 0 or Gender Male to “M”
and Female to “F” etc.)
▪ Split a column into multiples and merging multiple columns into a single
column.
▪ Transposing rows and columns,
▪ Use lookups to merge data
▪ Using any complex data validation (e.g., if the first two columns in a
row are empty then it automatically reject the row from processing)
Step 3) Loading
Loading data into the target datawarehouse database is the last step of the ETL
process. In a typical Data warehouse, huge volume of data needs to be loaded
in a relatively short period (nights). Hence, load process should be optimized
for performance.
In case of load failure, recover mechanisms should be configured to restart
74 SRMIST DDE Self Learning Material
from the point of failure without data integrity loss. Data Warehouse admins
need to monitor, resume, cancel loads as per prevailing server performance.
NOTES
Types of Loading:
Initial Load — populating all the Data Warehouse tables
Incremental Load — applying ongoing changes as when needed periodically.
Full Refresh —erasing the contents of one or more tables and reloading with
fresh data.
▪ Load verification
▪ Ensure that the key field data is neither missing nor null.
▪ Test modeling views based on the target tables.
▪ Check that combined values and calculated measures.
▪ Data checks in dimension table as well as history table.
▪ Check the BI reports on the loaded fact and dimension table.
ETL Tools
There are many ETL tools are available in the market. Here, are some most
prominent one:
1. MarkLogic:
MarkLogic is a data warehousing solution which makes data integration easier
and faster using an array of enterprise features. It can query different types of
data like documents, relationships, and metadata. Data warehouse needs to
integrate systems that have different.
5.9 Online Analytical Processing (OLAP)
(OLAP) is a category of software that allows users to analyze information from
multiple database systems at the same time. It is a technology that enables
analysts to extract and view business data from different points of view.
Analysts frequently need to group, aggregate and join data. These OLAP
operations in data mining are resource intensive. With OLAP data can be pre-
calculated and pre-aggregated, making analysis faster.
OLAP databases are divided into one or more cubes. The cubes are designed
75 SRMIST DDE Self Learning Material
in such a way that creating and viewing reports become easy. OLAP stands for
Online Analytical Processing.
NOTES
5.8.1 OLAP cube:
The OLAP Cube consists of numeric facts called measures which are categorized by
dimensions. OLAP Cube is also called the hypercube. It is shown in figure 5.5.
Figure 5.5 : OLAP Cube
Usually, data operations and analysis are performed using the simple
spreadsheet, where data values are arranged in row and column format. This is
ideal for two-dimensional data. However, OLAP contains multidimensional
data, with data usually obtained from a different and unrelated source. Using a
spreadsheet is not an optimal option. The cube can store and analyze
multidimensional data in a logical and orderly manner
5.8.2 How does it work?
A Data warehouse would extract information from multiple data sources and
formats like text files, excel sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is loaded into an OLAP
server (or OLAP cube) where information is pre-calculated in advance for
further analysis. Basic analytical operations of OLAP. Four types of analytical
76 SRMIST DDE Self Learning Material
OLAP operations are:
▪ Roll-up
NOTES
▪ Drill-down
▪ Slice and dice
▪ Pivot (rotate)
5.8.3 Roll-up:
Roll-up is also known as “consolidation” or “aggregation.” The Roll-up
operation can be performed in 2 ways
Reducing dimensions
Climbing up concept hierarchy.
Concept hierarchy is a system of grouping things based on their order or level.
Consider the following figure
Figure 5.6 : Roll-up operation in OLAP
Roll-up operation in OLAP
▪ In this example, cities New jersey and Lost Angles and rolled up into
country USA
▪ The sales figure of New Jersey and Los Angeles are 440 and 1560
respectively. They become 2000 after roll-up
▪ In this aggregation process, data is location hierarchy moves up from
city to the country.
77 SRMIST DDE Self Learning Material
▪ In the roll-up process at least one or more dimensions need to be
removed. In this example, Cities dimension is removed.
NOTES
5.8.4 Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the
rollup process. It can be done via
▪ Moving down the concept hierarchy
▪ Increasing a dimension
Figure 5.7 : Drill-down operation in OLAP
Consider the diagram (in the figure 5.7) above , Quarter Q1 is drilled down to
months January, February, and March. Corresponding sales are also registers.
In this example, dimension months are added.
5.8.5 Slice:
Here, one dimension is selected, and a new sub-cube is created. Following
diagram in the figure 5.8 , explain how slice operation performed:
78 SRMIST DDE Self Learning Material
NOTES
Figure 5.8 : Slice operation in OLAP
Dimension Time is Sliced with Q1 as the filter. A new cube is created
altogether.
5.8.6 Dice:
This operation is similar to a slice. The difference in dice is you select 2 or
more dimensions that result in the creation of a sub-cube shown in figure 5.9.
Figure 5.9 : Dice operation in OLAP
79 SRMIST DDE Self Learning Material
5.8.7 Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of data.
NOTES In the following example, the pivot is based on item types.
Figure 5.10 : Pivot operation in OLAP
5.9 Type of OLAP
▪ Relational OLAP(ROLAP): ROLAP is an extended RDBMS along
with multidimensional data mapping to perform the standard relational
operation.
▪ Multidimensional OLAP (MOLAP) MOLAP Implements operation
in multidimensional data.
▪ Hybrid Online Analytical Processing (HOLAP) In HOLAP approach
the aggregated totals are stored in a multidimensional database while the
detailed data is stored in the relational database. This offers both data
efficiency of the ROLAP model and the performance of the MOLAP
model.
▪ Desktop OLAP (DOLAP) In Desktop OLAP, a user downloads a part
of the data from the database locally, or on their desktop and analyze it.
▪ DOLAP is relatively cheaper to deploy as it offers very few
functionalities compares to other OLAP systems.
▪ Web OLAP (WOLAP) Web OLAP which is OLAP system
accessible via the web browser. WOLAP is a three-tiered architecture.
It consists of three components: client, middleware, and a database
server.
▪ Mobile OLAP: Mobile OLAP helps users to access and analyze
OLAP data using their mobile devices
80 SRMIST DDE Self Learning Material
▪ Spatial OLAP : SOLAP is created to facilitate management of both
spatial and non-spatial data in a Geographic Information system (GIS)
NOTES
Fig 5.11 : Types of OLAP
ROLAP
ROLAP works with data that exist in a relational database. Facts and
dimension tables are stored as relational tables. It also allows multidimensional
analysis of data and is the fastest growing OLAP.
Advantages of ROLAP model:
▪ High data efficiency. It offers high data efficiency because query
performance and access language are optimized particularly for the
multidimensional data analysis.
▪ Scalability. This type of OLAP system offers scalability for managing
large volumes of data, and even when the data is steadily increasing.
Drawbacks of ROLAP model:
▪ Demand for higher resources: ROLAP needs high utilization of
manpower, software, and hardware resources.
▪ Aggregately data limitations. ROLAP tools use SQL for all calculation
of aggregate data. However, there are no set limits to the for-handling
computations.
▪ Slow query performance. Query performance in this model is slow when
compared with MOLAP
81 SRMIST DDE Self Learning Material
MOLAP
NOTES ▪ MOLAP uses array-based multidimensional storage engines to display
multidimensional views of data. Basically, they use an OLAP cube.
Hybrid OLAP
▪ Hybrid OLAP is a mixture of both ROLAP and MOLAP. It offers fast
computation of MOLAP and higher scalability of ROLAP. HOLAP uses
two databases.
▪ Aggregated or computed data is stored in a multidimensional OLAP
cube Detailed information is stored in a relational database.
Benefits of Hybrid OLAP:
▪ This kind of OLAP helps to economize the disk space, and it also
remains compact which helps to avoid issues related to access speed and
convenience.
▪ Hybrid HOLAP’s uses cube technology which allows faster
performance for all types of data.
▪ ROLAP are instantly updated and HOLAP users have access to this real-
time instantly updated data. MOLAP brings cleaning and conversion of
data thereby improving data relevance. This brings best of both worlds.
Drawbacks of Hybrid OLAP:
▪ Greater complexity level: The major drawback in HOLAP systems is
that it supports both ROLAP and MOLAP tools and applications. Thus,
it is very complicated.
▪ Potential overlaps: There are higher chances of overlapping especially
into their functionalities.
Advantages of OLAP
▪ OLAP is a platform for all type of business includes planning,
budgeting, reporting, and analysis.
82 SRMIST DDE Self Learning Material
▪ Information and calculations are consistent in an OLAP cube. This is a
crucial benefit.
NOTES
▪ Quickly create and analyze “What if” scenarios
▪ Easily search OLAP database for broad or specific terms.
▪ OLAP provides the building blocks for business modeling tools, Data
mining tools, performance reporting tools.
▪ Allows users to do slice and dice cube data all by various dimensions,
measures, and filters.
▪ It is good for analyzing time series.
▪ Finding some clusters and outliers is easy with OLAP.
▪ It is a powerful visualization online analytical process system which
provides faster response times
Disadvantages of OLAP
▪ OLAP requires organizing data into a star or snowflake schema. These
schemas are complicated to implement and administer
▪ You cannot have large number of dimensions in a single OLAP cube
▪ Transactional data cannot be accessed with OLAP system.
▪ Any modification in an OLAP cube needs a full update of the cube. This
is a time-consuming process.
Answer the Questions
1.List down the goals of the data warehouse.
2.What are the features of data warehouse?
3.Explain 3-Tier data warehouse architecture.
4.Describe the process manager in detail.
5.Write notes on (i) Load manager (ii) Warehouse manager (iii) Query
manager.
6. What is ETL?
7. List down the purpose of ETL.
8. Explain ETL in detail.
9. Elaborate on OLAP and its types in details.
10. Write notes on OLAP cube.
83 SRMIST DDE Self Learning Material
MODULE VI
NOTES 6.1 Dimension Modeling
It can be understood that dimension modeling reduces the response time of query
fired unlike relational systems. The concept behind dimensional modeling is all about the
conceptual design. Firstly let’s see the introduction to dimensional modeling and how it is
different from a traditional data model design. A data model is a representation of how
data is stored in a database and it is usually a diagram of the few tables and the relationships
that exist between them. This modeling is designed to read, summarize and compute some
numeric data from a data warehouse. A data warehouse is an example of a system that
requires small number of large tables. This is due to many users using the application to
read lot of data a characteristic of a data warehouse is to write the data once and read it
many times over so it is the read operation that is dominant in a data warehouse. Now let's
look at the data warehouse containing customer related information in a single table this
makes it a lot easier for analytics just to count the number of customers by country but this
time the use of tables in the data warehouse simplify the query processing.
The main objective of dimension modeling is to provide an easy architecture for the end
user to write queries and also, to reduce the number of relationships between the tables and
dimensions hence providing efficient query handling.
Dimensional modeling populates data in a cube as a logical representation with
OLAP data management. The concept was developed by Ralph Kimball. It has “fact” and
“dimension” as its two important measure. The transaction record is divided into either
“facts”, which consists of business numerical transaction data, or “dimensions”, which are
the reference information that gives context to the facts. The more detail about fact and
dimension is explained in the subsequent sections.
The main objective of dimension modeling is to provide an easy architecture for the end
user to write queries. Also, it will reduce the number of relationships between the tables
and dimensions, hence providing efficient query handling.
The following are the steps in Dimension modeling as shown in figure1.
1. Identify Business Process
2. Identify Grain (level of detail)
3. Identify dimensions and attributes
5. Build Schema
84 SRMIST DDE Self Learning Material
The model should describe the Why, How much, When/Where/Who and What
of your business process.
NOTES
Figure 6.1 : Dimension modeling
Step 1: Identify the Business Objectives
Selection of the right business process to build a data warehouse and identifying the
business objectives is the first step in dimension modeling. This is very important step
otherwise this can lead to repeated process and software defects.
Step 2: Identifying Granularity
The grain literally means each minute detail of the business problem. This is decomposing
of the large and complex problem into the lowest level information. For example, if there
is some data month-wise. So, the table would contain details of all the months in a year. It
depends on the report to be submitted to the management. This affects the size of the data
warehouse.
Step 3: Identifying Dimensions and attributes
The dimensions of the data warehouse can be understood by the entities of the database.
like, items, products, date, stocks, time etc. The identification of the primary keys and the
foreign keys specifications all are described here.
Step 4: Build the Schema
The database structure or arrangement of columns in a database table, decides the schema.
There are various popular schemas like, star, snowflake, fact constellation schemas -
85 SRMIST DDE Self Learning Material
summarizing, from the selection of business process to identifying each and every finest
level of detail of the business transactions.
NOTES Identifying the significant dimensions and attributes would help to build the schema.
Strengths of Dimensional Modeling
Following are some of the strengths of Dimensional Modeling:
▪ It provides the simplicity of architecture or schema to understand and handle
various stakeholders from warehouse designers to business clients.
▪ It reduces the number of relationships between different data elements.
▪ It promotes data quality by enforcing foreign key constraints as a form of referential
integrity check on a data warehouse. The dimensional modeling helps the database
administrators to maintain the reliability of the data.
▪ The aggregate functions used in the schemas optimize the query performance
posted by the customers. Since data warehouse size keeps on increasing and with
this increased size, the optimization becomes the concern which dimension
modeling makes it easy
6.2 DWH Objects
The following types of objects are commonly used in dimensional data warehouse
schemas:
Fact tables are the large tables in your warehouse schema that store business
measurements.
Fact tables typically contain facts and foreign keys to the dimension tables. Fact tables
represent data, usually numeric and additive, that can be analyzed and examined. Examples
include sales, cost, and profit.
Figure 6.2 : Fact and Dimension tables
Dimension tables, also known as lookup or reference tables, contain the relatively static
86 SRMIST DDE Self Learning Material
data in the warehouse. Dimension tables store the information you normally use to contain
queries. Dimension tables are usually textual and descriptive and you can use them as the
NOTES
row headers of the result set. Examples are customers, Location, Time, Suppliers or
products, These are the elementary objects to build the schema.
6.3 FACT TABLE & DIMENSION TABLE
Facts and Fact table: A fact is an event. It is a measure which represents business
items transactions of items having association and context data. The Fact table contains
the description of all the primary keys of all the tables used in the business processes which
acts as a foreign key in the fact table. It also has an aggregate function to compute the
business process on some entity. It is a numeric attribute of a fact, representing the
performance or behavior of the business relative to the dimensions. The number of
columns in the fact table is less than the dimension table. It is more normalized form.
A fact table typically has two types of columns: those that contain numeric facts
(often called measurements), and those that are foreign keys to dimension tables. A fact
table contains either detail-level facts or facts that have been aggregated. Fact tables that
contain aggregated facts are often called SUMMARY TABLES. A fact table usually
contains facts with the same level of aggregation.
Though most facts are additive, they can also be semi-additive or non-additive.
▪ Additive facts can be aggregated by simple arithmetical addition. A common
example of this is sales.
▪ Non-additive facts cannot be added at all. An example of this is averages. Semi-
additive facts can be aggregated along some of the dimensions and not along others.
An example of this is inventory levels, where you cannot tell what a level means
simply by looking at it.
Fact tables contain business event details for summarization. Fact tables are
often very large, containing hundreds of millions of rows and consuming hundreds of
gigabytes or multiple terabytes of storage. Because dimension tables contain records that
describe facts, the fact table can be reduced to columns for dimension foreign keys and
numeric fact values. Text, BLOBs, and denormalized data are typically not stored in the
fact table
87 SRMIST DDE Self Learning Material
Creating a new fact table: You must define a fact table for each star schema. From a
modeling standpoint, the primary key of the fact table is usually a composite key that is
NOTES made up of all of its foreign keys.
Multiple Fact Tables: Multiple fact tables are used in data warehouses that address
multiple business functions, such as sales, inventory, and finance. Each business function
should have its own fact table and will probably have some unique dimension tables. Any
dimensions that are common across the business functions must represent the dimension
information in the same way, as discussed earlier in “Dimension Tables.”.
Dimensions and Dimension table: It is a collection of data which describe one business
dimension. Dimensions decide the contextual background for the facts, and they are the
framework over which OLAP is performed. Dimension tables establish the context of the
facts. The table stores fields that describe the facts. The data in the table are in de
normalized form. So, it contains large number of columns as compared to fact table. The
attributes in a dimension table are used as row and column headings in a document or
query results display.
Each business function will typically have its own schema that contains a fact table, several
conforming dimension tables, and some dimension tables unique to the specific business
function. Such business-specific schemas may be part of the central data warehouse or
implemented as data marts
Example: In the example of student registration case study to any particular course can
have attributes like student_id, course_id, program_id, date_of_registration, fee_id in fact
table. Course summary can have course name, duration of the course etc. Student
information can contain the personal details about the student like name, address, contact
details etc.
Student Registration
Fact Table (student_id, course_id, program_id, date_of_registration, fee_id)
Measure: Sum (Fee_amount))
Dimension Tables (Student_details, Course_details, Program_details, Fee_details, Date)
6.4 DWH USERS
The success of a data warehouse is measured solely by its acceptance by users. Without
users,historical data might as well be archived to magnetic tape and stored in the basement.
Successful data warehouse design starts with understanding the users and their needs.
88 SRMIST DDE Self Learning Material
Data warehouse users can be divided into four categories: Statisticians, knowledge
workers, information consumers, and executives. Each type makes up a portion of the user
NOTES
population as illustrated in this diagram.
Figure 6.3 : Data Warehouse Users
▪ Statisticians: There are typically only a handful of statisticians and operations
research types in any organization. Their work can contribute to closed loop
systems that deeply influence the operations and profitability of the company.
▪ Knowledge Workers: A relatively small number of analysts perform the bulk of
new queries and analyses against the data warehouse. These are the users who get
the Designer or Analyst versions of user access tools. They will figure out how to
quantify a subject area. After a few iterations, their queries and reports typically
get published for the benefit of the Information Consumers.
▪ Knowledge Workers are often deeply engaged with the data warehouse design and
place the greatest demands on the ongoing data warehouse operations team for
training and support.
▪ Information Consumers: Most users of the data warehouse are Information
Consumers; they will probably never compose a true ad hoc query. They use static
or simple interactive reports product of others. This group includes a large number
of people, and published reports are highly visible. Set up a great communication
infrastructure for distributing information widely, and gather feedback from these
users to improve the information sites over time.
▪ Executives: Executives are a special case of the Information Consumers group.
89 SRMIST DDE Self Learning Material
6.5 Data Warehouse Schemas:
We can arrange schema objects in the schema models designed for data warehousing in
NOTES a variety of ways. Most data warehouses use a dimensional model. The model of your
source data and the requirements of your users help you design the data warehouse
schema. You can sometimes get the source model from your company’s enterprise data
model and reverse-engineer the logical data model for the data warehouse from this. The
physical implementation of the logical data warehouse model may require some changes
to adapt it to your system parameters—size of machine, number of users, storage capacity,
type of network, and software.
6.5.1Dimensional Model Schemas
The principal characteristic of a dimensional model is a set of detailed business facts
surrounded by multiple dimensions that describe those facts. When realized in a database,
the schema for a dimensional model contains a central fact table and multiple dimension
tables. A dimensional model may produce a star schema or a snowflake schema.
6.6 Star Schemas
A schema is called a star schema if all dimension tables can be joined directly to the fact
table.
The following diagram shows a classic star schema. In the star schema design, a single
object (the fact table) sits in the middle and is radically connected to other surrounding
objects (dimensionlookup tables) like a star. A star schema can be simple or complex. A
simple star consists of one fact table; a complex star can have more than one fact table.
Steps in Designing Star Schema
▪ Identify a business process for analysis (like sales).
▪ Identify measures or facts (sales dollar).
▪ Identify dimensions for facts (product dimension, location dimension, time
dimension, organization dimension).
▪ List the columns that describe each dimension (region name, branch name, sub
region name).
▪ Determine the lowest level of summary in a fact table (sales dollar).
90 SRMIST DDE Self Learning Material
NOTES
Figure 6.4 : Star Schema
▪ Hierarchy
A logical structure that uses ordered levels as a means of organizing data. A
hierarchy can be used to define aggre; for example, in a time dimension, a hierarchy
might be used to aggregate data from the month level to the quarter level, from the
quarter level to the year level. A hierarchy can also be used to define a navigational
drill path, regardless of whether the levels inthe hierarchy represent aggregated
totals or not.
▪ Level
A position in a hierarchy. For example, a time dimension might have a hierarchy
that represents data at the month, quarter, and year levels.
6.6.1 Fact Table
A table in a star schema that contains facts and connected to dimensions. A fact table
typically has two types of columns: those that contain facts and those that are foreign keys
to dimension tables. The primary key of a fact table is usually a composite key that is made
up of all of its foreign keys.
A fact table might contain either detail level facts or facts that have been aggregated (fact
tables that contain aggregated facts are often instead called summary tables). A fact table
usually contains facts with the same level of aggregation.
91 SRMIST DDE Self Learning Material
Features of Star Schema
▪ The data is in denormalized database.
NOTES ▪ It provides quick query response
▪ Star schema is flexible can be changed or added easily.
▪ It reduces the complexity of metadata for developers and end users.
Example 1: Suppose a star schema is composed of a Sales fact table as shown Architecture
in Figure 6.5 and several dimension tables connected to it for Time, Branch, Item and
Location.
Fact Table
Sales is the Fact table.
Dimension Tables
The Time table has a column for each day, month, quarter, year etc..
The Item table has columns for each item_key, item_name, brand, type and supplier_type.
The Branch table has columns for each branch_key, branch_name and branch_type.
The Location table has columns of geographic data, including street, city, state, and
country. Unit_Sold and Dollars_Sold are the Measures.
Figure 6.5 : Sales Fact table
92 SRMIST DDE Self Learning Material
6.6.2 Advantages of Star Schema
Star schemas are easy for end users and applications to understand and navigate. With a
NOTES
well-designed schema, users can quickly analyze large, multidimensional data sets. The
main advantages of star schemas in a decisions support
environment are:
▪ Query performance Because a star schema database has a small number of tables
and clear join paths, queries run faster than they do against an OLTP system. Small
single table queries, usually of dimension tables, are almost instantaneous.
Large join queries that involve multiple tables take only seconds or minutes to run.
▪ In a star schema database design, the dimensions are linked only through the central
fact table. When two dimension tables are used in a query, only one join path,
intersecting the fact table, exists between those two tables. This design feature
enforces accurate and consistent query results.
▪ Load performance and administration: Structural simplicity also reduces the time
required to load large batches of data into a star schema database. By defining facts
and dimensions and separating them into different tables, the impact of a load
operation is reduced.
▪ Dimension tables can be populated once and occasionally refreshed. You can
add new facts regularly and selectively by appending records to a fact table.
▪ Built-in referential integrity: A star schema has referential integrity built in when
data is loaded. Referential integrity is enforced because each record in a dimension
table has a unique primary key, and all keys in the fact tables are legitimate foreign
keys drawn from the dimension tables.
▪ A record in the fact table that is not related correctly to a dimension cannot be given
the correct key value to be retrieved. Easily understood
▪ A star schema is easy to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end user, because
they represent the fundamental relationship between parts of the underlying
business. Users can also browse dimension table attributes before constructing
a query.
93 SRMIST DDE Self Learning Material
6.6.3 Disadvantages of Star Schema
As mentioned before, improving read queries and analysis in a star schema could
NOTES involve certain challenges:
▪ Decreased data integrity: Because of the denormalized data structure, star
schemas do not enforce data integrity very well. Although star schemas use
countermeasures to prevent anomalies from developing, a simple insert or update
command can still cause data incongruities.
▪ Less capable of handling diverse and complex queries: Databases designers build
and optimize star schemas for specific analytical needs. As denormalized data
sets, they work best with a relatively narrow set of simple queries.
Comparatively, a normalized schema permits a far wider variety of more complex
analytical queries.
▪ No Many-to-Many Relationships: Because they offer a simple dimension schema,
star schemas don’t work well for “many-to-many data relationships”.
6.7 Snowflake Schemas
A schema is called a snowflake schema if one or more-dimension tables do not join directly
to the fact table but must join through other dimension tables. For example, a dimension
that
describes products may be separated into three tables (snowflaked).
Fig 6.6 : Snow Flake Schema
94 SRMIST DDE Self Learning Material
The snowflake schema is an extension of the star schema where each point of the star
explodes into more points. The main advantage of the snowflake schema is the
NOTES
improvement in query performance due to minimized disk storage requirements and
joining smaller lookup tables. The main disadvantage of the snowflake schema is the
additional maintenance efforts needed due to the increase number of lookup tables.
Features of Snowflake Schema
Following are the important features of snowflake schema:
▪ It has normalized tables
▪ Occupy less disk space.
▪ It requires more lookup time as many tables are interconnected and extending
dimensions.
6.7.1 ADVANTAGES AND DISADVANTAGES OF SNOWFLAKE SCHEMA
Following are the advantages of Snowflake schema:
▪ A Snowflake schema occupies a much smaller amount of disk space compared to
the Star schema. Lesser disk space means more convenience and less hassle.
▪ Snowflake schema of small protection from various Data integrity issues.
▪ Most people tend to prefer the Snowflake schema because of how safe if it is.
▪ Data is easy to maintain and more structured.
▪ Data quality is better than star schema.
Disadvantages of Snowflake Schema Architecture
▪ Complex data schemas: As you might imagine, snowflake schemas create many
levels of complexity while normalizing the attributes of a star schema. This
complexity results in more complicated source query joins. In offering a more
efficient way to store data, snowflake can result in performance declines while
browsing these complex joins. Still, processing technology advancements have
resulted in improved snowflake schema query performance in recent years, which
is one of the reasons why snowflake schemas are rising in popularity.
▪ Slower at processing cube data: In a snowflake schema, the complex joins result in
slower cube data processing. The star schema is generally better for cube data
processing.
▪ Lower data integrity levels: While snowflake schemas offer greater normalization
and fewer risks of data corruption after performing UPDATE and INSERT
95 SRMIST DDE Self Learning Material
commands, they do not provide the level of transnational assurance that comes with
a traditional, highly normalized
NOTES ▪ database structure. Therefore, when loading data into a snowflake schema, it's vital
to be careful and double-check the quality of information post-loading.
Example
In the below figure , the snowflake schema is shown of a case study of customers, sales,
products, location wise quantity sold, and number of items sold are calculated. The
customers, products, date, store are saved in the fact table with their respective primary
keys acting in fact table as a foreign key.
You will observe that the two aggregate functions can be applied to calculate quantity sold
and amount sold. Further, the some dimensions are extended to the type of customer and
also store information territory wise too. Note, date has been expanded into date, month,
year. This schema will give you more opportunity to perform query handling in detail.
Figure 6.7: Snow Flake Schema example
6.7.2 Star Schema Vs Snowflake Schema
96 SRMIST DDE Self Learning Material
Table 6.1:Difference between Star and Snowflake Scheme
NOTES
6.8 FACT CONSTELLATION SCHEMA
There is another schema for representing a multidimensional model. This term fact
constellation is like the galaxy of universe containing several stars. It is a collection of
fact schemas having one or more-dimension tables in common as shown in the figure
below. This logical representation is mainly used in designing complex database systems.
Figure 6.8 : Fact constellation schema
In the above figure, it can be observed that there are two fact tables and two dimension
tables in the pink boxes are the common dimension tables connecting both the star
schemas.
97 SRMIST DDE Self Learning Material
For example, if we are designing a fact constellation schema for University students. In
the problem it is given that their fact table as Fact tables
NOTES Placement (Stud_roll, Company_id, TPO_id) , need to calculate the number of students
eligible and number of students placed.
Workshop ( Stud_roll, Institute_id, TPO_id) need to find out the facts about number of
students selected, number of students attended the workshop) So, there are two fact
tables namely, Placement and Workshop which are part of two different star schemas
having:
i) dimension tables – Company, Student and TPO in Star schema with fact table
Placement
ii) dimension tables – Training Institute, Student and TPO in Star schema with fact table
Workshop.
Both the star schema has two-dimension tables common and hence, forming a fact
constellation or galaxy schema.
6.8.1 Advantages and Disadvantages of Fact Constellation Schema
Advantage
This schema is more flexible and gives wider perspective about the data warehouse
system.
98 SRMIST DDE Self Learning Material
Disadvantage
NOTES
As, this schema is connecting two or more facts to form a constellation. This kind of
structure makes it complex to implement and maintain
Answer the Questions
1.What is dimension modeling?
2. Explain the steps in dimension modeling.
3. List down the strength of dimensional modeling.
4. Differentiate fact table and dimension table.
5. Write notes on (i) fact table (ii) dimension table.
6. Write about the data warehouse users.
7. Explain star schema in detail.
8. Write in detail about snowflake schema.
9. Explain fact constellation schema.
10 Differentiate star schema and snowflake schema.
99 SRMIST DDE Self Learning Material
MODULE VII
NOTES 7.1 Data Warehouse partitioning
Partitioning is done to enhance performance and facilitate easy management of data.
Partitioning also helps in balancing the various requirements of the system. It optimizes
the hardware performance and simplifies the management of data warehouse by
partitioning each fact table into multiple separate partitions. In this chapter, we will discuss
different partitioning strategies.
Partitioning is important for the following reasons −
▪ For easy management
The fact table in a data warehouse can grow up to hundreds of gigabytes in size.
This huge size of fact table is very hard to manage as a single entity. Therefore it
needs partitioning.
▪ To assist backup/recovery
We do not partition the fact table, then we have to load the complete fact table with
all the data. Partitioning allows us to load only as much data as is required on a
regular basis. It reduces the time to load and also enhances the performance of the
system.
▪ To enhance performance.
By partitioning the fact table into sets of data, the query procedures can be
enhanced. Query performance is enhanced because now the query scans only those
partitions that are relevant. It does not have to scan the whole data.
7.2 Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning,
we have to keep in mind the requirements for manageability of the data warehouse.
Partitioning by Time into Equal Segments
In this partitioning strategy, the fact table is partitioned on the basis of time period. Here
each time period represents a significant retention period within the business. For example,
if the user queries for month to date data then it is appropriate to partition the data into
monthly segments. We can reuse the partitioned tables by removing the data in them.
100 SRMIST DDE Self Learning Material
NOTES
Partition by Time into Different-sized Segments
This kind of partition is done where the aged data is accessed infrequently. It is
implemented as a set of small partitions for relatively current data, larger partition for
inactive data.
Figure 7.1: Partition by Time into Different-sized Segments
The detailed information remains available online. The number of physical tables is kept
relatively small, which reduces the operating cost.
This technique is suitable where a mix of data dipping recent history and data mining
through entire history is required.
This technique is not useful where the partitioning profile changes on a regular basis,
because repartitioning will increase the operation cost of data warehouse.
7.3 Vertical Partitioning
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.
101 SRMIST DDE Self Learning Material
Figure 7.2: Vertical Partitioning
Vertical partitioning can be performed in the following two ways −
NOTES • Normalization
• Row Splitting
Normalization
Normalization is the standard relational method of database organization. In this method,
the rows are collapsed into a single row, hence it reduce space. Take a look at the following
tables that show how normalization is performed.
Table 7.1: before Normalization
Product_id Qty Value sales_date Store_id Store_name Location Region
30 5 3.67 3-Aug-13 16 sunny Bangalore S
35 4 5.33 3-Sep-13 16 sunny Bangalore S
40 5 2.50 3-Sep-13 64 san Mumbai W
45 7 5.66 3-Sep-13 16 sunny Bangalore S
Table 7.2 after Normalization
Store_id Store_name Location Region
16 sunny Bangalore W
64 san Mumbai S
Product_id Quantity Value sales_date Store_id
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row
splitting is to speed up the access to large table by reducing its size.
102 SRMIST DDE Self Learning Material
Note − While using vertical partitioning, make sure that there is no requirement to
perform a major join operation between two partitions.
NOTES
Identify Key to Partition
It is very crucial to choose the right partition key. Choosing a wrong partition key will
lead to reorganizing the fact table. Let's have an example. Suppose we want to partition
the following table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
• region
• transaction_date
Suppose the business is organized in 30 geographical regions and each region has
different number of branches. That will give us 30 partitions, which is reasonable. This
partitioning is good enough because our requirements capture has shown that a vast
majority of queries are restricted to the user's own business region.
If we partition by transaction_date instead of region, then the latest transaction from
every region will be in one partition. Now the user who wants to look at data within his
own region has to query across multiple partitions. Hence it is worth determining the
right partitioning key.
7.4 Hardware Partitioning
When a system is constrained by I/O capabilities, it is I/O bottleneck. When a system is
constrained by having limited CPU resources, it is CPU bottleneck. Database architects
frequently use RAID (Redundant Arrays of Inexpensive Disks) systems to overcome I/O
bottlenecks and to provide higher availability. RAID can be implemented in several levels,
ranging from 0 to 7.
103 SRMIST DDE Self Learning Material
RAID is a storage technology often used for databases larger than a few gigabytes. RAID
can provide both performance and fault tolerance benefits. A variety of RAID controllers
NOTES and disk configurations offer tradeoffs among cost, performance, and fault tolerance.
Performance
Hardware RAID controllers divide read/writes of all data from Windows NT 4.0 and
Windows 2000 and applications (like SQL Server) into slices (usually 16–128 KB) that
are then spread across all disks participating in the RAID array. Splitting data across
physical drives like this has the effect of distributing the read/write I/O workload evenly
across all physical hard drives participating in the RAID array. This increases disk I/O
performance because the hard disks participating in the RAID array, as a whole are kept
equally busy, instead of some disks becoming a bottleneck due to uneven distribution of
the I/O requests.
Fault Tolerance
RAID also provides protection from hard disk failure and accompanying data loss by using
two methods: mirroring and parity.
Mirroring
It is implemented by writing information onto a second (mirrored) set of drives. If there is
a drive loss with mirroring in place, the data for the lost drive can be rebuilt by replacing
the failed drive and rebuilding the mirror set. Most RAID controllers provide the ability to
do this failed drive replacement and re-mirroring while Windows NT 4.0 and Windows
2000 and RDBMS are online. Such RAID systems are commonly referred to as “Hot Plug”
capable drives.
Advantages
It offers the best performance among RAID options if fault tolerance is required. Bear in
mind that each RDBMS write to the mirrorset results in two disk I/O operations, once to
each side of the mirrorset. Another advantage is that mirroring provides more fault
tolerance than parity RAID implementations. Mirroring can enable the system to survive
at least one failed drive and may be able to support the system through failure of up to half
of the drives in the mirrorset without forcing the system administrator to shut down the
server and recover from the file backup.
Disadvantages
The disk cost of mirroring is one extra drive for each drive worth of data. This essentially
doubles your storage cost, which, for a data warehouse, is often one of the most expensive
104 SRMIST DDE Self Learning Material
components needed. Both RAID 1 and its hybrid, RAID 0+1 (sometimes referred to as
RAID or 0/1) are implemented through mirroring.
NOTES
7.5 Software partitioning Methods
▪ Range Partitioning
▪ Hash Partitioning
▪ Index Partitioning
▪ Composite Partitioning
▪ List Partitioning
Range Partitioning
Range partitioning maps data to partitions based on ranges of partition key values that you
establish for each partition. It is the most common type of partitioning and is often used
with dates. For example, you might want to partition sales data into weekly, monthly or
yearly partitions. Range partitioning maps rows to partitions based on ranges of column
values. Range partitioning is defined by the partitioning specification for a table or index.
PARTITION BY RANGE (column_list) and by the partitioning specifications for each
individual partition: VALUES LESS THAN (value_list) where: column_list is an ordered
list of columns that determines the partition to which a row or an index entry belongs.
These columns are called the partitioning columns. The values in the partitioning columns
of a particular row constitute that row’s partitioning key. value_ list is an ordered list of
values for the columns in the column list.
Only the VALUES LESS THAN clause is allowed. This clause specifies a non-inclusive
upper bound for the partitions. All partitions, except the first, have an implicit low value
specified by the VALUES LESS THAN literal on the previous partition. Any binary values
of the partition key equal to or higher than this literal are added to the next higher partition.
Highest partition being where MAXVALUE literal is defined. Keyword, MAXVALUE,
represents a virtual infinite value that sorts higher than any other value for the data type,
including the null value.
Hash Partitioning
Hash partitioning maps data to partitions based on a hashing algorithm that Oracle applies
to a partitioning key that you identify. The hashing algorithm evenly distributes rows
among partitions, giving partitions approximately the same size. Hash partitioning is the
ideal method for distributing data evenly across devices. Hash partitioning is a good and
105 SRMIST DDE Self Learning Material
easy-to-use alternative to range partitioning when data is not historical and there is no
obvious column or column list where logical range partition pruning can be advantageous.
NOTES Oracle uses a linear hashing algorithm and to prevent data from clustering within specific
partitions, you should define the number of partitions by a power of two (for example, 2,
4, 8). The statement below creates a table sales_hash, which is hash partitioned on the
salesman_id field. data1, data2, data3, and data4 are tablespace names.
List Partitioning
List partitioning enables you to explicitly control how rows map to partitions. You do this
by specifying a list of discrete values for the partitioning column in the description for each
partition. This is different from range partitioning, where a range of values is associated
with a partition and with hash partitioning, where you have no control of the row-to-
partition mapping. The advantage of list partitioning is that you can group and organize
unordered and unrelated sets of data in a natural way.
Composite Partitioning
Composite partitioning combines range and hash partitioning. Oracle first distributes data
into partitions according to boundaries established by the partition ranges. Then Oracle
uses a hashing algorithm to further divide the data into subpartitions within each range
partition.
Index Partitioning
You can choose whether or not to inherit the partitioning strategy of the underlying tables.
You can create both local and global indexes on a table partitioned by range, hash, or
composite methods. Local indexes inherit the partitioning attributes of their related tables.
For example, if you create a local index on a composite table, Oracle automatically
partitions the local index using the composite method.
Answer the Questions
1.List the reasons for partitioning
2.Write notes on horizontal partitioning.
3.Explain vertical partitioning in detail.
4.Describe hardware partitioning.
5. Explain various software partitioning methods.
106 SRMIST DDE Self Learning Material
MODULE VIII
NOTES
8.1 Data Aggregation
Data aggregation is the process where data is collected and presented in summarized
format for statistical analysis and to effectively achieve business objectives. Data
aggregation is vital to data warehousing as it helps to make decisions based on vast
amounts of raw data. Data aggregation provides the ability to forecast future trends and
aids in predictive modeling. Effective data aggregation techniques help to minimize
performance problems.
Types of aggregation with mathematical functions:
▪ Sum— adds together all the specified data to get a total.
▪ Average— computes the average value of the specific data.
▪ Max— display the highest value for each category.
▪ Min— displays the lowest value for each category.
▪ Count— counts the total number of data entries for each category.
Benefits of Data Aggregation:
Improved performance: Aggregated data requires less storage space and can be queried
more efficiently, resulting in faster response times.
Simplified analysis: Aggregated data provides a high-level overview, making it easier
to identify trends, patterns, and insights.
Enhanced data quality: Aggregation helps in reducing redundancies, errors, and
inconsistencies present in the raw data.
8.2 Designing Summary tables
Summary tables
Summary tables, also known as aggregate tables that store data at higher levels than it was
stored when the data was initially captured and saved. Summary tables are an important
part of creating a high-performance data warehouse. A summary table stores data that has
been aggregated in a way that answers a common (or resource-intensive) business query.
Summary tables are all about speed. They’re smaller than fact tables, which means they
generally respond more quickly (fewer rows to query), and they deliver answers without
calculating every result from scratch.
8.2.1 Benefits of Summary tables
• Summary information speeds up the performance of common queries.
107 SRMIST DDE Self Learning Material
• It increases the operational cost.
• It needs to be updated whenever new data is loaded into the data warehouse.
NOTES • It may not have been backed up, since it can be generated fresh from the
detailed
• information
Types of Summary Tables:
Roll-up summary tables: These tables contain aggregated data at higher levels of
granularity.
For example, sales data can be summarized at a monthly or yearly level.
Drill-down summary tables: These tables store aggregated data at lower levels of
granularity.
For instance, sales data can be summarized by day or even by hour.
Pivot summary tables: These tables allow multidimensional analysis by summarizing
data along multiple dimensions, such as product, region, and time.
Strategies for Creating Summary Tables:
Aggregating during ETL (Extract, Transform, Load): Aggregations can be performed
during the data loading process to create summary tables.
Incremental updates: Summary tables can be updated incrementally as new data arrives,
reducing the need for full table recalculations.
Materialized views: Some data warehouse platforms provide materialized views, which
are pre-computed queries that store aggregated results.
Maintaining Summary Tables:
▪ Refresh frequency: Decide how often summary tables need to be updated based on
the data volatility and business requirements.
▪ Incremental updates: Utilize incremental update strategies to efficiently update
summary tables without recalculating the entire dataset.
▪ Partitioning: Partitioning summary tables based on relevant attributes, such as time,
can enhance performance and simplify updates.
Remember, the specific implementation of data aggregation and summary tables may vary
depending on the data warehouse technology and the specific requirements of your
organization.
108 SRMIST DDE Self Learning Material
. Let's consider a scenario where we have a data warehouse for a retail business that stores
data related to sales, customers, and products. We will create summary tables for sales
NOTES
analysis.
The fact table contains the sales transactional data and serves as the primary table for
aggregating data.
Table 8.1:Sales data
Sales Fact Table
SaleID(PK)
DateID(FK)
ProductID(FK)
Customer
Quantity Sold
Sale Amount
SaleID (PK): Primary key of the sales fact table.
DateID (FK): Foreign key referencing the Date dimension table, representing the
date of the sale.
ProductID (FK): Foreign key referencing the Product dimension table, representing
the product sold.
CustomerID (FK): Foreign key referencing the Customer dimension table, representing
the customer who made the purchase.
Quantity Sold: The quantity of products sold in each transaction.
Sale Amount: The total sale amount for each transaction.
Monthly Sales Summary Table:
This summary table aggregates sales data by month.
Table 8.2:Sales Summary data
Monthly Sales Summary
DateID(FK)
Total Sales
Total Quantity Sold
Average Sale Amount
109 SRMIST DDE Self Learning Material
DateID (FK): Foreign key referencing the Date dimension table, representing the month.
NOTES Total Sales: The aggregated total sales for each month.
Total Quantity Sold: The aggregated total quantity sold for each month.
Average Sale Amount: The average sale amount for each month.
8.3 Data Marts: Introduction
A data warehouse is a cohesive data model that defines the central data repository for an
organization. A data mart is a data repository for a specific user group. It contains
summarized data that the user group can easily understand, process, and apply. A data mart
cannot stand alone; it requires a data warehouse. Because each data warehousing effort is
unique, your company’s data warehousing environment may differ slightly from what we
are about to introduce. Each data mart is a collection of tables organized according to the
particular requirements of a user or group of users. Retrieving a collection of different
kinds of data from a “normalized” warehouse can be complex and time-consuming. Hence
the need to rearrange the data so they can be retrieved more easily. The notion of a “mart”
suggests that it is organized for the ultimate consumers — with the potato chips, and video
tapes all next to each other.
This organization does not have to follow any particular inherent rules or structures.
Indeed,
it may not even make sense. And however the marts are organized initially, the
requirements are almost certain to change once the user has seen the implications of the
request.
This means that the creation of data marts requires:
▪ Understanding of the business involved.
▪ Responsiveness to the user’s stated objectives.
▪ Sufficient facility with database modeling and design to produce new tables
quickly.
▪ Tools to convert models into data marts quickly.
110 SRMIST DDE Self Learning Material
NOTES
Figure 8.1:Data Warehouse and data mart
8.4 Type of Data Mart
There are three main types of data marts are:
1. Dependent: Dependent data marts are created by drawing data directly from operational,
external or both sources.
2. Independent: Independent data mart is created without the use of a central data
warehouse.
3. Hybrid: This type of data marts can take data from data warehouses or operational
systems.
Dependent Data Mart
A dependent data mart allows sourcing organization's data from a single Data Warehouse.
It offers the benefit of centralization. If you need to develop one or more physical data
marts, then you need to configure them as dependent data marts.
Dependent data marts can be built in two different ways. Either where a user can access
both the data mart and data warehouse, depending on need, or where access is limited only
to the data mart. The second approach is not optimal as it produces sometimes referred to
as a data junkyard. In the data junkyard, all data begins with a common source, but they
are scrapped, and mostly junked.
Independent Data Mart
An independent data mart is created without the use of central Data warehouse. This kind
of Data Mart is an ideal option for smaller groups within an organization.
An independent data mart has neither a relationship with the enterprise data warehouse nor
with any other data mart. In Independent data mart, the data is input separately, and its
analyses are also performed autonomously.
Implementation of independent data marts is antithetical to the motivation for building a
data warehouse. First of all, you need a consistent, centralized store of enterprise data
111 SRMIST DDE Self Learning Material
which can be analyzed by multiple users with different interests who want widely varying
information.
NOTES
Figure 8.2: Independent Data Mart
Hybrid data Mart:
A hybrid data mart combines input from sources apart from Data warehouse. This could
be helpful when you want ad-hoc integration, like after a new group or product is added to
the organization.
It is best suited for multiple database environments and fast implementation turnaround for
any organization. It also requires least data cleansing effort. Hybrid Data mart also supports
large storage structures, and it is best suited for flexible for smaller data-centric
applications.
Figure 8.3 Hybrid Data marts
112 SRMIST DDE Self Learning Material
Benefits of Data Marts
▪ Allows the data to be accessed in lesser time
NOTES
▪ Cost-efficient alternative to the bulky data warehouse
▪ Easy to use as designed according to the needs of specific user group
▪ Fastens the business processes.
8.5 Design Data Mart
Data Mart should be a designed as smaller version of snowflake schema, within the data
warehouse and should match the database design of data warehouse. It helps maintaining
control over database instances. Its shown in figure 8.4.
Figure 8.4:Design of data mart
The summary table helps to summarize the data mart in the same way of data warehouse.
the major steps in building a data mart are:
▪ Designing
▪ Constructing
▪ Populating
▪ Accessing
▪ Managing
Designing
Designing is the first phase of Data Mart implementation. It covers all the tasks between
initiating the request for a data mart to gathering information about the requirements.
Finally, we create the logical and physical design of the data mart.
113 SRMIST DDE Self Learning Material
The design step involves the following tasks:
▪ Gathering the business & technical requirements and Identifying data sources.
NOTES ▪ Selecting the appropriate subset of data.
▪ Designing the logical and physical structure of the data mart.
Data could be partitioned based on following criteria:
▪ Date
▪ Business or Functional Unit
▪ Geography
▪ Any combination of above
Data could be partitioned at the application or DBMS level. Though it is recommended to
partition at the Application level as it allows different data models each year with the
change in business environment.
Constructing
This is the second phase of implementation. It involves creating the physical database and
the logical structures.
This step involves the following tasks:
▪ Implementing the physical database designed in the earlier phase. For instance,
database schema objects like table, indexes, views, etc. are created.
▪ Storage management: An RDBMS stores and manages the data to create, add, and
delete data.
▪ Fast data access: With a SQL query you can easily access data based on certain
conditions/filters.
Data protection: The RDBMS system also offers a way to recover from system failures
such as power failures. It also allows restoring data from these backups incase of the disk
fails.
Multiuser support: The data management system offers concurrent access, the ability for
multiple users to access and modify data without interfering or overwriting changes made
by another user.
Security: The RDMS system also provides a way to regulate access by users to objects and
certain types of operations.
Populating:
In the third phase, data in populated in the data mart.
114 SRMIST DDE Self Learning Material
The populating step involves the following tasks:
▪ Source data to target data Mapping
NOTES
▪ Extraction of source data
▪ Cleaning and transformation operations on the data
▪ Loading data into the data mart
▪ Creating and storing metadata
Accessing
Accessing is a fourth step which involves putting the data to use: querying the data,
creating reports, charts, and publishing them. End-user submit queries to the database and
display the results of the queries
The accessing step needs to perform the following tasks:
▪ Set up a Meta layer that translates database structures and objects names into
business terms. This helps non-technical users to access the Data mart easily.
▪ Set up and maintain database structures.
▪ Set up API and interfaces if required
Managing
This is the last step of Data Mart Implementation process. This step covers management
tasks such as-
▪ Ongoing user access management.
▪ System optimizations and fine-tuning to achieve the enhanced performance.
▪ Adding and managing fresh data into the data mart.
▪ Planning recovery scenarios and ensure system availability in the case when the
system fails.
▪ Metadata
Answer the Questions
1.List and explain the aggregation functions.
2. What are the benefits of data aggregation?
3. How to design the summary table.
4. Explain the types of summary table.
5. Write detailed notes on summary table.
6.Describe DataMart in detail.
7. What is DataMart? Explain its types.
8. How to design the DataMart?
115 SRMIST DDE Self Learning Material
MODULE IX
9.1 Meta Data
NOTES Metadata is simply defined as data about data. The data that is used to represent other
data is known as metadata. For example, the index of a book serves as a metadata for the
contents in the book. In other words, we can say that metadata is the summarized data that
leads us to detailed data. In terms of data warehouse, we can define metadata as follows.
▪ Metadata is the road-map to a data warehouse.
▪ Metadata in a data warehouse defines the warehouse objects.
▪ Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
9.2 Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data, yet it plays an important role. The various
roles of metadata are explained below.
▪ Metadata acts as a directory.
▪ This directory helps the decision support system to locate the contents of the data
warehouse.
▪ Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
▪ Metadata helps in summarization between current detailed data and highly
summarized data.
▪ Metadata also helps in summarization between lightly detailed data and highly
summarized data.
▪ Metadata is used for query tools.
▪ Metadata is used in extraction and cleansing tools.
▪ Metadata is used in reporting tools.
▪ Metadata is used in transformation tools.
▪ Metadata plays an important role in loading functions.
116 SRMIST DDE Self Learning Material
9.3 Categories of Metadata
Metadata can be broadly categorized into three categories −
NOTES
▪ Business Metadata − It has the data ownership information, business definition,
and changing policies.
▪ Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes
structural information such as primary and foreign key attributes and indices.
▪ Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means
the history of data migrated and transformation applied on it.
Figure 9.1:Metadata and its types
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following
metadata −
▪ Definition of data warehouse − It includes the description of structure of data
warehouse. The description is defined by schema, view, hierarchies, derived data
definitions, and data mart locations and contents.
▪ Business metadata − It contains has the data ownership information, business
definition, and changing policies.
117 SRMIST DDE Self Learning Material
▪ Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means
NOTES the history of data migrated and transformation applied on it.
▪ Data for mapping from operational environment to data warehouse − It includes
the source databases and their contents, data extraction, data partition cleaning,
transformation rules, data refresh and purging rules.
▪ Algorithms for summarization − It includes dimension algorithms, data on
granularity, aggregation, summarizing, etc.
9.4 Challenges for Metadata Management
The importance of metadata can not be overstated. Metadata helps in driving the accuracy
of reports, validates data transformation, and ensures the accuracy of calculations.
Metadata also enforces the definition of business terms to business end-users. With all
these uses of metadata, it also has its challenges. Some of the challenges are discussed
below.
▪ Metadata in a big organization is scattered across the organization. This metadata
is spread in spreadsheets, databases, and applications.
▪ Metadata could be present in text files or multimedia files. To use this data for
information management solutions, it has to be correctly defined.
▪ There are no industry-wide accepted standards. Data management solution vendors
have narrow focus.
▪ There are no easy and accepted methods of passing metadata.
9.5 Legacy Systems
The task of documenting legacy systems and mapping their columns to those of the central
data warehouse is the worst part of any data warehouse project. Moreover, the particular
characteristics of the project vary, depending on the nature, technology, and sites of those
systems. Essential Strategies, Inc. has extended Oracle’s Designer/2000 product to allow
for reverse engineering of legacy systems—specifically providing the facility for mapping
each column of a legacy system to an attribute of the corporate data model. Combining
this with the mapping of each column in a data mart to an attribute of the data model
provides direct mapping of original data to the view of them seen by users.
118 SRMIST DDE Self Learning Material
9.6 Types of backups
NOTES
A data warehouse is a complex system and it contains a huge volume of data. Therefore,
it is important to back up all the data so that it becomes available for recovery in future as
per requirement. In this chapter, we will discuss the issues in designing the backup strategy.
The important issues affect backup plan:
▪ Realize that the restore is almost more important than the backup
▪ Create a hot backup
▪ Move your backup offline
▪ Perfect the connection between your production and backup servers
▪ Consider a warm backup
▪ Always reassess the situation
Before proceeding further, some of the backup types are discussed below.
▪ Complete backup − It backs up the entire database at the same time. This backup
includes all the database files, control files, and journal files.
▪ Partial backup − As the name suggests, it does not create a complete backup of the
database. Partial backup is very useful in large databases because they allow a
strategy whereby various parts of the database are backed up in a round-robin
fashion on a day-to-day basis, so that the whole database is backed up effectively
once a week.
▪ Cold backup − Cold backup is taken while the database is completely shut down.
In multi-instance environment, all the instances should be shut down.
▪ Hot backup − Hot backup is taken when the database engine is up and running. The
requirements of hot backup varies from RDBMS to RDBMS.
▪ Online backup − It is quite similar to hot backup.
9.7 Backup the data warehouse
Oracle RDBMS is basically a collection of physical database files. Backup and recovery
problems are most likely to occur at this level. Three types of files must be backed up:
database files, control files, and online redo log files. If you omit any of these files, you
have not made a successful backup of the database.
Cold backups shut down the database. Hot backups take backups while the database is
functioning. There are also supplemental backup methods, such as exports. Each type of
119 SRMIST DDE Self Learning Material
backup has its advantages and disadvantages. The major types of instance recovery are
cold restore, full database recovery, time-based recovery, and cancel-based recovery.
NOTES Oracle provides a server-managed infrastructure for backup, restore, and recovery tasks
that enables simpler, safer operations at terabyte scale. Some of the highlights are:
▪ Details related to backup, restore, and recovery operations are maintained by the
server in a recovery catalog and automatically used as part of these operations. This
reduces administrative burden and minimizes the possibility of human errors.
▪ Backup and recovery operations are fully integrated with partitioning. Individual
partitions, when placed in their own tablespaces, can be backed up and restored
independently of the other partitions of a table.
▪ Oracle includes support for incremental backup and recovery, enabling operations
to be completed efficiently within times proportional to the amount of changes,
rather than the overall size of the database.
▪ The backup and recovery technology is highly scalable, and provides tight
interfaces to industry-leading media management subsystems. This provides for
efficient operations that can scale up to handle very large volumes of data. Open
platforms for more hardware options & enterprise-level platforms.
Putting in place a backup and recovery plan for data warehouses is imperative. Even
though most of the data comes from operational systems originally, you cannot always
rebuild data warehouses in the event of a media failure (or a disaster). As operational data
ages, it is removed from the operational databases, but it may still exist in the data
warehouse. Furthermore, data warehouses often contain external data that, if lost, may
have to be purchased.
We allow you to back up any number of computers to your account. You only pay a small
one-time fee for each additional computer. All computers will share the common pool of
storage space.
Keeping a local backup allows you to save data way in excess of the storage that you
purchase for online backup. For other backup products designed for local backup.
Create a Hot Backup
The best solution for critical systems is always to restore the backup once it has been made
onto another machine. This will also provide you with a “hot” backup that is ready to be
120 SRMIST DDE Self Learning Material
used, should your production database ever crash. If you can’t afford this higher cost, you
should at least do periodic test restores.
NOTES
Consider a Warm (Not Hot) Backup
▪ After you install hot backups and build a solid network, can you relax? Sort of,
except there’s one thing you still need to worry about.
▪ Hot backups are great because they are extremely current. The minute data is
changed on the primary server, you propagate that change to your backup server.
▪ But that very same close binding also comes with its own dangers. Consider what
happens if someone accidentally deletes the wrong data. For example, say an
administrator accidentally drops the accounts table in the database. (I’m not making
this up; this actually happened at a client of mine.) The inadvertent deletion will
immediately get propagated to the backup, too.
▪ If every second of downtime is expensive, you should consider creating a copy of
the database that is kept “warm”—i.e., fairly recent but not as recent as your hot
backup. For example, you might set the hot backup within a minute of the primary
server, but deliberately keep your warm backup 30 minutes behind your primary.
That way, you only have to restore the last several transaction logs rather than start
from scratch.
Tape Backup Storage
▪ Tape backup of customers’ data on SAN
▪ Tape backup of Veritas disk storage
▪ Tape backup of customers’ data direct from network
▪ Tape archiving
▪ Off-site tape storage
9.8 Sure West Online Backup
using a secure connection and safely store it in the SureWest Broadband Data Storage
Warehouse until it is needed. If you use your computer for business or personal use, you
need to always have a backup of your data files in the event of software, hardware or
human error. Ideally you should keep a copy of this data at some location other than your
121 SRMIST DDE Self Learning Material
home or place of business. If your house burns down and your backups are sitting on the
shelf next to your computer, they too will be lost.
NOTES You need to make backups of your important data to prevent a total loss in the event of
any kind of system failure or human error. Simple data backup can be done by saving
your data to multiple locations, copying the data from its original location to removable
media, another hard drive, or another computer’s hard drive on the network.
You also need to have multiple versions of your information backed up. If you are working
on a spreadsheet over a period of several days, it’s best to keep a snapshot of every day’s
work.
This is easily done with an online backup service.
To make this task easier, SureWest Broadband has licensed Backup Software from
NovaStor, a leader in the backup business, to enable users to make backup copies of their
data. Most users do not want to be bothered with managing the tasks necessary for
maintaining their own backups, so a new type of backup system was needed. It’s known
as online backup.
Online backup software is designed to routinely copy your important files to a private,
secure location on the Internet by transmitting your data over your existing Internet
connection. All of your data is encrypted before it is sent to our storage network to protect
your privacy. If you have a working Internet connection on your computer, you can use
SureWest Broadband’s Online Backup Service to keep your important files safe from
disaster on a daily basis.
Answer the questions
1.What is metadata?
2.Explain the role of metadata.
3.List the categories of metadata.
4.What are legacy systems.
5.Explain the types of backups.
6.Differentiate hot backup and cold backup.
7.List the factors which affect the backup plan.
8.How to create hot backup.
122 SRMIST DDE Self Learning Material
MODULE X
NOTES
10.1 RECOVERY –STRATEGIES
Various Testing Strategies
Testing is very important for data warehouse systems to make them work correctly and
efficiently. There are three basic levels of testing performed on a data warehouse −
▪ Unit testing
▪ Integration testing
▪ System testing
10.1.1 Unit Testing
▪ In unit testing, each component is separately tested.
▪ Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
▪ This test is performed by the developer.
10.1.2 Integration Testing
▪ In integration testing, the various modules of the application are brought together and
then tested against the number of inputs.
▪ It is performed to test whether the various components do well after integration.
10.1.3 System Testing
▪ In system testing, the whole data warehouse application is tested together.
▪ The purpose of system testing is to check whether the entire system works correctly
together or not.
▪ System testing is performed by the testing team.
▪ Since the size of the whole data warehouse is very large, it is usually possible to
perform minimal system testing before the test plan can be enacted.
10.1.4 Test Schedule
First of all, the test schedule is created in the process of developing the test plan. In this
schedule, we predict the estimated time required for the testing of the entire data warehouse
system.
123 SRMIST DDE Self Learning Material
There are different methodologies available to create a test schedule, but none of them are
perfect because the data warehouse is very complex and large. Also the data warehouse
NOTES system is evolving in nature. One may face the following issues while creating a test
schedule −
▪ A simple problem may have a large size of query that can take a day or more to
complete, i.e., the query does not complete in a desired time scale.
▪ There may be hardware failures such as losing a disk or human errors such as
accidentally deleting a table or overwriting a large table.
10.2 VARIOUS TESTING STRATEGIES
Data warehousing projects are becoming larger, more complex, and increasingly important
to the organizations that implement them.
Data is loaded from a growing number of diverse sources across the enterprise to create
larger, richer assemblages of both text and numerical information.
Data is loaded into either a high-volume test area or in the user acceptance testing (UAT)
environments.
Regression testing: ensures that existing functionality remains intact each time a new
release of ETL code and data is completed.
Performance, load, and scalability tests: ensure that data loads and queries perform within
expected periods and that the technical architecture is scalable.
Acceptance testing: includes verifications of data model completeness to meet the
reporting needs of the project, reviewing table designs, validation of data to be loaded in
the production data warehouse, a review of the periodic data upload procedures, and finally
application reports.
Verifications that need a strategy: For the reason that data warehouse testing is different
from most software testing, a best practice is to break the testing and validation process
into several well-defined, high-level focal areas for data warehouse projects. Doing so
allows targeted planning for each focus area, such as integration and data validation.
Data validations: includes reviewing the ETL mapping encoded in the ETL tool as well as
reviewing samples of the data loaded into the test environment.
Data integration tests: tasks include reviewing and accepting the logical data model
captured with a data modeling tool (e.g., ERWin), converting the models to actual physical
124 SRMIST DDE Self Learning Material
database tables in the test environment, creating the proper indexes, documenting the
relevant metadata, and testing the ETL programs created by your ETL tool or stored
NOTES
procedures.
System testing: involves increasing the volume of the test data to be loaded, estimating and
measuring load times and loading errors, placing data
10.2.1 Testing Backup Recovery
Testing the backup recovery strategy is extremely important. Here is the list of scenarios
for which this testing is needed −
▪ Media failure
▪ Loss or damage of table space or data file
▪ Loss or damage of redo log file
▪ Loss or damage of control file
▪ Instance failure
▪ Loss or damage of archive file
▪ Loss or damage of table
▪ Failure during data failure
10.2.2 Testing Operational Environment
There are a number of aspects that need to be tested. These aspects are listed below.
▪ Security − A separate security document is required for security testing. This
document contains a list of disallowed operations and devising tests for each.
▪ Scheduler − Scheduling software is required to control the daily operations of a
data warehouse. It needs to be tested during system testing. The scheduling
software requires an interface with the data warehouse, which will need the
scheduler to control overnight processing and the management of aggregations.
▪ Disk Configuration. − Disk configuration also needs to be tested to identify I/O
bottlenecks. The test should be performed with multiple times with different
settings.
▪ Management Tools. − It is required to test all the management tools during system
testing. Here is the list of tools that need to be tested.
▪ Event manager
▪ System manager
▪ Database manager
125 SRMIST DDE Self Learning Material
▪ Configuration manager
▪ Backup recovery manager
NOTES 10.2.3 Testing the Database
The database is tested in the following three ways −
▪ Testing the database manager and monitoring tools − To test the database manager
and the monitoring tools, they should be used in the creation, running, and
management of test database.
▪ Testing database features − Here is the list of features that we have to test −
▪ Querying in parallel
▪ Create index in parallel
▪ Data load in parallel
▪ Testing database performance − Query execution plays a very important role in
data warehouse performance measures. There are sets of fixed queries that need to
be run regularly and they hould be tested. To test ad hoc queries, one should go
through the user requirement document and understand the business completely.
Take time to test the most awkward queries that the business is likely to ask against
different index and aggregation strategies.
10.2.4 Testing the Application
▪ All the managers should be integrated correctly and work in order to ensure
that the end-to-end load, index, aggregate and queries work as per the
expectations.
▪ Each function of each manager should work correctly
▪ It is also necessary to test the application over a period of time.
▪ Week end and month-end tasks should also be tested.
10.2.5 Logistic of the Test
▪ The aim of system test is to test all of the following areas −
▪ Scheduling software
▪ Day-to-day operational procedures
▪ Backup recovery strategy
▪ Management and scheduling tools
▪ Overnight processing
▪ Query performance
126 SRMIST DDE Self Learning Material
10.3 VARIOUS RECOVERY MODELS
There are three recovery models: full, bulk-logged, and simple. The recovery model of a
NOTES
new database is inherited from the model database when the new database is created. The
model for a database can be changed after the database has been created.
▪ Full recovery model − It provides the most flexibility for recovering the
database to an earlier point of time.
▪ Bulk-logged recovery model − Bulk-logged recovery provides higher
performance and lowers log space consumption for certain large-scale
operations.
▪ Simple recovery model − Simple recovery provides the highest performance
and lower log space consumption but with significant exposure to data loss in
the event of a system failure. The amount of exposure to data loss varies with
the model chosen.
Each recovery model addresses a different need. Knowledgeable administrators can use
this recovery model feature to significantly speed up data loads and bulk operations.
However, the amount of exposure to data loss varies with the model chosen. It is imperative
that the risks be thoroughly understood before choosing a recovery model.
Each recovery model addresses a different need. Trade-offs are made depending on the
model you chose. The trade-offs that occur pertain to performance, space utilization (disk
or tape), and protection against data loss. When you choose a recovery model, you are
deciding among the following business requirements:
▪ Performance of large-scale operations (for example, index creation or bulk loads)
▪ Data loss exposure (for example, the loss of committed transactions)
▪ Transaction log space consumption
▪ Simplicity of backup and recovery procedures
Depending on what operations you are performing, one model may be more appropriate
than another. Before choosing a recovery model, consider the impact it will have. The
following table provides helpful information
127 SRMIST DDE Self Learning Material
Table 10.1:Recovery Model
NOTES
10.4 Disaster Recovery procedure
Disaster recovery (DR) is an organization’s ability to restore access and functionality to IT
infrastructure after a disaster event, whether natural or caused by human action (or error).
DR is considered a subset of business continuity, explicitly focusing on ensuring that the
IT systems that support critical business functions are operational as soon as possible after
a disruptive event occurs.
Data warehouses are the central repository of information for businesses and they play a
critical role in the day-to-day operations for analytics and business intelligence at all levels
of the organization. It is therefore essential that your data warehouse is reliable and allows
for recoverability and continuous operation.
Types of disasters can include:
▪ Natural disasters (for example, earthquakes, floods, tornados, hurricanes, or
wildfires)
▪ Pandemics and epidemics
▪ Cyber attacks (for example, malware, DDoS, and ransomware attacks)
▪ Other intentional, human-caused threats such as terrorist or biochemical attacks
▪ Technological hazards (for example, power outages, pipeline explosions, and
transportation accidents)
▪ Machine and hardware failure
128 SRMIST DDE Self Learning Material
10.4.1 The disaster recovery elements
NOTES
Disaster recovery relies on having a solid plan to get critical applications and infrastructure
up and running after an outage—ideally within minutes.
An effective DR plan addresses three different elements for recovery:
▪ Preventive: Ensuring your systems are as secure and reliable as possible, using
tools and techniques to prevent a disaster from occurring in the first place. This
may include backing up critical data or continuously monitoring environments for
configuration errors and compliance violations.
▪ Detective: For rapid recovery, you’ll need to know when a response is necessary.
These measures focus on detecting or discovering unwanted events as they happen
in real time.
▪ Corrective: These measures are aimed at planning for potential DR scenarios,
ensuring backup operations to reduce impact, and putting recovery procedures into
action to restore data and systems quickly when the time comes.
10.4.2 Types of disaster recovery
The types of disaster recovery you’ll need will depend on your IT infrastructure, the type
of backup and recovery you use, and the assets you need to protect. Here are some of the
most common technologies and techniques used in disaster recovery:
▪ Backups: With backups, you back up data to an offsite system or ship an external
drive to an offsite location. However, backups do not include any IT infrastructure,
so they are not considered a full disaster recovery solution.
▪ Backup as a service (BaaS): Similar to remote data backups, BaaS solutions
provide regular data backups offered by a third-party provider.
▪ Disaster recovery as a service (DRaaS): Many cloud providers offer DRaaS, along
with cloud service models like IaaS and PaaS. A DRaaS service model allows you
to back up your data and IT infrastructure and host them on a third-party provider’s
cloud infrastructure. During a crisis, the provider will implement and orchestrate
your DR plan to help recover access and functionality with minimal interruption to
operations.
▪ Point-in-time snapshots: Also known as point-in-time copies, snapshots replicate
data, files, or even an entire database at a specific point in time. Snapshots can be
129 SRMIST DDE Self Learning Material
used to restore data as long as the copy is stored in a location unaffected by the
event. However, some data loss can occur depending on when the snapshot was
NOTES made.
▪ Virtual DR: Virtual DR solutions allow you to back up operations and data or even
create a complete replica of your IT infrastructure and run it on offsite virtual
machines (VMs). In the event of a disaster, you can reload your backup and resume
operation quickly. This solution requires frequent data and workload transfers to
be effective.
▪ Disaster recovery sites: These are locations that organizations can temporarily use
after a disaster event, which contain backups of data, systems, and other technology
infrastructure.
10.4.3 Planning a disaster recovery strategy
A comprehensive disaster recovery strategy should include detailed emergency
response requirements, backup operations, and recovery procedures. DR strategies and
plans often help form a broader business continuity strategy, which includes
contingency plans to mitigate impact beyond IT infrastructure and systems, allowing
all business areas to resume normal operations as soon as possible . Business continuity
and disaster recovery plans need to optimize for the following considering the
following key metrics:
▪ Recovery time objective (RTO): The maximum acceptable length of time that
systems and applications can be down without causing significant damage to the
business. For example, some applications can be offline for an hour, while others
might need to recover in minutes.
▪ Recovery point objective (RPO): The maximum age of data you need to recover to
resume operations after a major event. RPO helps to define the frequency of backups.
▪ Failover is the disaster recovery process of automatically offloading tasks to backup
systems in a way that is seamless to users. You might fail over from your primary
data center to a secondary site, with redundant systems that are ready to take over
immediately.
▪ Failback is the disaster recovery process of switching back to the original systems.
Once the disaster has passed and your primary data center is back up and running,
you should be able to fail back seamlessly as well.
▪ Restore is the process of transferring backup data to your primary system or data
130 SRMIST DDE Self Learning Material
center. The restore process is generally considered part of backup rather than disaster
recovery.
NOTES
Benefits of disaster recovery
• Stronger business continuity: Every second counts when your business goes
offline, impacting productivity, customer experience, and your company’s
reputation. Disaster recovery helps safeguard critical business operations by
ensuring they can recover with minimal or no interruption.
• Enhanced security : DR plans use data backup and other procedures that strengthen
your security posture and limit the impact of attacks and other security risks. For
example, cloud-based disaster recovery solutions offer built-in security
capabilities, such as advanced encryption, identity and access management, and
organizational policy.
• Faster recovery : Disaster recovery solutions make restoring your data and
workloads easier so you can get business operations back online quickly after a
catastrophic event. DR plans leverage data replication and often rely on automated
recovery to minimize downtime and data loss.
• Reduced recovery costs : The monetary impacts of a disaster event can be
significant, ranging from loss of business and productivity to data privacy penalties
to ransoms. With disaster recovery, you can avoid, or at least minimize, some of
these costs. Cloud DR processes can also reduce the operating costs of running and
maintaining a secondary location.
• High availability : Many cloud-based services come with high availability (HA)
features that can support your DR strategy. HA capabilities help ensure an agreed
level of performance and offer built-in redundancy and automatic failover,
protecting data against equipment failure and other smaller-scale events that may
impact data availability.
• Better compliance : DR planning supports compliance requirements by considering
potential risks and defining a set of specific procedures and protections for your
data and workloads in the event of a disaster. This usually includes strong data
backup practices, DR sites, and regularly testing your DR plan to ensure that your
organization is prepared.
131 SRMIST DDE Self Learning Material
Answer the questions
NOTES 1.What are the three basic levels of testing?
2.List the issues addressed while designing a test schedule.
3.Elaborate on various testing strategies.
4.Explain the logistics of the test.
5. Describe the disaster recovery procedure.
6.Write notes on the elements addressed by the disaster recovery plan.
7.Write notes on types of disaster recovery.
8.Describe the step to plan a disaster recovery strategy.
**************
132 SRMIST DDE Self Learning Material