Hierarchical Clustering
Hierarchical Clustering
LECTURE HANDOUTS L 01
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 02
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
1. Relational databases
2. Data warehouses
3. Transactional Databases
4. Advanced database systems
5. Object-relational
6. Spacial and Temporal
7. Time-series
8. Multimedia Data Mining
9. Text Mining
10. Web Mining
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 03
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
2.Retail Industry :
Data Mining has its great application in Retail Industry because it collects large amount of data from
on sales, customer purchasing history, goods transportation, consumption and services.
Design and Construction of data warehouses based on the benefits of data mining.
Multidimensional analysis of sales, customers, products, time and region.
Customer Retention.
Product recommendation and cross-referencing of items.
3. Telecommunication Industry :
The telecommunication industry is one of the most emerging industries providing various services such
as fax, pager, cellular phone, internet messenger, images, e-mail, web data transmission, etc.
Multidimensional Analysis of Telecommunication data.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
6. Intrusion Detection :
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of
network resources.
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build discriminating attributes.
Analysis of Stream data.
Distributed data mining.
Visualization and query tools.
Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place.
It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. The major issues are −
Mining Methodology and User Interaction.
Performance Issues.
Diverse Data Types Issues.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 04
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Mining different kinds of knowledge in Databases − Different users may be interested in different
kinds of knowledge.
Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs
to be interactive because it allows users to focus the search for patterns.
Incorporation of background knowledge − Background knowledge may be used to express the
discovered patterns not only in concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data Mining Query language that allows
the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language.
Presentation and visualization of data mining results − Once the patterns are discovered it needs to
be expressed in high level languages, and visual representations.
Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and
incomplete objects while mining the data regularities.
Pattern evaluation − The patterns discovered should be interesting because either they represent
common knowledge or lack novelty.
2. Performance Issues :
Efficiency and scalability of data mining algorithms − In order to effectively extract the information
from huge amount of data in databases, data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the development
of parallel and distributed data mining algorithms.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 05
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
1. Data Cleaning
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
Data cleaning tasks :
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 06
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
1. Schema Integration and Object Matching: Integrate metadata from different sources.
The real-world entities from multiple sources are matched referred to as the entity identification
problem.
Entity identification problem: identify real world entities from multiple data sources,
e.g., A.cust-id =B.cust-id
2. Redundancy :
Clustering :
Partition data set into clusters, and one can store cluster representation only.
Can be very effective if data is clustered but not if data is “smeared”.
Can have hierarchical clustering and be stored in multi-dimensional index tree structures.
Incomplete: Lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data.
Noisy: Containing Random errors or outliers.
Inconsistent: Containing discrepancies in codes or names.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 07
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 08
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
1. Data cube aggregation : Aggregation operation is applied to data for the construction of the data
cube.
The lowest level of a data cube
The aggregated data for an individual entity of interest
e.g., a customer in a phone calling data warehouse.
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with it.
Reference appropriate levels
Use the smallest representation which is enough to solve the task.
2. Dimensionality Reduction :
This reduce the size of data by encoding mechanisms.
It can be lossy or lossless.
If after reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction.
The two effective methods of dimensionality reduction are:
1. Wavelet Transforms
2. PCA (Principal Component Analysis).
4. Numerosity Reduction :
For example:
Regression Models.
Linear regression
Multiple regression
Log-linear regression
An attribute may be redundant if it can be derived or obtaining from another attribute or set of
attributes.
Inconsistencies in attributes can also cause redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis.
The same attribute may have different names in different databases Careful integration of the data
from multiple sources may help reduce/avoid redundancies and inconsistencies and improve
mining speed and quality.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 09
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Reduce the number of values for a given continuous attribute by dividing the range of the attribute
into intervals.
Interval labels can then be used to replace actual data values.
Three types of attributes:
Concept Hierarchy Generation : Attributes are converted from lower level to higher level in
hierarchy.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 10
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
The rank correlation coefficients are used just to establish the type of the relationship, but not to
investigate in detail like the Pearson’s correlation coefficient.
They are also used to reduce the calculations and make the results more independent of the non-
normality of the distributions considered.
Association is a concept, but correlation is a measure of association and mathematical tools are
provided to measure the magnitude of the correlation.
Pearson’s product moment correlation coefficient establishes the presence of a linear relationship
and determines the nature of the relationship (whether they are proportional or inversely
proportional)
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 11
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 12
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
To facilitate tree traversal, an item header table is built so that each item points to its
occurrences in the tree via a chain of node-links.
The FP-tree is mined as follows. Start from each frequent length-1 pattern (as an initial suffix
pattern), construct its conditional pattern base (a “sub-database,” which consists of the set of
prefix paths in the FP-tree co-occurring with the suffix pattern), then constructits (conditional)
FP-tree,and performmining recursively on the tree.
The pattern growth is achieved by the concatenation of the suffix pattern with the frequent
patterns generated from a conditional FP-tree.
A Pattern-Growth Approach for Mining Frequent Itemsets
To facilitate tree traversal, an item header table is built so that each item points to its occurrences
in the tree via a chain of node-links.
The FP-tree is mined as follows. Start from each frequent length-1 pattern (as an initial suffix
pattern), construct its conditional pattern base (a “sub-database,” which consists of the set of prefix
paths in the FP-tree co-occurring with the suffix pattern), then constructits (conditional) FP-
tree,and performmining recursively on the tree.
The pattern growth is achieved by the concatenation of the suffix pattern with the frequent patterns
generated from a conditional FP-tree.
1.Strong Rules Are Not Necessarily Interesting
Whether or not a rule is interesting can be assessed either subjectively or objectively.
Ultimately,only the user can judge if a given rule is interesting,and this judgment,being
subjective,may differ from one user to another.
However,objective interestingness measures, based on the statistics “behind” the data, can
be used as one step toward the goal of weeding out uninteresting rules that would otherwise
be presented to the user.
2. From Association Analysis to Correlation Analysis
As we have seen so far, the support and confidence measures are insufficient at filtering
out uninteresting association rules.
To tackle this weakness, a correlation measure can be used to augment the support–
confidence framework for association rules.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 13
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
To review, a frequent pattern is a pattern (or itemset) that satisfies a minimum support threshold.
A pattern p is a closed pattern if there is no super pattern p0 with the same support as p.
Frequent patterns can also be mapped into association rules, or other kinds of rules based on
interestingness measures.
Sometimes we may also be interested in infrequent or rarepat terns (i.e., patterns that occur rarely
but are of critical importance, or negative patterns (i.e., patterns that reveal a negative correlation
between items).
Based on the number of dimensions involved in the rule or pattern: If the items or attributes in an
association rule or pattern reference only one dimension, it is a single-dimensional association
rule/pattern.
Association rules generated from mining data at multiple levels of abstraction are called multiple-level
or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under a support-
confidence framework. In general, a top-down strategy is employed, where counts are accumulated for
the calculation of frequent item sets at each concept level, starting at the concept level 1 and working
downward in the hierarchy toward the more specific concept levels, until no more frequent itemsets
can be found.
Using uniform minimum support for all levels (referred to as uniform support): The same minimum
support threshold is used when mining at each level of abstraction. For example, in Figure 5.11, a
minimum support threshold of 5% is used throughout (e.g., for mining from “computer” down to
“laptop computer”). Both “computer” and “laptop computer” are found to be frequent, while “desktop
computer” is not.
When a uniform minimum support threshold is used, the search procedure is simplified. The method is
also simple in that users are required to specify only one minimum support threshold. An Apriori-like
optimization technique can be adopted, based on the knowledge that an ancestor is a superset of its
descendants: The search avoids examining itemsets containing any item whose ancestors do not have
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 14
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Items often form hierarchy. Items at the lower level are expected to have lower support.
• Rules regarding itemsets at appropriate levels could be quite useful.
• Transaction database can be encoded based on dimensions and levels
• We can explore shared multi-level mining
• Figure shows the Mining multiple level association rules from transactional databases.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 15
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
A single dimensional or intra dimensional association rule because it contains a single distinct
predicate (e.g., buys) with multiple occurrences (i.e., the predicate occurs more than once within
the rule).
Such rules are commonly mined from transactional data.
Instead of considering transactional data only,sales and related information are often linked with
relational data or integrated into a data warehouse. Such data stores are multidimensional in
nature.
Additional relational information regarding the customers who purchased the items (e.g.,
customer age, occupation, credit rating, income, and address) may also be stored. Considering
each database attribute or warehouse dimension as a predicate, we can therefore mine
association rules containing multiple predicates such as
age(X, “20...29”)∧occupation(X, “student”)⇒buys(X, “laptop”)
Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules.
Rule contains three predicates (age, occupation, and buys), each of which occurs only once in
the rule. Hence, we say that it has no repeated predicates.
Multidimensional association rules with no repeated predicates are called inter dimensional
association rules.
We can also mine multidimensional association rules with repeated predicates, which contain
multiple occurrences of some predicates.
These rules are called hybrid-dimensional association rules.
An example of such a rule is the following, where the predicate buys is repeated:
age(X, “20...29”)∧buys(X, “laptop”)⇒buys(X, “HP printer”)
In the first approach, quantitative attributes are discretized using predefined concept hierarchies.
This discretization occurs before mining.
For instance, a concept hierarchy for income may be used to replace the original numeric values
of this attribute by interval labels such as “0..20K,” “21K..30K,” “31K..40K,” and so on. Here,
discretization is static and predetermined. Chapter 3 on data preprocessing gave several
techniques for discretizing numeric attributes.
The discretized numeric attributes, with their interval labels, can then be treated as nominal
attributes (where each interval is considered a category). We refer to this as mining
multidimensional association rules using static discretization of quantitative attributes.
In the second approach, quantitative attributes are discretized or clustered into “bins” based on
the data distribution.
These bins may be further combined during the mining process.The discretization process is
dynamic and established so as to satisfy somemining criteria such as maximizing the confidence
of the rules mined.
Because this strategy treats the numeric attribute values as quantities rather than as pre defined
ranges or categories, association rules mined from this approach are also referred to as (dynamic)
quantitative associationrules.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 16
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Knowledge type constraints: These specify the type of knowledge to be mined, such as association,
correlation, classification, or clustering.
Dataconstraints: These specify the set of task-relevant data. Dimension/level constraints: These
specify the desired dimensions (or attributes) of the data, the abstraction levels, or the level of the
concept hierarchies to be used in mining.
Ruleconstraints: These specify the form of, or conditions on, the rules to be mined. Such constraints
may be expressed as meta rules (rule templates),as the maximum or
minimumnumberofpredicatesthatcanoccurintheruleantecedentorconsequent, or as relationships among
attributes, attribute values, and/or aggregates. These constraints can be specified using a high-level
declarative data mining query language and user interface.
P1∧P2∧···∧Pl ⇒Q1∧Q2∧···∧Qr
where Pi (i=1,..., l) and Qj (j=1,..., r) are either instantiated predicates or predicate variables. Let
the number of predicates in the metarule be p=l+r.
To find interdimensional association rules satisfying the template,We need to find all frequent p-
predicate sets, Lp.
We must also have the support or count of the l-predicate subsets of Lp to compute the
confidence of rules derived from Lp.
This is a typical case of mining multidimensional association rules. By extending such
methods using the constraint-pushing techniques described in the following section,we can
derive efficient methods for metarule-guided mining.
Constraint-Based Pattern Generation: Pruning Pattern Space and Pruning Data Space
Rule constraints specify expected set/subset relationships of the variables in the mined rules,
constant initiation of variables, and constraints on aggregate functions and other forms of
constraints. Users typically employ their knowledge of the application or data to specify rule
constraints for the mining task.
These rule constraints may be used together with, or as an alternative to, metarule-guided mining.
In this section, we examine rule constraints as to how they can be used to make the mining process
more efficient.
Associative classification, where association rules are generated from frequent patterns and used
for classification.
The general idea is that we can search for strong associations between frequent patterns
(conjunctions of attribute–value pairs) and class labels. The next is Discriminative frequent
pattern–based classification, where frequent patterns serve as combined features, which are
considered in addition to single features when building a classification model.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 17
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Example: Before starting any project, we need to check its feasibility. In this case, a classifier is
required to predict class labels such as ‘Safe’ and ‘Risky’
A bank loans officer needs analysis of her data in order to learn which loan applicants are ―safe‖ and
which are ―risky‖ for the bank. A marketing manager at All Electronics needs data analysis to help
guess whether a customer with a given profile will buy a new computer.
A medical researcher wants to analyze breast cancer data in order to predict which one of three specific
treatments a patient should receive. In each of these examples, the data analysis task is classification,
where a model or classifier is constructed to predict categorical labels, such as ―safe‖ or ―risky‖ for
the loan application data; ―yes‖ or ―no‖ for the marketing data; or ―treatment A,‖ ―treatment B,‖ or
―treatment C‖ for the medical data.
These categories can be represented by discrete values, where the ordering among values has no
meaning. For example, the values 1, 2, and 3 may be used to represent treatments A,B, and C, where
there is no ordering implied among this group of treatment regimes.
Learning Step (Training Phase): Construction of Classification Model Different Algorithms are
used to build a classifier by making the model learn using the training set available. The model has to
be trained for the prediction of accurate results.
Issues Regarding Classification and Prediction :
Data Cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by
applying smoothing techniques, for example) and the treatment of missing values (e.g., by replacing a
missing value with the most commonly occurring value for that attribute, or with the most probable
value based on statistics). Although most classification algorithms have some mechanisms for
handling noisy or missing data, this step can help reduce confusion during learning.
Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can be
used to identify whether any two given attributes are statistically related. For example, a strong
correlation between attributes A1 and A2 would suggest that one of the two could be removed from
further analysis. A database may also contain irrelevant attributes.
Attribute subset selection4 can be used in these cases to find a reduced set of attributes such that the
resulting probability distribution of the data classes is as close as possible to the original distribution
obtained using all attributes.
It can be used to detect attributes that do not contribute to the classification or prediction task.
Including such attributes may otherwise slow down, and possibly mislead, the learning step. Ideally,
the time spent on relevance analysis, when added to the time spent on learning from the resulting
―reduced‖ attribute (or feature) subset, should be less than the time that would have been spent on
learning from the original set of attributes.
Data transformation and reduction: The data may be transformed by normalization, particularly
when neural networks or methods involving distance measurements are used in the learning step.
Normalization involves scaling all values for a given attribute so that they fall within a small specified
range, such as -1.0 to 1.0, or 0.0 to 1.0. In methods that use distance measurements, for example, this
would prevent attributes with initially large ranges (like, say, income) from out weighing attributes
with initially smaller ranges (such as binary attributes).
Video Content / Details of website for further learning (if any):
https://www.jigsawacademy.com/blogs/data-science/classification-and-prediction-in-data-mining/
Important Books/Journals for further learning including the page nos.:
Data Mining Concepts and Techniques Jiawei Han and Micheline Kamber Second Edition,Morgan
Kaufmann Publication 2010 (Pg.No : 285 - 288)
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 18
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Nominal: When more than two outcomes are possible. It is in Alphabet form rather than being in
Integer form.
Example: One needs to choose some material but of different colors. So, the color might be Yellow,
Green, Black, Red.
Generative: It models the distribution of individual classes and tries to learn the model that generates
the data behind the scenes by estimating assumptions and distributions of the model. Used to predict
the unseen data.
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that too divided in
1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam emails). Now if a user wants to
check that if an email contains the word cheap, then that may be termed as Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are spam.
So, if the email contains the word cheap, what is the probability of it being spam ?? (= 80%)
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 19
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Decision tree induction is the learning of decision trees from class-labeled training tuples.
A decision tree is a flow chart-like tree structure, where each internal node (nonleaf node)
denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node
(or terminal node) holds a class label.
The topmost node in a tree is the root node.
It represents the concept buys computer, that is, it predicts whether a customer at AllElectronics
is likely to purchase a computer.
Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals.
Some decision tree algorithms produce only binary trees (where each internal node branches to
exactly two other nodes), whereas others can produce nonbinary trees.
Algorithm: Generate decision tree. Generate a decision tree from the training tuples of data
partition, D.
Input:
Data partition, D, which is a set of training tuples and their associated class labels;
attribute list, the set of candidate attributes; Attribute selection method, a
procedure to determine the splitting criterion that “best” partitions the data tuples
into individual classes. This criterion consists of a splitting attribute and, possibly,
either a split-point or splitting subset.
Information Gain
ID3 uses information gain as its attribute selection measure.
This measure is based on pioneering work by Claude Shannon on information theory, which
studied the value or “information content” of messages.
Let node N represent or hold the tuples of partition D.
The attribute with the highest information gain is chosen as the splitting attribute for node N.
We can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and Gain(credit rating) =
0.048 bits. Because age has the highest information gain among the attributes, it is selected as the
splitting attribute.
Node N is labeled with age, and branches are grown for each of the attribute’s values. The tuples
are then partitioned accordingly, Notice that the tuples falling into the partition for age = middle
aged all belong to the same class. Because they all belong to class “yes,” a leaf should therefore be
created at the end of this branch and labeled with “yes.”
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 20
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 21
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Rule-Based Classification
Using IF-THEN Rules for Classification
Rules are a good way of representing information or bits of knowledge. A rule-based
classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression
of the form
IF condition THEN conclusion.
The “IF” part (or left side) of a rule is known as the rule antecedent or precondition. The
“THEN” part (or right side) is the rule consequent.
In the rule antecedent, the condition consists of one or more attribute tests (e.g., age = youth
and student = yes) that are logically ANDed.
The rule’s consequent contains a class prediction (in this case, we are predicting whether a
customer will buy a computer).
R1 can also be written as
R1: (age=youth)∧(student=yes)⇒(buys computer=yes).
A rule R can be assessed by its coverage and accuracy.
Given a tuple, X, from a class labeled dataset,D,let n covers be the number of tuples
covered by R;n correct be the number of tuples correctly classified by R; and|D|be the
number of tuples in D.
We can define the coverage and accuracy of R.
Rule Induction Using a Sequential Covering Algorithm
Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification.
Input:
D, a data set of class-labeled tuples;
Att vals, the set of all attributes and their possible values.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 22
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Each node in the directed acyclic graph represents a random variable. The variables may be discrete or
continuous-valued. They may correspond to actual attributes given in the data or to ―hidden variables‖
believed to form a relationship (e.g., in the case of medical data, a hidden variable may indicate a
syndrome,representing a number of symptoms that, together, characterize a specific disease).
Each arc represents a probabilistic dependence.If an arcis drawn from a node Y to a node Z, then Y is a
parentor immediate predecessor of Z, and Z is a descendant of Y. Each variable is conditionally
independent of its non descendants in the graph,
given its parents.
Attribute Selection Measures
An attribute selection measure is a heuristic for selecting the splitting criterion that
“best”separates a given data partition, D,of class-labeled training tuples into individual classes.
If we were to split D into smaller partitions according to the outcomes of the splitting criterion,
ideally each partition would be pure (i.e., all the tuples that fall into a given partition would
belong to the same class).
Conceptually, the “best” splitting criterion is the one that most closely results in such a scenario.
Attribute selection measures are also known as splitting rules because they determine how the
tuples at a given node are to be split.
Information Gain
ID3 uses information gain as its attribute selection measure.
This measure is based on pioneering work by Claude Shannon on information theory, which
studied the value or “information content” of messages.
Let node N represent or hold the tuples of partition D.
The attribute with the highest information gain is chosen as the splitting attribute for node N.
A belief network has one conditional probability table (CPT) for each variable. The CPT for a
variable Y specifies the conditional distribution P(Yj Parents(Y)), where Parents(Y) are the parents
of Y. Figure(b) shows a CPT for the variable Lung Cancer. The conditional probability for each
known value of Lung Cancer is given for each possible combination of values of its parents.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 23
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
The back propagation algorithm performs learning on a multilayer feed-forward neural network.
It iteratively learns a set of weights for prediction of the class label of tuples.
A multi layer feed-forward neural network consists of an input layer,one or more hidden layers,
and an output layer.Each layer is made up of units.The inputs to the network correspond to the
attributes measured for each training tuple.
The inputs are fed simultaneously into the units making up the input layer. These inputs pass
through the input layer and are then weighted and fed simultaneously to a second layer of “neuron
like” units, known as a hidden layer. The outputs of the hidden layer units can be input to another
hidden layer, and so on. The number of hidden layers is arbitrary, although in practice, usually
only one is used.
Algorithm:
Back propagation. Neural network learning for classification or numeric prediction, using the back
propagation algorithm.
Input:
D, a data set consisting of the training tuples and their associated target values;
l, the learning rate; network, a multilayer feed-forward network.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 24
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Support Vector Machines (SVM) : A new classification method for both linear and nonlinear data. It
uses a nonlinear mapping to transform the original training data into a higher dimension. With the new
dimension, it searches for the linear optimal separating hyperplane (i.e., decision boundary). With an
appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be
separated by a hyperplane. SVM finds this hyperplane using support vectors (essential training tuples)
and margins (defined by the support vectors).
Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear
decision boundaries (margin maximization). Used both for classification and prediction.
Applications: handwritten digit recognition, object recognition, speaker identification, bench marking
time-series prediction tests.
The Case When the Data Are Linearly Separable : An SVM approaches this problem by searching
for the maximum marginal hyperplane. Consider the below Figure, which shows two possible
separating hyper planes and their associated margins. Before we get into the definition of margins, let’s
take an intuitive.
Both hyper planes can correctly classify all of the given data tuples. Intuitively,however,we expect the
hyperplane with the larger margin to be more accurate at classifying future data tuples than the
hyperplane with the smaller margin.
This is why (during the learning or training phase), the SVM searches for the hyperplane with the
largest margin, that is, the maximum marginal hyperplane (MMH).
The associated margin gives the largest separation between classes. Getting to an informal definition of
margin, we can say that the shortest distance from a hyperplane to one side of its margin is equal to the
shortest distance from the hyperplane to the other side of its margin, where the sides of the margin are
parallel to the hyperplane.
When dealing with the MMH, this distance is, in fact, the shortest distance from the MMH to the
closest training tuple of either class.
The Case When the Data Are Linearly Inseparable : We learned about linear SVMs for classifying
linearly separable data, but what if the data are not linearly separable no straight line can be found that
would separate the classes.
The linear SVMs we studied would not be able to find a feasible solution here.
The approach described for linear SVMs can be extended to create nonlinear SVMs for the
classification of linearly inseparable data (also called non linearly separable data, or nonlinear data, for
short).
Such SVMs are capable of finding nonlinear decision boundaries (i.e., nonlinear hyper surfaces) in
input space.
Lazy Learners (or Learning from Your Neighbors) : The classification methods like decision tree
induction, Bayesian classification, rule-based classification, classification by back propagation, support
vector machines, and classification based on association rule mining are all examples of eager learners.
Eager learners, when given a set of training tuples, will construct a generalization (i.e., classification)
model before receiving new (e.g., test) tuples to classify.
We can think of the learned model as being ready and eager to classify previously unseen tuples.
A contrasting lazy approach, in which the learner instead waits until the last minute before doing any
model construction to classify a given test tuple. That is, when given a training tuple, a lazy learner
simply stores it (or does only a little minor processing) and waits until it is given a test tuple.
Only when it sees the test tuple does it perform generalization to classify the tuple based on its
similarity to the stored training tuples.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 25
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
K-Nearest-neighbor classifiers are based on learning by analogy,that is,by comparing a given test
tuple with training tuples that are similar to it. The training tuples are described by n attributes. Each
tuple represents a point in an n-dimensional space. In this way, all of the training tuples are stored in
an n-dimensional pattern space.
When given an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k
training tuples that are closest to the unknown tuple. These k training tuples are the k-nearest
neighbors‖ of the unknown tuple. Closeness is defined in terms of a distance metric, such as Euclidean
distance. The Euclidean distance between two points or tuples, say, X1 = (x11, x12, : : : , x1n) and X2
= (x21, x22, : : , x2n), is
Case-Based Reasoning : Case-based reasoning (CBR) classifiers use a database of problem solutions
to solve new problems. Unlike nearest-neighbor classifiers, which store training tuples as points in
Euclidean space, CBR stores the tuples or ―cases‖ for problem solving as complex symbolic
descriptions.
Business applications of CBR include problem resolution for customer service help desks, where cases
describe product-related diagnostic problems. CBR has also been applied to areas such as engineering
and law, where cases are either technical designs or legal rulings, respectively. Medical education is
another area for CBR, where patient case histories and treatments are used to help diagnose and treat
new patients.
Introducing Ensemble Methods
Bagging, boosting, and random forests are examples of ensemble methods. An ensemble combines
a series of k learned models (or base classifiers), M1, M2,..., Mk, with the aim of creating an
improved composite classification model, M∗.
A given data set, D, is used to create k training sets, D1, D2,..., Dk, where Di (1≤i≤k−1) is used to
generate classifier Mi.
Given a new data tuple to classify, the base classifiers each vote by returning a class prediction.
The ensemble returns a class prediction based on the votes of the base classifiers.
The first three do not involve any changes to the construction of the classification model.
That is, oversampling and under sampling change the distribution of tuples in the training set;
threshold moving affects how the model makes decisions when classifying new data.
Both oversampling and under sampling change the training data distribution so that there
are(positive) class is well represented.Over sampling works by resampling the positive tuples so that
the resulting training set contains an equal number of positive and negative tuples.
Under sampling works by decreasing the number of negative tuples. It randomly eliminates tuples
from the majority (negative) class until there are an equal number of positive and negative tuples.
The threshold-moving approach to the class imbalance problem does not involve any sampling. It
applies to classifiers that, given an input tuple, return a continuous output value That is, for an input
tuple, X, such a classifier returns as output a mapping, f (X)→[0,1].
In the simplest approach,tuples for which f (X)≥t, for some threshold, t, are considered positive,
while all other tuples are considered negative.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 26
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
The goal of the model is to make the sum of the squares as small as possible. The sum of squares is a
measure that tracks how far the Y observations vary from the nonlinear (curved) function that is used
to predict Y.
It is computed by first finding the difference between the fitted nonlinear function and every Y point
of data in the set. Then, each of those differences is squared. Lastly, all of the squared figures are
added together. The smaller the sum of these squared figures, the better the function fits the data
points in the set. Nonlinear regression uses logarithmic functions, trigonometric functions, exponential
functions, power functions, Lorenz curves, Gaussian functions, and other fitting methods.
Nonlinear regression modeling is similar to linear regression modeling in that both seek to track a
particular response from a set of variables graphically. Nonlinear models are more complicated than
linear models to develop because the function is created through a series of approximations (iterations)
that may stem from trial-and-error. Mathematicians use several established methods, such as the
Gauss-Newton method and the Levenberg-Marquardt method.
One example of how nonlinear regression can be used is to predict population growth over time. A
scatterplot of changing population data over time shows that there seems to be a relationship between
time and population growth, but that it is a nonlinear relationship, requiring the use of a nonlinear
regression model. A logistic population growth model can provide estimates of the population for
periods that were not measured, and predictions of future population growth.
Nonlinear Regression Prediction Model
Due to the complex nature and variety of real-world data, it is very clumsy and inaccurate to use a
simple linear relationship to describe the changing and trend of a time series.
The nonlinear regression modeling should be used to better describe those data as shown by the red
curve (dark gray in print versions) in Fig. 3A. Between each two partition blocks, a nonlinear
regression function should be calculated for prediction. For example, as shown in Fig. 3B, with the
sampled data points from i to i+ l, some function should be calculated by a nonlinear regression
model for inner section prediction.
How can we model data that does not show a linear dependence? For example, what if a given
response variable and predictor variable have a relationship that may be modeled by a polynomial
function?” Think back to the straight-line linear regression case above where dependent response
variable, y, is modeled as a linear function of a single independent predictor variable, x.
What if we can get a more accurate model using a nonlinear model, such as a parabola or some other
higher-order polynomial? Polynomial regression is often of interest when there is just one predictor
variable.
It can be modeled by adding polynomial terms to the basic linear model. By applying transformations
to the variables, we can convert the nonlinear model into a linear one that can then be solved by the
method of least squares.
Frequent patterns and their corresponding association or correlation rules characterize interesting
relationships between attribute conditions and class labels, and thus have been recently used for
effective classification.
Association rules show strong associations between attribute-value pairs (or items) that occur
frequently in a given data set. Association rules are commonly used to analyze the purchasing patterns
of customers in a store. Such analysis is useful in many decision-making processes, such as product
placement, catalog design, and cross-marketing.
The discovery of association rules is based on frequent item set mining. Many methods for frequent
item set mining and the generation of association rules were described s section, we look at associative
classification, where association rules are generated and analyzed for use in classification. The general
idea is that we can search for strong associations between frequent patterns (conjunctions of attribute-
value pairs) and class labels. Because association rules explore highly confident associations among
multiple attributes, this approach may overcome some constraints introduced by decision - tree
induction, which considers only one attribute at a time.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 27
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Regression analysis is a good choice when all of the predictor variables are continuous
valued as well. Many problems can be solved by linear regression, and even more can be
tackled by applying transformations to the variables so that a nonlinear problem can be
converted to a linear one.
For reasons of space, we cannot give a fully detailed treatment of regression. Instead, this
section provides an intuitive introduction to the straight-line regression analysis (which
involves a single predictor variable) and multiple linear regression analysis (which involves
two or more predictor variables)
Several software packages exist to solve regression problems. Examples include SAS
(www.sas.com), SPSS (www.spss.com), and S-Plus (www.insightful.com).
Another useful resource is the book Numerical Recipes in C, by Press, Flannery, Teukolsky,
and Vetterling, and its associated source code.
where the variance of y is assumed to be constant, and b and w are regression coefficients
specifying the Y-intercept and slope of the line, respectively. The regression coefficients, w
and b, can also be thought of as weights, so that we can equivalently write,
y = w0 +w1x.
These coefficients can be solved for by the method of least squares, which estimates the
best-fitting straight line as the one that minimizes the error between the actual data and the
estimate of the line.
Let D be a training set consisting of values of predictor variable, x, for some population and
their associated values for response variable, y.
The training set contains |D| data points of the form (x1, y1), (x2, y2),..., (x|D| , y|D| ).
The regression coefficients can be estimated using this method with the following equations:
Note that earlier, we had used the notation (Xi , yi) to refer It.
Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label
correctly and the accuracy of the predictor refers to how well a given predictor can guess the
value of predicted attribute for a new data.
Speed − This refers to the computational cost in generating and using the classifier or
predictor.
Robustness − It refers to the ability of classifier or predictor to make correct predictions from
given noisy data.
Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently;
given large amount of data.
Interpret ability − It refers to what extent the classifier or predictor understands.
Suppose the marketing manager needs to predict how much a given customer will spend during a sale
at his company. In this example we are bothered to predict a numeric value. Therefore the data
analysis task is an example of numeric prediction. In this case, a model or a predictor will be
constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric prediction.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 28
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Clustering is a Unsupervised learning Concepts.As a stand-alone tool to get insight into data
distribution.As a pre processing step for other algorithms.
Applications of Clustering
Biology: Taxonomy of living things like kingdom, phylum, class, order, family, genus and species.
Information retrieval: To document clustering.
Land use: Identification of areas of similar land use in an earth observation database.
Marketing: Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs.
1. Interval-Scaled variables :
Interval-scaled variables are continuous measurements of a roughly linear scale.
Example:
weight and height, latitude and longitude coordinates (e.g., when clustering houses),
and weather temperature.
2. Binary variables :
A binary variable is a variable that can take only 2 values.
Example : Generally gender variables can take 2 variables male and female.
Ordinal Variables:
An ordinal variable can be discrete or continuous.
Example : Rank.
Ratio variables :
It is a positive measurement on a nonlinear scale, approximately at an exponential scale.
Example : Ae^Bt or A^e-Bt.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 29
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
E ik1 pCi ( p ci ) 2
Given k, find a partition of k clusters that optimizes the chosen partitioning criterion.
Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids
algorithms
k-means : Each cluster is represented by the center of the cluster
k-medoids or PAM (Partition around medoids) : Each cluster is represented by one of the objects
in the cluster
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 30
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Hierarchical Methods:
This method does not require the number of clusters k as an input, but needs a termination condition.
CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999 .Measures the similarity based on a
dynamic model.Two clusters are merged only if the inter connectivity and closeness (proximity)
between two clusters are high relative to the internal inter connectivity of the clusters and closeness of
items within the clusters .Graph-based, and a two-phase algorithm Use a graph-partitioning algorithm:
cluster objects into a large number of relatively small sub-clusters.Use an agglomerative hierarchical
clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 31
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Major Features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Two parameters:
Density-reachable:
A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn,
p1 = q, pn = p such that pi+1 is directly density-reachable from pi
Density-connected :
A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and
q are density-reachable from o w.r.t. Eps and MinPts
DBSCAN: Density-Based Spatial Clustering of Applications with Noise :
Outlier
Border
Eps = 1cm
Core MinPts = 5
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 32
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Each cell at a high level is partitioned into a number of smaller cells in the next lower level.Statistical
info of each cell is calculated and stored before hand and is used to answer queries.Parameters of higher
level cells can be easily calculated from parameters of lower level cell.count, mean, s, min, max type of
distribution—normal, uniform, etc.Use a top-down approach to answer spatial data queries.Start from a
pre-selected layer—typically with a small number of cells. For each cell in the current level compute
the confidence interval.
Remove the irrelevant cells from further consideration.When finish examining the current layer,
proceed to the next lower level. Repeat this process until the bottom layer is reached.
Advantages:
Query-independent, easy to parallelize, incremental update.
O(K), where K is the number of grid cells at the lowest level.
Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected.
A multi-resolution clustering approach which applies wavelet transform to the feature space; both
grid-based and density-based.
Wavelet transform: A signal processing technique that decomposes a signal into different frequency
sub-band.
Data are transformed to preserve relative distance between objects at different levels of resolution.
Allows natural clusters to become more distinguishable.
How to apply wavelet transform to find clusters Summarizes the data by imposing a multidimensional
grid structure onto data space.These multidimensional spatial data objects are represented in a n-
dimensional feature space. Apply wavelet transform on feature space to find the dense regions in the
feature space. Apply wavelet transform multiple times which result in clusters at different scales from
fine to coarse .
Major Features:
1. Complexity O(N)
2. Detect arbitrary shaped clusters at different scales.
3. Not sensitive to noise, not sensitive to input order.
4. Only applicable to low dimensional data.
Automatically identifying sub spaces of a high dimensional data space that allow better clustering
than original space.
CLIQUE can be considered as both density-based and grid-based
It partitions each dimension into the same number of equal length interval.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 33
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Cluster analysis is to find hidden categories. A hidden category (i.e., probabilistic cluster) is a
distribution over the data space, which can be mathematically represented using a probability density
function (or distribution function).
Example :
Given n objects O = {o1, …, on}, we want to mine a set of parameters Θ = {θ1, …, θk} s.t.,P(O|Θ) is
maximized, where θj = (μj, σj) are the mean and standard deviation of the j-th univariate Gaussian
distribution. We initially assign random values to parameters θj, then iteratively conduct the E- and
M- steps until converge or sufficiently small change. At the E-step, for each object oi, calculate the
probability that oi belongs to each distribution.
Advantages :
Mixture models are more general than partitioning and fuzzy clustering .
Clusters can be characterized by a small number of parameters.
The results may satisfy the statistical assumptions of the generative models.
Disadvantages :
Converge to local optimal (overcome: run multi-times w. random initialization)
Computationally expensive if the number of distributions is large, or the data set contains very
few observed data points.
Need large data sets.
Hard to estimate the number of clusters.
COBWEB
A popular a simple method of incremental conceptual learning. Creates a hierarchical clustering
in the form of a classification tree.Each node refers to a concept and contains a probabilistic
description of that concept.
It maps all the points in a high-dimensional source space into a 2 to 3-d target space, s.t., the
distance and proximity relationship (i.e., topology) are preserved as much as possible.Similar to
k-means: cluster centers tend to lie in a low-dimensional manifold in the feature space.Clustering
is performed by having several units competing for the current object.
SOMs are believed to resemble processing that can occur in the brain Useful for visualizing high-
dimensional data in 2- or 3-D space.
Video Content / Details of website for further learning (if any):
https://www.lancaster.ac.uk/stor-i-student-sites/hamish-thorburn/2020/02/23/model-based-clustering/
Important Books/Journals for further learning including the page nos.:
Data Mining Concepts and Techniques Jiawei Han and Micheline Kamber Second Edition,Morgan
Kaufmann Publication 2010 (Pg.No : 429 - 433)
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 34
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Major challenges:
Many irrelevant dimensions may mask clusters.Distance measure becomes meaningless—due to equi-
distance. Clusters may exist only in some sub spaces.
METHODS :
Subspace-clustering : Search for clusters existing in sub spaces of the given high dimensional data
space.
CLIQUE, ProClus, and bi-clustering approaches.
Dimensionality reduction approaches: Construct a much lower dimensional space and search for
clusters there (may construct new dimensions by combining some dimensions in the original data)
Dimensionality reduction methods and spectral clustering.
EXAMPLE :
Traditional distance measure could be dominated by noises in many dimensions.
Ex. Which pairs of customers are more similar?
Subspace High-Dimensional Clustering Methods :
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 35
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
The analysis of outlier data is referred to as outlier mining.Outliers may be detected using statistical
tests that assume a distribution or probability model for the data. Many data mining algorithms try to
minimize the influence of outliers or eliminate them all together.Outlier detection and analysis is an
interesting data mining task.
The easiest way to detect outliers is to create a graph. Plots such as Box plots, Scatter plots and
Histograms can help to detect outliers. Alternatively, we can use mean and standard deviation to list out
the outliers. Inter quartile Range and Quartiles can also be used to detect outliers.
Here is another illustration of an outlier. If you look at the Histogram below, you will see that one value
lies far to the left of all other data. This data point is an outlier.
How Can Outlier Detection Improve Business Analysis?
Outlier data points can represent either a) items that are so far outside the norm that they need not be
considered or b) the illustration of a very unique and singular category or variable that is worth
exploring either to capitalize on a niche or find an area where an organization can offer a unique focus.
When considering the use of Outlier analysis, a business should first think about why they want to find
the outliers and what they will do with that data. That focus will help the business to select the right
method of analysis, graphing or plotting to reveal the results they need to see and understand.
When considering the use of Outlier analysis, it is important to recognize that, when the Outlier
analysis is applied to certain datasets, the results will indicate that outliers should be discounted, while
in other cases, the outlier results will indicate that the organization focus solely on those outliers.
For example, if an outlier indicates a risk or a mistake, that outlier should be identified and the risk or
mistake should be addressed.
If an outlier indicates an exceptional result, such as a person that recovered from a particular disease in
spite of the fact that most other patients did not survive, the organization will want to perform further
analysis on the outlier result to identify the unique aspects that may be responsible for the patient’s
recovery.
When a business uses Outlier analysis, it is important to test the results and analyze the overall dataset
and environment to be sure that the presence of outliers does not indicate that the data set may be more
complex than anticipated and may require a different form of analysis.
The Smarten approach to augmented analytics and modern business intelligence focuses on the
business user and provides tools for Advanced Data Discovery so users can perform early prototyping
and test hypotheses without the skills of a data scientist.
Smarten Augmented Analytics tools include assisted predictive modeling, smart data
visualization, self-serve data preparation, Clickless Analytics with natural language processing
(NLP) for search analytics, Auto Insights, Key Influencer Analytics, and SnapShot monitoring and
alerts.
These tools are designed for business users with average skills and require no specialized knowledge of
statistical analysis or support from IT or data scientists.
Businesses can advance Citizen Data Scientist initiatives with in-person and online workshops and self-
paced eLearning courses designed to introduce users and businesses to the concept, illustrate the
benefits and provide introductory training on analytical concepts and the Citizen Data Scientist role.
The Smarten approach to data discovery is designed as an augmented analytics solution to serve
business users.
Smarten is a representative vendor in multiple Gartner reports including the Gartner Modern BI and
Analytics Platform report and the Gartner Magic Quadrant for Business Intelligence and Analytics
Platforms Report.
Errors such as computational errors or incorrect entry of an object cause outliers. The differences of
outliers to that of noise are:
Whenever some random error occurs in some measured variable or there is variance in the measured
variable, then it is termed as noise.
Before detecting the outliers present in a dataset, it is advisable to remove the noise.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 36
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Distance based outlier detection faces difficulty in identifying outliers if data is not uniformly
distributed.
Therefore this approach is used which depends on the overall distribution of the given set of data
points.
Eg: The below Figure shows a simple 2-D data set containing 502 objects, with two obvious clusters.
Cluster C1 contains 400 objects while Cluster C2 contains 100 objects.
Two additional objects, o1 and o2 are clearly outliers.
This approach identifies outliers by examining the main characteristics of objects in a group.
Objects that “deviate” from this description are considered outliers.
Hence, in this approach the term deviation is typically used to refer to outliers.
There are Two techniques for deviation-based outlier detection.
1. Sequentially compares objects in a set.
2. An OLAP Data Cube Approach.
Outlier data points can represent either a) items that are so far outside the norm that they need not be
considered or b) the illustration of a very unique and singular category or variable that is worth
exploring either to capitalize on a niche or find an area where an organization can offer a unique focus.
When considering the use of Outlier analysis, a business should first think about why they want to find
the outliers and what they will do with that data. That focus will help the business to select the right
method of analysis, graphing or plotting to reveal the results they need to see and understand.
When considering the use of Outlier analysis, it is important to recognize that, when the Outlier
analysis is applied to certain datasets, the results will indicate that outliers should be discounted, while
in other cases, the outlier results will indicate that the organization focus solely on those outliers.
Video Content / Details of website for further learning (if any):
https://towardsdatascience.com/5-outlier-detection-methods-that-every-data-enthusiast-must-know-
f917bf439210
Important Books/Journals for further learning including the page nos.:
Data Mining Concepts and Techniques Jiawei Han and Micheline Kamber Second Edition,Morgan
Kaufmann Publication 2010 (Pg.No : 452 - 458)
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 37
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
They perform conversions, summarization, key changes, structural changes and condensation.
The data transformation is required so that the information can by used by decision support
tools.
The transformation produces programs, control statements, JCL code, COBOL code, UNIX
scripts, and SQL DDL code etc., to move the data into data warehouse from multiple
operational systems.
Managed Query tools: used to generate SQL query. It uses Meta layer software in between
users and databases which offers a point-and-click creation of SQL statement.
This tool is a preferred choice of users to perform segment identification, demographic
analysis, territory management and preparation of customer mailing lists etc.
Application development tools: This is a graphical data access environment which integrates
OLAP tools with data warehouse and can be used to access all dbsystems
OLAP Tools: are used to analyze the data in multi dimensional and complex views.
To enable multidimensional properties it uses MDDB and MRDB where MDDB refers multi
dimensional data base and MRDB refers multi relational data bases.
Data mining tools: are used to discover knowledge from the data warehouse data also can be
used for data visualization and data correction purposes.
Video Content / Details of website for further learning (if any):
https://www.herzing.edu/blog/what-data-warehousing-and-why-it-important
Important Books/Journals for further learning including the page nos.:
Alex Berson and Stephen J. Smith Data Warehousing, Data Mining & OLAP Tata McGraw Hill
Edition 2007, Pg.No 113-114
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 38
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
1. Data warehousedatabase
This is the central part of the data warehousing environment. This is the item number 2 in the
above arch. diagram. This is implemented based on RDBMS technology.
Data heterogeneity: It refers to the different way the data is defined and used in different
modules. E.g Prism Solutions, Evolutionary Technology Inc., Vality, Praxis and Carleton
3. Metadata
It is data about data. It is used for maintaining, managing and using the data warehouse. It is
classified intotwo:
Technical Metadata:
It contains information about data warehouse data used by warehouse designer, administrator to
carry out development and management tasks. It includes,
Information about datastores
Transformation descriptions. That is mapping methods from operational db to warehouse
db
Warehouse Object and data structure definitions for targetdata
The rules used to perform clean up, and dataenhancement
Data mappingoperations
Access authorization, backup history, archive history, info delivery history, data
acquisition history, data accessetc.,
4. Accesstools
Its purpose is to provide info to business users for decision making. There are
five categories:
Data query and reportingtools
Application developmenttools
Executive info system tools(EIS)
OLAPtools
Data miningtools
Query and reporting tools are used to generate query and report. There are two types of
reporting tools. They are:
Production reporting tool used to generate regular operationalreports
Desktop report writer are inexpensive desktop tools designed for endusers.
Managed Query tools: used to generate SQL query. It uses Meta layer software in between users
and databases which offers a point-and-click creation of SQL statement. This tool is a preferred
choice of users to perform segment identification, demographic analysis, territory management
and preparation of customer mailing lists etc.
Application development tools: This is a graphical data access environment which integrates
OLAP tools with data warehouse and can be used to access all dbsystems
OLAP Tools: are used to analyze the data in multi dimensional and complex views. To enable
multidimensional properties it uses MDDB and MRDB where MDDB refers multi dimensional
data base and MRDB refers multi relational data bases.
Data mining tools: are used to discover knowledge from the data warehouse data also can be
used for data visualization and data correction purposes.
5. Datamarts
Departmental subsets that focus on selected subjects. They are independent used by dedicated
user group. They are used for rapid delivery of enhanced decision support functionality to end
users. Data mart is used in the following situation:
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 39
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Operational Data:
Informational Data:
Focusing on providing answers to problems posed by decision makers
Summarized
Non updateable
Metadata
It is data about data. It is used for maintaining, managing and using the data warehouse. It is
classified into two:
Technical Metadata:
It contains information about data warehouse data used by warehouse designer, administrator to
carry out development and management tasks. It includes,
Information about data stores
Transformation descriptions. That is mapping methods from operational db to warehouse
db
Warehouse Object and data structure definitions for target data
The rules used to perform clean up, and data enhancement
Access authorization, backup history, archive history, info delivery history.
Business Meta data:
It contains info that gives info stored in data warehouse to users. It includes,
Subject areas, and info object type including queries, reports, images, video, audio clips
etc.
Internet homepages
Info related to info delivery system
Data warehouse operational info such as ownerships, audit trailsetc.,
Data warehouse and database MS specialization
Very large size of databases and need to process complex adhoc queries in a short time
The most important requirements for the data warehouse database MS are performance,
throughput and scalability.
Implementation considerations
Intangible benefits (not easy to quantified): Improvement in productivity by keeping all data in single
location and eliminating re - keying of data, Reduced redundant processing, Enhanced customer
relation.
Video Content / Details of website for further learning (if any):
https://www.healthcatalyst.com/insights/database-vs-data-warehouse-a-comparative-review/
Important Books/Journals for further learning including the page nos.:
Alex Berson and Stephen J. Smith Data Warehousing, Data Mining & OLAP Tata McGraw Hill
Edition 2007, Pg.No 139-140
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 40
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for the
dimension time, item, and location.
These dimensions allow the save to keep track of things, for example, monthly sales of items and
the locations at which the items were sold.
Each dimension has a table related to it, called a dimensional table, which describes the dimension
further. For example, a dimensional table for an item may contain the attributes item_name, brand,
and type.
A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the table.
In this 2D representation, the sales for Delhi are shown for the time dimension (organized in quarters) and
the item dimension (classified according to the types of an item sold). The fact or measure displayed in
rupee_sold (in thousands).
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 41
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Topic of Lecture : Schemas for Multidimensional Databases - Star & Snowflake Schemas
Introduction :
The basic concepts of dimensional modeling are: facts, dimensions and measures.
A fact is a collection of related data items, consisting of measures and context data.
It typically represents business items or business transactions.
A dimension is a collection of data that describe one business dimension.
Dimensions determine the contextual background for the facts; they are the parameters over which
we want to perform OLAP.
Prerequisite knowledge for Complete understanding and learning of Topic:
Dimension modeling
OLAP
Facts
Dimension
Detailed content of the Lecture:
A measure is a numeric attribute of a fact, representing the performance or behavior of the
business relative to the dimensions.
Considering Relational context, there are three basic schemas that are used in dimensional
modeling:
1. Star schema
2. Snowflake schema
3. Fact constellation schema
4.1. Star schema
The multidimensional view of data that is expressed using relational data base semantics is
provided by the data base schema design called star schema. The basic of stat schema is that
information can be classified into two groups:
Facts
Dimension
Star schema has one large central table (fact table) and a set of smaller tables (dimensions)
arranged in a radial pattern around the centraltable.
Facts are core data element being analyzed while dimensions are attributes about the facts.
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
Each dimension in a star schema is represented with only one-dimensiontable.
This dimension table contains the set ofattributes.
There is a fact table at the center. It contains the keys to each of fourdimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
main characteristics of star schema:
Simple structure -> easy to understand schema
Great query effectives -> small number of tables tojoin
Relatively long time of loading data into dimension tables -> de-
normalization, redundancy data caused that size of the table could belarge.
The most commonly used in the data warehouse implementations -> widely
supported by a large number of business intelligence tools
The star schema suffers the following performance problems.
1.Indexing
Multipart key presents some problems in the star schema
• It requires multiple metadata definition( one for each component) to design a singletable.
• Since the fact table must carry all key components as part of its primary key, addition or deletion of
levels in the hierarchy will require physical modification of the affected table, which is time-consuming
processed that limitsflexibility.
• Carrying all the segments of the compound dimensional key in the fact table increasesthe
size of the index, thus impacting both performance and scalability.
2.Level Indicator.
The dimension table design includes a level of hierarchy indicator for every record.
Every query that is retrieving detail records from a table that stores details and aggregates must use this
indicator as an additional constraint to obtain a correct result.
The user is not and aware of the level indicator, or its values are in correct, the otherwise valid query may
result in a totally invalid answer.
Alternative to using the level indicator is the snowflake schema. Aggregate fact tables are created
separately from detail tables. Snowflake schema contains separate fact tables for each level of aggregation.
Other problems with the star schema design - Pairwise Join Problem
5 tables require joining first two tables, the result of this join with third table and so on. The intermediate
result of every join operation is used to join with the next table.
Selecting the best order of pairwise joins rarely can be solve in a reasonable amount of time.
Five-table query has 5!=120 combinations
2 .Snowflake schema:
is the result of decomposing one or more of the dimensions. The many-to-one relationships among
sets of attributes of a dimension can separate new dimension tables, forming a hierarchy. The
decomposed snowflake structure visualizes the hierarchical structure of dimensions very well.
Red Brick's RDBMS indexes, called STAR indexes, used for STAR join performance. The STAR
indexes are created on one or more foreign key columns of a fact table. STAR index contains information
that relates the dimensions of a fact table to the rows that contains those dimensions. STAR indexes are
very space-efficient. The presence of a STAR index allows Red Brick's RDBMS to quickly identify which
target rows of the fact table are of interest for a particular set of dimension. Also, because STAR indexes
are created over foreign keys, no assumptions are made about the type of queries which can use the STAR
indexes.
Video Content / Details of website for further learning (if any):
https://www.guru99.com/star-snowflake-data-warehousing.html
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 42
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
For example, address data cardinality pin code is 50 (50 possible values), and gender data cardinality
is only 2 (male and female)..
If the bit for a given index is "on", the value exists in the record. Here, a 10,000 — row employee
table that contains the "gender" column is bitmap-indexed for this value.
Bitmap indexes can become bulky and even unsuitable for high cardinality data where the range of
possible values is high. For example, values like "income" or "revenue" may have an almost infinite
number of values.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 43
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
OLAP stands for On-Line Analytical Processing. OLAP is a classification of software technology which
authorizes analysts, managers, and executives to gain insight into information through fast, consistent,
interactive access in a wide variety of possible views of data that has been transformed from raw
information to reflect the real dimensionality of the enterprise as understood by the clients.
o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
Sales and Marketing
Production
o Production planning
o Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data model more
intuitive to them than a tabular model. This model is called a Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to achieve using tabular models.
OLAP Guidelines-Need
1) Multidimensional Conceptual View: This is the central features of an OLAP system. By needing a
multidimensional view, it is possible to carry out methods like slice and dice.
2) Transparency: Make the technology, underlying information repository, computing operations, and the
dissimilar nature of source data totally transparent to users. Such transparency helps to improve the
efficiency and productivity of the users.
3) Accessibility: It provides access only to the data that is actually required to perform the particular
analysis, present a single, coherent, and consistent view to the clients. The OLAP system must map its own
logical schema to the heterogeneous physical data stores and perform any necessary transformations. The
OLAP operations should be sitting between data sources (e.g., data warehouses) and an OLAP front-end.
4) Consistent Reporting Performance: To make sure that the users do not feel any significant degradation
in documenting performance as the number of dimensions or the size of the database increases. That is, the
performance of OLAP should not suffer as the number of dimensions is increased. Users must observe
consistent run time, response time, or machine utilization every time a given query is run.
5) Client/Server Architecture: Make the server component of OLAP tools sufficiently intelligent that the
various clients to be attached with a minimum of effort and integration programming. The server should be
capable of mapping and consolidating data between dissimilar databases.
6) Generic Dimensionality: An OLAP method should treat each dimension as equivalent in both is
structure and operational capabilities. Additional operational capabilities may be allowed to selected
dimensions, but such additional tasks should be grantable to any dimension.
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific analytical model being
created and loaded that optimizes sparse matrix handling. When encountering the sparse matrix, the system
must be easy to dynamically assume the distribution of the information and adjust the storage and access to
obtain and maintain a consistent level of performance.
8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and access
security.
9) Unrestricted cross-dimensional Operations: It provides the ability for the methods to identify
dimensional order and necessarily functions roll-up and drill-down methods within a dimension or across
the dimension.
10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation direction like as
reorientation (pivoting), drill-down and roll-up, and another manipulation to be accomplished naturally and
precisely via point-and-click and drag and drop methods on the cells of the scientific model. It avoids the
use of a menu or multiple trips to a user interface.
11) Flexible Reporting: It implements efficiency to the business clients to organize columns, rows, and
cells in a manner that facilitates simple manipulation, analysis, and synthesis of data.
12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should be unlimited.
Each of these common dimensions must allow a practically unlimited number of customer-defined
aggregation levels within any given consolidation path.
OLAP Operations :
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on
a data cube, by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is
like zooming-out on the data cubes.
Figure shows the result of roll-up operations performed on the dimension location. The hierarchy for
the location is defined as the Order Street, city, province, or state, country.
The roll-up operation aggregates the data by ascending the location hierarchy from the level of the
city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed from
the cube.
For example, consider a sales data cube having two dimensions, location and time. Roll-up may be
performed by removing, the time dimensions, appearing in an aggregation of the total sales by
location, relatively than by location and by time.
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down is
like zooming-in on the data cube. It navigates from less detailed record to more detailed data. Drill-
down can be performed by either stepping down a concept hierarchy for a dimension or adding
additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a concept
hierarchy which is defined as day, month, quarter, and year. Drill-down appears by descending the
time hierarchy from the level of the quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a new
dimension to a cube. For example, a drill-down on the central cubes of the figure can occur by
introducing an additional dimension, such as a customer group.
https://www.tutorialspoint.com/dwh/dwh_olap.htm
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 44
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
These are the intermediate servers that stand in between a relational back-end server and client
front-end tools. They use a relational or extended-relational DBMS to store and manage warehouse
data, and OLAP middleware to support missing pieces.
ROLAP servers include optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.
ROLAP technology tends to have greater scalability than MOLAP technology. The DSS server of
Microstrategy,
for example, adopts the ROLAP approach.
These servers support multidimensional views of data through array-based multidimensional storage
engines.
They map multidimensional views directly to data cube array structures. The advantage of using a
data cube is that it allows fast indexing to precomputed summarized data. Notice that with
multidimensional data stores, the storage utilization may be low if the data set is sparse.
In such cases, sparse matrix compression techniques should be explored Many MOLAP servers
adopt a two-level storage representation to handle dense and sparse data sets: denser subcubes are
identified and stored as array structures, whereas sparse subcubes employ compression technology
for efficient storage utilization.
The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the
greater scalability of ROLAP and the faster computation of MOLAP. For example, a HOLAP server
may allow large volumes of detail data to be stored in a relational database, while aggregations are
kept in a separate MOLAP store.
The Microsoft SQL Server 2000 supports a hybrid OLAP server. Specialized SQL servers: To meet
the growing demand of OLAP processing in relational databases, some database system vendors
implement specialized.
SQL servers that provide advanced query language and query processing support for SQL queries
over star and snowflake schemas in a read-only environment.
Course Faculty
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
LECTURE HANDOUTS L 45
MCA II / III
Course Name with Code : Data Mining And Data Warehousing / 19CAC16
Course Faculty
Verified by HOD