0 ratings0% found this document useful (0 votes) 72 views16 pagesData Mining-Unit-3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
UNIT-1
LECTURE 1
n?
‘What is classification and predi
Data classification is a two step process. In the fi
predetermined
t step, a model is built describing a
Set of data classes or concepts. The model is constructed by analyzing database tuples described
by attributes. Each tuple is assumed to belong to a predefined class, as determined by one of the
attributes, called the class label attribute. In the context of classification, data tuples are also
referred to as samples, examples, or objects. The Data tuples analyzed to build the model
collectively form the training data set. The individual tuples making up the training set are
referred to as training samples and are randomly selected from the sample population. Since the
class label of each training sample is provided, this step is also known as supervised learning
(ic., the Learning of the model is ‘supervised! in that itis told to which class each training sample
belongs). It contrasts with unsupervised learning (or clustering), in which the class labels of the
training samples are not known, and the Number or set of classes to be learned may not be
known in advance.
Typically, the learned model is represented in the form of classification rules,
decision trees, or mathematical formulae. For example, given a database of customer credit
information, classification rules can be learned to identify customers as having either excellent or
fair credit ratings Figure below. The rules can be used to categorize future data samples, as well
as provide a better understanding of the database contents
In the second step is shown in fig, the model is used for classification. First, the
predictive accuracy of the model is estimated. The holdout method is a simple technique which
uses a test set of class-labeled samples. These samples areClassification Process
4 Construction
(1): Model
Classification
Algorithms
4
Mike [Assistant Prof | 3 no
Mary |Assistant Prof 7 yes
Bill Professor 2 yes
Jim [Associate Prof} 7 yes
Dave |Assistant Prof | 6 no
[Anne [Associate Prof] 3 no
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’Classification Process (2): Use the
Model in Prediction
Es
a
Tom |AssistantProt | 2 no
Merlisa [Associate Prof 7 no
George|Professar é yes
Joseph [AssistantProft[ 7 yes
Figure 7.1: The data classification process: a)Leaming: Training data are analyzed by a
classification algorithm. Here, the class label attribute is credit rating, and the learned model or
classifier is represented in the form of classification rules, b) Classification: Test data are used to
ceptable, the
rules can be applied to the classification of new data tuples randomly selected and are
estimate the accuracy of the classification rules. If the accuracy is considered a
independent of the training samples. The accuracy of a model on a given test set is the
percentage of test set samples that are correctly classified by the model. For each test sample, the
known class label is compared with the leamed model's class prediction for that sample. Note
that if the accuracy of the model were estimated based on the training data set, this estimate
could be optimistic since the learned model tends to over fit the data (that is, it may have
incorporated some particular anomalies of the training data which are not present in the overall
sample population). Therefore, a test set is used,
If the accuracy of the model is considered acceptable, the model can be used to classify future
data tuples or objects for which the class label is not known. (Such data are also referred to in the
machine learning literature as “unknown" or “previously unseen" data). For example, theclassification rules learned in Figure 7.1a from the analysis of data from existing customers can
be used to predict the credit rating of new or future (i.e., previously unseen) customers.
“How is prediction different from classification?” Prediction can be viewed as the construction
and use of a model to assess the class of an unlabeled object, or to assess the value or value
ranges of an attribute that a given object is likely to have. In this view, classification and
regression are the two major types of prediction problems where classification is used to predict
discrete or nominal values, while regression is used to predict continuous ot ordered values. In
our view, however, we refer to the use of predication to predict class labels as classification and
The use of predication to predict continuous values (e.g., using regression techniques) as
prediction. This view is commonly accepted in data mining
Lecture?
Issues regarding classification and prediction
Preparing the data for classification and prediction: The following preprocessing steps may be
applied to the data in order to help improve the accuracy, efficiency, and scalability of the
classification or prediction process.
1, Data cleaning, This refers to the preprocessing of data in order to remove or reduce noise (by
applying smoothing techniques, for example), and the treatment of missing values (e.g., by
replacing a missing value with the most commonly occurring value for that attribute, or with the
most probable value based on statistics).
Although most classification algorithms have some mechanisms for handling noisy or missing
data, this step can help reduce confusion during learning,
2. Relevance analysis. Many of the attributes in the data may be irrelevant to the classification
or prediction task. For example, data recording the day of the week on which a bank loan
application was filed is unlikely to be relevant to the success of the application. Furthermore,
other attributes may be redundant. Hence, relevance analysis may be performed on the data with
the aim of removing any irrelevant or redundant attributes from the learning process. In machine
learning, this step is known as feature selection. Including such attributes may otherwise slow
down, and possibly mislead, the learning step
Ideally, the time spent on rel e analysis, when added to the time spent on learning from the
resulting “reduced” feature subset, should be less than the time that would have been spent on
learning from the original set of features. Hence, such analysis can help improve classification
efficiency and scalability.
ani3. Data transformation. The data can be generalized to higher-level concepts. Concept
hierarchies may be used for this purpose, This is particularly useful for continuous-valued
attributes. For example, numeric values for the attribute income may be generalized to discrete
ranges such as low, medium, and high. Similarly, nominal-valued attributes, like street, can be
generalized to higher-level concepts, like city. Since generalization compresses the original
training data, fewer input/output operations may be involved during learning. The data may also
be normalized, particularly when neural networks or methods involving distance measurements
are used in the learning step. Normalization involves scaling all values for a given attribute so
that they fall within a small specified range, such as -1.0 to 1.0, or 0 to 1.0. In methods which use
distance measurements, for example, this would prevent attributes with initially large ranges
(like, say income) from outweighing attributes with initially smaller ranges (such as binary
attributes).
Comparing classification methods. Classification and prediction methods can be compared and
evaluated according to the following criteria:
1, Predictive accuracy. This refers to the ability of the model to correctly predict the class label
of new or Previously unseen data.
2. Speed. This refers to the computation costs involved in generating and using the model.
3. Robustness. This is the ability of the model to make correct predictions given noisy data or
data with missing
Values,
4, Scalabil
amounts of data.
ity. This refers to the ability of the learned model to perform efficiently on large
5. Interpretability. This refers is the level of understanding and insight that is provided by the
learned model.Lecture 3
Classification by Decision Tree Induction:
A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an
attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class
distributions. The topmost node in a tree is the root node. A typical decision tree is shown in
Figure 7.2. It represents the concept buys computer, that is, it predicts whether or not a customer
at All Electronics is likely to purchase a computer. Internal nodes are denoted by rectangles, and
leaf nodes are denoted by ovals.
Algorithm for Decision Tree Induction:
Basic algorithm (a greedy algorithm)
1. Tree is constructed in a top-down recursive divide-and-conquer manner
2. At start, all the training examples are at the root
3. Attributes are categorical (if continuous-valued, they are discretized in advance)
4, Examples are partitioned recursively based on selected attributes
5, Test attributes are selected on the basis of a heuristic or stati
information gain)
ical measure (e.g.,
Conditions for stopping partitioning
1. All samples for a given node belong to the same class,
2. There are no remaining attributes for further partitioning — majority voting is
employed for classifying the leaf
3. There are no samples left
Attribute Selection Measure:
1. Information gain (ID3/C4.5)
All attributes are assumed to be categorical
Can be modified for continnous-valued attributes
Select the attribute with the highest information gain
Assume there are two classes, P and NLet the set of examples $ contain p elements of class P and n elements of class N
The amount of information, needed to decide if an arbitrary example in S belongs to P
or N is defined as
1(p.n) =-—2e,
pen
2. Gini index (IBM Intelligent Miner)
All attributes are assumed continuous-valued
Assume there exist several possible split values for each attribute
May need other tools, such as clustering, to get the possible split values
Can be modified for categorical attributes
Ifa data set T contains examples from n classes, gini index, gini (I) is defined as
Where p, is the relative frequency of class j in T
Ifa data set T is split into two subsets T; and T2 with sizes N; and Nz respectively, the gini
index of the split data contains examples from n classes, the gini index gini (T) is defined as
SINI spi (7) Nisin (r)+N2 y er)
The attribute provides the smallest ginigpii(T) is chosen to split the node (need to enumerate
all possible splitting points for each attribute).Lecture 4
Bayesian Classification
“What are Bayesian classifiers"?
Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the
probability that a given sample belongs to a particular class.
Bayesian classification is based on Bayes theorem, described below. Studies comparing classification
algorithms have found a simple Bayesian classifier known as the naive Bayesian classifier to be
comparable in performance with decision tree and neural network classifiers. Bayesian classifiers have
also exhibited high accuracy and speed when applied to large databases.
Naive Bayesian classifiers assume that the effect of an attribute value on a given class is independent of
the values of the other attributes. This assumption is called class conditional independence. It is made to
simplify the computations involved, and in this sense, is considered “naive. Bayesian belief networks
are graphical models, which unlike naive Bayesian classifiers allow the representation of dependencies
among subsets of attributes. Bayesian belief networks can also be used for classification,
Bayesian Theorem
Given training data D, posteriori probability of a hypothesis h, P{h/D) follows the Bayes theorem,
PID|h)P(h)
PalD)=“Ses
MAP (maximum posteriori) hypothesis
h rgmax P(h|D)=argmax P(D|h)P(h).
he, heH
MAP
Practical difficulty: require initial knowledge of many probabilities, significant computational cost
Naive Bayes Classifier
simplified assumption: attributes are conditionally independent:
PCI) © PCN LPC)
Greatly reduces the computation cost, only count the class distribution.
Given a training set, we can compute the probabilitiesOutlook | P| N| [Humidity P | N
sunny 2/9 | 3/5 [high 3/9 | 4/5
overcast 4/19 | 0 normal | 6/9 | 1/5
rain 3/9 | 2/5
hot 2/9 | 2/5 true 3/9 | 3/5
mild 4/9 | 2/5 false 6/9 | 2/5
cool 3/9 _| 1/5 fssing a-posteriori probabitit es
P(C|X) = prob. that the sample tuple X= is of elass C
E.g. P(class=N | outlook=sunny ,windy-true,...)
Idea: assign to sample X the class label C such that P(C|X) is maximal
Estimating a-posteriori probabilities
Bayes theorem
P(CIX) = P(XIC)-P(C) / POX)
P(X) is constant for all classes
P(C) = relative freq of class C samples
c such that PCIX) is maximum, =
C such that P(X|C)-P(C) is maximum
Problem: computing P(X{C) is unfeasible
Naive Bayesian Classification:
Naive assumption: attribute independence
POX XKIC) = POG IC)... PCRIC)
If i-th attribute is categorical:
P(xi{C) is estimated as the relative freq of samples having value x: as i-h attribute in class CIf i-th attribute is continuous:
P(x{C) is estimated thru a Gaussian density function
Computationally easy in both cases
Lecture 5
Classification by Back propagation
A multilayer feed-forward neural network
The backpropagation algorithm performs learning on a multilayer feed-forward neural network
‘An example of such a network is shown in fig below. The inputs correspond to the attributes
measured for each training sample.
The inputs are fed simultaneously into a layer of units making up the input layer. The weighted
outputs of these units are, in turn, fed simultaneously to a second layer of “neuron-like” units,
known as a hidden layer. The hidden layer's weighted outputs can be input to another hidden
layer, and so on. The number of hidden layers is arbitrary, although in practice, usually only one
is used. The weighted outputs of the last hidden layer are input to units making up the output,
layer, which emits the network's prediction for given samples. The units in the hidden layers and
output layer are sometimes referred to as neurodes, due to their symbolic biological basis, or as
output units. The multilayer neural network shown in Figure 7.9 has two layers of output units,
Therefore, we say that it is a two-layer neural network. Similarly, a network containing two
hidden layers is called a three-layer neural network, and so on. The network is feed-forward in
that none of the weights cycle back to an input unit or to an output unit of a previous layer. It is
fully connected in that each unit provides input to each unit in the next forward layer. Multilayer
feed-forward networks of linear threshold functions, given enough hidden units, can closely
approximate any function.
Neural Networks:
1. Advantages
prediction accuracy is generally high
robust, works when training examples contain errors,
output may be discrete, real-valued, or a vector of several discrete or real-valued
attributes
fast evaluation of the leamed target function2.Criticism
Jong training time
difficult to understand the leaned function (weights)
not easy to incorporate domain knowledge
Neuron
output y
Input weight weighted Activation
vectorx vector w sum function
Lecture 6
Classification Based on Concepts from Association Rule Mining
Several methods for association-based classification
ARCS: Quantitative association mining and clustering of association rules (Lent et
al’97)
It beats C4.5 in (mainly) scalability and also accuracyAssociative classification: (Liu et al’98)
It mines high support and high confidence rules in the form of “cond_set =>
y”, where y is a class label
CAEP (Classification by aggregating emerging patterns) (Dong et al’99)
Emerging patterns (EPs): the item sets whose support increases significantly
from one class to another
Mine Eps based on minimum support and growth rate
Lecture 7
Other Classification Methods
1. k-nearest neighbor classifier
Al instances correspond to points in the n-D space.
The nearest neighbor is defined in terms of Euclidean distance.
‘The target function could be diserete- or real- valued.
For discrete-valued, the k-NN returns the most common value among the k training
Examples nearest to xq.
Vonoroi diagram: the decision surface induced by I-NN for a typical set of training
examples,
2. case-based reasoning
Also uses: lazy evaluation + analyze similar instances
Difference: Instances are not “points in a Euclidean space”
Example: Water faucet problem in CADET (Sycara et al’92)
Methodology
Instances represented by rich symbolic descriptions (e.g., function graphs)
Multiple retrieved cases may be combinedTight coupling between case retrieval, knowledge-based reasoning, and problem
solving
Indexing based on syntactic similarity measure, and when failure, backtracking,
and adapting to additional cases
3. Genetic algorithm
GA: based on an analogy to biological evolution
Each rule is represented by a string of bits
‘An initial population is created consisting of randomly generated rules
e.g, IF A: and Not Aa then C2 can be encoded as 100
Based on the notion of survival of the fittest, a new population is formed to consists of the fittest
rules and their offsprings
The fitness of a rule is represented by its classification accuracy on a set of training examples
Offsprings are generated by crossover and mutation.
4, Rough set approach:
Rough sets are used to approximately or “roughly” define equivalent classes
A rough set for a given class C is approximated by two sets: a lower approximation
(certain to be in C) and an upper approximation (cannot be described as not belonging to
°
Finding the minimal subsets (reducts) of attributes (for feature reduction) is NP-hard but
a discernibility matrix is used to reduce the computation intensity.
5. Fuzzy set approaches:
Fuzzy logic uses truth values between 0.0 and 1.00 to represent the degree of membership
(such as using fuzzy membership graph)
Attribute values are converted to fuzzy valuese.g, income is mapped into the discrete categories {low, medium, high} with
fuzzy values calculated
For a given new sample, more than one fuzzy value may apply
Each applicable rule contributes a vote for membership in the categories
Typically, the truth values for each predicted category are summed
Lecture 8
Prediction and Classifier Accuracy:
The prediction of continuous values can be modeled by statistical techniques of regression. For
example, we may like to develop a model to predict the salary of college graduates with 10 years
of work experience, or the potential sales of a new product given its price. Many problems can
be solved by linear regression, and even more can be tackled by applying transformations to the
variables so that a nonlinear problem can be converted to a linear one.
Regress Analysis and Log-Linear
Models in Prediction
= Linear regression: Y = a + 6X
= Two parameters , « and B specify the line and are to
be estimated by using the data at hand.
= using the least squares criterion to the known values
of Yi, Y2, ..., Xt, X2,
= Multiple regression: Y = b0 + bt X1 + b2 x2.
= Many nonlinear functions can be transformed into the
above.
= Log-linear models:
= The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
= Probability: ofa, b, ¢ d) = aabBacvad ObedPrediction: Numerical Data
Prediction: Categorical DataEien ger ter see toe cial
Oeste) Toke of Ll =)
en =e el
Sef ST] is]
emanate
rot
“ Ill 366.00-420.00 J 205.00-1000.00 lj 1260.00-6008.00
™ ous
i ie |i 42000-20500 113000-128000
ee me ‘Camping Chain 60 Outet Independent
4 be
"|
: Pee
4 Mass Marketer Sports Chain
3 (sami a0 Go
sean hee
siete 23h can,
sstimating Error Rates:
Partition: Training-and-testing
‘use two independent data sets, e.g, training set (2/3), test set(1/3)
used for data set with large number of samples
alidation
divide the data set into k subsamples
use k-1 subsamples as training data and one sub-sample as test data —- k-fold cross-validation
for data set with moderate size
Bootstrapping (leave-one-out)
for small size data,
You might also like
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages