UNIT III
FREQUENT PATTERNS, ASSOCIATONS AND CLASSIFICATIONS
Market Basket Analysis:
What Is an Item set?
A set of items together is called an item set. If any item set has k-items it is called a k-item
set. An item set consists of two or more items. An item set that occurs frequently is called a
frequent item set. Thus frequent item set mining is a data mining technique to identify the items
that often occur together.
For Example, Bread and butter, Laptop and Antivirus software, etc.
What Is A Frequent Item set?
A set of items is called frequent if it satisfies a minimum threshold value for support and
confidence. Support shows transactions with items purchased together in a single transaction.
Confidence shows transactions where the items are purchased one after the other.
For frequent item set mining method, we consider only those transactions which meet
minimum threshold support and confidence requirements. Insights from these mining
algorithms offer a lot of benefits, cost-cutting and improved competitive advantage.
There is a tradeoff time taken to mine data and the volume of data for frequent mining.
The frequent mining algorithm is an efficient algorithm to mine the hidden patterns of item sets
within a short time and less memory consumption.
Frequent Pattern Mining (FPM)
The frequent pattern mining algorithm is one of the most important techniques of data
mining to discover relationships between different items in a dataset. These relationships are
represented in the form of association rules. It helps to find the irregularities in data.
FPM has many applications in the field of data analysis, software bugs, cross-marketing,
sale campaign analysis, market basket analysis, etc.
Frequent item sets discovered through Apriori have many applications in data mining
tasks. Tasks such as finding interesting patterns in the database, finding out sequence and
Mining of association rules is the most important of them.
Association rules apply to supermarket transaction data, that is, to examine the customer
behavior in terms of the purchased products. Association rules describe how often the items are
purchased together.
Association Rules:
Association Rule Mining is defined as: “Let I= { …} be a set of ‘n’ binary attributes called
items. Let D= { ….} be set of transaction called database. Each transaction in D has a unique
transaction ID and contains a subset of the items in I. A rule is defined as an implication of form
X->Y where X, Y? I and X?Y=?. The set of items X and Y are called antecedent and consequent of
the rule respectively.”
Learning of Association rules is used to find relationships between attributes in large
databases. An association rule, A=> B, will be of the form” for a set of transactions, some value of
item set A determines the values of item set B under the condition in which minimum support
and confidence are met”.
Support and Confidence can be represented by the following example:
Bread=> butter [support=2%, confidence-60%]
The above statement is an example of an association rule. This means that there is a 2%
transaction that bought bread and butter together and there are 60% of customers who bought
bread as well as butter.
Support and Confidence for Item set A and B are represented by formulas:
Association rule mining consists of 2 steps:
1. Find all the frequent item sets.
2. Generate association rules from the above frequent itemsets.
Why Frequent Item set Mining?
Frequent item set or pattern mining is broadly used because of its wide applications in
mining association rules, correlations and graph patterns constraint that is based on frequent
patterns, sequential patterns, and many other data mining tasks.
Apriori Algorithm – Frequent Pattern Algorithms
Apriori algorithm was the first algorithm that was proposed for frequent item set mining.
It was later improved by R Agarwal and R Srikant and came to be known as Apriori. This
algorithm uses two steps “join” and “prune” to reduce the search space. It is an iterative
approach to discover the most frequent itemsets.
Apriori says:
The probability that item I is not frequent is if:
P(I) < minimum support threshold, then I is not frequent.
P (I+A) < minimum support threshold, then I+A is not frequent, where A also belongs to
item set.
If an item set set has value less than minimum support then all of its supersets will also
fall below min support, and thus can be ignored. This property is called the Antimonotone
property.
The steps followed in the Apriori Algorithm of data mining are:
1. Join Step: This step generates (K+1) item set from K-item sets by joining each item with
itself.
2. Prune Step: This step scans the count of each item in the database. If the candidate item
does not meet minimum support, then it is regarded as infrequent and thus it is removed.
This step is performed to reduce the size of the candidate item sets.
Steps In Apriori:
Apriori algorithm is a sequence of steps to be followed to find the most frequent item set
in the given database. This data mining technique follows the join and the prune steps iteratively
until the most frequent item set is achieved. A minimum support threshold is given in the
problem or it is assumed by the user.
1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The
algorithm will count the occurrences of each item.
2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose
occurrence is satisfying the min sup are determined. Only those candidates which count more
than or equal to min_sup, are taken ahead for the next iteration and the others are pruned.
3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-
itemset is generated by forming a group of 2 by combining items with itself.
4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have
2 –itemsets with min-sup only.
5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each
group fll in min_sup. If all 2-itemset subsets are frequent then the superset will be frequent
otherwise it is pruned.
6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its
subset does not meet the min_sup criteria. The algorithm is stopped when the most frequent
itemset is achieved.
Example of Apriori: Support threshold=50%, Confidence= 60%
TABLE-1
Transaction List of items
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3
1. Count Of Each Item
TABLE-2
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
2. Prune Step: TABLE -2 shows that I5 item does not meet min_sup=3, thus it is deleted, only I1,
I2, I3, I4 meet min_sup count.
TABLE-3
Item Count
I1 4
Item Count
I2 5
I3 4
I4 4
3. Join Step: Form 2-itemset. From TABLE-1 find out the occurrences of 2-itemset.
TABLE-4
Item Count
I1,I2 4
I1,I3 3
I1,I4 2
I2,I3 4
I2,I4 3
I3,I4 2
4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not meet min_sup, thus it
is deleted.
TABLE-5
Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3
5. Join and Prune Step: Form 3-itemset. From the TABLE- 1 find out occurrences of 3-itemset.
From TABLE-5, find out the 2-itemset subsets which support min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are occurring in TABLE-5 thus
{I1, I2, I3} is frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4} is not frequent, as it is
not occurring in TABLE-5 thus {I1, I2, I4} is not frequent, hence it is deleted.
TABLE-6
Item
I1,I2,I3
I1,I2,I4
I1,I3,I4
I2,I3,I4
Only {I1, I2, I3} is frequent.
6. Generate Association Rules: From the frequent itemset discovered above the association
could be:
{I1, I2} => {I3}
Confidence = support {I1, I2, I3} / support {I1, I2} = (3/ 4)* 100 = 75%
{I1, I3} => {I2}
Confidence = support {I1, I2, I3} / support {I1, I3} = (3/ 3)* 100 = 100%
{I2, I3} => {I1}
Confidence = support {I1, I2, I3} / support {I2, I3} = (3/ 4)* 100 = 75%
{I1} => {I2, I3}
Confidence = support {I1, I2, I3} / support {I1} = (3/ 4)* 100 = 75%
{I2} => {I1, I3}
Confidence = support {I1, I2, I3} / support {I2 = (3/ 5)* 100 = 60%
{I3} => {I1, I2}
Confidence = support {I1, I2, I3} / support {I3} = (3/ 4)* 100 = 75%
This shows that all the above association rules are strong if minimum confidence threshold is
60%.
The Apriori Algorithm: Pseudo Code
C: Candidate item set of size k
L: Frequent itemset of size k
Advantages
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large itemsets in large databases
Disadvantages
1. It requires high computation if the itemsets are very large and the minimum support is
kept very low.
2. The entire database needs to be scanned.
Methods To Improve Apriori Efficiency
Many methods are available for improving the efficiency of the algorithm.
1. Hash-Based Technique: This method uses a hash-based structure called a hash table for
generating the k-itemsets and its corresponding count. It uses a hash function for
generating the table.
2. Transaction Reduction: This method reduces the number of transactions scanning in
iterations. The transactions which do not contain frequent items are marked or removed.
3. Partitioning: This method requires only two database scans to mine the frequent
itemsets. It says that for any itemset to be potentially frequent in the database, it should
be frequent in at least one of the partitions of the database.
4. Sampling: This method picks a random sample S from Database D and then searches for
frequent itemset in S. It may be possible to lose a global frequent itemset. This can be
reduced by lowering the min_sup.
5. Dynamic Itemset Counting: This technique can add new candidate itemsets at any
marked start point of the database during the scanning of the database.
Applications Of Apriori Algorithm
Some fields where Apriori is used:
1. In Education Field: Extracting association rules in data mining of admitted students
through characteristics and specialties.
2. In the Medical field: For example Analysis of the patient’s database.
3. In Forestry: Analysis of probability and intensity of forest fire with the forest fire data.
4. Apriori is used by many companies like Amazon in the Recommender System and by
Google for the auto-complete feature
Classification & Prediction
There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −
Classification
Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky, or a prediction model to predict the expenditures
in dollars of potential customers on computer equipment given their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no for
marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model or a
predictor will be constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric
prediction.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process includes two steps −
Building the Classifier or Model
Using Classifier for Classification
Building the Classifier or Model
This step is the learning step or the learning phase.
In this step the classification algorithms build the classifier.
The classifier is built from the training set made up of database tuples and their
associated class labels.
Each tuple that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to estimate
the accuracy of classification rules. The classification rules can be applied to the new data tuples
if the accuracy is considered acceptable.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities −
Data Cleaning − Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques and the problem of
missing values is solved by replacing a missing value with most commonly occurring
value for that attribute.
Relevance Analysis − Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
Data Transformation and reduction − The data can be transformed by any of the
following methods.
o Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within a
small specified range. Normalization is used when in the learning step, the neural
networks or the methods involving measurements are used.
o Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
Decision Tree Induction
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each
leaf node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a test
on an attribute. Each leaf node represents a class.
The benefits of having a decision tree are as follows
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of
ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees
are constructed in a top-down recursive divide-and-conquer manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
if attribute_list is empty then
return N as leaf node with labeled
with majority class in D;|| majority voting
apply attribute_selection_method(D, attribute_list)
to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
multiway splits allowed then // no restricted to binary trees
attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome j; // a partition
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise or
outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
Pre-pruning − The tree is pruned by halting its construction early.
Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
The cost complexity is measured by the following two parameters −
Number of leaves in the tree, and
Error rate of the tree
Bayesian Classification
Why Bayesian?
Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches tocertain types of learning problems
Incremental: Each training example can incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge can be combined with observed data.
Probabilistic prediction: Predict multiple hypotheses,weighted by their probabilities
Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured
Basics
Let X be a data sample whose class label is unknown
Let H be a hypothesis that X belongs to class C
Posterior Probability:
For classification problems, determine P(H/X): the probability that the hypothesis H holds given
the observed data sample X
Prior Probability:
P(H): prior probability of hypothesis H
It is the initial probability before we observe any data
It reflects the background knowledge
P(X): probability that sample data is observed
P(X|H) : probability of observing the sample X, given that the
hypothesis H holds
Bayesian Theorem:
Given training data X, posteriori probability of ahypothesis H, P(H|X) follows the Bayes theorem
𝑋
𝐻 𝑃 (𝐻 ) 𝑃(𝐻)
𝑃( ) =
𝑋 𝑃(𝑋)
Naïve Bayesian Classifier
Example:
Given the following data sample :
X=(age<=30,Income=medium,Student=yes,Credit_rating=Fair)
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667=0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(buys_computer=“yes”) = 9 / 14 = 0.643
P(buys_computer=“no”) = 5 / 14 = 0.357
𝑋
To find the class Ci , that maximizes 𝑃 (𝐶𝑖 ) 𝑃(𝐶𝑖), we compute
P(X|buys_computer=“yes”) P(buys_computer=“yes”)=0.044 X 0.643=0.028
P(X|buys_computer=“no”) P(buys_computer=“no”)=0.019 X 0.357 = 0.007
Therefore , the naïve Bayesian classifier predicts buy_computer=yes for tuple X.
Advantages:
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, thereforeloss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer,diabetes etc
Dependencies among these cannot be modeled byNaïve Bayesian Classifier
Rule Based Classification
Using IF-THEN Rules for Classification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
Rule antecedent/precondition vs. rule consequent IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a
rule in the following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
The IF part of the rule is called rule antecedent or precondition.
The THEN part of the rule is called rule consequent.
The antecedent part the condition consist of one or more attribute tests and these tests
are logically ANDed.
The consequent part consists of class prediction.
Assessment of a rule: coverage and accuracy
ncovers = # of tuples covered by R
ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
If more than one rule is triggered, need conflict resolution
Size ordering: assign the highest priority to the triggering rules that has the
toughest requirement (i.e., with the most attribute test)
Class-based ordering: decreasing order of prevalence or misclassification cost per class
Rule-based ordering (decision list): rules are organized into one long priority list,
according to some measure of rule quality or by experts
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a
decision tree.
Points to remember −
To extract a rule from a decision tree −
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Example: Rule extraction from our buys_computer decision-tree
Rule Extraction from the Training Data
Sequential covering algorithm: Extracts rules directly from training data
Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci but
none (or few) of the tuples of other classes
steps:
Rules are learned one at a time
Each time a rule is learned, the tuples covered by the rules are removed
The process repeats on the remaining tuples unless termination condition, e.g., when no
more training examples or when the quality of a rule returned is below a user-specified
threshold
Rule Induction Using Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training
data. We do not require to generate a decision tree first. In this algorithm, each rule for a given
class covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by
the rule is removed and the process continues for the rest of the tuples. This is because the path
to each leaf in a decision tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class
at a time. When learning a rule from a class Ci, we want the rule to cover all the tuples from
class C only and no tuple form any other class.
Algorithm: Sequential Covering
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
Output: A Set of IF-THEN rules.
Method:
Rule_set={ }; // initial set of rules learned is empty
for each class c do
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
Rule_set=Rule_set+Rule; // add a new rule to rule-set
end for
return Rule_Set;
Rule Pruning
The rule is pruned is due to the following reason −
The Assessment of quality is made on the original set of training data. The rule may
perform well on training data but less well on subsequent data. That's why the rule
pruning is required.
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has
greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg
where pos and neg is the number of positive tuples covered by R, respectively.
Classification by Backpropagation
Backpropagation: A neural network learning algorithm
Started by psychologists and neurobiologists to develop and test computational
analogues of neurons
A neural network: A set of connected input/output units where each connection has
a weight associated with it
During the learning phase, the network learns by adjusting the weights so as to be able
to predict the correct class label of the input tuples
Also referred to as connectionist learning due to the connections between units
Neural Network as a Classifier
Weakness
Long training time
Require a number of parameters typically best determined empirically, e.g., the network
topology or ``structure."
Poor interpretability: Difficult to interpret the symbolic meaning behind the learned
weights and of ``hidden units" in the network
Strength
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs o Successful on a wide array of
real-world data
Algorithms are inherently parallel
Techniques have recently been developed for the extraction of rules from trained neural
networks.
ANeuron(=aperceptron)
A Multi-Layer Feed-Forward Neural Network
The inputs to the network correspond to the attributes measured for each training tuple
Inputs are fed simultaneously into the units making up the input layer
They are then weighted and fed simultaneously to a hidden layer
The number of hidden layers is arbitrary, although usually only one
The weighted outputs of the last hidden layer are input to units making up the output
layer, which emits the network's prediction
The network is feed-forward in that none of the weights cycles back to an input unit or to
an output unit of a previous layer
From a statistical point of view, networks perform nonlinear regression: Given enough
hidden units and enough training samples, they can closely approximate any function
Example:
To classify an unknown tuple, X, the tuple is input to the trained network, and the net
input and output of each unit are computed. (There is no need for computation and/or
backpropagation of the error.) If there is one output node per class, then the output node with
the highest value determines the predicted class label for X. If there is only one output node,
then output values greater than or equal to 0.5 may be considered as belonging to the positive
class, while values less than 0.5 may be considered negative.
Lazy Learners (or Learning from Your Neighbors)
The classification methods discussed so far in this chapter—decision tree induction,
Bayesian classification, rule-based classification, classification by backpropagation, support
vector machines, and classification based on association rule mining—are all examples of eager
learners. Eager learners, when given a set of training tuples, will construct a generalization (i.e.,
classification) model before receiving new (e.g., test) tuples to classify. We can think of the
learned model as being ready and eager to classify previously unseen tuples.
k-Nearest-Neighbor Classifiers
The k-nearest-neighbor method was first described in the early 1950s. The method is
labor intensive when given large training sets, and did not gain popularity until the 1960s when
increased computing power became available. It has since been widely used in the area of
pattern recognition.
Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a
given test tuple with training tuples that are similar to it. The training tuples are described
by n attributes. Each tuple represents a point in an n-dimensional space. In this way, all of the
training tuples are stored in an n-dimensional pattern space. When given an unknown tuple,
a k-nearest-neighbor classifier searches the pattern space for the k training tuples that are
closest to the unknown tuple. These k training tuples are the k ―nearest neighbors‖ of the
unknown tuple.
Closeness‖ is defined in terms of a distance metric, such as Euclidean distance. The
Euclidean distance between two points or tuples, say, X1 = (x11, x12, : : : , x1n) and X2 =
(x21, x22, : : , x2n), is
𝑑𝑖𝑠𝑡(𝑋1 , 𝑋2 ) = √∑(𝑥1𝑖 − 𝑥2𝑖 )2
𝑖=1
In other words, for each numeric attribute, we take the different between the
corresponding values of that attribute in tuple X 1 and in tuple X2, square this difference, and
accumulate it. The square root is taken of the total accumulated distance count.
Case-Based Reasoning
Case-based reasoning (CBR) classifiers use a database of problem solutions to solve new
problems. Unlike nearest-neighbor classifiers, which store training tuples as points in Euclidean
space, CBR stores the tuples or ―cases‖ for problem solving as complex symbolic descriptions.
Business applications of CBR include problem resolution for customer service help desks,
where cases describe product-related diagnostic problems. CBR has also been applied to areas
such as engineering and law, where cases are either technical designs or legal rulings,
respectively. Medical education is another area for CBR, where patient case histories and
treatments are used to help diagnose and treat new patients.
When given a new case to classify, a case-based reasoner will first check if an identical
training case exists. If one is found, then the accompanying solution to that case is returned. If
no identical case is found, then the case-based reasoner will search for training cases having
SCE Department of Information Technology components that are similar to those of the new
case. Conceptually, these training cases may be considered as neighbors of the new case. If
cases are represented as graphs, this involves searching for subgraphs that are similar to
subgraphs within the new case. The case-based reasoner tries to combine the solutions of the
neighboring training cases in order to propose a solution for the new case. If incompatibilities
arise with the individual solutions, then backtracking to search for other solutions may be
necessary. The case-based reasoner may employ background knowledge and problem-solving
strategies in order to propose a feasible combined solution.
Other Classification Methods
Genetic Algorithms
Genetic Algorithm: based on an analogy to biological evolution
An initial population is created consisting of randomly generated rules
Each rule is represented by a string of bits
E.g., if A1 and ¬A2 then C2 can be encoded as 100 o If an attribute has k > 2 values, k bits
can be used
Based on the notion of survival of the fittest, a new population is formed to consist of the
fittest rules and their offspring
The fitness of a rule is represented by its classification accuracy on a set of training
examples
Offspring are generated by crossover and mutation
The process continues until a population P evolves when each rule in P satisfies
a prespecified threshold
Slow but easily parallelizable
Rough Set Approach:
Rough sets are used to approximately or ―roughly‖ define equivalent classes
A rough set for a given class C is approximated by two sets: a lower approximation
(certain to be in C) and an upper approximation (cannot be described as not belonging
to C)
Finding the minimal subsets (reducts) of attributes for feature reduction is NP-hard but
a discernibility matrix (which stores the differences between attribute values for each
pair of data tuples) is used to reduce the computation intensity
Fuzzy Set approaches
Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership
(such as using fuzzy membership graph)
Attribute values are converted to fuzzy values
e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values
calculated
For a given new sample, more than one fuzzy value may apply
Each applicable rule contributes a vote for membership in the categories
Typically, the truth values for each predicted category are summed, and these sums are
combined