UNIT-2
CLASSIFICATION AND CLUSTERING
Machine Learning algorithms are generally categorized based upon the type of
output variable and the type of problem that needs to be addressed. These algorithms are
broadly divided into three types i.e. Regression, Clustering, and Classification. Regression
and Classification are types of supervised learning algorithms while Clustering is a type of
unsupervised algorithm.
When the output variable is continuous, then it is a regression problem whereas
when it contains discrete values, it is a classification problem. Clustering algorithms are
generally used when we need to create the clusters based on the characteristics of the data
points.
Classification
Classification is a type of supervised machine learning algorithm. For any given input,
the classification algorithms help in the prediction of the class of the output variable. There
can be multiple types of classifications like binary classification, multi-class classification,
etc. It depends upon the number of classes in the output variable.
Types of Classification algorithms
Logistic Regression: – It is one of the linear models which can be used for
classification. It uses the sigmoid function to calculate the probability of a certain event
occurring. It is an ideal method for the classification of binary variables.
K-Nearest Neighbours (kNN): – It uses distance metrics like Euclidean distance,
Manhattan distance, etc. to calculate the distance of one data point from every other data
point. To classify the output, it takes a majority vote from k nearest neighbors of each data
point.
Decision Trees: – It is a non-linear model that overcomes a few of the drawbacks of
linear algorithms like Logistic regression. It builds the classification model in the form of a
tree structure that includes nodes and leaves. This algorithm involves multiple if-else
statements which help in breaking down the structure into smaller structures and
eventually providing the final outcome. It can be used for regression as well as classification
problems.
Random Forest: – It is an ensemble learning method that involves multiple decision
trees to predict the outcome of the target variable. Each decision tree provides its own
outcome. In the case of the classification problem, it takes the majority vote of these
multiple decision trees to classify the final outcome. In the case of the regression problem, it
takes the average of the values predicted by the decision trees.
Naïve Bayes: – It is an algorithm that is based upon Bayes’ theorem. It assumes that
any particular feature is independent of the inclusion of other features. i.e. They are not
correlated to one another. It generally does not work well with complex data due to this
assumption as in most of the data sets there exists some kind of relationship between the
features.
Support Vector Machine: – It represents the data points in multi-dimensional space.
These data points are then segregated into classes with the help of hyperplanes. It plots an
n-dimensional space for the n number of features in the dataset and then tries to create the
hyperplanes such that it divides the data points with maximum margin.
Clustering
Clustering is a type of unsupervised machine learning algorithm. It is used to group
data points having similar characteristics as clusters. Ideally, the data points in the same
cluster should exhibit similar properties and the points in different clusters should be as
dissimilar as possible.
Clustering is divided into two groups – hard clustering and soft clustering. In hard
clustering, the data point is assigned to one of the clusters only whereas in soft clustering, it
provides a probability likelihood of a data point to be in each of the clusters.
Types of Clustering algorithms
K-Means Clustering: – It initializes a pre-defined number of k clusters and uses
distance metrics to calculate the distance of each data point from the centroid of each
cluster. It assigns the data points into one of the k clusters based on its distance.
Agglomerative Hierarchical Clustering (Bottom-Up Approach): – It considers each
data point as a cluster and merges these data points on the basis of distance metric and the
criterion which is used for linking these clusters.
Divisive Hierarchical Clustering (Top-Down Approach): – It initializes with all the
data points as one cluster and splits these data points on the basis of distance metric and
the criterion. Agglomerative and Divisive clustering can be represented as a dendrogram
and the number of clusters to be selected by referring to the same.
DBSCAN (Density-based Spatial Clustering of Applications with Noise): – It is a
density-based clustering method. Algorithms like K-Means work well on the clusters that are
fairly separated and create clusters that are spherical in shape. DBSCAN is used when the
data is in arbitrary shape and it is also less sensitive to the outliers. It groups the data points
that have many neighbouring data points within a certain radius.
OPTICS (Ordering Points to Identify Clustering Structure): – It is another type of
density-based clustering method and it is similar in process to DBSCAN except that it
considers a few more parameters. But it is more computationally complex than DBSCAN.
Also, it does not separate the data points into clusters, but it creates a reachability plot
which can help in the interpretation of creating clusters.
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): – It creates
clusters by generating a summary of the data. It works well with huge datasets as it first
summarises the data and then uses the same to create clusters. However, it can only deal
with numeric attributes that can be represented in space.
Difference Between Clustering and Classification
Characteristics Classification Clustering
Type Supervised Unsupervised
Process Classifying the input data as Data points are grouped as
one of the class labels from clusters based on their
the output variable. similarities.
Prediction Classification involves the Clustering is generally used
prediction of the input to analyze the data and draw
variable based on the model inferences from it for better
building. decision making
Splitting of data Classification algorithms Clustering algorithms do not
need the data to be split as need the splitting of data for
training and test data for its use.
predicting and evaluating the
model.
Data Label Classification algorithms deal Clustering algorithms deal
with labelled data with unlabeled data.
Stages Classification process Clustering process involves
involves two stages – only the grouping of data.
Training and Testing.
Complexity High complexity because Low complexity because
classification deals with a Clustering algorithms whose
greater number of stages. aim is only to group the data.
Classification and Prediction
There are two forms of data analysis that can be used for extracting models
describing important classes or to predict future data trends. These two forms are as
follows −
Classification
Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky, or a prediction model to predict the
expenditures in dollars of potential customers on computer equipment given their income
and occupation.
Example - Classification
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no for
marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model
or a predictor will be constructed that predicts a continuous-valued-function or ordered
value.
Classifier Model
The Data Classification process includes two steps −
Building the Classifier or Model
Using Classifier for Classification
Building the Classifier or Model
This step is the learning step or the learning phase.
In this step the classification algorithms build the classifier.
The classifier is built from the training set made up of database tuples and
their associated class labels.
Each tuple that constitutes the training set is referred to as a category or
class. These tuples can also be referred to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to
estimate the accuracy of classification rules. The classification rules can be applied to the
new data tuples if the accuracy is considered acceptable.
Decision Tree Induction
A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the outcome of a
test, and each leaf node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.
The benefits of having a decision tree are as follows −
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
Bayesian classification
It is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers.
Bayesian classifiers can predict class membership probabilities such as the probability that
a given tuple belongs to a particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
Posterior Probability [P(H/X)]
Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem, P(H/X)= P(X/H)P(H) / P(X)
Bayesian Belief Network
Bayesian Belief Networks specify joint conditional probability distributions. They are
also known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
A Belief Network allows class conditional independencies to be defined
between subsets of variables.
It provides a graphical model of causal relationship on which learning can
be performed.
We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −
Directed acyclic graph
A set of conditional probability tables
Directed Acyclic Graph
Each node in a directed acyclic graph represents a random variable.
These variables may be discrete or continuous valued.
These variables may correspond to the actual attribute given in the data.
Directed Acyclic Graph Representation
The following diagram shows a directed acyclic graph for six Boolean variables.
The arc in the diagram allows representation of causal knowledge. For example, lung
cancer is influenced by a person's family history of lung cancer, as well as whether or not
the person is a smoker. It is worth noting that the variable PositiveXray is independent of
whether the patient has a family history of lung cancer or that the patient is a smoker,
given that we know the patient has lung cancer.
Conditional Probability Table
The conditional probability table for the values of the variable LungCancer (LC)
showing each possible combination of the values of its parent nodes, FamilyHistory (FH),
and Smoker (S) is as follows −
Rule Based Classification
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can
express a rule in the following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes THEN buy_computer = yes
Points to remember −
The IF part of the rule is called rule antecedent or precondition.
The THEN part of the rule is called rule consequent.
The antecedent part the condition consist of one or more attribute tests and these
tests are logically ANDed.
The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules
from a decision tree.
Points to remember −
To extract a rule from a decision tree −
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Types of data
At the highest level, two kinds of data exist: quantitative and qualitative.
Quantitative data deals with numbers and things you can measure objectively:
dimensions such as height, width, and length, Temperature and humidity, Prices, Area and
volume.
Qualitative data deals with characteristics and descriptors that can't be easily
measured, but can be observed subjectively—such as smells, tastes, textures,
attractiveness, and color.
There are two types of quantitative data, which is also referred to as numeric
data: continuous and discrete. As a general rule, counts are discrete and measurements are
continuous.
Discrete data is a count that can't be made more precise. Typically it involves
integers. For instance, the number of children (or adults, or pets) in your family is discrete
data, because you are counting whole, indivisible entities: you can't have 2.5 kids, or 1.3
pets.
Continuous data, on the other hand, could be divided and reduced to finer and finer
levels. For example, you can measure the height of your kids at progressively more precise
scales—meters, centimeters, millimeters, and beyond—so height is continuous data.
There are three main kinds of qualitative data.
1. Binomial
2. Nominal
3. Ordinal
Binary data place things in one of two mutually exclusive categories: right/wrong,
true/false, or accept/reject.
When collecting unordered or nominal data, we assign individual items to named
categories that do not have an implicit or natural value or rank.
We also can have ordered or ordinal data, in which items are assigned to categories
that do have some kind of implicit or natural order, such as "Short, Medium, or Tall."
Categorization of Major Clustering Methods
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another cluster.
Clustering is the process of making a group of abstract objects into classes of similar
objects.
Applications of Cluster Analysis
Clustering analysis is broadly used in many applications such as market research,
pattern recognition, data analysis, and image processing.
Clustering can also help marketers discover distinct groups in their customer
base. And they can characterize their customer groups based on the purchasing
patterns.
In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations.
Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a
city according to house type, value, and geographic location.
Clustering also helps in classifying documents on the web for information
discovery.
Clustering is also used in outlier detection applications such as detection of
credit card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Clustering Methods
Clustering methods can be classified into the following categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means
that it will classify the data into k groups, which satisfy the following requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
Note −
For a given number of partitions (say k), the partitioning method will create
an initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by
moving objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects.
We can classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each
object forming a separate group. It keeps on merging the objects or groups that are close
to one another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of
the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds. This
method is rigid, i.e., once a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
Perform careful analysis of object linkages at each hierarchical partitioning.
Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm to group objects into micro-clusters, and then
performing macro-clustering on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighborhood exceeds some threshold, i.e.,
for each data point within a given cluster, the radius of a given cluster has to contain at
least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the
quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data
for a given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields robust
clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or
application-oriented constraints. A constraint refers to the user expectation or the
properties of desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user or the
application requirement.
Outlier Analysis
It is one of the Data mining function or process. Outliers may be defined as
following:
A database may contain data objects that do not comply with the general
behavior or
Model of the data. Such data objects, which are grossly different from or
inconsistent with the remaining set of data, are called outliers.
Most data mining methods discard outliers as noise or exceptions.
However, in some applications such as fraud detection, the rare events can
be more interesting than the more regularly occurring ones.
The analysis of outlier data is referred to as outlier mining.
Outliers may be detected using statistical tests that assume a distribution or
probability model for the data, or using distance measures where objects
that are a substantial distance from any other cluster are considered outliers.
Rather than using statistical or distance measures, deviation-based methods
identify outliers by examining differences in the main characteristics of
objects in a group.
Outliers can be caused by measurement or execution error.
Outliers may be the result of inherent data variability.
Many data mining algorithms try to minimize the influence of outliers or
eliminate them all together.
This, however, could result in the loss of important hidden information
because one person’s noise could be another person’s signal.
Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
Data Mining Applications
Data mining is widely used in diverse areas. There are a number of commercial data
mining systems available today and yet there are many challenges in this field. In this
tutorial, we will discuss the applications and the trend of data mining.
Here is the list of areas where data mining is widely used −
Financial Data Analysis
Retail Industry
Telecommunication Industry
Biological Data Analysis
Other Scientific Applications
Intrusion Detection
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Some of the typical cases
are as follows −
Design and construction of data warehouses for multidimensional data
analysis and data mining.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue to
expand rapidly because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and
trends that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in the retail industry −
Design and Construction of data warehouses based on the benefits of data
mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, internet messenger, images, e-
mail, web data transmission, etc. Due to the development of new computer and
communication technologies, the telecommunication industry is rapidly expanding. This is
the reason why data mining is become very important to help and understand the
business.
Data mining in telecommunication industry helps in identifying the
telecommunication patterns, catch fraudulent activities, make better use of resource, and
improve quality of service. Here is the list of examples for which data mining improves
telecommunication services −
Multidimensional Analysis of Telecommunication data.
Fraudulent pattern analysis.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
Use of visualization tools in telecommunication data analysis.
Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological data
mining is a very important part of Bioinformatics. Following are the aspects in which data
mining contributes for biological data analysis −
Semantic integration of heterogeneous, distributed genomic and proteomic
databases.
Alignment, indexing, similarity search and comparative analysis multiple
nucleotide sequences.
Discovery of structural patterns and analysis of genetic networks and protein
pathways.
Association and path analysis.
Visualization tools in genetic data analysis.
Other Scientific Applications
The applications discussed above tend to handle relatively small and homogeneous
data sets for which the statistical techniques are appropriate. Huge amount of data have
been collected from scientific domains such as geosciences, astronomy, etc. A large
amount of data sets is being generated because of the fast numerical simulations in various
fields such as climate and ecosystem modeling, chemical engineering, fluid dynamics, etc.
Following are the applications of data mining in the field of Scientific Applications −
Data Warehouses and data preprocessing.
Graph-based mining.
Visualization and domain specific knowledge.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical
component of network administration. Here is the list of areas in which data mining
technology may be applied for intrusion detection −
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build
discriminating attributes.
Analysis of Stream data.
Distributed data mining.
Visualization and query tools.