MACHINE LEARNING
UNIT-1
Clustering
TOPICS TO BE COVERED
Clustering
Reinforcement Learning
Decision Tree Learning
Bayesian Networks
Support Vector Machine
Genetic Algorithm
Issues in Machine Learning
Data Science Vs Machine Learning
Clustering: An Unsupervised Learning
UNSUPERVISED LEARNING:
● Draw references from datasets consisting of input data without labeled
responses
● Describe hidden structure from unlabeled data
Unsupervised Learning Problems can be divided into two-categories:
1. CLUSTERING
2. ASSOCIATION RULE
Clustering: An Unsupervised Learning
● The task of grouping a set of objects in such a way that objects in the
same group(called a cluster) are more similar to each other than to
those in other clusters.
● It makes unlabeled data more understandable and manipulative.
● Machine learns the attributes and trends by itself without any provided
input-output mapping.
● The clustering algorithms extract patterns and inferences from the type
of data objects, and then make discrete classes of clustering them
suitably.
Clustering: Applications in Biology
Clustering: Applications in Biology
Clustering: Other Applications
❖ Google News Clustering.
❖ Marketing: Customer Segmentation based on a database of customer data containing their
properties, and past buying records.
❖ Recognizing Communities in social networks.
Major Clustering Approaches
● Partitioning: Construct various partitions and then evaluate them by some criterion
● Hierarchical: Create a hierarchical decomposition of the set of objects using some criterion
● Model-Based: Hypothesize a model for each cluster and find best fit of models to data
● Density-Based: Guided by connectivity and density functions
● Graph-Theoretic: Clustering
Aspects of Clustering
The quality of a clustering result depends on the algorithm, the distance function, and the application.
● A clustering algorithm such as
➔ Partitioning Clustering, e.g. K-Means
➔ Hierarchical Clustering, e.g. AHC
➔ Mixture of Gaussians
● A Distance or Similarity Function
➔ Such as Euclidean, Minkowski, Cosine
● Clustering Quality
➔ Inter-Clusters distance=>Maximized
➔ Intra-Clusters distance=>Minimized
Partitioning Algorithms
● Partitioning Method: Construct a partition of a database D of m objects into a set of k clusters
● Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
➔ Global Optimal: Exhaustively enumerate all partitions
➔ Heuristic Method: k-means (MacQueen, 1967)
Hierarchical Clustering
➔ Hierarchical clustering of animal into vertebrate and invertebrate.
➔ Produce a nested sequence of clusters.
➔ One approach: Recursive application of a partitional clustering algorithm.
Model Based Clustering
➔ A model is hypothesized
➔ E.g. Assume data is generated by a mixture of underlying probability distributions
➔ Fit the data to model
Density Based Clustering
➔ Based on density connected points
➔ Locates regions of high density separated by regions of low density
➔ E.g., DBSCAN
Graph Theoretic Clustering
➔ Weights of edges between items (nodes) based on similarity.
➔ E.g., look for minimum cut in a graph
(Dis)similarity Measures
➔ Distance Metric (Scale-Dependent)
➔ Minkowski Family of distance measures
➔ Cosine Distance
(Dis)similarity Measures
➔ Correlation coefficients (scale-invariant)
➔ Mahalanobis distance
➔ Pearson Correlation
Quality of Clustering
Internal Evaluation: Assign the best score to the algorithm that produces clusters with high similarity
within a cluster and low similarity between clusters, e.g.,
Davies-Bouldin Index
External Evaluation: Evaluated based on data such as known class labels and external benchmarks,
e.g. Rand Index, Jaccard Index, f-measure
Issues in Machine Learning
1. How much training data is sufficient?
2. How much testing data is sufficient?
3. What methods should be used to reduce learning overhead?
4. For which type of data, which methods should be used?
5. What algorithms should be used?
6. Which algorithm perform best for which type of problem?
Machine Learning Basics-Relationship between AL, ML, and DS
Machine Learning Basics-Relationship between ML and DS
DS covers the whole
spectrum of data processing, while
ML has the algorithmic or statistical
aspects.
Difference between ML and DS
Data science is a field that studies data and how to extract meaning from it, whereas machine learning is a
field devoted to understanding and building methods that utilize data to improve performance or inform
predictions.
Difference between ML and DS