K-MEANS CLUSTERING
What is clustering?
• Clustering is the classification of objects into
different groups, or more precisely, the
partitioning of a data set into subsets
(clusters), so that the data in each subset
(ideally) share some common trait - often
according to some defined distance measure.
K-MEANS CLUSTERING
• The k-means algorithm is an algorithm to
cluster n objects based on attributes into k
partitions, where k < n.
• It is similar to the expectation-maximization
algorithm for mixtures of Gaussians in that
they both attempt to find the centers of
natural clusters in the data.
• Simply speaking k-means clustering is an
algorithm to classify or to group the objects
based on attributes/features into K number of
group.
• K is positive integer number.
• The grouping is done by minimizing the sum
of squares of distances between data and the
corresponding cluster centroid.
How the K-Mean Clustering algorithm
works?
Steps
• Step 1: Begin with a decision on the value of k =
number of clusters .
• Step 2: Put any initial partition that classifies the
data into k clusters. You may assign the
training samples randomly,or
systematically as the following:
1.Take the first k training sample as single-
element clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest
centroid. After each assignment, recompute
the centroid of the gaining cluster.
• Step 3: Take each sample in sequence and
compute its distance from the centroid of each of
the clusters. If a sample is not currently in the
cluster with the closest centroid, switch this
sample to that cluster and update the centroid of
the cluster gaining the new sample and the
cluster losing the sample.
• Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
training sample causes no new assignments.
How to choose the value of "K number
of clusters" in K-means Clustering?
• The Elbow method is one of the most popular ways to find
the optimal number of clusters.
• This method uses the concept of WCSS value. WCSS stands
for Within Cluster Sum of Squares, which defines the total
variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
To find the optimal value of clusters, the elbow
method follows the below steps:
• It executes the K-means clustering on a given
dataset for different K values (ranges from 1-
10).
• For each value of K, calculates the WCSS value.
Plots a curve between calculated WCSS values
and the number of clusters K.
• The sharp point of bend or a point of the plot
looks like an arm, then that point is
considered as the best value of K.
Example
The distance matrix based on the Euclidean distance is given below:
seed1=A1=(2,10), seed2=A4=(5,8), seed3=A7=(1,2)
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
seed1=A1=(2,10), seed2=A4=(5,8), seed3=A7=(1,2)
centers of the new clusters:
After the 2nd epoch the results would be:
1: {A1, A8}, 2: {A3, A4, A5, A6}, 3: {A2, A7}
with centers C1=(3, 9.5), C2=(6.5, 5.25) and
C3=(1.5, 3.5).
After the 3rd epoch, the results would be:
1: {A1, A4, A8}, 2: {A3, A5, A6}, 3: {A2, A7}
with centers C1=(3.66, 9), C2=(7, 4.33) and
C3=(1.5, 3.5).
Advantages and Disadvantages
Advantages
• It is very easy to understand and implement.
• If we have large number of variables then, K-means would be faster than
Hierarchical clustering.
• On re-computation of centroids, an instance can change the cluster.
• Tighter clusters are formed with K-means as compared to Hierarchical clustering.
Disadvantages
• It is a bit difficult to predict the number of clusters i.e. the value of k.
• Output is strongly impacted by initial inputs like number of clusters (value of k).
• Order of data will have strong impact on the final output.
• It is very sensitive to rescaling. If we will rescale our data by means of
normalization or standardization, then the output will completely change final
output.
Applications
• Market segmentation
• Document Clustering
• Image segmentation
• Customer segmentation
• Analyzing the trend on dynamic data
Apriori Algorithm
What Is An Itemset?
• A set of items together is called an itemset. If any itemset has k-items it is
called a k-itemset. An itemset consists of two or more items. An itemset
that occurs frequently is called a frequent itemset. Thus frequent itemset
mining is a data mining technique to identify the items that often occur
together.
• For Example, Bread and butter, Laptop and Antivirus software, etc.
What Is A Frequent Itemset?
• A set of items is called frequent if it satisfies a minimum threshold value
for support and confidence.
Apriori Algorithm
Minimum Support :2
Minimum Support :3
Advantages of Apriori Algorithm
• It is used to calculate large itemsets.
• Simple to understand and apply.
Disadvantages of Apriori Algorithms
• Apriori algorithm is an expensive
method to find support since the
calculation has to pass through the
whole database.
• Sometimes, you need a huge number of
candidate rules, so it becomes
computationally more expensive.