K Means Algorithm
January 7, 2018
1 K-Means Clustering
Asir Mosaddek Sakib ID: 1109044
1.1 Introduction
K-means clustering is a method of vector quantization, originally from signal processing, that is
popular for cluster analysis in data mining. k-means clustering aims to partition n observations
into k clusters in which each observation belongs to the cluster with the nearest mean, serving as
a prototype of the cluster.
1.2 Algorithm
Randomly initialize K cluster centroids mu(1), mu(2), ..., mu(K)
Repeat:
for i = 1 to m:
c(i):= index (from 1 to K) of cluster centroid closest to x(i)
for k = 1 to K:
mu(k):= average (mean) of points assigned to cluster k
Where, k is the number of clusters, m is the number of data, mu(k) contains the cluster centroids
and c(i) contains the index of ith cluster
1.3 Implementation
The following code is an implementation of K-Means clustering in python. The following libraries
are needed to run the script:
1. NumPy
2. MatPlotLib
After installing those libraries, run the following scripts in a python file. change the value of
K for number of different clusters.
To genrate data for the first time uncomment GenerateData() method.
In [2]: # import statements
import random
import matplotlib.pyplot as plt
import numpy as np
1
In [3]: '''
This method is used to generate random points and
save them into a file
'''
def GenerateData(file_name, max_points, max_value):
min_value = (-1) * max_value
with open(file_name, 'w') as file:
for i in range(max_points):
x = random.randint(min_value, max_value)
y = random.randint(min_value, max_value)
file.write("%d %d\n" % (x, y))
pass
In [4]: '''
This method is used to read 2D points
Each point is stored in a line, seperated by space
'''
def ReadData(file_name):
# Read the file, splitting by lines
f = open(file_name, 'r');
lines = f.read().splitlines();
f.close();
points = [];
for i in range(1, len(lines)):
line = lines[i].split(' ');
x_y = [];
for j in range(len(line)):
v = float(line[j]);
x_y.append(v);
points.append(x_y);
#random.shuffle(points);
return points;
In [5]: class KMeans:
def __init__(self, data, k):
self.data = data
self.k = k
self.centroids = []
self.cluster_points = []
2
def getSquareDistance(self, A, B):
return (A[0]-B[0]) * (A[0]-B[0]) + (A[1]-B[1])*(A[1]-B[1])
def performAlgo(self):
data = self.data
k = self.k
n = len(data)
c = [0] * n
old_c = [1] * n
centroids = []
for i in range (k):
centroids.append(data[i])
self.cluster_points.append([])
changed = True
while changed == True:
flag = False
cluster_size = [0]*k
cluster_points_sum = []
for j in range(k):
cluster_points_sum.append([0, 0])
for i in range(n):
old_c[i] = c[i]
c[i] = 0
distance = self.getSquareDistance(data[i], centroids[0])
for j in range(1, k):
new_distance = self.getSquareDistance(data[i], centroids[j])
if new_distance < distance:
distance = new_distance
c[i] = j
cluster_points_sum[ c[i] ][0] += data[i][0]
cluster_points_sum[ c[i] ][1] += data[i][1]
# increment cluster size
cluster_size[ c[i] ] += 1
if c[i] != old_c[i] and flag == False:
flag = True
if flag == False:
changed = False
3
# Now Calculate Average
for j in range (k):
centroids[j][0] = cluster_points_sum[j][0] / cluster_size[j]
centroids[j][1] = cluster_points_sum[j][1] / cluster_size[j]
for i in range(n):
self.cluster_points[ c[i] ].append( data[i] )
self.centroids = centroids
return centroids
In [15]: file_name = "data.txt"
max_points = 2000
max_value = 100
# Uncomment the following method
# if you are running for the first time
#GenerateData(file_name, max_points, max_value)
data = ReadData(file_name)
# change the value of k
k = 20
k_means = KMeans(data, k)
centroids = k_means.performAlgo()
cluster_points = k_means.cluster_points
# plot data
for j in range(k):
print("Cluster Number %d: Center: (%f, %f)\n" % ( (j+1), centroids[j][0], centroid
x = np.array(cluster_points[j])[:, 0]
y = np.array(cluster_points[j])[:, 1]
plt.plot(x, y, 'ro', c=np.random.rand(3,1))
plt.show()
Cluster Number 1: Center: (13.656863, 21.058824)
Cluster Number 2: Center: (79.333333, -5.322917)
Cluster Number 3: Center: (-35.218487, 20.680672)
4
Cluster Number 4: Center: (-80.370000, 30.700000)
Cluster Number 5: Center: (75.141176, 36.470588)
Cluster Number 6: Center: (73.323529, -43.127451)
Cluster Number 7: Center: (-1.140351, -75.833333)
Cluster Number 8: Center: (-41.414894, 73.606383)
Cluster Number 9: Center: (81.694444, 79.833333)
Cluster Number 10: Center: (-77.765766, 76.981982)
Cluster Number 11: Center: (-78.699029, -20.563107)
Cluster Number 12: Center: (80.816901, -84.211268)
Cluster Number 13: Center: (27.241379, 88.931034)
Cluster Number 14: Center: (40.140494, -75.999984)
Cluster Number 15: Center: (45.873786, 59.553398)
Cluster Number 16: Center: (-5.802326, 69.127907)
Cluster Number 17: Center: (27.471069, -20.239654)
Cluster Number 18: Center: (-43.417391, -74.113043)
Cluster Number 19: Center: (-80.500000, -73.977778)
Cluster Number 20: Center: (-23.956897, -27.801724)
5
6