KEMBAR78
K-Means Clustering From Scratch | PDF | Algorithms | Statistical Data Types
0% found this document useful (0 votes)
58 views3 pages

K-Means Clustering From Scratch

ML lab program code K-NN and K means etc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views3 pages

K-Means Clustering From Scratch

ML lab program code K-NN and K means etc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

K-Means Clustering from Scratch

Algorithm Explanation

1. Initialization: Choose k random centroids.


2. Assignment: Assign each data point to the nearest centroid based on Euclidean distance.
3. Update: Recompute the centroids as the mean of the points assigned to each cluster.
4. Repeat: Continue the assign-update steps until convergence (no change in centroids).

python
Copy code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def initialize_centroids(X, k):


"""Randomly initialize centroids"""
indices = np.random.choice(X.shape[0], k, replace=False)
return X[indices]

def assign_clusters(X, centroids):


"""Assign data points to the closest centroid"""
distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
return np.argmin(distances, axis=1)

def update_centroids(X, labels, k):


"""Recalculate the centroids"""
new_centroids = np.array([X[labels == i].mean(axis=0) for i in
range(k)])
return new_centroids

def kmeans(X, k, max_iters=100):


"""K-Means clustering"""
centroids = initialize_centroids(X, k)
for _ in range(max_iters):
labels = assign_clusters(X, centroids)
new_centroids = update_centroids(X, labels, k)
if np.all(centroids == new_centroids):
break
centroids = new_centroids
return labels, centroids

# Load CSV file


data = pd.read_csv('data.csv')
X = data[['Feature1', 'Feature2']].values

# Apply K-Means from scratch


k = 3 # Number of clusters
labels, centroids = kmeans(X, k)

# Plotting the clusters


plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], s=300, c='red', marker='X')
plt.show()

K-Means Using sklearn


python
Copy code
from sklearn.cluster import KMeans

# Load CSV file


data = pd.read_csv('data.csv')
X = data[['Feature1', 'Feature2']].values

# Apply K-Means using sklearn


kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)

# Plot the clusters


plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=300, c='red', marker='X')
plt.show()

K-Nearest Neighbors (KNN) from Scratch

Algorithm Explanation

1. Distance Calculation: For a given test point, calculate the Euclidean distance to all training
points.
2. Neighbors Selection: Select the k nearest neighbors.
3. Majority Voting: Assign the class of the majority of these neighbors to the test point.

python
Copy code
import numpy as np
import pandas as pd
from collections import Counter

def euclidean_distance(x1, x2):


"""Calculate Euclidean distance"""
return np.sqrt(np.sum((x1 - x2) ** 2))

def knn_predict(X_train, y_train, X_test, k=3):


"""KNN classification"""
predictions = []
for test_point in X_test:
distances = [euclidean_distance(test_point, x) for x in X_train]
k_indices = np.argsort(distances)[:k]
k_nearest_labels = [y_train[i] for i in k_indices]
most_common = Counter(k_nearest_labels).most_common(1)[0][0]
predictions.append(most_common)
return np.array(predictions)

# Load CSV file


data = pd.read_csv('data.csv')
X = data[['Feature1', 'Feature2']].values
y = data['Label'].values

# Split data (for simplicity, let's take a small portion for test)
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

# Apply KNN from scratch


k = 3
predictions = knn_predict(X_train, y_train, X_test, k)

# Compare predictions with actual test labels


print('Predicted labels:', predictions)
print('Actual labels:', y_test)

KNN Using sklearn


python
Copy code
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load CSV file


data = pd.read_csv('data.csv')
X = data[['Feature1', 'Feature2']].values
y = data['Label'].values

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Apply KNN using sklearn


knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)

# Check accuracy
accuracy = accuracy_score(y_test, predictions)
print('Accuracy:', accuracy)

Explanation of Code:

1. K-Means from Scratch:


o We initialize centroids randomly and iteratively assign clusters and update centroids
until convergence.
o We visualize the clusters by plotting points and centroids.
2. KNN from Scratch:
o The Euclidean distance is calculated for each test point against all training points.
o The k nearest points are selected, and the majority class label is predicted.
3. Using sklearn:
o Both K-Means and KNN implementations are simplified by using the sklearn
library, allowing for faster execution and easier integration with different datasets.

Make sure to adjust the CSV file path and the feature columns based on your dataset.

You might also like