K-Means Clustering from Scratch
Algorithm Explanation
1. Initialization: Choose k random centroids.
2. Assignment: Assign each data point to the nearest centroid based on Euclidean distance.
3. Update: Recompute the centroids as the mean of the points assigned to each cluster.
4. Repeat: Continue the assign-update steps until convergence (no change in centroids).
python
Copy code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def initialize_centroids(X, k):
"""Randomly initialize centroids"""
indices = np.random.choice(X.shape[0], k, replace=False)
return X[indices]
def assign_clusters(X, centroids):
"""Assign data points to the closest centroid"""
distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
return np.argmin(distances, axis=1)
def update_centroids(X, labels, k):
"""Recalculate the centroids"""
new_centroids = np.array([X[labels == i].mean(axis=0) for i in
range(k)])
return new_centroids
def kmeans(X, k, max_iters=100):
"""K-Means clustering"""
centroids = initialize_centroids(X, k)
for _ in range(max_iters):
labels = assign_clusters(X, centroids)
new_centroids = update_centroids(X, labels, k)
if np.all(centroids == new_centroids):
break
centroids = new_centroids
return labels, centroids
# Load CSV file
data = pd.read_csv('data.csv')
X = data[['Feature1', 'Feature2']].values
# Apply K-Means from scratch
k = 3 # Number of clusters
labels, centroids = kmeans(X, k)
# Plotting the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], s=300, c='red', marker='X')
plt.show()
K-Means Using sklearn
python
Copy code
from sklearn.cluster import KMeans
# Load CSV file
data = pd.read_csv('data.csv')
X = data[['Feature1', 'Feature2']].values
# Apply K-Means using sklearn
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)
# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=300, c='red', marker='X')
plt.show()
K-Nearest Neighbors (KNN) from Scratch
Algorithm Explanation
1. Distance Calculation: For a given test point, calculate the Euclidean distance to all training
points.
2. Neighbors Selection: Select the k nearest neighbors.
3. Majority Voting: Assign the class of the majority of these neighbors to the test point.
python
Copy code
import numpy as np
import pandas as pd
from collections import Counter
def euclidean_distance(x1, x2):
"""Calculate Euclidean distance"""
return np.sqrt(np.sum((x1 - x2) ** 2))
def knn_predict(X_train, y_train, X_test, k=3):
"""KNN classification"""
predictions = []
for test_point in X_test:
distances = [euclidean_distance(test_point, x) for x in X_train]
k_indices = np.argsort(distances)[:k]
k_nearest_labels = [y_train[i] for i in k_indices]
most_common = Counter(k_nearest_labels).most_common(1)[0][0]
predictions.append(most_common)
return np.array(predictions)
# Load CSV file
data = pd.read_csv('data.csv')
X = data[['Feature1', 'Feature2']].values
y = data['Label'].values
# Split data (for simplicity, let's take a small portion for test)
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]
# Apply KNN from scratch
k = 3
predictions = knn_predict(X_train, y_train, X_test, k)
# Compare predictions with actual test labels
print('Predicted labels:', predictions)
print('Actual labels:', y_test)
KNN Using sklearn
python
Copy code
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load CSV file
data = pd.read_csv('data.csv')
X = data[['Feature1', 'Feature2']].values
y = data['Label'].values
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Apply KNN using sklearn
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
# Check accuracy
accuracy = accuracy_score(y_test, predictions)
print('Accuracy:', accuracy)
Explanation of Code:
1. K-Means from Scratch:
o We initialize centroids randomly and iteratively assign clusters and update centroids
until convergence.
o We visualize the clusters by plotting points and centroids.
2. KNN from Scratch:
o The Euclidean distance is calculated for each test point against all training points.
o The k nearest points are selected, and the majority class label is predicted.
3. Using sklearn:
o Both K-Means and KNN implementations are simplified by using the sklearn
library, allowing for faster execution and easier integration with different datasets.
Make sure to adjust the CSV file path and the feature columns based on your dataset.