Experiment 11: K-Nearest Neighbor Algorithm
Aim
To implement the K-Nearest Neighbor (KNN) algorithm to classify a dataset and evaluate the classification
accuracy.
Algorithm
1. Read the dataset from a CSV file and load it into a pandas DataFrame.
2. Preprocess the data by extracting features and target labels.
3. Normalize the feature data to ensure all features are on the same scale.
4. Split the dataset into training and testing sets.
5. Import the K-Nearest Neighbor classifier from the sklearn library.
6. Instantiate the KNN classifier with a chosen number of neighbours (k).
7. Fit the classifier to the training data.
8. Predict the class labels for the test data.
9. Calculate the accuracy, precision, and recall of the classifier.
10. Save and display the confusion matrix.
Program
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv(’student_performance_knn.csv’)
X = data[[’StudyHours’, ’Attendance’]].values
y = data[’Performance’].map({’Low’: 0, ’Medium’: 1, ’High’: 2}).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
1
AML311 - PML Lab
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average=’weighted’)
recall = recall_score(y_test, y_pred, average=’weighted’)
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
plt.figure(figsize=(6,6))
sns.heatmap(conf_matrix, annot=True, fmt=’d’, cmap=’Blues’,
xticklabels=[’Low’, ’Medium’, ’High’], yticklabels=[’Low’, ’Medium’, ’High’])
plt.title(’Confusion Matrix’)
plt.xlabel(’Predicted’)
plt.ylabel(’Actual’)
plt.savefig(’confusion_matrix_knn.png’)
plt.close()
Result
The K-Nearest Neighbor (KNN) classifier was successfully implemented to classify the given dataset. The
accuracy, precision, and recall values were calculated.
Viva Questions and Answers
1. What is the K-Nearest Neighbor (KNN) algorithm?
KNN is a simple, non-parametric, and instance-based learning algorithm used for classification and
regression. It classifies a new sample by finding the ’K’ nearest data points in the training data and
assigning the majority class among the neighbors.
2. What is the role of ’K’ in KNN?
The value of ’K’ in KNN represents the number of nearest neighbors considered to classify a new
point. A larger value of ’K’ reduces the effect of noise but may also smooth out the boundaries
between classes.
3. How is the distance between neighbors calculated in KNN?
Common distance metrics include Euclidean distance, Manhattan distance, and Minkowski distance.
Euclidean distance is the most commonly used, which is the straight-line distance between two points
in a multi-dimensional space.
4. What happens if K is too small or too large?
If K is too small (e.g., K=1), the model becomes sensitive to noise and may overfit the training
2 CEK
AML311 - PML Lab
data. If K is too large, the model may oversimplify the decision boundary, leading to underfitting
and reduced accuracy.
5. Is KNN a parametric or non-parametric algorithm?
KNN is a non-parametric algorithm because it does not assume any underlying probability distribu-
tion for the data. Instead, it makes decisions based on the proximity of neighboring data points.
6. What is meant by feature scaling in KNN? Why is it important?
Feature scaling ensures that all features contribute equally to the distance calculation. Since KNN
relies on distance measurements, features with larger scales may dominate the results if scaling is
not applied.
7. What are the advantages of the KNN algorithm?
Advantages of KNN include simplicity, easy implementation, and effectiveness in low-dimensional
spaces. It also has no assumptions about the underlying data distribution.
8. What are the disadvantages of the KNN algorithm?
Disadvantages include high computation time during classification, sensitivity to the choice of ’K’,
and performance degradation in high-dimensional data due to the curse of dimensionality.
3 CEK