Machine Learning Laboratory BCSL606
Program 1: Develop a program to create histograms for all numerical
features and analyze the distribution of each feature.
Generate box plots for all numerical features and identify any outliers.
Use California Housing dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
# Load the California Housing dataset
housing_data = fetch_california_housing(as_frame=True)
data = housing_data['data']
print(data)
data['MedHouseVal'] = housing_data['target'] # Adding target variable for
completeness
# Histograms for all numerical features
print("Creating histograms for all numerical features...")
for column in data.columns:
plt.figure(figsize=(8, 5))
plt.hist(data[column], bins=50, edgecolor='k', alpha=0.7)
plt.title(f'Distribution of {column}')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
# Box plots for all numerical features
print("Creating box plots for all numerical features to identify outliers...")
for column in data.columns:
plt.figure(figsize=(8, 5))
Dept. of ISE, JSSATEB 2024-25 1
Machine Learning Laboratory BCSL606
plt.boxplot(data[column], vert=False, patch_artist=True,
boxprops=dict(facecolor='skyblue', color='blue'))
plt.title(f'Box Plot of {column}')
plt.xlabel(column)
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()
# Identify outliers using IQR
print("Identifying potential outliers using the IQR method...")
outliers = {}
for column in data.columns:
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers[column] = data[(data[column] < lower_bound) | (data[column] >
upper_bound)]
print(f"{column}:")
print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")
print(f"Number of outliers: {len(outliers[column])}")
print("---")
Output:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85
... ... ... ... ... ... ... ...
20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48
20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49
20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43
20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43
20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37
Dept. of ISE, JSSATEB 2024-25 2
Machine Learning Laboratory BCSL606
Longitude
0 -122.23
1 -122.22
2 -122.24
3 -122.25
4 -122.25
... ...
20635 -121.09
20636 -121.21
20637 -121.22
20638 -121.32
20639 -121.24
[20640 rows x 8 columns]
Creating histograms for all numerical features...
Identifying potential outliers using the IQR method...
MedInc:
Lower Bound: -0.7063750000000004, Upper Bound: 8.013024999999999
Number of outliers: 681
---
HouseAge:
Lower Bound: -10.5, Upper Bound: 65.5
Number of outliers: 0
---
AveRooms:
Lower Bound: 2.023219161170969, Upper Bound: 8.469878027106942
Number of outliers: 511
---
Dept. of ISE, JSSATEB 2024-25 3
Machine Learning Laboratory BCSL606
AveBedrms:
Lower Bound: 0.8659085155701288, Upper Bound: 1.2396965968190603
Number of outliers: 1424
---
Population:
Lower Bound: -620.0, Upper Bound: 3132.0
Number of outliers: 1196
---
AveOccup:
Lower Bound: 1.1509614824735064, Upper Bound: 4.5610405893536905
Number of outliers: 711
---
Latitude:
Lower Bound: 28.259999999999998, Upper Bound: 43.38
Number of outliers: 0
---
Longitude:
Lower Bound: -127.48499999999999, Upper Bound: -112.32500000000002
Number of outliers: 0
---
MedHouseVal:
Lower Bound: -0.9808749999999995, Upper Bound: 4.824124999999999
Number of outliers: 1071
---
Dept. of ISE, JSSATEB 2024-25 4
Machine Learning Laboratory BCSL606
Program 2: Develop a program to Compute the correlation matrix to
understand the relationships between pairs of features. Visualize the
correlation matrix using a heatmap to know which variables have strong
positive/negative correlations. Create a pair plot to visualize pairwise
relationships between features. Use California Housing dataset.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
# Load the California Housing dataset
housing_data = fetch_california_housing(as_frame=True)
data = housing_data['data']
data['MedHouseVal'] = housing_data['target'] # Adding target variable for
completeness
# Compute the correlation matrix
print("Computing the correlation matrix...")
correlation_matrix = data.corr()
print(correlation_matrix)
# Visualize the correlation matrix using a heatmap
print("Visualizing the correlation matrix using a heatmap...")
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm",
cbar=True, square=True)
plt.title("Correlation Matrix Heatmap")
plt.show()
Dept. of ISE, JSSATEB 2024-25 5
Machine Learning Laboratory BCSL606
# Create a pair plot to visualize pairwise relationships between features
print("Creating a pair plot to visualize pairwise relationships between
features...")
sns.pairplot(data, diag_kind='kde', corner=True)
plt.show()
Output:
Computing the correlation matrix...
MedInc HouseAge AveRooms AveBedrms Population AveOccup \
MedInc 1.000000 -0.119034 0.326895 -0.062040 0.004834 0.018766
HouseAge -0.119034 1.000000 -0.153277 -0.077747 -0.296244 0.013191
AveRooms 0.326895 -0.153277 1.000000 0.847621 -0.072213 -0.004852
AveBedrms -0.062040 -0.077747 0.847621 1.000000 -0.066197 -0.006181
Population 0.004834 -0.296244 -0.072213 -0.066197 1.000000 0.069863
AveOccup 0.018766 0.013191 -0.004852 -0.006181 0.069863 1.000000
Latitude -0.079809 0.011173 0.106389 0.069721 -0.108785 0.002366
Longitude -0.015176 -0.108197 -0.027540 0.013344 0.099773 0.002476
MedHouseVal 0.688075 0.105623 0.151948 -0.046701 -0.024650 -0.023737
Latitude Longitude MedHouseVal
MedInc -0.079809 -0.015176 0.688075
HouseAge 0.011173 -0.108197 0.105623
AveRooms 0.106389 -0.027540 0.151948
AveBedrms 0.069721 0.013344 -0.046701
Population -0.108785 0.099773 -0.024650
AveOccup 0.002366 0.002476 -0.023737
Latitude 1.000000 -0.924664 -0.144160
Longitude -0.924664 1.000000 -0.045967
MedHouseVal -0.144160 -0.045967 1.000000
Visualizing the correlation matrix using a heatmap...
Dept. of ISE, JSSATEB 2024-25 6
Machine Learning Laboratory BCSL606
Dept. of ISE, JSSATEB 2024-25 7
Machine Learning Laboratory BCSL606
Program 3: Develop a program to implement Principal Component
Analysis (PCA) for reducing the dimensionality of the Iris dataset from 4
features to 2.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from numpy.linalg import eig
# Load the Iris dataset
iris = load_iris()
iris_data = iris.data
iris_target = iris.target
iris_feature_names = iris.feature_names
# Convert to DataFrame
df = pd.DataFrame(iris_data, columns=iris_feature_names)
df['Target'] = iris_target
# Example Data (First 5 Samples for Explanation)
example_data = iris_data[:5]
print("Example Data (First 5 Samples):")
print(example_data)
# Step 1: Standardize the Data
scaler = StandardScaler()
iris_data_scaled = scaler.fit_transform(iris_data)
example_data_scaled = scaler.transform(example_data)
print("\nStandardized Example Data:")
print(example_data_scaled)
Dept. of ISE, JSSATEB 2024-25 8
Machine Learning Laboratory BCSL606
# Step 2: Compute Covariance Matrix Manually
n_samples = iris_data_scaled.shape[0]
mean_vector = np.mean(iris_data_scaled, axis=0)
X_centered = iris_data_scaled - mean_vector
cov_matrix_manual = (1 / (n_samples - 1)) * np.dot(X_centered.T,
X_centered)
print("\nManually Computed Covariance Matrix:")
print(cov_matrix_manual)
# Step 3: Compute Eigenvalues and Eigenvectors Manually
eigenvalues_manual, eigenvectors_manual = eig(cov_matrix_manual)
print("\nManually Computed Eigenvalues:")
print(eigenvalues_manual)
print("\nManually Computed Eigenvectors:")
print(eigenvectors_manual)
# Step 4: Select Top 2 Principal Components
sorted_indices = np.argsort(eigenvalues_manual)[::-1]
top_2_indices = sorted_indices[:2]
top_2_eigenvectors = eigenvectors_manual[:, top_2_indices]
print("\nTop 2 Eigenvectors:")
print(top_2_eigenvectors)
# Step 5: Transform Data to 2D
iris_pca = np.dot(iris_data_scaled, top_2_eigenvectors)
example_pca = np.dot(example_data_scaled, top_2_eigenvectors)
print("\nReduced 2D Example Data:")
print(example_pca)
# Step 6: Visualize PCA Results
iris_pca_df = pd.DataFrame(data=iris_pca, columns=["Principal Component
1", "Principal Component 2"])
iris_pca_df['Target'] = iris_target
Dept. of ISE, JSSATEB 2024-25 9
Machine Learning Laboratory BCSL606
plt.figure(figsize=(8, 6))
sns.scatterplot(
x="Principal Component 1", y="Principal Component 2", hue="Target",
data=iris_pca_df,
palette="viridis", s=100, alpha=0.8
)
plt.title("PCA of Iris Dataset")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(title="Target", labels=iris.target_names)
plt.grid(alpha=0.5)
plt.show()
Output:
Dept. of ISE, JSSATEB 2024-25 10
Machine Learning Laboratory BCSL606
Program 4: For a given set of training data examples stored in a .CSV
file, implement and demonstrate the Find-S algorithm to output a
description of the set of all hypotheses consistent with the training
examples.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Implement Find-S algorithm
print("Implementing Find-S algorithm...")
# Implement Find-S algorithm
def find_s_algorithm(csv_file):
# Load the dataset
dataset = pd.read_csv(csv_file)
attributes = dataset.iloc[:, :-1].values
labels = dataset.iloc[:, -1].values
for i, label in enumerate(labels):
if label == 'Yes': # First positive example found
hypothesis = list(attributes[i])
break # Stop after finding the first "Yes"
for i in range(len(labels)):
if labels[i] == 'Yes': # Only process positive examples
for j in range(len(hypothesis)):
if hypothesis[j] != attributes[i][j]:
hypothesis[j] = '?' # Generalize
return hypothesis
csv_file = "/content/find_s_example.csv" # Provide the path to your CSV file
final_hypothesis = find_s_algorithm(csv_file)
print("Final Hypothesis:", final_hypothesis)
Output:
Implementing Find-S algorithm...
Final Hypothesis: ['Sunny', 'Warm', '?', '?', '?', '?']
Dept. of ISE, JSSATEB 2024-25 11
Machine Learning Laboratory BCSL606
Program 5: Develop a program to implement k-Nearest Neighbour
algorithm to classify the randomly generated 100 values of x in the range
of [0,1]. Perform the following based on dataset generated.
1. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε
Class1, else xi ε Class1
2. Classify the remaining points, x51,……,x100 using KNN. Perform this
for k=1,2,3,4,5,20,30
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
# Step 1: Generate 100 random values in the range [0,1]
np.random.seed(42) # For reproducibility
x = np.random.rand(100).reshape(-1, 1) # Reshape for sklearn compatibility
print(x[:5])
# Step 2: Label the first 50 points
labels = np.array([1 if xi <= 0.5 else 2 for xi in x[:50]]) # Class 1 if xi <= 0.5
else Class 2
# Step 3: Train KNN classifier
k_values = [1, 2, 3, 4, 5, 20, 30]
classified_labels = {}
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(x[:50], labels) # Train using first 50 points
classified_labels[k] = knn.predict(x[50:]) # Classify remaining 50 points
# Step 4: Visualize the results
plt.figure(figsize=(10, 6))
plt.scatter(x[:50], labels, color='blue', label='Training Data')
Dept. of ISE, JSSATEB 2024-25 12
Machine Learning Laboratory BCSL606
plt.scatter(x[50:], classified_labels[1], color='red', marker='x', label='Classified
Data (k=1)')
plt.xlabel('X values')
plt.ylabel('Class')
plt.title('KNN Classification of Random Values')
plt.legend()
plt.show()
# Print classification results for different k values
for k in k_values:
print(f"Classification results for k={k}: {classified_labels[k]}")
Output:
Classification results for k=1: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1 2
2221112211112
2 2 1 1 2 2 2 2 1 2 1 1 1]
Classification results for k=2: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1 2
2221112211112
2 2 1 1 2 2 2 2 1 2 1 1 1]
Classification results for k=3: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1 2
2221112211112
2 2 1 1 2 2 2 2 2 2 1 1 1]
Dept. of ISE, JSSATEB 2024-25 13
Machine Learning Laboratory BCSL606
Classification results for k=4: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1 2
2221112211112
2 2 1 1 2 2 2 2 2 2 1 1 1]
Classification results for k=5: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1 2
2221112211112
2 2 1 1 2 2 2 2 2 2 1 1 1]
Classification results for k=20: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1
22221112211112
2 2 1 1 2 2 2 2 2 2 1 1 1]
Classification results for k=30: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1
22221112211112
2 2 1 1 2 2 2 2 1 2 1 1 1]
Dept. of ISE, JSSATEB 2024-25 14
Machine Learning Laboratory BCSL606
Program 6: Implement the non-parametric Locally Weighted Regression
algorithm in order to fit data points. Select appropriate data set for your
experiment and draw graphs
import numpy as np
import matplotlib.pyplot as plt
def gaussian_kernel(x, x_query, tau):
"""Compute the Gaussian weight for each training sample."""
return np.exp(-np.square(x - x_query) / (2 * tau**2))
def locally_weighted_regression(x_train, y_train, x_query, tau):
"""Perform Locally Weighted Regression (LWR) for a given query point."""
m = len(x_train)
W = np.diag(gaussian_kernel(x_train, x_query, tau)) # Compute weights
X_bias = np.c_[np.ones(m), x_train] # Add bias term
theta = np.linalg.pinv(X_bias.T @ W @ X_bias) @ (X_bias.T @ W @ y_train)
return np.array([1, x_query]) @ theta # Predict output for x_query
# Generate synthetic dataset
np.random.seed(42)
x_train = np.linspace(0, 10, 100)
y_train = np.sin(x_train) + np.random.normal(0, 0.2, 100) # Sinusoidal data
with noise
# Define tau (bandwidth parameter)
tau_values = [0.1, 0.5, 1, 5]
x_test = np.linspace(0, 10, 100) # Test data
plt.figure(figsize=(12, 8))
for tau in tau_values:
y_pred = np.array([locally_weighted_regression(x_train, y_train, xq, tau)
for xq in x_test])
plt.plot(x_test, y_pred, label=f'tau={tau}')
# Plot training data
plt.scatter(x_train, y_train, color='black', label='Training Data', alpha=0.5)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Locally Weighted Regression (LWR) with Different Tau Values')
plt.legend()
plt.show()
Dept. of ISE, JSSATEB 2024-25 15
Machine Learning Laboratory BCSL606
Output:
Dept. of ISE, JSSATEB 2024-25 16
Machine Learning Laboratory BCSL606
Program 7: Develop a program to demonstrate the working of Linear
Regression and Polynomial Regression. Use Boston Housing Dataset for
Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency
prediction) for Polynomial Regression.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
# Load Boston Housing Dataset from CSV
boston_df = pd.read_csv('/content/Boston.csv')
print("Boston CSV Columns:", boston_df.columns)
X_boston = boston_df[['rm']].values
y_boston = boston_df['medv'].values
X_train, X_test, y_train, y_test = train_test_split(X_boston, y_boston,
test_size=0.2, random_state=42)
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
y_pred = linear_reg.predict(X_test)
# Plot results
plt.figure(figsize=(10, 5))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('Average Number of Rooms (RM)')
plt.ylabel('Housing Price')
Dept. of ISE, JSSATEB 2024-25 17
Machine Learning Laboratory BCSL606
plt.title('Linear Regression on Boston Housing Dataset')
plt.legend()
plt.show()
print(f"Mean Squared Error (Linear Regression): {mean_squared_error(y_test,
y_pred)}")
# Polynomial Regression on Auto MPG Dataset
auto_mpg_url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/auto-mpg/auto-mpg.data"
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'model_year', 'origin']
auto_df = pd.read_csv(auto_mpg_url, delim_whitespace=True,
names=column_names, na_values='?')
auto_df = auto_df.dropna() # Remove rows with missing values
X_auto = auto_df[['horsepower']].astype(float).values # Using 'horsepower' as
feature
y_auto = auto_df['mpg'].values
X_train, X_test, y_train, y_test = train_test_split(X_auto, y_auto,
test_size=0.2, random_state=42)
# Polynomial Regression (degree=3)
poly_model = make_pipeline(PolynomialFeatures(degree=3),
StandardScaler(), LinearRegression())
poly_model.fit(X_train, y_train)
y_poly_pred = poly_model.predict(X_test)
# Plot results
X_test_sorted, y_poly_pred_sorted = zip(*sorted(zip(X_test.flatten(),
y_poly_pred)))
Dept. of ISE, JSSATEB 2024-25 18
Machine Learning Laboratory BCSL606
plt.figure(figsize=(10, 5))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test_sorted, y_poly_pred_sorted, color='red', linewidth=2,
label='Predicted')
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('Polynomial Regression on Auto MPG Dataset')
plt.legend()
plt.show()
print(f"Mean Squared Error (Polynomial Regression):
{mean_squared_error(y_test, y_poly_pred)}")
Output:
Mean Squared Error (Linear Regression): 46.144775347317264
Dept. of ISE, JSSATEB 2024-25 19
Machine Learning Laboratory BCSL606
Dept. of ISE, JSSATEB 2024-25 20
Machine Learning Laboratory BCSL606
8. Develop a program to demonstrate the working of the decision tree
algorithm. Use Breast Cancer Data set for building the decision tree and
apply this knowledge to classify a new sample.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt
from collections import Counter
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names
target_names = data.target_names
print("Feature names:", feature_names)
print("Target names:", target_names)
def calculate_entropy(labels):
total = len(labels)
counts = Counter(labels)
entropy = 0.0
for count in counts.values():
p = count / total
entropy -= p * np.log2(p)
return entropy
entropy_dataset = calculate_entropy(y)
print(f"\nOverall Entropy of Target (Malignant vs Benign):
{entropy_dataset:.4f}")
print("\nInformation Gain for Each Feature (using median split):")
for i, feature in enumerate(feature_names):
feature_values = X[:, i]
median_value = np.median(feature_values)
# Split dataset
left_mask = feature_values <= median_value
right_mask = feature_values > median_value
y_left = y[left_mask]
Dept. of ISE, JSSATEB 2024-25 21
Machine Learning Laboratory BCSL606
y_right = y[right_mask]
entropy_left = calculate_entropy(y_left)
entropy_right = calculate_entropy(y_right)
weighted_entropy = (len(y_left) / len(y)) * entropy_left + (len(y_right) /
len(y)) * entropy_right
info_gain = entropy_dataset - weighted_entropy
print(f"{feature}: IG = {info_gain:.4f}")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
clf = DecisionTreeClassifier(criterion='entropy', max_depth=4,
random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
plt.figure(figsize=(20, 10))
plot_tree(clf, feature_names=feature_names, class_names=target_names,
filled=True, rounded=True)
plt.title("Decision Tree Visualization for Breast Cancer Dataset")
plt.show()
new_sample = np.array([[17.99, 10.38, 122.8, 1001.0, 0.1184,
0.2776, 0.3001, 0.1471, 0.2419, 0.07871,
1.095, 0.9053, 8.589, 153.4, 0.006399,
0.04904, 0.05373, 0.01587, 0.03003, 0.006193,
25.38, 17.33, 184.6, 2019.0, 0.1622,
0.6656, 0.7119, 0.2654, 0.4601, 0.1189]])
prediction = clf.predict(new_sample)
print("\nPrediction for new sample:")
print("Class:", target_names[prediction[0]])
Dept. of ISE, JSSATEB 2024-25 22
Machine Learning Laboratory BCSL606
Output:
Feature names: ['mean radius' 'mean texture' 'mean perimeter'
'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal
dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension
error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal
dimension']
Target names: ['malignant' 'benign']
Overall Entropy of Target (Malignant vs Benign): 0.9526
Information Gain for Each Feature (using median split):
mean radius: IG = 0.3416
mean texture: IG = 0.1445
mean perimeter: IG = 0.3507
mean area: IG = 0.3416
mean smoothness: IG = 0.0660
mean compactness: IG = 0.2325
mean concavity: IG = 0.3695
mean concave points: IG = 0.3995
mean symmetry: IG = 0.0627
mean fractal dimension: IG = 0.0000
radius error: IG = 0.1824
texture error: IG = 0.0000
perimeter error: IG = 0.2192
area error: IG = 0.2910
smoothness error: IG = 0.0023
compactness error: IG = 0.0990
concavity error: IG = 0.1601
concave points error: IG = 0.1445
symmetry error: IG = 0.0037
fractal dimension error: IG = 0.0284
worst radius: IG = 0.4588
worst texture: IG = 0.1298
worst perimeter: IG = 0.4436
worst area: IG = 0.4556
worst smoothness: IG = 0.0990
worst compactness: IG = 0.1882
worst concavity: IG = 0.3792
worst concave points: IG = 0.4209
worst symmetry: IG = 0.0762
worst fractal dimension: IG = 0.0452
Dept. of ISE, JSSATEB 2024-25 23
Machine Learning Laboratory BCSL606
Classification Report:
precision recall f1-score support
0 0.97 0.91 0.94 43
1 0.95 0.99 0.97 71
accuracy 0.96 114
macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114
Accuracy: 0.956140350877193
Prediction for new sample:
Class: malignant
Dept. of ISE, JSSATEB 2024-25 24
Machine Learning Laboratory BCSL606
9. Develop a program to implement the Naive Bayesian classifier
considering Olivetti Face Data set for training. Compute the accuracy of
the classifier, considering a few test data sets.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
faces = fetch_olivetti_faces()
X = faces.data # Flattened images: 400 x 4096
y = faces.target # Labels: 0 to 39 (40 classes)
images = faces.images # Original image shapes: 64 x 64
print(f"Total samples: {X.shape[0]}")
print(f"Image shape: {images[0].shape}")
print(f"Total classes: {len(np.unique(y))}")
X_train, X_test, y_train, y_test, img_train, img_test = train_test_split(
X, y, images, test_size=0.3, random_state=42, stratify=y)
model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy)
def show_predictions(images, true_labels, predicted_labels, n=8):
plt.figure(figsize=(15, 5))
for i in range(n):
plt.subplot(1, n, i + 1)
plt.imshow(images[i], cmap='gray')
plt.title(f"True: {true_labels[i]}\nPred: {predicted_labels[i]}")
plt.axis('off')
plt.tight_layout()
plt.suptitle("Sample Test Predictions", fontsize=16)
plt.subplots_adjust(top=0.75)
plt.show()
show_predictions(img_test, y_test, y_pred, n=8)
Dept. of ISE, JSSATEB 2024-25 25
Machine Learning Laboratory BCSL606
Output:
Classification Report:
precision recall f1-score support
0 1.00 0.67 0.80 3
1 1.00 0.67 0.80 3
2 0.43 1.00 0.60 3
3 1.00 0.33 0.50 3
4 1.00 0.33 0.50 3
5 1.00 1.00 1.00 3
6 1.00 0.67 0.80 3
7 0.60 1.00 0.75 3
8 1.00 1.00 1.00 3
9 1.00 0.33 0.50 3
10 1.00 0.67 0.80 3
11 1.00 1.00 1.00 3
12 1.00 1.00 1.00 3
13 1.00 0.67 0.80 3
14 1.00 1.00 1.00 3
15 0.50 1.00 0.67 3
16 1.00 0.33 0.50 3
17 0.00 0.00 0.00 3
18 1.00 1.00 1.00 3
19 1.00 1.00 1.00 3
20 1.00 1.00 1.00 3
21 1.00 1.00 1.00 3
22 1.00 1.00 1.00 3
23 1.00 1.00 1.00 3
24 1.00 0.67 0.80 3
25 0.75 1.00 0.86 3
26 1.00 0.67 0.80 3
27 1.00 1.00 1.00 3
28 1.00 1.00 1.00 3
29 1.00 1.00 1.00 3
30 0.75 1.00 0.86 3
31 1.00 0.67 0.80 3
32 1.00 1.00 1.00 3
33 1.00 0.67 0.80 3
34 0.43 1.00 0.60 3
35 0.75 1.00 0.86 3
36 1.00 1.00 1.00 3
37 1.00 0.33 0.50 3
38 1.00 1.00 1.00 3
39 0.33 1.00 0.50 3
accuracy 0.82 120
macro avg 0.89 0.82 0.81 120
Dept. of ISE, JSSATEB 2024-25 26
Machine Learning Laboratory BCSL606
weighted avg 0.89 0.82 0.81 120
Accuracy: 0.8166666666666667
Dept. of ISE, JSSATEB 2024-25 27
Machine Learning Laboratory BCSL606
10. Develop a program to implement k-means clustering using
Wisconsin Breast Cancer data set and visualize the clustering result.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names
target_names = data.target_names
print("Data Shape:", X.shape)
print("Classes:", target_names)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)
labels_mapped = np.where(clusters == 1, 0, 1)
print("\nConfusion Matrix:")
print(confusion_matrix(y, labels_mapped))
print("Accuracy:", accuracy_score(y, labels_mapped))
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis', alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=250, marker='X', c='red', label='Centroids')
plt.title("K-Means Clustering of Breast Cancer Dataset (PCA-2D)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.legend()
plt.grid(True)
plt.show()
Dept. of ISE, JSSATEB 2024-25 28
Machine Learning Laboratory BCSL606
Output:
Data Shape: (569, 30)
Classes: ['malignant' 'benign']
Confusion Matrix:
[[176 36]
[ 18 339]]
Accuracy: 0.9050966608084359
Dept. of ISE, JSSATEB 2024-25 29