1.
Compute Central Tendency Measures: Mean, Median, Mode Measure of
Dispersion:Variance, Standard Deviation.
2. Apply the following Pre-processing techniques for a given dataset.
a. Attribute selection
b. Handling Missing Values
c. Discretization
d. Elimination of Outliers
3. Apply KNN algorithm for classification and regression
4. Demonstrate decision tree algorithm for a classification problem and perform
parameter tuning for better results
5. Demonstrate decision tree algorithm for a regression problem
6. Apply Random Forest algorithm for classification and regression
7. Demonstrate Naïve Bayes Classification algorithm.
8. Apply Support Vector algorithm for classification
9. Demonstrate simple linear regression algorithm for a regression problem
10. Apply Logistic regression algorithm for a classification problem
11. Demonstrate Multi-layer Perceptron algorithm for a classification problem
12. Implement the K-means algorithm and apply it to the data you selected.
Evaluate performance by measuring the sum of the Euclidean distance of each
example fromits class center. Test the performance of the algorithm as a function
of the parameters K.
13. Demonstrate the use of Fuzzy C-Means Clustering
14. Demonstrate the use of Expectation Maximization based clustering algorithm
1) Compute Central Tendency Measures: Mean, Median, Mode Measure of
Dispersion:Variance, Standard Deviation.
Program:
import numpy as np
import scipy.stats as stats
# Sample data
data = [10, 20, 30, 40, 50]
# Calculate central tendency
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data)
# Calculate dispersion
variance = np.var(data)
std_dev = np.std(data)
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Variance:", variance)
print("Standard Deviation:", std_dev)
output:
Mean: 30.0
Median: 30.0
Mode: ModeResult(mode=array([10]), count=1, axis=0)
Variance: 500.0
Standard Deviation: 22.3606797749979
3) Apply KNN algorithm for classification and regression
Program:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
# Step 1: Load the dataset (example using the Iris dataset)
from sklearn.datasets import load_iris
data = load_iris()
# Convert to pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
# Step 2: Preprocess data (if needed, e.g., scaling features)
X = df.drop('target', axis=1) # Features
y = df['target'] # Target
# Feature Scaling (important for KNN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 3: Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3,
random_state=42)
# Step 4: Apply KNN Classifier
knn_classifier = KNeighborsClassifier(n_neighbors=5) # You can tune 'k'
knn_classifier.fit(X_train, y_train)
# Step 5: Make predictions and evaluate the model
y_pred = knn_classifier.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("Confusion Matrix:")
print(conf_matrix)
OUTPUT:
Accuracy: 100.00%
Confusion Matrix:
[[19 0 0]
[ 0 13 0]
[ 0 0 13]]
4) Demonstrate decision tree algorithm for a classification problem and
perform parameter tuning for better results
Program:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
from sklearn import tree
# Load the Iris dataset
data = load_iris()
X = data.data # Features
y = data.target # Target (class labels)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Train a Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
# Make predictions
y_pred = dt_classifier.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Define the parameter grid for tuning
param_grid = {
'max_depth': [None, 5, 10, 15, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': [None, 'sqrt', 'log2']
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=DecisionTreeClassifier(random_state=42),
param_grid=param_grid,
cv=5, # 5-fold cross-validation
scoring='accuracy', # Metric to optimize for
n_jobs=-1, # Use all available CPUs
verbose=1)
# Fit GridSearchCV
grid_search.fit(X_train, y_train)
# Best parameters from GridSearchCV
print("Best Parameters:", grid_search.best_params_)
# Get the best model
best_dt_classifier = grid_search.best_estimator_
# Make predictions with the best model
y_pred_best = best_dt_classifier.predict(X_test)
# Evaluate the tuned model
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Accuracy after tuning: {accuracy_best * 100:.2f}%")
# Classification Report for the tuned model
print("\nClassification Report for Tuned Model:")
print(classification_report(y_test, y_pred_best))
# Visualize the decision tree
plt.figure(figsize=(12, 8))
tree.plot_tree(best_dt_classifier, filled=True, feature_names=data.feature_names,
class_names=data.target_names)
plt.title("Decision Tree Visualization")
plt.show()
OUTPUT:
Accuracy: 100.00%
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Fitting 5 folds for each of 135 candidates, totalling 675 fits
Best Parameters: {'max_depth': None, 'max_features': None,
'min_samples_leaf': 4, 'min_samples_split': 2}
Accuracy after tuning: 100.00%
Classification Report for Tuned Model:
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
5) Demonstrate decision tree algorithm for a classification problem and
perform parameter tuning for better results
Program:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression
# Generate synthetic dataset
X, y = make_regression(n_samples=100, n_features=1, noise=15,
random_state=42)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Initialize the Decision Tree Regressor
dt_regressor = DecisionTreeRegressor(max_depth=5, random_state=42)
# Train the model
dt_regressor.fit(X_train, y_train)
# Predict on the test set
y_pred = dt_regressor.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
# Visualize the decision tree's predictions
X_grid = np.linspace(X.min(), X.max(), 500).reshape(-1, 1)
y_grid_pred = dt_regressor.predict(X_grid)
plt.scatter(X, y, color="blue", label="Data")
plt.plot(X_grid, y_grid_pred, color="red", label="Decision Tree Prediction")
plt.title("Decision Tree Regression")
plt.xlabel("Feature")
plt.ylabel("Target")
plt.legend()
plt.show()
OUTPUT:
Mean Squared Error: 396.47
6) Apply Random Forest algorithm for classification and regression
Program:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
rf_classifier.fit(X_train, y_train)
# Predict on the test set
y_pred = rf_classifier.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
OUTPUT:
Accuracy: 1.00
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
7)Demonstrate Naïve Bayes Classification algorithm.
Program:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Initialize the Naïve Bayes classifier
nb_classifier = GaussianNB()
# Train the model
nb_classifier.fit(X_train, y_train)
# Make predictions on the test set
y_pred = nb_classifier.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues",
xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
OUTPUT:
Accuracy: 1.00
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
8)Apply Support Vector algorithm for classification
Program:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Initialize the Support Vector Classifier with a linear kernel
svc_model = SVC(kernel='linear', random_state=42)
# Train the model
svc_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = svc_model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues",
xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
OUTPUT:
Accuracy: 1.00
Classification Report:
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 9
virginica 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
9)Demonstrate simple linear regression algorithm for a regression problem
Program:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Generate synthetic data
np.random.seed(42)
X = 2.5 * np.random.randn(100, 1) + 1.5 # Feature (Independent Variable)
y = 1.2 * X + np.random.randn(100, 1) * 0.8 # Target (Dependent Variable)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Initialize the Linear Regression model
lr_model = LinearRegression()
# Train the model
lr_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = lr_model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared (R²) Score: {r2:.2f}")
# Visualize the results
plt.figure(figsize=(8, 6))
plt.scatter(X_test, y_test, color="blue", label="Actual Data")
plt.plot(X_test, y_pred, color="red", label="Regression Line")
plt.title("Simple Linear Regression")
plt.xlabel("Independent Variable (X)")
plt.ylabel("Dependent Variable (y)")
plt.legend()
plt.show()
OUTPUT:
Mean Squared Error: 0.56
R-squared (R²) Score: 0.90
10)Apply Logistic regression algorithm for a classification problem
Data set:
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38.0,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
7,100,0,0,0,30.0,0.484,32,1
0,118,84,47,230,45.8,0.551,31,1
7,107,74,0,0,29.6,0.254,31,1
1,103,30,38,83,43.3,0.183,33,0
Program:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset (Pima Indians Diabetes Dataset)
# You can replace this with your local dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-
diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=columns)
# Separate features (X) and target (y)
X = data.drop('Outcome', axis=1)
y = data['Outcome']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Initialize the Logistic Regression model
log_reg_model = LogisticRegression(max_iter=200, random_state=42)
# Train the model
log_reg_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = log_reg_model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["No Diabetes",
"Diabetes"]))
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues",
xticklabels=["No Diabetes", "Diabetes"], yticklabels=["No Diabetes",
"Diabetes"])
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
OUTPUT:
Accuracy: 0.75
Classification Report:
precision recall f1-score support
No Diabetes 0.81 0.79 0.80 99
Diabetes 0.64 0.67 0.65 55
accuracy 0.75 154
macro avg 0.73 0.73 0.73 154
weighted avg 0.75 0.75 0.75 154
11. Demonstrate Multi-layer Perceptron algorithm for a classification
problem
Program:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
# Load Iris dataset
data = load_iris()
X, y = data.data, data.target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Define and train the MLP classifier
mlp = MLPClassifier(hidden_layer_sizes=(10,), activation='relu', solver='adam',
max_iter=500, random_state=42)
mlp.fit(X_train, y_train)
# Make predictions
y_pred = mlp.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')
OUTPUT:
Accuracy: 0.9333
12. Implement the K-means algorithm and apply it to the data you selected.
Evaluate performance by measuring the sum of the Euclidean distance of each
example fromits class center. Test the performance of the algorithm as a
function of the parameters K.
Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate synthetic data
np.random.seed(42)
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0)
# Function to evaluate K-Means clustering performance
def evaluate_kmeans(X, max_k=10):
sse = [] # Sum of Squared Errors (SSE)
for k in range(1, max_k + 1):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
sse.append(kmeans.inertia_) # SSE (inertia) for the given K
return sse
# Evaluate performance for different values of K
max_k = 10
sse_values = evaluate_kmeans(X, max_k)
# Plot SSE vs K
plt.figure(figsize=(8, 5))
plt.plot(range(1, max_k + 1), sse_values, marker='o', linestyle='--', color='b')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Sum of Squared Errors (SSE)")
plt.title("K-Means Performance Evaluation (Elbow Method)")
plt.grid()
plt.show()
OUTPUT:
13. Demonstrate the use of Fuzzy C-Means Clustering
Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.2, random_state=42)
# Initialize cluster centers randomly
cntr = X[np.random.choice(len(X), 3, replace=False)]
for _ in range(50): # Reduce iterations for simplicity
dist = np.linalg.norm(X[:, None] - cntr, axis=2)
u = 1 / (dist ** 2 + 1e-8) # Avoid division by zero
u /= u.sum(axis=1, keepdims=True)
new_cntr = (u.T @ X) / u.sum(axis=0, keepdims=True).T
if np.linalg.norm(new_cntr - cntr) < 1e-5:
break
cntr = new_cntr
# Assign clusters
labels = np.argmax(u, axis=1)
# Plot results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(cntr[:, 0], cntr[:, 1], c='red', marker='x', s=200)
plt.title("Fuzzy C-Means Clustering")
plt.show()
OUTPUT:
14. Demonstrate the use of Expectation Maximization based clustering
algorithm
Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
# Generate synthetic data
np.random.seed(42)
X1 = np.random.normal(loc=2, scale=1.0, size=(100, 2)) # Cluster 1
X2 = np.random.normal(loc=-2, scale=1.0, size=(100, 2)) # Cluster 2
X3 = np.random.normal(loc=5, scale=1.5, size=(100, 2)) # Cluster 3
X = np.vstack((X1, X2, X3)) # Combine all clusters
# Apply Gaussian Mixture Model (Expectation-Maximization)
gmm = GaussianMixture(n_components=3, covariance_type='full',
random_state=42)
gmm.fit(X)
# Predict cluster labels
labels = gmm.predict(X)
# Plot clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(gmm.means_[:, 0], gmm.means_[:, 1], s=200, c='red', marker='X',
label='Centroids')
plt.title("Expectation Maximization Clustering (GMM)")
plt.legend()
plt.show()
OUTPUT: