AryAbhAttA college
University of delhi
DATA MINING I PRACTICAL
SEMESTER IV
sUbMitted by : PriyAnshU KUMAr
ROLL NO : CSC/23/54 45
University roll no : 23059570007
23059570039
sUbMitted to : dr. sonAl lindA
COURSE : B.sc (H) Computer science
INDEX
S.no. Practicals Page
no.
1. Apply data cleaning techniques on any dataset 4-6
(e.g, wine dataset). Techniques may include
handling missing values, outliers, inconsistent
values. A set of validation rules can be prepared
based on the dataset and validations can be
performed.
2. Apply data pre-processing techniques such as 7-14
standardization/normalization, transformation,
aggregation, discretization/binarization,
sampling etc. on any dataset
3. Run Apriori algorithm to find frequent item sets 15-16
and association rules on 2 real datasets and use
appropriate evaluation measures to compute
correctness of obtained patterns
a) Use minimum support as 50% and minimum
confidence as 75%
b) Use minimum support as 60% and minimum
confidence as 60 %
4. Use Naive bayes, K-nearest, and Decision tree 16-20
classification algorithms and build classifiers on
any two datasets. Divide the data set into
training and test set. Compare the accuracy of
the different classifiers under the following
situations:
I.
a) Training set = 75% Test set = 25%
b) Training set = 66.6% (2/3rd of total), Test set =
33.3%
II. Training set is chosen by
a) hold out method
b) Random subsampling
c) Cross-Validation. Compare the
accuracy of the classifiers obtained.
Data needs to be scaled to standard
format.
5. Use Simple K-means algorithm for clustering on 21-26
any dataset. Compare the performance of
clusters by changing the parameters involved in
the algorithm. Plot MSE computed after each
iteration using a line plot for any set of
parameters.
Practical 1
Question :
Apply data cleaning techniques on any dataset
(e.g, wine dataset). Techniques may include
handling missing values, outliers, inconsistent
values. A set of validation rules can be prepared
based on the dataset and validations can be
performed.
CODE :
# Step 1: Import Libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
import seaborn as sns
import matplotlib.pyplot as plt
# Step 2: Load Dataset
wine = load_wine()
df = pd.DataFrame(wine.data, columns=wine.feature_names)
df['target'] = wine.target
print("Original Dataset:")
print(df.head())
# Step 3: Introduce Some Missing Values for Practice
df.loc[5:10, 'ash'] = np.nan
df.loc[15:18, 'alcalinity_of_ash'] = np.nan
# Step 4: Handling Missing Values
# Option 1: Fill with Mean
df['ash'].fillna(df['ash'].mean(), inplace=True)
df['alcalinity_of_ash'].fillna(df['alcalinity_of_ash'].median(), inplace=True)
# Step 5: Handling Outliers (Using IQR)
def remove_outliers(col):
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df[col] = np.where((df[col] < lower) | (df[col] > upper), df[col].median(), df[col])
# Apply on few numeric columns
for col in ['alcohol', 'malic_acid', 'color_intensity']:
remove_outliers(col)
# Step 6: Inconsistent Values
# Suppose we mistakenly type a wrong value in 'target'
df.loc[0, 'target'] = 5 # Invalid, valid target = 0,1,2
# Fix: Replace with mode or set a rule
df['target'] = df['target'].apply(lambda x: x if x in [0, 1, 2] else df['target'].mode()[0])
# Step 7: Validation Rules
validation_results = {
'no_negative_values': (df.select_dtypes(include=[np.number]) >= 0).all().all(),
'target_in_range': df['target'].isin([0, 1, 2]).all(),
'ash_not_null': df['ash'].isnull().sum() == 0,
print("Validation Results:")
print(validation_results)
# Optional Visualization
sns.boxplot(x=df['alcohol'])
plt.title("Boxplot for Alcohol after Outlier Handling")
plt.show()
Screenshot :
(i)
(ii)
Practical 2
Question :
Apply data pre-processing techniques such as
standardization/normalization, transformation, aggregation,
discretization/binarization, sampling etc. on any dataset
CODE :
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler, MinMaxScaler, KBinsDiscretizer, Binarizer
from sklearn.model_selection import train_test_split
# Load dataset
wine = load_wine()
df = pd.DataFrame(wine.data, columns=wine.feature_names)
df['target'] = wine.target
print("Original Dataset:")
print(df.head())
# 1. Standardization (mean=0, std=1)
scaler = StandardScaler()
standardized = scaler.fit_transform(df.iloc[:, :-1]) # without target
df_standardized = pd.DataFrame(standardized, columns=wine.feature_names)
print("\nStandardized Data:")
print(df_standardized.head())
# 2. Normalization (min-max scaling)
minmax = MinMaxScaler()
normalized = minmax.fit_transform(df.iloc[:, :-1])
df_normalized = pd.DataFrame(normalized, columns=wine.feature_names)
print("\nNormalized Data:")
print(df_normalized.head())
# 3. Transformation (log transformation on skewed column)
df['log_proline'] = np.log(df['proline'] + 1) # +1 to avoid log(0)
print("\nLog Transformed 'proline':")
print(df[['proline', 'log_proline']].head())
# 4. Aggregation (mean values by class)
aggregated = df.groupby('target').mean()
print("\nAggregated Mean by Target Class:")
print(aggregated)
# 5. Discretization (binning alcohol into 3 categories)
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
df['alcohol_bin'] = discretizer.fit_transform(df[['alcohol']])
print("\nDiscretized 'alcohol':")
print(df[['alcohol', 'alcohol_bin']].head())
# 6. Binarization (proline high/low based on threshold)
binarizer = Binarizer(threshold=750)
df['proline_bin'] = binarizer.fit_transform(df[['proline']])
print("\nBinarized 'proline':")
print(df[['proline', 'proline_bin']].head())
# 7. Sampling (random 20% of data)
sampled_df = df.sample(frac=0.2, random_state=42)
print("\nSampled 20% of Dataset:")
print(sampled_df.head())
OUTPUT :-
Original Dataset:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols \
0 14.23 1.71 2.43 15.6 127.0 2.80
1 13.20 1.78 2.14 11.2 100.0 2.65
2 13.16 2.36 2.67 18.6 101.0 2.80
3 14.37 1.95 2.50 16.8 113.0 3.85
4 13.24 2.59 2.87 21.0 118.0 2.80
flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue \
0 3.06 0.28 2.29 5.64 1.04
1 2.76 0.26 1.28 4.38 1.05
2 3.24 0.30 2.81 5.68 1.03
3 3.49 0.24 2.18 7.80 0.86
4 2.69 0.39 1.82 4.32 1.04
od280/od315_of_diluted_wines proline target
0 3.92 1065.0 0
1 3.40 1050.0 0
2 3.17 1185.0 0
3 3.45 1480.0 0
4 2.93 735.0 0
Standardized Data:
alcohol malic_acid ash alcalinity_of_ash magnesium \
0 1.518613 -0.562250 0.232053 -1.169593 1.913905
1 0.246290 -0.499413 -0.827996 -2.490847 0.018145
2 0.196879 0.021231 1.109334 -0.268738 0.088358
3 1.691550 -0.346811 0.487926 -0.809251 0.930918
4 0.295700 0.227694 1.840403 0.451946 1.281985
total_phenols flavanoids nonflavanoid_phenols proanthocyanins \
0 0.808997 1.034819 -0.659563 1.224884
1 0.568648 0.733629 -0.820719 -0.544721
2 0.808997 1.215533 -0.498407 2.135968
3 2.491446 1.466525 -0.981875 1.032155
4 0.808997 0.663351 0.226796 0.401404
color_intensity hue od280/od315_of_diluted_wines proline
0 0.251717 0.362177 1.847920 1.013009
1 -0.293321 0.406051 1.113449 0.965242
2 0.269020 0.318304 0.788587 1.395148
3 1.186068 -0.427544 1.184071 2.334574
4 -0.319276 0.362177 0.449601 -0.037874
Normalized Data:
alcohol malic_acid ash alcalinity_of_ash magnesium \
0 0.842105 0.191700 0.572193 0.257732 0.619565
1 0.571053 0.205534 0.417112 0.030928 0.326087
2 0.560526 0.320158 0.700535 0.412371 0.336957
3 0.878947 0.239130 0.609626 0.319588 0.467391
4 0.581579 0.365613 0.807487 0.536082 0.521739
total_phenols flavanoids nonflavanoid_phenols proanthocyanins \
0 0.627586 0.573840 0.283019 0.593060
1 0.575862 0.510549 0.245283 0.274448
2 0.627586 0.611814 0.320755 0.757098
3 0.989655 0.664557 0.207547 0.558360
4 0.627586 0.495781 0.490566 0.444795
color_intensity hue od280/od315_of_diluted_wines proline
0 0.372014 0.455285 0.970696 0.561341
1 0.264505 0.463415 0.780220 0.550642
2 0.375427 0.447154 0.695971 0.646933
3 0.556314 0.308943 0.798535 0.857347
4 0.259386 0.455285 0.608059 0.325963
Log Transformed 'proline':
proline log_proline
0 1065.0 6.971669
1 1050.0 6.957497
2 1185.0 7.078342
3 1480.0 7.300473
4 735.0 6.601230
Aggregated Mean by Target Class:
alcohol malic_acid ash alcalinity_of_ash magnesium \
target
0 13.744746 2.010678 2.455593 17.037288 106.338983
1 12.278732 1.932676 2.244789 20.238028 94.549296
2 13.153750 3.333750 2.437083 21.416667 99.312500
total_phenols flavanoids nonflavanoid_phenols proanthocyanins \
target
0 2.840169 2.982373 0.290000 1.899322
1 2.258873 2.080845 0.363662 1.630282
2 1.678750 0.781458 0.447500 1.153542
color_intensity hue od280/od315_of_diluted_wines proline \
target
0 5.528305 1.062034 3.157797 1115.711864
1 3.086620 1.056282 2.785352 519.507042
2 7.396250 0.682708 1.683542 629.895833
log_proline
target
0 6.998383
1 6.212565
2 6.430818
Discretized 'alcohol':
alcohol alcohol_bin
0 14.23 2.0
1 13.20 1.0
2 13.16 1.0
3 14.37 2.0
4 13.24 1.0
Binarized 'proline':
proline proline_bin
0 1065.0 1.0
1 1050.0 1.0
2 1185.0 1.0
3 1480.0 1.0
4 735.0 0.0
Sampled 20% of Dataset:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols \
19 13.64 3.10 2.56 15.2 116.0 2.70
45 14.21 4.04 2.44 18.9 111.0 2.85
140 12.93 2.81 2.70 21.0 96.0 1.54
30 13.73 1.50 2.70 22.5 101.0 3.00
67 12.37 1.17 1.92 19.6 78.0 2.11
flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue \
19 3.03 0.17 1.66 5.10 0.96
45 2.65 0.30 1.25 5.24 0.87
140 0.50 0.53 0.75 4.60 0.77
30 3.25 0.29 2.38 5.70 1.19
67 2.00 0.27 1.04 4.68 1.12
od280/od315_of_diluted_wines proline target log_proline alcohol_bin \
19 3.36 845.0 0 6.740519 2.0
45 3.33 1080.0 0 6.985642 2.0
140 2.31 600.0 2 6.398595 1.0
30 2.71 1285.0 0 7.159292 2.0
67 3.48 510.0 1 6.236370 1.0
proline_bin
19 1.0
45 1.0
140 0.0
30 1.0
67 0.0
Screenshot :
(i)
(ii)
(iii)
Practical 3
QUESTION :
Run Apriori algorithm to find frequent item sets and
association rules on 2 real datasets and use appropriate
evaluation measures to compute correctness of obtained patterns
a) Use minimum support as 50% and minimum confidence as 75%
b) Use minimum support as 60% and minimum confidence as 60 %
CODE :
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
# Dataset 1: Groceries (example style)
data = [['milk', 'bread', 'eggs'],
['milk', 'bread'],
['milk', 'eggs'],
['bread', 'eggs'],
['milk', 'bread', 'eggs', 'butter'],
['bread', 'butter']]
# Convert to dataframe (one-hot encoded format)
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_data = te.fit(data).transform(data)
df = pd.DataFrame(te_data, columns=te.columns_)
def run_apriori(df, min_support, min_confidence):
print(f"\nRunning Apriori with min_support={min_support}, min_confidence={min_confidence}")
freq_items = apriori(df, min_support=min_support, use_colnames=True)
rules = association_rules(freq_items, metric="confidence", min_threshold=min_confidence)
# Evaluation: Show Lift, Confidence, Support
rules = rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]
print("\nFrequent Itemsets:\n", freq_items)
print("\nAssociation Rules:\n", rules)
return rules
# a) Support=50%, Confidence=75%
run_apriori(df, min_support=0.5, min_confidence=0.75)
# b) Support=60%, Confidence=60%
run_apriori(df, min_support=0.6, min_confidence=0.6)
Screenshot :
(i)
Practical 4
QUESTION :
Use Naive bayes, K-nearest, and Decision tree
classification algorithms and build classifiers on any two
datasets. Divide the data set into training and test set.
Compare the accuracy of the different classifiers under
the following situations:
I.
a) Training set = 75% Test set = 25%
b) Training set = 66.6% (2/3rd of total), Test set =
33.3%
II. Training set is chosen by
i) hold out method
ii) Random subsampling
iii) Cross-Validation. Compare the accuracy of
the classifiers obtained. Data needs to be
scaled to standard format.
CODE :
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Load Dataset 1: Iris
iris = datasets.load_iris()
X1 = pd.DataFrame(iris.data, columns=iris.feature_names)
y1 = pd.Series(iris.target)
# Load Dataset 2: Wine
wine = datasets.load_wine()
X2 = pd.DataFrame(wine.data, columns=wine.feature_names)
y2 = pd.Series(wine.target)
# Standardize the data
scaler = StandardScaler()
X1_scaled = scaler.fit_transform(X1)
X2_scaled = scaler.fit_transform(X2)
# Classifiers
models = {
'Naive Bayes': GaussianNB(),
'KNN': KNeighborsClassifier(),
'Decision Tree': DecisionTreeClassifier(random_state=0)
# Function to evaluate models
def evaluate_models(X, y, dataset_name):
print(f"\n Results for {dataset_name} Dataset:\n")
# I. a) Train 75% / Test 25%
print("I. a) 75% Train / 25% Test:")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f" {name}: {acc:.2f}")
# I. b) Train 66.6% / Test 33.3%
print("\nI. b) 66.6% Train / 33.3% Test:")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333, random_state=42)
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f" {name}: {acc:.2f}")
# II. Hold-out (already done above)
# II. Random Subsampling (Repeat holdout multiple times)
print("\nII. Random Subsampling (10 runs):")
for name, model in models.items():
scores = []
for _ in range(10):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
scores.append(accuracy_score(y_test, y_pred))
print(f" {name}: Avg Accuracy: {sum(scores)/len(scores):.2f}")
# II. Cross-validation
print("\nII. Cross Validation (5-fold):")
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5)
print(f" {name}: Avg Accuracy: {scores.mean():.2f}")
# Run evaluation on both datasets
evaluate_models(X1_scaled, y1, "Iris")
evaluate_models(X2_scaled, y2, "Wine")
Screenshot :
(i)
Practical 5
Question :
Use Simple K-means algorithm for clustering
on any dataset. Compare the performance of clusters by
changing the parameters involved in the algorithm. Plot
MSE computed after each iteration using a line plot for
any set of parameters.
CODE :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Load Iris dataset
iris = datasets.load_iris()
X = iris.data
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Try different values of k (clusters)
cluster_range = [2, 3, 4, 5, 6]
print(" Cluster Performance (Silhouette Score):\n")
for k in cluster_range:
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, random_state=42)
kmeans.fit(X_scaled)
labels = kmeans.labels_
silhouette = silhouette_score(X_scaled, labels)
print(f"k = {k} → Silhouette Score = {silhouette:.4f}, Inertia (MSE) =
{kmeans.inertia_:.2f}")
# --------------------------
# MSE vs Iteration Plot for k=3
print("\n Plotting MSE per iteration for k=3")
kmeans = KMeans(n_clusters=3, init='random', max_iter=10, n_init=1, verbose=1,
random_state=42)
kmeans.fit(X_scaled)
# Plot inertia over iterations (Using verbose=True helps print it)
# Note: sklearn doesn't expose inertia per iteration, so we'll do manual tracking
inertias = []
X_input = X_scaled
for i in range(1, 11):
kmeans = KMeans(n_clusters=3, init='random', max_iter=i, n_init=1, random_state=42)
kmeans.fit(X_input)
inertias.append(kmeans.inertia_)
# Plotting the inertia (MSE) after each iteration
plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), inertias, marker='o', linestyle='-')
plt.title("MSE (Inertia) after each iteration (k=3)")
plt.xlabel("Iteration")
plt.ylabel("MSE (Inertia)")
plt.grid(True)
plt.show()
Screenshot:
(i)
(ii)
(iii)
Thank You