TRIBHUVAN UNIVERSITY
INSTITUTE OF SCIENCE AND TECHNOLOGY
BIRENDRA MULTIPLE CAMPUS
Data Warehousing and Data Mining
BIT 454
Submitted by
Aaditya Pageni (BIT 267/077)
Submitted to
Lab 01 :
Implementation of K-means clustering algorithm
Objective:
Write a Python program to implement K-means Clustering algorithm. Generate 1000 2D data
points in the range 0-100 randomly. Divide data points into 3 clusters.
Required Theory:
K-means clustering is a popular unsupervised machine learning algorithm used for partitioning
a dataset into a predetermined number of clusters. The goal of K means is to minimize the
within-cluster variance, also known as inertia or sum of squared distances from each point in the
cluster to the centroid of that cluster. Here's a step-by-step overview of how the algorithm works:
1. Initialization: Choose the number of clusters (K) and randomly initialize K centroids.
Centroids are the points that represent the center of each cluster.
2. Assignment Step: Assign each data point to the nearest centroid. This is typically done by
calculating the Euclidean distance between each point and each centroid, and assigning each
point to the cluster with the nearest centroid.
3. Update Step: After all points have been assigned to clusters, calculate the mean of the points
in each cluster and update the centroid to be the mean. This moves the centroid to the center of
its cluster.
4. Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the centroids no
longer change significantly or when a maximum number of iterations is reached.
5. Final Step: Once convergence is reached, the algorithm outputs the final centroids and the
cluster assignments for each data point.
Executable python code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
data = np.random.rand(1000, 2) * 100
km = KMeans(n_clusters=3, init="random")
km.fit(data)
centers = km.cluster_centers_
labels = km.labels_
print("Cluster centers: ", *centers)
# print("Cluster Labels: ", *labels)
colors = ["r", "g", "b"]
markers = ["+", "x", "*"]
for i in range(len(data)):
plt.plot(data[i][0], data[i][1], color=colors[labels[i]],
marker=markers[labels[i]])
plt.scatter(centers[:, 0], centers[:, 1], marker="s", s=100,
linewidths=5)
plt.show()
Output:
Cluser centers: [82.29904926 50.2243326 ] [31.71356124 22.45600779]
[31.52067957 77.620532 ]
Conclusion:
Hence, we implemented the k-means clustering algorithm in python using google colab.
Lab 02 :
Implementation of Apriori algorithm
Objective:
Write a Python program to utilize the Apriori algorithm on a retail dataset and identify
significant association rules between items in customer transactions.
Required Theory:
The Apriori algorithm is a classical algorithm in data mining and machine learning used for
association rule mining in transactional databases. It aims to find interesting relationships or
associations among a set of items in large datasets. The most common application of the Apriori
algorithm is in market basket analysis, where it helps identify associations between products that
are frequently purchased together.
Algorithm Steps:
1. Generating Candidate Itemsets:
- The Apriori algorithm starts by generating candidate itemsets of length 1 (individual items)
and then iteratively generates larger itemsets.
- New candidate itemsets are generated by joining pairs of frequent itemsets found in the
previous iteration.
2. Pruning Candidate Itemsets:
- After generating candidate itemsets, the algorithm scans the dataset to count the support of
each candidate itemset.
- Candidate itemsets that do not meet the minimum support threshold are pruned from further
consideration.
3. Generating Association Rules:
- Once frequent itemsets are identified, association rules are generated from these itemsets.
- Association rules are generated by partitioning frequent itemsets into non empty subsets and
calculating support, confidence, and lift for each rule.
- Rules that meet the minimum confidence and lift thresholds are considered significant and are
returned as the final output of the algorithm.
Dataset Description:
We are using a dataset of a retail store which contains 7501 total customer transactions (rows) in
a CSV file. A snapshot of the dataset transaction is given below:
Executable python code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori
path="/content/store_data.csv"
dataset=pd . read_csv (path)
dataset.head(None)
records = []
for i in range(0, 7500):
test = []
data = dataset.iloc[i]
data = data.dropna()
for j in range(0, len(data)):
test.append(str(dataset.values[i, j]))
records.append(test)
association_rules = apriori(
records, min_support=0.005, min_confidence=0.2,
min_lift=3, min_length=2
)
association_results = list(association_rules)
for item in association_results:
print(list(item[2][0][0]), '->', list(item[2][0][1]))
Output:
Association rules generated:
['mushroom cream sauce'] -> ['escalope']
['pasta'] -> ['escalope']
['herb & pepper'] -> ['ground beef']
['tomato sauce'] -> ['ground beef']
['whole wheat pasta'] -> ['olive oil']
['pasta'] -> ['shrimp']
['chocolate', 'frozen vegetables'] -> ['shrimp']
['spaghetti', 'frozen vegetables'] -> ['ground beef']
['shrimp', 'mineral water'] -> ['frozen vegetables']
['spaghetti', 'frozen vegetables'] -> ['olive oil']
['spaghetti', 'frozen vegetables'] -> ['shrimp']
['spaghetti', 'frozen vegetables'] -> ['tomatoes']
['spaghetti', 'grated cheese'] -> ['ground beef']
['herb & pepper', 'mineral water'] -> ['ground beef']
['herb & pepper', 'spaghetti'] -> ['ground beef']
['shrimp', 'ground beef'] -> ['spaghetti']
['milk', 'spaghetti'] -> ['olive oil']
['mineral water', 'soup'] -> ['olive oil']
['pancakes', 'spaghetti'] -> ['olive oil']
Conclusion:
Hence, we implemented Apriori algorithm in python using google collab.
Lab 03 :
Implementation of ID3 decision tree algorithm
Objective:
Write a python program to predict diabetes using ID3 Decision Tree Classifier.
Required Theory:
The ID3 (Iterative Dichotomiser 3) algorithm is a classic and straightforward algorithm used for
constructing decision trees. It was developed by Ross Quinlan in 1986 and is particularly
popular for its simplicity and ease of understanding.
ID3 algorithm works as:
1. Input Data: ID3 algorithm starts with a dataset containing features and corresponding target
labels.
2. Feature Selection: It selects the best attribute to split the data at each node based on a
criterion called Information Gain. Information Gain measures how much entropy (uncertainty or
randomness) is reduced in the dataset after splitting on a particular attribute.
3. Tree Construction: It recursively constructs the decision tree by selecting the best attribute to
split the data at each node. This process continues until one of the stopping criteria is met, such
as:
- All instances at a node belong to the same class.
- No more attributes are left to split on.
- The tree reaches a maximum depth.
4. Output: The resulting decision tree is used for classification by following the decision paths
from the root to the leaf nodes based on the values of the features of the input data.
Dataset Description:
We are using a dataset of a hospital which contains 768 total patients transactions (rows) in a
CSV file. A snapshot of the dataset transaction is given below:
Executable python code:
import pandas as pd
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
path="/content/Diabetes.csv"
dataset = pd.read_csv(path)
print("Dataset Size: ", len(dataset))
split = int(len(dataset) * 0.7)
train, test = dataset.iloc[:split], dataset.iloc[split:]
p = train["Pragnency"].values
g = train["Glucose"].values
bp = train["Blod Pressure"].values
st = train["Skin Thikness"].values
ins = train["Insulin"].values
bmi = train["BMI"].values
dpf = train["DFP"].values
a = train["Age"].values
d = train["Diabetes"].values
trainfeatures = zip(p, g, bp, st, ins, bmi, dpf, a)
traininput = list(trainfeatures)
# print(traininput)
model = DecisionTreeClassifier(criterion="entropy",
max_depth=4)
model.fit(traininput, d)
p = test["Pragnency"].values
g = test["Glucose"].values
bp = test["Blod Pressure"].values
st = test["Skin Thikness"].values
ins = test["Insulin"].values
bmi = test["BMI"].values
dpf = test["DFP"].values
a = test["Age"].values
d = test["Diabetes"].values
testfeatures = zip(p, g, bp, st, ins, bmi, dpf, a)
testinput = list(testfeatures)
predicted = model.predict(testinput)
# print('Actual Class:', *d)
# print('Predicted Class:', *predicted)
print("Confusion Matrix:")
print(metrics.confusion_matrix(d, predicted))
print("\nClassification Measures:")
print("Accuracy:", metrics.accuracy_score(d, predicted))
print("Recall:", metrics.recall_score(d, predicted))
print("Precision:", metrics.precision_score(d,
predicted))
print("F1-score:", metrics.f1_score(d, predicted)
Output:
Association rules generated:
Dataset Size: 767
Confusion Matrix:
[[117 35]
[ 17 62]]
Classification Measures:
Accuracy: 0.7748917748917749
Recall: 0.784A8101265822784
Precision: 0.6391752577319587
F1-score: 0.7045454545454545
Conclusion:
Hence, we implemented the ID3 decision tree algorithm in python using google colab.