Association Rule Mining
Apriori:
The Apriori algorithm is used for mining frequent item sets from a transactional database. It
uses “support” parameter to put constraint on quality of item sets. Support refers to
frequency of occurrence of itemsets to be “frequent”. “min_sup” is used as a threshold to
support measure and is denoted by “𝜎”.
It uses a key concept in “Apriori property” of itemsets which is the anti-monotonicity of the
support measure.
1. All subsets of a frequent itemset must be frequent – Apriori property
2. For any infrequent itemset, all its supersets must be infrequent. – Anti-monotone
property.
Problem Statement: Generating the frequent itemsets and association rules from
transactional dataset.
Apriori Algorithm:
Following are the main steps of the algorithm:
1. Calculate the support of item sets of size k (= 1) in the transactional database. This is
called the set candidate 1-itemsets.
2. Prune candidates by eliminating itemsets with a support less than the given threshold 𝜎
by scanning the dataset D.
3. Join the frequent itemsets to form sets of size k + 1 e.g. size of 2 – in 2-itemsets.
4. Repeat step 2&3 until no more itemsets can be formed. This will happen when the set(s)
formed have a support less than the given support threshold.
Required libraries:
Working in Jupiter notebook:
1. Install library for Apriori algorithm using: !pip install mlxtend.
2. Load the “basket” dataset using the required command.
3. Perform pre-processing (if required). Select only products bought by a customer. Do not
include customer details.
4. Apply Apriori algorithm using given value for support and conYidence using the required
library.
Generating Association rules:
To Yind the association rules, the obtained frequent itemsets are generated using the
conYidence measure calculated on the basis of support of itemsets.
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df, min_support=0.25, use_colnames=True/False)
frequent_itemsets
Assignment Questions
1. Find frequent itemsets in the dataset using Apriori min_sup 0.1
2. Find the association rules in the dataset having min_conf 10%
3. Find association rules having minimum antecedent_len 2 & conYidence greater than 0.75
4. Load the "zoo" data
5. Perform pre-processing (if required)
Hint: All the atrributes will be transformed using one hot encoding
6. Find Frequent itemsets in zoo dataset having min support 0.5
7. Find Frequent association rules having min conYidence 0.5
8. Convert the dataset into two classes "Mammal" and "others"
9. Partition the dataset into training and testing part (70:30)
10. Generate association rules for "mammal" class (training data) with min_sup 0.4 and
conYidence as 1
11. Test the rules generated on testing dataset and Yind precision and recall for the rule
based classiYier
12. Apply decision tree on the dataset and calculate the performance evaluation measures
13. Which out of the two performs better.
Clustering using K-Means
K-Means:
K-means is an iterative algorithm that tries to partition the dataset into K pre-deYined distinct
non-overlapping subgroups (clusters) where each data point belongs to only one group. It
tries to make the intra-cluster similarities as high as possible while also keeping inter-cluster
similarities as low as possible. It assigns data points to a cluster whose centroid is at
minimum distance.
Algorithm:
Following are the main steps of the algorithm:
1. Specify number of clusters ‘K’.
2. Initialize centroids by Yirst shufYling the dataset and then randomly selecting K data
points without replacement.
3. Assign each data point to the closest cluster (centroid).
4. Compute the centroids for the clusters by taking the average of the all data points
that belong to each cluster
5. Keep iterating until there is no change to the centroids i.e. assignment of data points
to clusters isn’t changing.
Required libraries:
To load dataset: import pandas as pd
Preprocessing: from sklearn import preprocessing
ploting graph: import matplotlib.pyplot as plt k-
means: from sklearn.cluster import KMeans
Working in Jupiter notebook:
1. Load the required libraries for K-means algorithm.
2. Load the given dataset using the required command.
3. Perform pre-processing if required
4. Apply K-means algorithm. The various parameters used are as follows:
a. init controls the initialization technique. The standard version of the k-means
algorithm is implemented by setting init to "random". Setting this to "k-means++"
employs an advanced trick to speed up convergence.
b. n_clusters sets k for the clustering step. This is the most important parameter for
kmeans.
c. n_init sets the number of initializations to perform. This is important because two
runs can converge on different cluster assignments. The default behaviour for the
scikit-learn algorithm is to perform ten k-means runs and return the results of the one
with the lowest SSE.
d. max_iter sets the number of maximum iterations for each initialization of the kmeans
algorithm.
Assignment Questions
1. Load the dataset and perform standardization as pre-processing task.
2. Apply K-means on the dataset. Use random initialization, number of clusters = 5,
number of initializations = 10, maximum iterations = 300, random state = 42
3. Plot the clusters
4. Find the lowest SSE values and Yinal locations of the centroids.
5. Use initialization technique as "K-means++" and calculate the above values again
6. Draw a graph between K and SSE. Vary the value of K from 1 to 10, using same
parameters for K-Means as above
Hint: The elbow nick method is commonly used to evaluate the appropriate number of
clusters. To perform the elbow nick method, run several K-means, increment K with each
iteration, and record the SSE.
DBSCAN
DBSCAN:
The main concept of DBSCAN algorithm is to locate regions of high density that are separated
from one another by regions of low density.
To measure density of a region:
• Density at a point p: Number of points within a hypersphere of Radius Eps (ϵ) from point p.
• Dense Region: For each point in the cluster, the hypersphere with radius ϵ contains at least
minimum number of points (MinPts).
The Epsilon neighborhood of a point P in the database D is deYined as:
𝑁𝜖(p) = {𝑞 ∈ 𝐷 | 𝑑𝑖𝑠𝑡(𝑝, 𝑞) ≤ 𝜖
Following the deYinition of dense region, a point can be classiYied as a
1. Core Point if |𝑁𝜖 (p)|≥ MinPts (including p). The Core Points, as the name suggests, lie
usually within the interior of a cluster.
2. Border Point has fewer than MinPts within its 𝑁𝜖 , but it lies in the neighborhood of
another core point.
3. Noise is any data point that is neither core nor border point.
Advantages of DBSCAN:
1. DBSCAN is good at separating clusters of high density versus clusters of low density
within a given dataset.
2. It handles outliers well.
3. It can discover clusters of arbitrary shape.
4. EfYicient for large database
Disadvantages of DBSCAN:
1. Does not work well when dealing with clusters of varying densities. This is
because varying will either Yind cluster of high density or put many
clusters into one.
2. DBSCAN is sensitive to its parameters ϵ and MinPts. Steps of
Algorithm:
Following are the main steps of the algorithm:
1. All the data points are tagged as unprocessed. The algorithm starts with an arbitrary
unprocessed point for its neighborhood information from dataset.
2. If this point contains MinPts within ϵ neighborhood, cluster formation starts. Otherwise
the point is labelled as processed. This point may be later found within the ϵ
neighborhood of a different point and, thus can be made a part of the cluster. So we don’t
label it as noise.
3. If a point is found to be a core point then the points within the ϵ neighborhood are also
part of the cluster. So all the points found within ϵ neighborhood are added to queue Q.
4. Repeat this process for every point in Q until queue is empty. The core point is tagged as
‘core’ and processed. If the point from Q is not core tag, it is border point and processed.
In the last step, tag all the points which are not core or border as ‘noise’.
5. This completes evolution of one cluster. Repeat this process till all the clusters are
identiYied.
Required libraries:
import pandas as pd
from sklearn import preprocessing
import numpy as np
from sklearn import metrics # for evaluations
import matplotlib.pyplot as plt
%matplotlibinlineKMeans
Hierarchical Clustering
Hierarchical clustering is a type of unsupervised machine learning algorithm used to cluster
unlabelled data points. Like K-means clustering, hierarchical clustering also groups together
the data points with similar characteristics.
There are two types of hierarchical clustering: Agglomerative and Divisive. In the former, data
points are clustered using a bottom-up approach starting with individual data points as
individual clusters. The closest clusters are merged together as the clustering process evolves
till one big cluster is obtained. While in the latter top-down approach is followed where all
the data points are treated as one big cluster and the clustering process involves dividing the
one big cluster into several small clusters.
Steps for agglomerative clustering
1. Treat each data point as an individual cluster. Therefore, the number of clusters at the
start will be n, while n is an integer representing the number of data points.
2. Form a cluster by joining the two closest data points resulting in n-1 clusters.
3. Next two closest clusters resulting in n-2 clusters.
4. Repeat this step until one big cluster is formed.
5. Each merge will happen at an agglomerative level. Draw dendrogram of agglomerative
level vs no. of clusters. Cut the dendrogram to get right set of clusters.
import scipy.cluster.hierarchy as shc
plt.Yigure(Yigsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(X, method='single'))
Assignment Questions
1. Load the “s1_modiYied” dataset and perform standardization as pre-processing task.
2. DBSCAN Algorithm with Scikit-Learn using epsilon as 0.3 and min points as 50. Take
euclidean distance for calculating distance between points.
3. Plot the clusters.
4. Detect the outliers using DBSCAN
5. Import the required libraries for agglomerative clustering Hint: from sklearn.cluster
import AgglomerativeClustering
6. Load dataset into a data frame and perform pre-processing if required
7. Apply agglomerative clustering with single link
8. Plot the clusters 9. Plot the dendrograms.
10. Apply agglomerative clustering using complete link and wards method and plot their
dendrogram.