0% found this document useful (0 votes)

7 views13 pages

Comparison of Clustering Algo

This study compares K-Means, DBSCAN, and Gaussian Mixture Models (GMM) for clustering traffic data, highlighting their performance differences. K-Means is efficient but struggles with non-globular clusters, DBSCAN excels in handling noise but is sensitive to parameters, and GMM offers a flexible probabilistic approach. The findings suggest that the choice of algorithm should align with specific traffic management needs, with potential for future exploration of hybrid methods.

Uploaded by

kavya kuthyar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views13 pages

Comparison of Clustering Algo

Uploaded by

kavya kuthyar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Comparison Of K-Means, DBSCAN, Gaussian Mixture

Using Traffic Dataset

Kavya , Nisha
1 1

Department of Computer Science, Mahatma Gandhi Memorial College, Udupi, 576102

e-mail: kavyashankarkulal@gmail.com, nishaamin0611@gmail.com

Abstract: This study compares three clustering algorithms—K-Means, Density-Based Spatial

Clustering of Applications with Noise(DBSCAN), and Gaussian Mixture Models (GMM)
applied to a traffic dataset containing the following attributes: datetime, junction, vehicles, and
ID. The main objective is to examine how well these algorithms cluster traffic data.
Data pre-processing includes standardizing the dataset. Clustering performance is measured
using silhouette score that evaluates cluster consistency. Scatter plots are used to visualize the
result.

The research identifies significant differences in the performance and results of the algorithms.
When it comes to K-means which is centroid-based method it uses Elbow Method to determine
the optimal number of clusters. DBSCAN, which is density-based is checked for its ability of
handling noise as well as finding variously shaped clusters where epsilon value would be
determined by a k-distance graph. GMM treats data points as generated from overlapping
Gaussian distributions and performs soft clustering by assigning probabilities to each data point
on belonging to different clusters. In this comparison, we can see what each algorithm does best
when clustering traffic data sets. The comparison reveals strengths or flaws of every algorithm in
relation to traffic grouping

Keywords: K-Means, DBSCAN, Gaussian Mixture Models, Clustering, Traffic Data, Silhouette
Score, Principal Component Analysis PCA, Traffic Management.

______________________________________________________________________________

1.Introduction: The world's traffic demands have increased considerably due to the world's
rapid economic expansion and urbanization process. There is a growing contradiction between
the expansion of urban traffic and the environment's carrying capacity as well as the restricted
availability of urban resources. As a result, the issue of urban traffic congestion has gotten
considerably worse. In addition to making daily travel and living more difficult, traffic
congestion raises exhaust emissions, fuel consumption, and noise pollution. It also causes
significant financial losses. [1]
From the common consequences of traffic jams, the most crucial one is the delay of emergency
services such as medical emergency services, police, fire, and rescue operations. Undeniably,
community safety, individual lives, and the financial stability of institutions and the economy in
case of incidents, criminal attacks, and robberies highly depend on the efficient and timely
response of emergency vehicle services. According to road traffic statistical data, the frequent
and increasing number of vehicle crashes is a very serious challenge. These crashes often occur
due to overspeed driving in areas with traffic congestion. As a result, the negative consequences
of car accidents harm individuals, groups, and the community as a whole, and would worsen if
emergency vehicles are involved in crashes. Despite all possible solutions employed, most large
cities worldwide are still suffering from traffic jams. Understanding the different types of traffic
problems will help researchers contribute the best solutions to solve existing problems or at least
reduce their socio-economic impact. Two major types of congestion can be identified: recurrent
and non-recurrent. Recurrent jams occur when many vehicles are crowded at the same time on
very limited roads (e.g., during peak hours). Non-recurrent jams usually happen due to random
events such as Christmas, public holidays, etc. [2]

Effective traffic management strategies require a comprehensive understanding of traffic

patterns, which can be achieved through data clustering. Clustering algorithms, such as K-Means
Clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and
Gaussian Mixture Models (GMM), offer powerful tools to identify and group similar traffic
patterns from large datasets. By analyzing historical traffic data, these algorithms can reveal
hidden structures within the data, such as peak congestion periods, frequent bottlenecks, and
areas with irregular traffic flows. This information is crucial for traffic management authorities,
as it allows for more informed decision-making, targeted interventions, and ultimately, more
efficient use of road networks.

2.Methodology:
2.1 K-Means Clustering:

The k-means algorithm takes the input data set D and parameter k (number of clusters) and then
divides the data set D of n objects into k groups. This partition is done on the basis of similarity
between the objects so that objects within the cluster are similar to one another and dissimilar to
objects in other clusters. In other words, there is high intracluster similarity and low intercluster
similarity. Euclidean distance is used to measure the distance between each data point and the
cluster centroids. Additionally, the Elbow Method is used to determine the optimal number of
clusters by plotting the explained variance as a function of the number of clusters and identifying
the "elbow point" where the marginal gain drops.[3]

Algorithm :

Input:

 k: the number of clusters,

 D: a data set containing n objects.

Output: A set of K clusters

Method:

1. Randomly choose k objects from D as the initial cluster centers;

2. Repeat

3.(re)assign each object to the cluster to which the object is most similar, based on the
mean value of the objects in the cluster;

4.Update the cluster means, calculate the mean value of the objects for each cluster;

5.until no change;

2.2. DBSCAN(Density-Based Spatial Clustering of Applications with Noise)

The DBSCAN algorithm is based on density reachability and density connectivity. Epsilon
distance and minimum points are used for clustering. Epsilon represents the distance around a
point that defines its neighborhood. If a point q is within the ε distance and covers the minimum
number of points, then q is considered a core point. All objects within the ε distance of the core
point q are density-reachable from q. Furthermore, point p and q are said to be density-connected
if there exists an point o such that both p and q are density-reachable from o . Density
reachability and density connectivity are used by the DBSCAN algorithm. A cluster is defined as
the set of points in a dataset that are density-connected to a particular core point. Points that are
not part of any cluster are defined as noisy points. If an object is covered by the ε distance but
has fewer points than the minimum required, it is known as a border point.[4]

A graph, called the k-NN plot, is computed from the k-nearest neighbor distances obtained for
each data point. The k-NN plot will identify how the distance from a point to its k-th nearest
neighbor behaves. These k-distances would be estimated for all data points with some k,
arranged in nondescending order, and plotted. As a result of this plotting, a sharp change is likely
to be noticed. In the k-NN plot, there will be a sharp change that corresponds to an appropriate
value of Epsilon.[5]

Working Algorithm:

Algorithm: DBSCAN: a density-based clustering algorithm.

Input:

 D: a data set containing n objects,

 epsilon: the radius parameter, and
 MinPts: the neighborhood density threshold

Output: A set of density-based clusters.

Method:

(1) mark all objects as unvisited;

(2) do

(3) randomly select an unvisited object p;

(4) mark p as visited;

(5) if the epsilon-neighborhood of p has at least MinPts objects

(6) create a new cluster C, and add p to C;

(7) let N be the set of objects in the epsilon-neighborhood of p;

(8) for each point p 0 in N

(9) if p 0 is unvisited

(10) mark p 0 as visited;

(11) if the epsilon-neighborhood of p 0 has at least MinPts points,

add those Minimum points;

(12) if p 0 is not yet a member of any cluster, add p 0 to C;

(13) end for

(14) Output c;

(15) else mark p as noise;

(16)until no object is unvisted[3]

2.3. Gaussian Mixture clustering:

Gaussian Mixture Clustering is a clustering technique that uses the Gaussian Mixture Model to
group data points into clusters. Gaussian Mixture Models (GMMs) are a probabilistic approach
to clustering that assumes data is generated from a mixture of several Gaussian distributions,
each representing a different cluster. It is a type of generative model, and it provides a way to
model the underlying distribution of data in a dataset.

Gaussian distribution is symmetrical about the mean and is described by the mean and the
standard deviation. A GMM is an unsupervised clustering technique that forms ellipsoidal shaped
clusters based on probability density estimations using the Expectation-Maximization. Each
cluster is modelled as a Gaussian distribution. The mean and the covariance rather than only the
mean as in K-Means, give GMMs the ability to provide a better quantitative measure of fitness
per number of clusters.[6]. Principal Component Analysis(PCA) is used to reduce the
dimensionality of the dataset to 6 principal components. This helps in simplifying the dataset and
improving the clustering results.
The EM algorithm is used to find the maximum likelihood estimates of parameters in GMMs. It
is, a two-step iterative process, first is expectation step(E-step), we calculate the expected value
of the latent variables given the current estimates of the parameters. Second is the maximization
step(M-step). In maximization step the parameter to maximize the expected log-likelihood found
in the E-step is updated. This process is repeated until convergence, where the changes in the
parameter values become sufficiently small.

3.Experimental Results and Discussions:

Dataset description:

The traffic dataset used in this study was obtained from Kaggle. The dataset, titled “Traffic
Prediction Dataset,” is available at https://www.kaggle.com/datasets/fedesoriano/traffic-
prediction-dataset . It includes attributes such as datetime, junction, vehicles, and ID. Pre-
processing involved converting the datetime attribute into numerical features (year, month, day,
hour) and standardizing the dataset. This dataset is essential for analyzing traffic patterns and
implementing.

K-Means Clustering:

Parameter Selection:

o Number of clusters(k): Optimal number of clusters is determined by the Elbow Method.

The Elbow Method involves plotting within cluster sum of squares(WCSS) against the
number of clusters and identifying the point(elbow point) where the gain drops
significantly. Figure 3.1 shows the Elbow Method plot for the traffic dataset. Gain
decreases significantly for
o Distance between data points: Euclidean distance is used to measure the distance between
data points and cluster centroids.
Figure 3.1 Elbow method plot

K-Means clustering is initialized to generate 2 clusters. The optimal number of clusters is

determined using the Elbow method. In Elbow method the optimal number of clusters is
visualized by plotting the number of cluster against within cluster sum of square. The optimal
number of clusters is found at the "elbow point," where the rate of decrease in WCSS slows
down significantly.

Typically, Euclidean distance is used to measure the distance between points and centroids. The
clustering shows a consistent pattern of lower traffic during the early hours of the day and late at
night, with distinct peaks in the morning and evening. This consistency helps in planning traffic
management strategies and resource allocation. The plot indicates distinct clusters, with a large
number of points (vehicles) during peak hours and fewer points during off-peak hours. The
yellow clusters likely represent high-traffic periods, while the dark clusters represent lower-
traffic periods.

Figure 3.2 K-Means Clustering based on hour and number of vehicles

By analyzing the clustering results, we can conclude that traffic is heaviest during typical
commuting hours (morning and evening rush hours) and lightest during late-night and early-
morning hours. This information is crucial for urban planning, traffic management, and
optimizing public transportation schedules.

The Silhouette score is: 0.22392834847029727

Time taken for k-means clustering execution: 0.79 seconds.

DBSCAN(Density-Based Spatial Clustering of Applications with Noise):

Parameter Selection :
In the DBSCAN algorithm, the epsilon value should be calculated with a K-NN distance plot.
The plot contains, on one axis, a sorted index versus a sorted distance to the K-th nearest
neighbor. We noticed that in this plot, the "elbow" point, where distance increases sharply, will
help in deciding a good value of epsilon in DBSCAN. In our case, the epsilon value was
observed to be 0.3.

Figure 3.3 k-NN Distance plot to determine epsilon value.

The proper value of MinPts is calculated by k = 2 * n_dimensions, where `n_dimensions` is the

number of features selected. In this case, we will use 13 as our MinPts.
Then we applied the DBSCAN algorithm and calculated a number of clusters, noise points, and
execution time for the formation of clusters. The Silhouette score measures how similar an object
is to its own cluster. The value ranges from -1 to 1. -1 indicating poor clustering and +1 is good
clustering results.
Figure 3.4 Estimated number of clusters

The results that have been observed are:

-Number of estimated clusters: 93. This means there are quite a large number of distinct groups
in the dataset.
- Estimated number of noise points:7107. There are quite a large number of noise points, which
means that lots of the points did not fit into any cluster.
- Time taken for DBSCAN clustering execution: 3.016 seconds. That is, this computation is not
intensive at all.
- Silhouette Score: -0.082. This indicates poor performance in clustering.

Analyzing the characteristics of the identified clusters reveals that they are not well-
separated. The absence of meaningful traffic patterns—such as consistent traffic flows during
specific times or in particular areas—confirms that DBSCAN is not effective in this context.

Gaussian Mixture:

Parameter Selection:

Principal Component Analysis (PCA) is a technique used to reduce the features (dimension) of
data while preserving as much information as possible. Original data is transformed into a set of
new, uncorrelated variables called principal components. PCA is often applied before GMM
clustering. By transforming data into low-dimension it simplifies problem for GMM, making it
easier and faster to identify clusters.
Number of Principal Components: To determine the optimal number of principal components we
perform Principal Component Analysis (PCA). Each principal component will have an explained
variance ratio, that indicates the proportion of the total variance the component explains. The
cumulative explained variance is calculated to determine how much of total variance is explained
as we include more components. We choose the smallest number of components.

Figure 3.5 shows the number of principal components versus cumulative explained variance plot,
using which we determine the number of principal components for the GMM clustering.

Figure 3.5 Plot showing the optimal number of principal components

In Gaussian Mixture clustering, dimensionality is reduced to six principal components with PCA.
This step makes the data easier to handle by reducing it to almost the variance contained. At the
heart of this process is the fitting of a GMM with two components to the PCA-transformed data.
The GMM gives the probability viewpoint of clustering, according to which data are assumed to
be generated from a mixture of Gaussian distributions. It is considered that there is a Gaussian
distribution for each cluster, and thus, it uses the EM algorithm to update the parameters
iteratively until it converges.

The result we observed:

Figure 3.6 Gaussian Mixture Clustering with Silhouette score

-Time taken for Gaussian Mixture clustering execution-0.59 This indicates that the dataset size
and complexity were well-handled by the algorithm, likely aided by the dimensionality reduction
step.

-Silhoutte score- A score close to 1 thereby indicates that the clusters are very well separated,
while -1 means the clusters overlap. On the other hand, a score of 0 shows a totally overlapping
or not well-separated cluster. In this case, hence, a silhouette score of 0.27 indicates quite
distinct clustering but with some overlapping. This score, however, implies that although
clustering is not perfect, it is quite reasonably good.

____________________________________________________________________________

4.Conclusion:

The results exhibit significant differences in the performance and effectiveness of the clustering
algorithms. K-Means, which is a centroid-based method, is efficient but may struggle with non-
globular clusters. DBSCAN excels in identifying clusters of arbitrary shapes and handling noise
but is sensitive to the choice of epsilon and minimum samples. GMM provides a flexible
probabilistic approach but can be computationally intensive.

This study provides a comparative analysis of K-Means, DBSCAN, and Gaussian Mixture
Models on traffic data clustering. Each algorithm has its own strengths and weaknesses, and the
choice of algorithm depends on the specific requirements of the traffic management application.
Future work could explore hybrid approaches and real-time clustering for dynamic traffic
management. In this paper, we compared K-Means Clustering, DBSCAN, and a Gaussian
Mixture Model on a traffic dataset. Among them, GMM was found to be the best because it
balanced clustering quality against execution time. Its higher Silhouette Score corresponds to a
better data structure capture than the other methods. DBSCAN performed poorly due to its
sensitivity to parameters, leading to a negative Silhouette Score and many noise points. K-Means
was the fastest algorithm but not so great as GMM due to its assumption of spherical clusters.
Overall, GMM has been recommended in various traffic management applications because of
accuracy and execution time

_________________________________________________________________________

5.References:

[1] Pei Y., Cai X., Li J., Song K., & Liu R. (2021, August) Method for identifying the traffic
congestion situation of the main road in cold-climate cities based on the clustering analysis
algorithm

[2] Doreswamy , Ghoneim O. A.,( 2018) Traffic Jams Detection and Congestion Avoidance
in Smart City Using Parallel K-Means Clustering Algorithm

[3] Han J., Kamber M., Pei J., (2011) Data mining: Concepts and techniques

[4] Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for
discovering clusters in large spatial databases with noise. Institute for Computer Science,
University of Munich

[5] Kurumalla S., Rao P. S., (2016, October) K-nearest neighbor based dbscan clustering
algorithm for image segmentation

[6] Patel E., Kushwaha D. S., (2020) Clustering Cloud Workloads: K-Means vs Gaussian

Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Unit 4
No ratings yet
Unit 4
16 pages
ML - 8
No ratings yet
ML - 8
70 pages
Density Based CA
No ratings yet
Density Based CA
8 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
Clustering L7
No ratings yet
Clustering L7
7 pages
A Comparative Study of K-Means, DBSCAN and OPTICS
No ratings yet
A Comparative Study of K-Means, DBSCAN and OPTICS
6 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
PART2
No ratings yet
PART2
61 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
DS9 - Clustering
No ratings yet
DS9 - Clustering
35 pages
Unit 4
No ratings yet
Unit 4
5 pages
Management-Activity Prediction For Differently-Mouneshachari S
No ratings yet
Management-Activity Prediction For Differently-Mouneshachari S
6 pages
DBSCAN Clustering Explained
No ratings yet
DBSCAN Clustering Explained
3 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Chapter 2 (19-06-2019 v2)
No ratings yet
Chapter 2 (19-06-2019 v2)
10 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
Exp5 - Unsupervised Learning
No ratings yet
Exp5 - Unsupervised Learning
13 pages
ML Unit 4
No ratings yet
ML Unit 4
15 pages
Unit 5
No ratings yet
Unit 5
63 pages
Clustering
No ratings yet
Clustering
34 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Segment 7 (Ch10)
No ratings yet
Segment 7 (Ch10)
60 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
12 pages
Unit 4
No ratings yet
Unit 4
40 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
Image Clustering: Prof. Dr. Rafiqul Islam Department of CSE
No ratings yet
Image Clustering: Prof. Dr. Rafiqul Islam Department of CSE
26 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Clustering
No ratings yet
Clustering
12 pages
DBSCAN
No ratings yet
DBSCAN
7 pages
Clustering
No ratings yet
Clustering
10 pages
DBSCAN
No ratings yet
DBSCAN
23 pages
DSS09 (B) - Clustering
No ratings yet
DSS09 (B) - Clustering
35 pages
Module 3 Clustering
No ratings yet
Module 3 Clustering
57 pages
Clustering
No ratings yet
Clustering
65 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Clustering in Data Mining Lecture
No ratings yet
Clustering in Data Mining Lecture
80 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
18 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Clustering
No ratings yet
Clustering
55 pages
Unsupervised Learning Guide
No ratings yet
Unsupervised Learning Guide
50 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Clustering
No ratings yet
Clustering
80 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Chapter 4
No ratings yet
Chapter 4
30 pages
Unit 4
No ratings yet
Unit 4
29 pages
Clustering
No ratings yet
Clustering
75 pages
Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Spatial Cluster Modelling: An: A.B. Lawson D.G.T. Denison
No ratings yet
Spatial Cluster Modelling: An: A.B. Lawson D.G.T. Denison
19 pages
SoccerCPD TimeSeries Project
No ratings yet
SoccerCPD TimeSeries Project
6 pages
Probabilistic Identification o F Keyblocks in Rock Excavations PDF
No ratings yet
Probabilistic Identification o F Keyblocks in Rock Excavations PDF
250 pages
ML - Expectation-Maximization Algorithm
No ratings yet
ML - Expectation-Maximization Algorithm
3 pages
Latent 2
No ratings yet
Latent 2
4 pages
CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning
No ratings yet
CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning
8 pages
Iss8411 Gross Miller
No ratings yet
Iss8411 Gross Miller
11 pages
Unit 4 - Machine Learning - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 4 - Machine Learning - WWW - Rgpvnotes.in PDF
27 pages
Module 5 AIML Notes
No ratings yet
Module 5 AIML Notes
77 pages
Parallel Corpora for Translators
No ratings yet
Parallel Corpora for Translators
45 pages
PFE Book - Mass Analytics - 2022
No ratings yet
PFE Book - Mass Analytics - 2022
22 pages
The Frailty Model PDF
No ratings yet
The Frailty Model PDF
15 pages
Biomedical Image Segmentation Guide
No ratings yet
Biomedical Image Segmentation Guide
831 pages
Performance Enhancement of Robust Expectation-Maximization Direction of Arrival Estimation Algorithm For Wideband Source Signals
No ratings yet
Performance Enhancement of Robust Expectation-Maximization Direction of Arrival Estimation Algorithm For Wideband Source Signals
7 pages
Topic 05 - Data Preprocessing
No ratings yet
Topic 05 - Data Preprocessing
62 pages
191IT7310Machine LearningQB
No ratings yet
191IT7310Machine LearningQB
27 pages
Ai & ML 2 Marks Was
No ratings yet
Ai & ML 2 Marks Was
23 pages
MSC Artificial Intll Ud 2024 25 (1) - OK
No ratings yet
MSC Artificial Intll Ud 2024 25 (1) - OK
75 pages
6418 Chapter 3A Meyers I Proof 2
No ratings yet
6418 Chapter 3A Meyers I Proof 2
32 pages
ML - LAB - BE CSE (DS) Final
No ratings yet
ML - LAB - BE CSE (DS) Final
110 pages
Image Reconstruction FBP vs. OSEM
No ratings yet
Image Reconstruction FBP vs. OSEM
74 pages
Midterm
No ratings yet
Midterm
33 pages
Tipping Bishop 1999
No ratings yet
Tipping Bishop 1999
12 pages
Usage Apriori and Clustering Algorithms in WEKA Tools To Mining Dataset of Traffic Accidents
No ratings yet
Usage Apriori and Clustering Algorithms in WEKA Tools To Mining Dataset of Traffic Accidents
16 pages
Expectation Maximization (EM) Algorithm
No ratings yet
Expectation Maximization (EM) Algorithm
47 pages
The Oxford Handbook Quantitative Methods: Todd D. Little
No ratings yet
The Oxford Handbook Quantitative Methods: Todd D. Little
63 pages
Data Mining-Model Based Clustering
No ratings yet
Data Mining-Model Based Clustering
8 pages
Rossmanith - Completion of Market Data
No ratings yet
Rossmanith - Completion of Market Data
62 pages
ABIRCH-Automatic Threshold Estimation For The BIRCH Clustering Algorithm
No ratings yet
ABIRCH-Automatic Threshold Estimation For The BIRCH Clustering Algorithm
11 pages
Exploring AI-Generated Text in Student Writing: How Does AI Help?
No ratings yet
Exploring AI-Generated Text in Student Writing: How Does AI Help?
43 pages

Comparison of Clustering Algo

Uploaded by

Comparison of Clustering Algo

Uploaded by

Comparison Of K-Means, DBSCAN, Gaussian Mixture

Using Traffic Dataset

Department of Computer Science, Mahatma Gandhi Memorial College, Udupi, 576102

e-mail: kavyashankarkulal@gmail.com, nishaamin0611@gmail.com

Abstract: This study compares three clustering algorithms—K-Means, Density-Based Spatial

Effective traffic management strategies require a comprehensive understanding of traffic

 k: the number of clusters,

Output: A set of K clusters

1. Randomly choose k objects from D as the initial cluster centers;

2.2. DBSCAN(Density-Based Spatial Clustering of Applications with Noise)

Algorithm: DBSCAN: a density-based clustering algorithm.

 D: a data set containing n objects,

Output: A set of density-based clusters.

(1) mark all objects as unvisited;

(3) randomly select an unvisited object p;

(4) mark p as visited;

(5) if the epsilon-neighborhood of p has at least MinPts objects

(7) let N be the set of objects in the epsilon-neighborhood of p;

(8) for each point p 0 in N

(10) mark p 0 as visited;

(11) if the epsilon-neighborhood of p 0 has at least MinPts points,

add those Minimum points;

(12) if p 0 is not yet a member of any cluster, add p 0 to C;

(13) end for

(15) else mark p as noise;

(16)until no object is unvisted[3]

2.3. Gaussian Mixture clustering:

3.Experimental Results and Discussions:

o Number of clusters(k): Optimal number of clusters is determined by the Elbow Method.

K-Means clustering is initialized to generate 2 clusters. The optimal number of clusters is

Figure 3.2 K-Means Clustering based on hour and number of vehicles

The Silhouette score is: 0.22392834847029727

Time taken for k-means clustering execution: 0.79 seconds.

DBSCAN(Density-Based Spatial Clustering of Applications with Noise):

Figure 3.3 k-NN Distance plot to determine epsilon value.

The proper value of MinPts is calculated by k = 2 * n_dimensions, where `n_dimensions` is the

The results that have been observed are:

Figure 3.5 Plot showing the optimal number of principal components

The result we observed:

You might also like