Comparison Of K-Means, DBSCAN, Gaussian Mixture
Using Traffic Dataset
Kavya , Nisha
1 1
Department of Computer Science, Mahatma Gandhi Memorial College, Udupi, 576102
e-mail: kavyashankarkulal@gmail.com, nishaamin0611@gmail.com
Abstract: This study compares three clustering algorithms—K-Means, Density-Based Spatial
Clustering of Applications with Noise(DBSCAN), and Gaussian Mixture Models (GMM)
applied to a traffic dataset containing the following attributes: datetime, junction, vehicles, and
ID. The main objective is to examine how well these algorithms cluster traffic data.
Data pre-processing includes standardizing the dataset. Clustering performance is measured
using silhouette score that evaluates cluster consistency. Scatter plots are used to visualize the
result.
The research identifies significant differences in the performance and results of the algorithms.
When it comes to K-means which is centroid-based method it uses Elbow Method to determine
the optimal number of clusters. DBSCAN, which is density-based is checked for its ability of
handling noise as well as finding variously shaped clusters where epsilon value would be
determined by a k-distance graph. GMM treats data points as generated from overlapping
Gaussian distributions and performs soft clustering by assigning probabilities to each data point
on belonging to different clusters. In this comparison, we can see what each algorithm does best
when clustering traffic data sets. The comparison reveals strengths or flaws of every algorithm in
relation to traffic grouping
Keywords: K-Means, DBSCAN, Gaussian Mixture Models, Clustering, Traffic Data, Silhouette
Score, Principal Component Analysis PCA, Traffic Management.
______________________________________________________________________________
1.Introduction: The world's traffic demands have increased considerably due to the world's
rapid economic expansion and urbanization process. There is a growing contradiction between
the expansion of urban traffic and the environment's carrying capacity as well as the restricted
availability of urban resources. As a result, the issue of urban traffic congestion has gotten
considerably worse. In addition to making daily travel and living more difficult, traffic
congestion raises exhaust emissions, fuel consumption, and noise pollution. It also causes
significant financial losses. [1]
From the common consequences of traffic jams, the most crucial one is the delay of emergency
services such as medical emergency services, police, fire, and rescue operations. Undeniably,
community safety, individual lives, and the financial stability of institutions and the economy in
case of incidents, criminal attacks, and robberies highly depend on the efficient and timely
response of emergency vehicle services. According to road traffic statistical data, the frequent
and increasing number of vehicle crashes is a very serious challenge. These crashes often occur
due to overspeed driving in areas with traffic congestion. As a result, the negative consequences
of car accidents harm individuals, groups, and the community as a whole, and would worsen if
emergency vehicles are involved in crashes. Despite all possible solutions employed, most large
cities worldwide are still suffering from traffic jams. Understanding the different types of traffic
problems will help researchers contribute the best solutions to solve existing problems or at least
reduce their socio-economic impact. Two major types of congestion can be identified: recurrent
and non-recurrent. Recurrent jams occur when many vehicles are crowded at the same time on
very limited roads (e.g., during peak hours). Non-recurrent jams usually happen due to random
events such as Christmas, public holidays, etc. [2]
Effective traffic management strategies require a comprehensive understanding of traffic
patterns, which can be achieved through data clustering. Clustering algorithms, such as K-Means
Clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and
Gaussian Mixture Models (GMM), offer powerful tools to identify and group similar traffic
patterns from large datasets. By analyzing historical traffic data, these algorithms can reveal
hidden structures within the data, such as peak congestion periods, frequent bottlenecks, and
areas with irregular traffic flows. This information is crucial for traffic management authorities,
as it allows for more informed decision-making, targeted interventions, and ultimately, more
efficient use of road networks.
2.Methodology:
2.1 K-Means Clustering:
The k-means algorithm takes the input data set D and parameter k (number of clusters) and then
divides the data set D of n objects into k groups. This partition is done on the basis of similarity
between the objects so that objects within the cluster are similar to one another and dissimilar to
objects in other clusters. In other words, there is high intracluster similarity and low intercluster
similarity. Euclidean distance is used to measure the distance between each data point and the
cluster centroids. Additionally, the Elbow Method is used to determine the optimal number of
clusters by plotting the explained variance as a function of the number of clusters and identifying
the "elbow point" where the marginal gain drops.[3]
Algorithm :
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of K clusters
Method:
1. Randomly choose k objects from D as the initial cluster centers;
2. Repeat
3.(re)assign each object to the cluster to which the object is most similar, based on the
mean value of the objects in the cluster;
4.Update the cluster means, calculate the mean value of the objects for each cluster;
5.until no change;
2.2. DBSCAN(Density-Based Spatial Clustering of Applications with Noise)
The DBSCAN algorithm is based on density reachability and density connectivity. Epsilon
distance and minimum points are used for clustering. Epsilon represents the distance around a
point that defines its neighborhood. If a point q is within the ε distance and covers the minimum
number of points, then q is considered a core point. All objects within the ε distance of the core
point q are density-reachable from q. Furthermore, point p and q are said to be density-connected
if there exists an point o such that both p and q are density-reachable from o . Density
reachability and density connectivity are used by the DBSCAN algorithm. A cluster is defined as
the set of points in a dataset that are density-connected to a particular core point. Points that are
not part of any cluster are defined as noisy points. If an object is covered by the ε distance but
has fewer points than the minimum required, it is known as a border point.[4]
A graph, called the k-NN plot, is computed from the k-nearest neighbor distances obtained for
each data point. The k-NN plot will identify how the distance from a point to its k-th nearest
neighbor behaves. These k-distances would be estimated for all data points with some k,
arranged in nondescending order, and plotted. As a result of this plotting, a sharp change is likely
to be noticed. In the k-NN plot, there will be a sharp change that corresponds to an appropriate
value of Epsilon.[5]
Working Algorithm:
Algorithm: DBSCAN: a density-based clustering algorithm.
Input:
D: a data set containing n objects,
epsilon: the radius parameter, and
MinPts: the neighborhood density threshold
Output: A set of density-based clusters.
Method:
(1) mark all objects as unvisited;
(2) do
(3) randomly select an unvisited object p;
(4) mark p as visited;
(5) if the epsilon-neighborhood of p has at least MinPts objects
(6) create a new cluster C, and add p to C;
(7) let N be the set of objects in the epsilon-neighborhood of p;
(8) for each point p 0 in N
(9) if p 0 is unvisited
(10) mark p 0 as visited;
(11) if the epsilon-neighborhood of p 0 has at least MinPts points,
add those Minimum points;
(12) if p 0 is not yet a member of any cluster, add p 0 to C;
(13) end for
(14) Output c;
(15) else mark p as noise;
(16)until no object is unvisted[3]
2.3. Gaussian Mixture clustering:
Gaussian Mixture Clustering is a clustering technique that uses the Gaussian Mixture Model to
group data points into clusters. Gaussian Mixture Models (GMMs) are a probabilistic approach
to clustering that assumes data is generated from a mixture of several Gaussian distributions,
each representing a different cluster. It is a type of generative model, and it provides a way to
model the underlying distribution of data in a dataset.
Gaussian distribution is symmetrical about the mean and is described by the mean and the
standard deviation. A GMM is an unsupervised clustering technique that forms ellipsoidal shaped
clusters based on probability density estimations using the Expectation-Maximization. Each
cluster is modelled as a Gaussian distribution. The mean and the covariance rather than only the
mean as in K-Means, give GMMs the ability to provide a better quantitative measure of fitness
per number of clusters.[6]. Principal Component Analysis(PCA) is used to reduce the
dimensionality of the dataset to 6 principal components. This helps in simplifying the dataset and
improving the clustering results.
The EM algorithm is used to find the maximum likelihood estimates of parameters in GMMs. It
is, a two-step iterative process, first is expectation step(E-step), we calculate the expected value
of the latent variables given the current estimates of the parameters. Second is the maximization
step(M-step). In maximization step the parameter to maximize the expected log-likelihood found
in the E-step is updated. This process is repeated until convergence, where the changes in the
parameter values become sufficiently small.
3.Experimental Results and Discussions:
Dataset description:
The traffic dataset used in this study was obtained from Kaggle. The dataset, titled “Traffic
Prediction Dataset,” is available at https://www.kaggle.com/datasets/fedesoriano/traffic-
prediction-dataset . It includes attributes such as datetime, junction, vehicles, and ID. Pre-
processing involved converting the datetime attribute into numerical features (year, month, day,
hour) and standardizing the dataset. This dataset is essential for analyzing traffic patterns and
implementing.
K-Means Clustering:
Parameter Selection:
o Number of clusters(k): Optimal number of clusters is determined by the Elbow Method.
The Elbow Method involves plotting within cluster sum of squares(WCSS) against the
number of clusters and identifying the point(elbow point) where the gain drops
significantly. Figure 3.1 shows the Elbow Method plot for the traffic dataset. Gain
decreases significantly for
o Distance between data points: Euclidean distance is used to measure the distance between
data points and cluster centroids.
Figure 3.1 Elbow method plot
K-Means clustering is initialized to generate 2 clusters. The optimal number of clusters is
determined using the Elbow method. In Elbow method the optimal number of clusters is
visualized by plotting the number of cluster against within cluster sum of square. The optimal
number of clusters is found at the "elbow point," where the rate of decrease in WCSS slows
down significantly.
Typically, Euclidean distance is used to measure the distance between points and centroids. The
clustering shows a consistent pattern of lower traffic during the early hours of the day and late at
night, with distinct peaks in the morning and evening. This consistency helps in planning traffic
management strategies and resource allocation. The plot indicates distinct clusters, with a large
number of points (vehicles) during peak hours and fewer points during off-peak hours. The
yellow clusters likely represent high-traffic periods, while the dark clusters represent lower-
traffic periods.
Figure 3.2 K-Means Clustering based on hour and number of vehicles
By analyzing the clustering results, we can conclude that traffic is heaviest during typical
commuting hours (morning and evening rush hours) and lightest during late-night and early-
morning hours. This information is crucial for urban planning, traffic management, and
optimizing public transportation schedules.
The Silhouette score is: 0.22392834847029727
Time taken for k-means clustering execution: 0.79 seconds.
DBSCAN(Density-Based Spatial Clustering of Applications with Noise):
Parameter Selection :
In the DBSCAN algorithm, the epsilon value should be calculated with a K-NN distance plot.
The plot contains, on one axis, a sorted index versus a sorted distance to the K-th nearest
neighbor. We noticed that in this plot, the "elbow" point, where distance increases sharply, will
help in deciding a good value of epsilon in DBSCAN. In our case, the epsilon value was
observed to be 0.3.
Figure 3.3 k-NN Distance plot to determine epsilon value.
The proper value of MinPts is calculated by k = 2 * n_dimensions, where `n_dimensions` is the
number of features selected. In this case, we will use 13 as our MinPts.
Then we applied the DBSCAN algorithm and calculated a number of clusters, noise points, and
execution time for the formation of clusters. The Silhouette score measures how similar an object
is to its own cluster. The value ranges from -1 to 1. -1 indicating poor clustering and +1 is good
clustering results.
Figure 3.4 Estimated number of clusters
The results that have been observed are:
-Number of estimated clusters: 93. This means there are quite a large number of distinct groups
in the dataset.
- Estimated number of noise points:7107. There are quite a large number of noise points, which
means that lots of the points did not fit into any cluster.
- Time taken for DBSCAN clustering execution: 3.016 seconds. That is, this computation is not
intensive at all.
- Silhouette Score: -0.082. This indicates poor performance in clustering.
Analyzing the characteristics of the identified clusters reveals that they are not well-
separated. The absence of meaningful traffic patterns—such as consistent traffic flows during
specific times or in particular areas—confirms that DBSCAN is not effective in this context.
Gaussian Mixture:
Parameter Selection:
Principal Component Analysis (PCA) is a technique used to reduce the features (dimension) of
data while preserving as much information as possible. Original data is transformed into a set of
new, uncorrelated variables called principal components. PCA is often applied before GMM
clustering. By transforming data into low-dimension it simplifies problem for GMM, making it
easier and faster to identify clusters.
Number of Principal Components: To determine the optimal number of principal components we
perform Principal Component Analysis (PCA). Each principal component will have an explained
variance ratio, that indicates the proportion of the total variance the component explains. The
cumulative explained variance is calculated to determine how much of total variance is explained
as we include more components. We choose the smallest number of components.
Figure 3.5 shows the number of principal components versus cumulative explained variance plot,
using which we determine the number of principal components for the GMM clustering.
Figure 3.5 Plot showing the optimal number of principal components
In Gaussian Mixture clustering, dimensionality is reduced to six principal components with PCA.
This step makes the data easier to handle by reducing it to almost the variance contained. At the
heart of this process is the fitting of a GMM with two components to the PCA-transformed data.
The GMM gives the probability viewpoint of clustering, according to which data are assumed to
be generated from a mixture of Gaussian distributions. It is considered that there is a Gaussian
distribution for each cluster, and thus, it uses the EM algorithm to update the parameters
iteratively until it converges.
The result we observed:
Figure 3.6 Gaussian Mixture Clustering with Silhouette score
-Time taken for Gaussian Mixture clustering execution-0.59 This indicates that the dataset size
and complexity were well-handled by the algorithm, likely aided by the dimensionality reduction
step.
-Silhoutte score- A score close to 1 thereby indicates that the clusters are very well separated,
while -1 means the clusters overlap. On the other hand, a score of 0 shows a totally overlapping
or not well-separated cluster. In this case, hence, a silhouette score of 0.27 indicates quite
distinct clustering but with some overlapping. This score, however, implies that although
clustering is not perfect, it is quite reasonably good.
____________________________________________________________________________
4.Conclusion:
The results exhibit significant differences in the performance and effectiveness of the clustering
algorithms. K-Means, which is a centroid-based method, is efficient but may struggle with non-
globular clusters. DBSCAN excels in identifying clusters of arbitrary shapes and handling noise
but is sensitive to the choice of epsilon and minimum samples. GMM provides a flexible
probabilistic approach but can be computationally intensive.
This study provides a comparative analysis of K-Means, DBSCAN, and Gaussian Mixture
Models on traffic data clustering. Each algorithm has its own strengths and weaknesses, and the
choice of algorithm depends on the specific requirements of the traffic management application.
Future work could explore hybrid approaches and real-time clustering for dynamic traffic
management. In this paper, we compared K-Means Clustering, DBSCAN, and a Gaussian
Mixture Model on a traffic dataset. Among them, GMM was found to be the best because it
balanced clustering quality against execution time. Its higher Silhouette Score corresponds to a
better data structure capture than the other methods. DBSCAN performed poorly due to its
sensitivity to parameters, leading to a negative Silhouette Score and many noise points. K-Means
was the fastest algorithm but not so great as GMM due to its assumption of spherical clusters.
Overall, GMM has been recommended in various traffic management applications because of
accuracy and execution time
_________________________________________________________________________
5.References:
[1] Pei Y., Cai X., Li J., Song K., & Liu R. (2021, August) Method for identifying the traffic
congestion situation of the main road in cold-climate cities based on the clustering analysis
algorithm
[2] Doreswamy , Ghoneim O. A.,( 2018) Traffic Jams Detection and Congestion Avoidance
in Smart City Using Parallel K-Means Clustering Algorithm
[3] Han J., Kamber M., Pei J., (2011) Data mining: Concepts and techniques
[4] Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for
discovering clusters in large spatial databases with noise. Institute for Computer Science,
University of Munich
[5] Kurumalla S., Rao P. S., (2016, October) K-nearest neighbor based dbscan clustering
algorithm for image segmentation
[6] Patel E., Kushwaha D. S., (2020) Clustering Cloud Workloads: K-Means vs Gaussian