0 ratings 0% found this document useful (0 votes) 38 views 29 pages K-Means Algorithm
The document provides a comprehensive guide on K-Means clustering in Python, explaining clustering techniques, particularly K-Means and density-based clustering. It outlines the algorithm's steps, implementation using scikit-learn, and the importance of feature scaling. Additionally, it includes practical examples and code snippets for building a K-Means clustering pipeline with real-world datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save K-Means Algorithm For Later © Customize appearance
K-Means Clustering in Python: A Practical
Guide — Real Python
&k-Means Clustering in Python: A Practical Guide
What Is Clustering?
Clustering is a set of techniques used to partition data into groups, or
clusters. Clusters are loosely defined as groups of data objects that
are more similar to other objects in their cluster than they are to data
objects in other clusters. In practice, clustering helps identify two
qualities of data:
1. Meaningfulness
2. Usefulness
Meaningful clusters expand domain knowledge. For example, in the
medical field, researchers applied clustering to gene expressionexperiments. The clustering results identified groups of patients who
tespond differently to medical treatments.
Useful clusters, on the other hand, serve as an intermediate step ina
data pipeline. For example, businesses use clustering for customer
segmentation. The clustering results segment customers into groups
with similar purchase histories, which businesses can then use to
create targeted advertising campaigns.
Note: You'll learn about unsupervised machine learning techniques in
this tutorial. If you're interested in learning more about supervised
machine learning techniques, then check out Logistic Regression in
Python.
There are many other applications of clustering, such as document
clustering and social network analysis. These applications are relevant
in nearly every industry, making clustering a valuable skill for
professionals working with data in any field.
Density-Based Clustering
Density-based clustering determines cluster assignments based on
the density of data points in a region. Clusters are assigned where
there are high densities of data points separated by low-density
regions.
Unlike the other clustering categories, this approach doesn't require
the user to specify the number of clusters. Instead, there is a distance-
based parameter that acts as a tunable threshold. This threshold
determines how close points must be to be considered a cluster
member.Examples of density-based clustering algorithms include Density-
Based Spatial Clustering of Applications with Noise, or DBSCAN, and
Ordering Points To Identify the Clustering Structure, or OPTICS.
The strengths of density-based clustering methods include the
following:
* They excel at identifying clusters of nonspherical shapes.
* They're resistant to outliers.
The weaknesses of density-based clustering methods include the
following:
© They aren't well suited for clustering in high-dimensional spaces.
* They have trouble identifying clusters of varying densities.
How to Perform K-Means Clustering in Python
In this section, you'll take a step-by-step tour of the conventional
version of the means algorithm. Understanding the details of the
algorithm is a fundamental step in the process of writing your Kmeans
clustering pipeline in Python. What you learn in this section will help
you decide if Kmeans is the right choice to solve your clustering
problem.
Understanding the K-Means Algorithm
Conventional k-means requires only a few steps. The first step is to
randomly select k centroids, where kis equal to the number of clusters
you choose. Centroids are data points representing the center of a
cluster.The main element of the algorithm works by a two-step process called
expectation-maximization. The expectation step assigns each data
point to its nearest centroid. Then, the maximization step computes
the mean of all the points for each cluster and sets the new centroid.
Here's what the conventional version of the Kmeans algorithm looks
like:
Algorithm 1 k-means algorithm
1: Specify the number k of clusters to assign.
2; Randomly initialize k centroids.
3: repeat
4 expectation: Assign each point to its closest centroid.
5: maximization: Compute the new centroid (mean) of each cluster.
6; until The centroid positions do not change.
The quality of the cluster assignments is determined by computing the
sum of the squared error (SSE) after the centroids converge, or match
the previous iteration’s assignment. The SSE is defined as the sum of
the squared Euclidean distances of each point to its closest centroid.
Since this is a measure of error, the objective of means is to try to
minimize this value.
The figure below shows the centroids and SSE updating through the
first five iterations from two different runs of the means algorithm on
the same dataset:k-means iteration: 1
Iniuianzation #1 Initaeetion #2
‘SSE: 6735.8 SSE: 5238.5
x conoid
The purpose of this figure is to show that the initialization of the
centroids is an important step. It also highlights the use of SSE as a
measure of clustering performance. After choosing a number of
clusters and the initial centroids, the expectation-maximization step is
tepeated until the centroid positions reach convergence and are
unchanged.
The random initialization step causes the Kmeans algorithm to be
nondeterministic, meaning that cluster assignments will vary if you run
the same algorithm twice on the same dataset. Researchers
commonly run several initializations of the entire k-means algorithm
and choose the cluster assignments from the initialization with the
lowest SSE.
Writing Your First K-Means Clustering Code in Python
Thankfully, there's a robust implementation of Kmeans clustering in
Python from the popular machine learning package scikit-learn. You'll
learn how to write a practical implementation of the Kmeans algorithm
using the scikit-learn version of the algorithm.Note: If you're interested in gaining a deeper understanding of how to
write your own k-means algorithm in Python, then check out the
Python Data Science Handbook.
The code in this tutorial requires some popular external Python
packages and assumes that you've installed Python with Anaconda.
For more information on setting up your Python environment for
machine learning in Windows, read through Setting Up Python for
Machine Learning on Windows.
Otherwise, you can begin by installing the required packages:
(base) $ conda install matplotlib numpy pandas seaborn
scikit-learn ipython
(base) $ conda install -c conda-forge kneed
The code is presented so that you can follow along in an {ipython|
console or Jupyter Notebook. Click the prompt (>>>) at the top right of
each code block to see the code formatted for copy-paste. You can
also download the source code used in this article by clicking on the
link below:
Download the sample code: Click here to get the code you'l| use to
learn how to write a k-means clustering pipeline in this tutorial.
This step will import the modules needed for all the code in this
section:
>>>
In [1]: import matplotlib.pyplot as plt
...1 from kneed import KneeLocator
: from sklearn.datasets import make_blobs: from sklearn.cluster import KMeans
: from sklearn.metrics import silhouette_score
...1 from sklearn.preprocessing import
StandardScaler
You can generate the data from the above GIF using lmake_blobs( ),a
convenience function in scikitlearn used to generate synthetic
clusters. make_blobs() uses these parameters:
* n_samples is the total number of samples to generate.
* centers is the number of centers to generate.
+ \cluster_std is the standard deviation.
fnake_blobs() returns a tuple of two values:
1. Atwo-dimensional NumPy array with the x- and y-values for each of
the samples
2. A one-dimensional NumPy array containing the cluster labels for
each sample
Note: Many scikit-learn algorithms rely heavily on NumPy in their
implementations. If you want to learn more about NumPy arrays, check
out Look Ma, No |For, Loops: Array Programming With NumPy.
Generate the synthetic data and labels:
>>>
In [2]: features, true_labels = make_blobs(
n_samples=2e0,
centers=3,
cluster_std=2.75,
Peet random_state=420008 ))
Nondeterministic machine learning algorithms like Kmeans are
difficult to reproduce. The \random_state parameter is set to an
integer value so you can follow the data presented in the tutorial. In
practice, it's best to leave random_state|as the default value, None.
Here's a look at the first five elements for each of the variables
returned by make_blobs():
>>>
In [3]: features[:5]
Out[3]:
array([[ 9.77075874, 3.27621022],
[ -9.71349666, 11.27451802],
[ -6.91330582, -9.34755911],
[-10.86185913, -10.75063497],
[ -8.50038027, -4.54370383]])
In [4]: true_labels[:5]
Out[4]: array([1, @, 2, 2, 2])
Data sets usually contain numerical features that have been measured
in different units, such as height (in inches) and weight (in pounds). A
machine learning algorithm would consider weight more important
than height only because the values for weight are larger and have
higher variability from person to person.
Machine learning algorithms need to consider all features on an even
playing field. That means the values for all features must be
transformed to the same scale.The process of transforming numerical features to use the same scale
is known as feature scaling. It's an important data preprocessing step
for most distance-based machine learning algorithms because it can
have a significant impact on the performance of your algorithm.
There are several approaches to implementing feature scaling. A great
way to determine which technique is appropriate for your dataset is to
tead scikit-learn’s preprocessing documentation.
In this example, you'll use the StandardScaler class. This class
implements a type of feature scaling called standardization.
Standardization scales, or shifts, the values for each numerical feature
in your dataset so that the features have a mean of 0 and standard
deviation of 1:
>>>
In [5]: scaler = StandardScaler()
...: scaled features =
scaler. fit_transform(features)
Take a look at how the values have been scaled in|scaled_features:
>>>
In [6]: scaled_features[:5]
Out [6]:
array([[ 2.13082109, ©.25604351],
[-1.52698523, 1.41036744],
[-1.00130152, -1.56583175],
[-1.74256891, -1.76832509],
[-1.29924521, -@.87253446]])Now the data are ready to be clustered. The KMeans estimator class in
scikit-learn is where you set the algorithm parameters before fitting the
estimator to the data. The scikit-learn implementation is flexible,
providing several parameters that can be tuned.
Here are the parameters used in this example:
+ [init controls the initialization technique. The standard version of
the kmeans algorithm is implemented by setting init to
"random", Setting this to "k-means++" employs an advanced trick
to speed up convergence, which you'll use later.
n_clusters sets k for the clustering step. This is the most
important parameter for means.
. n_init) sets the number of initializations to perform. This is
important because two runs can converge on different cluster
assignments. The default behavior for the scikit-learn algorithm is
to perform ten Kmeans runs and return the results of the one with
the lowest SSE.
* max_iter sets the number of maximum iterations for each
initialization of the Kmeans algorithm.
Instantiate the KMeans class with the following arguments:
>>>
In [7]: kmeans = KMeans(
init="random",
n_clusters=3,
n_init=10,
max_iter=300,random_state=42
The parameter names match the language that was used to describe
the Kmeans algorithm earlier in the tutorial. Now that the Ameans
class is ready, the next step is to fit it to the data in|scaled_features.
This will perform ten runs of the k-means algorithm on your data with a
maximum of 309 iterations per run:
>>>
In [8]: kmeans.fit(scaled_features)
Out[8]:
KMeans (ini
‘random’, n_clusters=3, random_state=42)
Statistics from the initialization run with the lowest SSE are available
as attributes of means after calling .fit():
>>>
In [9]: # The lowest SSE value
...: kmeans.inertia_
Out[9]: 74.57960106819854
In [10]: # Final locations of the centroid
«++: kmeans.cluster_centers_
Out[1@]:
array([[ 1.19539276, @.13158148],
[-@.25813925, 1.05589975],
[-9.91941183, -1.18551732]])
In [11]: # The number of iterations required to
converge+..1 kmeans.n_iter_
Out[11]: 6
Finally, the cluster assignments are stored as a one-dimensional
NumPy array in kmeans -labels_. Here's a look at the first five
predicted labels:
>>>
In [12]: kmeans.labels_[:5]
Out[12]: array([®, 1, 2, 2, 2], dtype=int32)
Note that the order of the cluster labels for the first two data objects
was flipped. The order was|[1, @] in{true_labels but|[@, 1]/in
kmeans.labels_ even though those data objects are still members of
their original clusters in kmeans.lables_|
This behavior is normal, as the ordering of cluster labels is dependent
on the initialization. Cluster 0 from the first run could be labeled cluster
1 in the second run and vice versa. This doesn't affect clustering
evaluation metrics.
How to Build a K-Means Clustering Pipeline in
Python
Now that you have a basic understanding of k-means clustering in
Python, it’s time to perform Ameans clustering on a real-world dataset.
These data contain gene expression values from a manuscript
authored by The Cancer Genome Atlas (TCGA) Pan-Cancer analysis
project investigators.There are (381 samples (rows) representing five distinct cancer
subtypes. Each sample has gene expression values for 2e, 531 genes
(columns). The dataset is available from the UC Irvine Machine
Learning Repository, but you can use the Python code below to obtain
the data programmatically.
To follow along with the examples below, you can download the source
code by clicking on the following link:
In this section, you'll build a robust Kmeans clustering pipeline. Since
you'll perform multiple transformations of the original input data, your
pipeline will also serve as a practical clustering framework.
Building a K-Means Clustering Pipeline
Assuming you want to start with a fresh namespace, import all the
modules needed to build and evaluate the pipeline, including pandas
and seaborn for more advanced visualizations:
>>>
In [1]: import tarfile
+++: import urllib
+..: import numpy as np
«++: import matplotlib.pyplot as plt
...1 import pandas as pd
...! import seaborn as sns
...1 from sklearn.cluster import KMeans
...: from sklearn.decomposition import PCA
...: from sklearn.metrics import silhouette_score,
adjusted_rand_score: from sklearn.pipeline import Pipeline
: from sklearn.preprocessing import LabelEncoder,
MinMaxScaler
Download and extract the TCGA dataset from UCI:
>>>
In [2]: uci_tcga_url =
“https: //archive.ics.uci.edu/ml/machine-learning-
databases/@0401/"
...: archive_name = "TCGA-PANCAN-HiSeq-
801x20531.tar.gz”
++.2 # Build the url
: full_download_url =
urllib.parse.urljoin(uci_tcga_url, archive_name)
# Download the file
r = urllib.request.urlretrieve
(full_download_url, archive_name)
-: # Extract the data from the archive
.: tar = tarfile.open(archive_name, "r:gz")
.: tar.extractall()
+++: tar.close()
After the download and extraction is completed, you should have a
directory that looks like this:
TCGA-PANCAN -HiSeq-801x20531/[- data.csv
\ labels.csv
The |KMeans class in scikit-learn requires a NumPy array as an
argument. The NumPy package has a helper function to load the data
from the text file into memory as NumPy arrays:
>>>
In [3]: datafile = "TCGA-PANCAN-HiSeq-
801x20531/data.csv"
...1 labels file = "TCGA-PANCAN-HiSeq-
801x20531/labels.csv"
«+1 data = np.genfromtxt(
weet datafile,
eee delimiter=",",
eee usecols=range(1, 20532),
skip_header=1
...1 true_label_names = np.genfromtxt(
eel labels_file,
weal delimiter=",",
weet usecols=(1,),
eel skip_header=1,
Peet dtype="str"
Check out the first three columns of data for the first five samples as
well as the labels for the first five samples:>>>
In [4]: data[:5, :3]
Out[4]:
array([[@. » 2.01720929, 3.26552691],
[e. » @.59273209, 1.58842082],
[e. » 3.51175898, 4.32719872],
[@. » 3.66361787, 4.50764878],
[e. » 2.65574107, 2.82154696]])
In [5]: true_label_names[:5]
Out[5]: array(['PRAD', 'LUAD', 'PRAD', 'PRAD', ‘BRCA'],
dtype=">>
In [9]: label_encoder.classes_
Out[9]: array(['BRCA', 'COAD', 'KIRC', ‘LUAD', 'PRAD'],
dtype='>>
In [11]: preprocessor = Pipeline(
Pets) L
weet ("scaler", MinMaxScaler()),Peet ("pca", PCA(n_components=2,
random_state=42)),
Now that you've built a pipeline to process the data, you'll build a
separate pipeline to perform k-means clustering. You'll override the
following default arguments of the KMeans class:
« init: You'll use "k-means++" instead of |"random" to ensure
centroids are initialized with some distance between them. In most
cases, this will be an improvement over "random .
* n_init: You'll increase the number of initializations to ensure you find
a stable solution.
* max_iter: You'll increase the number of iterations per initialization to
ensure that means will converge.
Build the k-means clustering pipeline with user-defined arguments in
the KMeans constructor:
>>>
In [12]: clusterer = Pipeline(
Pet: [
“kmeans",
KMeans(
n_clusters=n_clusters,
init="k-means++",
n_init=50,
max_iter=500,weal random_state=42,
weet )s
weed )s
The Pipeline class can be chained to form a larger pipeline. Build an
end-to-end Kmeans clustering pipeline by passing the
"preprocessor" and |"clusterer™ pipelines to Pipeline:
>>>
In [13]: pipe = Pipeline(
Peer L
Peet ("preprocessor", preprocessor),
weet ("clusterer", clusterer)
Calling |. Fit ()| with data as the argument performs all the pipeline
steps on the data:
>>>
In [14]: pipe. fit(data)
Out[14]:
Pipeline(steps=[(‘preprocessor',
Pipeline(steps=[('‘scaler',
MinMaxScaler()),
C'pca’,
PCA(n_components=2,
random_state=42))])),
(‘clusterer',Pipeline(steps=[(‘kmeans',
KMeans(max_iter=500,
n_clusters=5, n_init=5e,
random_state=42))]))])
The pipeline performs all the necessary steps to execute k-means
clustering on the gene expression data! Depending on your Python.
REPL, |. fit()) may print a summary of the pipeline. Objects defined
inside pipelines are accessible using their step name.
Evaluate the performance by calculating the silhouette coefficient:
>>>
In [15]: preprocessed data =
pipe[ "preprocessor" ].transform(data)
In [16]: predicted_labels = pipe["clusterer"]
["kmeans"].labels_
In [17]: silhouette_score(preprocessed_data,
predicted_labels)
Out[17]: @.5118775528450304
Calculate ARI, too, since the ground truth cluster labels are available:
>>>
In [18]: adjusted_rand_score(true_labels,
predicted_labels)
Out[18]: @.722276752060253As mentioned earlier, the scale for each of these clustering
performance metrics ranges from -1 to 1. A silhouette coefficient of 0
indicates that clusters are significantly overlapping one another, and a
silhouette coefficient of 1 indicates clusters are well-separated. An ARI
score of 0 indicates that cluster labels are randomly assigned, and an
ARI score of 1 means that the true labels and predicted labels form
identical clusters.
Since you specified In_components=;
in the PCA step of the Ameans
clustering pipeline, you can also visualize the data in the context of the
true labels and predicted labels. Plot the results using a pandas
DataFrame and the seaborn plotting library:
>>>
In [19]: pcadf = pd.DataFrame(
pipe["preprocessor"].transform(data),
columns=[“component_1", “component_2"],
: pcadf[“predicted_cluster"] = pipe[“clusterer"]
["kmeans"].labels_
set pcadf["true_label"] =
label_encoder. inverse_transform(true_labels)
In [20]: plt.style.use("fivethirtyeight")
plt.figure(figsize=(8, 8))
scat = sns.scatterplot(
“component_1",
“component_2",
s=50,data=pcadf,
hue="predicted_cluster",
style="true_label",
palette="Set2",
scat.set_title(
"Clustering results from TCGA Pan-
Cancer\nGene Expression Data”
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2,
borderaxespad=0.0)
«+2? plt.show()
Here's what the plot looks like:
Clustering results from TCGA Pan-Cancer
Gene Expression Data
predicted cluster
$
component.2
component 1
The visual representation of the clusters confirms the results of the
two clustering evaluation metrics. The performance of your pipeline
was pretty good. The clusters only slightly overlapped, and cluster
assignments were much better than random.Tuning a K-Means Clustering Pipeline
Your first Kmeans clustering pipeline performed well, but there's still
room to improve. That's why you went through the trouble of building
the pipeline: you can tune the parameters to get the most desirable
clustering results.
The process of parameter tuning consists of sequentially altering one
of the input values of the algorithm's parameters and recording the
results. At the end of the parameter tuning process, you'll have a set of
performance scores, one for each new value of a given parameter.
Parameter tuning is a powerful method to maximize performance from
your clustering pipeline.
By setting the PCA| parameter |n_components=2, you squished all the
features into two components, or dimensions. This value was
convenient for visualization on a two-dimensional plot. But using only
two components means that the PCA step won't capture all of the
explained variance of the input data.
Explained variance measures the discrepancy between the PCA-
transformed data and the actual input data. The relations!
fh_components, and explained variance can be visualized in a plot to
show you how many components you need in your PCA to capture a
certain percentage of the variance in the input data. You can also use
between
clustering performance metrics to evaluate how many components are
necessary to achieve satisfactory clustering results.
In this example, you'll use clustering performance metrics to identify
the appropriate number of components in the PCA step. The Pipeline
class is powerful in this situation. It allows you to perform basic
parameter tuning using a {for loop.Iterate over a range of n_components and record evaluation metrics
for each iteration:
>>>
In [21]: # Empty lists to hold evaluation metrics
silhouette_scores = []
ari_scores = []
for n in range(2, 11):
# This set the number of components for
# but leaves other steps unchanged
pipe["preprocessor"]["pca"].n_components =
pipe.fit(data)
silhouette_coef = silhouette_score(
pipe["preprocessor"].transform(data) ,
pipe["clusterer"]["kmeans"].labels_,
)
ari = adjusted_rand_score(
true_labels,
pipe["clusterer"]["kmeans"].labels_,
eet # Add metrics to their lists
silhouette_scores.append(silhouette_coef)
ari_scores.append(ari)
Plot the evaluation metrics as a function of n_components to visualize
the relationship between adding components and the performance of
the means clustering results:In [22]: plt.style.use("fivethirtyeight")
...1 plt.figure(figsize=(6, 6))
«..? plt.plot(
eee range(2, 11),
silhouette_scores,
weet c="#008fd5",
weet label="Silhouette Coefficient",
: plt.plot(range(2, 11), ari_scores, c="#fc4f30",
label="ARI")
«++: plt.xlabel("n_components")
«set plt.legend()
...1 plt.title("Clustering Performance as a Function
of n_components")
...2 plt.tight_layout()
«..? plt.show()
The above code generates the a plot showing performance metrics as
a function of n_components:Clustering Performance
as a Function of n_components
09
08
o7 — sithouette Coefficient
—
06
os
oa
2 4 a 10
6
n_components
There are two takeaways from this figure:
. The silhouette coefficient decreases linearly. The silhouette
coefficient depends on the distance between points, so as the
number of dimensions increases, the sparsity increases.
2. The ARI improves significantly as you add components. It appears
to start tapering off after n_components=7| so that would be the
value to use for presenting the best clustering results from this
pipeline.
Like most machine learning decisions, you must balance optimizing
clustering evaluation metrics with the goal of the clustering task. In
situations when cluster labels are available, as is the case with the
cancer dataset used in this tutorial, ARI is a reasonable choice. ARI
quantifies how accurately your pipeline was able to reassign the
cluster labels.
The silhouette coefficient, on the other hand, is a good choice for
exploratory clustering because it helps to identify subclusters. Thesesubclusters warrant additional investigation, which can lead to new
and important insights.
Conclusion
You now know how to perform k-means clustering in Python. Your final
k-means clustering pipeline was able to cluster patients with different
cancer types using real-world gene expression data. You can use the
techniques you learned here to cluster your own data, understand how
to get the best clustering results, and share insights with others.
In this tutorial, you learned:
What the popular clustering techniques are and when to use them
What the kmeans algorithm is
How to implement Ameans clustering in Python
How to evaluate the performance of clustering algorithms
¢ How to build and tune a robust Ameans clustering pipeline in
Python
« How to analyze and present clustering results from the kAmeans
algorithm
You also took a whirlwind tour of scikitearn, an accessible and
extensible too! for implementing means clustering in Python. If you'd
like to reproduce the examples you saw above, then be sure to
download the source code by clicking on the following link:
You're now ready to perform k-means clustering on datasets you find
interesting. Be sure to share your results in the comments below!
Note: The dataset used in this tutorial was obtained from the UCI
Machine Learning Repository. Dua, D. and Graff, C. (2019). UCIMachine Learning Repository. Irvine, CA: University of California,
School of Information and Computer Science.
The original dataset is maintained by The Cancer Genome Atlas Pan-
Cancer analysis project.
Mark as Completed
BD Python Tricks
Get a short & sweet Python Trick delivered to your inbox every couple
of days. No spam ever. Unsubscribe any time. Curated by the Real
Python team.
‘Send Me Python Tricks »
About Kevin Arvai