KEMBAR78
PCA & Clustering | PDF | Principal Component Analysis | Cluster Analysis
0% found this document useful (0 votes)
136 views6 pages

PCA & Clustering

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a large set of variables into a smaller set of variables called principal components. The principal components contain most of the information in the original dataset. PCA reduces the number of variables in a dataset while minimizing loss of accuracy. This makes the dataset easier to explore, visualize, and process for machine learning algorithms. K-means clustering is an unsupervised learning algorithm that partitions data into k groups where each data point belongs to the cluster with the nearest mean. The k-means algorithm assigns data to clusters to minimize total within-cluster sum of squares and iteratively updates cluster means until convergence is reached.

Uploaded by

Tan Boon Pin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views6 pages

PCA & Clustering

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a large set of variables into a smaller set of variables called principal components. The principal components contain most of the information in the original dataset. PCA reduces the number of variables in a dataset while minimizing loss of accuracy. This makes the dataset easier to explore, visualize, and process for machine learning algorithms. K-means clustering is an unsupervised learning algorithm that partitions data into k groups where each data point belongs to the cluster with the nearest mean. The k-means algorithm assigns data to clusters to minimize total within-cluster sum of squares and iteratively updates cluster means until convergence is reached.

Uploaded by

Tan Boon Pin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Principal Component Analysis

Principal Component Analysis is a dimensionality-reduction method that is often used


to reduce the dimensionality of large data sets which by transforming a large set of variables
into a smaller and still contains most of the information of the data set. Besides, factor
analysis of mixed data (FAMD) is a principal component method dedicated to analysing a
data set containing both quantitative and qualitative variables

The objective of PCA is reduced the number of variables of a data set naturally comes
at the expense of accuracy, but the dimensionality reduction is to trade a little accuracy for
simplicity. Because smaller data sets are easier to explore and visualize and make analysing
data much easier and faster for machine learning algorithms without extraneous variables to
process.

Figure3.1: Contribution of Variables in 5 Dimension

Based on graph above, we can see that the Grouped_TrafficType contribute the most
among 5 dimension. This means that Group_TrafficType is vital to all of this dimension.
Besides, the Special Day contribute the least among 5 dimensions which is least important.
Figure3.2: Contribution of Variables

Based on the graph above, the total_visited_pgs and total_duration contribute the most
in first 2 dimensions. However, SpecialDay, Month and Group_TrafficType contribute the
least in first 2 dimension.

We can see grouping includes total_visited_pgs, total_duration, Month, and


PageValues. This grouping make senses as amount of time spent of pages, number of pages
visited and the Average page value of the pages visited by the visitor influences the Revenue.

Figure3.3: The values of the 9 PCs

Based on figure above, the standard deviation also known as eigenvalue once the data
is standardized. The proportion of variance refers to the amount of variance component
account for in the data. For example, we can observe that the PC1 accounts for more than 24
% of total variance in the data alone. For the Cumulative proportion refers to the accumulated
amount of explain variance, for instance, if we used first 4 principal components would be
able to account for more than 65% of total variance in the data.
The PC1 has proportion of variance of 0.2408, means that one principal component
could only explain 24.08% of variation of the data. Hence, it is not optimum as we
oversimplified. Thus, we need more principal components so that the prediction can be more
accurate. For example, PC1 and PC2 explains only 41.64%. However, it is considered
accurate if the PCs can account for at least 95% of variation.

Figure3.4: Variance Plot (left) and Cumulative Variance Plot (right)

We notice that the first 6 components in (Cumulative Variance plot) explain almost
90% of variance. We can effectively reduce dimensionality from 9 to 6 while only “loosing”
about 10% of variance. We also can see that with just first 4 components can explain more
than 60% variance.

.
K-means Clustering

K-means clustering is an unsupervised machine learning algorithm that aims for partition a
given dataset into k groups (cluster). The objective of K-means is group similar data points
together and discover underlying patterns. To achieve this objective, K-means looks for a
fixed number k of cluster in a dataset, which refers to the number of centroids needed in
dataset. Every data point is allocated to each of the clusters through reducing the in-cluster
sum of squares.

K-means algorithm

Identifies k number of centroids and then allocates every data point to the nearest cluster
while keeping the centroid as small as possible.

k k
tot . withinss=∑ W ( C k )=∑ ∑ ( xi −μ k )2
k=1 k =1 x i ∈C k

where x idesign a data point belonging to the cluster C k.


μk is the mean value of the points assigned to the cluster C k.

1.Specify the number of clusters (k).

2.Select randomly k objects from the dataset as the initial cluster centers or means.

3.Assigns each observation to their closest centroid, based on Euclidean distance between the
object and the centroid.

4.For each of the k-clusters update the cluster centroid by calculating new mean values of all
the data points in the cluster. The centroid of a k th cluster is a vector of length p (number of
variables) containing the means of all variables for the observations in the k th cluster.

5.Iteratively minimize the total within sum of square until the maximum number of iterations
is reached.

kmeans() function returns a list of components, including:

cluster: A vector of integers (from 1: k) indicating the cluster to which each point is allocated
centers: A matrix of cluster centers (cluster means)
2
totss: The total sum of squares (TSS), ∑ ( xi −x́ ) . TSS measures the total variance in the data.
withinss: Vector of within-cluster sum of squares, one component per cluster
tot.withinss: Total within-cluster sum of squares, such as sum(withinss)
betweenss: The between-cluster sum of squares, such as totss−tot.withinss
size: The number of observations in each cluster
Figure4.1: Within clusters sum of squares

We know that the model is no accurate at prediction because of the within clusters sum of
squares is only 66.57%.

Figure4.2: K-Means Clusters

We can observe that there are lots of overlapping between the PC1 and PC2. This
mean that the model not to be accurate.

Figure4.3: Confusion Matrix table (K-Means)

From the figure above, we observe that this model has lower accuracy with just
54.72%. Moreover, the TPR (online shopper with revenue) is 64.89% which is high recall,
while the PPV is 21.22%, low precision. The TNR (online shopper without revenue) is
52.72% which has less recall, while the NPV is 88.44% which considered high precise.
Hence, it is not a good model for the dataset.
References:

Luke. (2018, Aug 10). Principal Component Analysis in R. Retrieved from:


https://www.datacamp.com/community/tutorials/pca-analysis-r

Kassambara. (2017, Sep 24). FAMD - Factor Analysis of Mixed Data in R: Essentials.
Retrieved from: http://www.sthda.com/english/articles/31-principal-component-
methods-in-r-practical-guide/115-famd-factor-analysis-of-mixed-data-in-r-
essentials/#introduction

Kassambara. (n.d). K-Means Clustering in R: Algorithm and Practical Examples. Retrieved


from: https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-
practical-examples/#k-means-basic-ideas

Michael. (2013, Sep 13). Understanding K-means Clustering in Machine Learning. Retrieved
from: https://towardsdatascience.com/understanding-k-means-clustering-in-machine-
learning-6a6e67336aa

You might also like