CLONGLOMERATE ANALYSIS
Hierarchical clustering
1. Hierarchical clustering by RNA-seq in a 8me course/different condi8ons/samples
• It’s used to figure out which genes (rows) present similar behavior in rela:on to other genes in different samples,
condi:ons, or :me course (dots).
• You can do hierarchical clustering also clustering by sample instead of by genes, but it will be very similar than if
you just look at the sample across all the genes.
• Steps to do hierarchical clustering:
1. Calculate the log(TPM) pairwise distance between every event of one gene (all sample/condi:ons) with
another gene. For example, calculate the distance between the log(TPMs) of the dot 1 of the gene 1 and
the dot 1 of the gene 2; the distance between the log(TPMs) of the dot 1 of the gene 1 and the dot 1 of
the gene 3… In total (n-1)/2 pairwise comparisons of the distances.
2. Cluster the genes that has the smallest the distance across condi:ons. And you generate a node. Then
compare the distance between the nodes/genes and every gene or node in the remaining data.
3. Choose the smallest distance again between the previous node and the closer gene or node. Generate a
new node. Then compare the distance between the nodes/genes and every gene or node in the
remaining data.
4. Choose the smallest distance again between the previous node and the closer gene or node. Generate a
new node. Then compare the distance between the nodes/genes and every gene or node in the
remaining data.
5. Choose the smallest distance again between the previous node and the closer gene or node. Generate a
new node. Then compare the distance between the nodes/genes and every gene or node in the
remaining data un:l there is not more comparisons possible.
2. Hierarchical clustering by RNA-seq in a PCA
1. In contrast to K-mean clustering, it’s not necessary to determine the number of clusters that are going to be
generated (K). The hierarchical clustering method calculate the distance between all each event (point)
regarding to all the other events.
2. The closest distance between events (points) is determine and these two points are going to be clustered
together and a new center is going to be determined by a different types of linkage mothod that could be
predefined.
3. Recalculate the distances and cluster the closest points by distance together
4. There is a moment where all the groups are going to be formed. Nevertheless, the fron:ers between
groups would be different between the different linkage methodology that has been selected.
5. If you let the algorithm works to a infinite :me, it will cluster all the events in a same cluster group. To avoid it you
should use a cutoff threshold referred to the dendrogram: determine a height of the dendrogram or a preselected
number of clusters (similar to K-means clustering which follows the elbow rule).
3. Hierarchical generali8es
• The dendrogram allows the iden:fica:on of distances between groups (clusters).
• Repeatedly:
§ Merge two nodes (either a gene or a cluster) that are closest to each other (the smallest distance).
§ Re-calculate the distance from newly formed node to all the other nodes (gene or cluster).
§ Branch length represents distances.
• Linkage: distance from newly formed node to all other nodes. There are different ways to do the linkage
(methods to take distances between clusters):
§ Single linkage: take as the distance the minimum distance between the two elements that are going to
be clustered together.
§ Complete linkage (it usually the default method in RStudio): take as the distance the maximum distance
between the two elements that are going to be clustered together.
§ Average linkage: take as the distance the mean (average) distance between the two elements that are
going to be clustered together.
§ Centroid linkage: take as the distance the minimum distance between the centroids of two clusters that
are going to be clustered together.
§ Ward’s method for linkage: combining clusters where the increase in cluster variance is to the smallest
degree. The objec:ve is to minimize the total within cluster variance.
• Different types of linkage methods generate different types of clustering results and dendrograms, you should
test all of them and select which one fits your data the best. Nevertheless, if it’s a stable clustering a similar
result should be obtained across the different types of linkage methods.
4. Brain Teasers
• Having n data point (each point is a gene).
• How many internal nodes are in the hierarchical cluster à (n-1)/2
• How many possible ways to draw the hierarchical cluster à 2n-1.
• In RNA-seq people usually do this hierarchical clustering using matrix for the genes (gene level clustering) and
then for the samples (sample level clustering).