Hierarchical clustering in Python and beyond

Hierarchical clustering
in Python & elsewhere
For @PyDataConf London, June 2015, by Frank Kelly
Data Scientist, Engineer @analyticsseo
@norhustla

Hierarchical
Clustering
Theory Practice Visualisation
Origins & definitions
Methods & considerations
Hierachical theory
Metrics & performance
My use case
Python libraries
Example
Static
Interactive
Further ideas
All opinions expressed are my own

Who am I?
All opinions expressed are my own

Attribution: www.alexmaclean.com
Clustering: a recap

Clustering is an unsupervised learning
problem
"SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:SLINK-Gaussian-data.svg#/media/File:SLINK-Gaussian-data.svg
based on some
notion of similarity.
whereby we aim to
group subsets of
entities with one
another

Origins
1930s:
Anthropology
&
Psychology
http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html

Diverse applications
Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/

Two
main
purposes
Exploratory analysis – standalone tool
(Data mining)
As a component of a supervised learning
pipeline (in which distinct classifiers or
regression models are trained for each
cluster).
(Machine Learning)

Clustering considerations
Partitioning
criteria
(single /
multi level)
Separation
Exclusive /
non-
exclusive
Clustering
space
(Full-space /
sub-space)
Similarity
measure
(distance /
connectivity)

Use case: search keywords
RD
P
P
P
KW
KW
KW
KW
KW
CP
CP
KW
KW
KW
The
competition!
KW
KW
CP
CD
You
Opportunity!
CD = Competing domains
CP = Competitor’s pages
RD = Ranking domain
P = Your page
KW = Keyword

Use case: search keywords
KW…so we have found 100,000 new ‘s – now what?
How do we summarise and present these to a client?

Clients’ questions…
• Do search categories in general
align with my website structure?
• Which categories of opportunity
keywords have the highest
search volume, bring the most
visitors, revenue etc.?
• Which keywords are not
relevant?

Requirements
• Need: visual insights;
structure
• Allow targeting of
problem in hand
• May develop into a
semi- supervised
solution

• High-dimensional and sparse
data set
• Values correspond to word
frequencies
• Recommended methods
include: hierarchical
clustering, Kmeans with an
appropriate distance measure,
topic modelling (LDA, LSI),
co-clustering
Options for text
clustering?

Hierarchical Clustering
bringing structure

2 types
Agglomerative
Divisive Deterministic algorithms!
Attribution: Wikipedia

Agglomerative
Start with many
“singleton” clusters
…
Merge 2 at a time
continuously
…
Build a hierarchy
Divisive
Start with a huge “macro”
cluster
…
Iteratively split into 2
groups
…
Build a hierarchy

Agglomerative method:
Linkage types
• Single (similarity between
most similar – based on nearest
neighbour - two elements)
• Complete (similarity between
most dissimilar two elements)
Attribution: https://www.coursera.org/course/clusteranalysis

Agglomerative method:
Linkage types
Average link
( avg. of similarity between
all inter-cluster pairs )
Computationally expensive (Na*Nb)
Trick: Centroid link (similarity
between centroid of two clusters)

Ward’s criterion
• Minimise a function: total in-cluster variance
• As defined by, e.g.:
• Once merged, then the SSE will increase
(cluster becomes bigger) by:
https://en.wikipedia.org/wiki/Ward's_method

Divisive clustering
• Top-down approach
• Criterion to split: Ward’s criterion
• Handling noise: Use a threshold to determine
the termination criteria

Similarity measures
This will certainly influence the shape of the
clusters!
• Numerical: Use a variation of the Manhattan
distance (e.g. City block, Euclidean)
• Binary: Manhattan, Jaccard co-efficient,
Hamming
• Text: Cosine similarity.

Cosine similarity
Represent a document by a bag of terms
Record the frequency of a particular term (word/ topic/ phrase)
If d1 and d2 are two term vectors,
…can thus calculate the similarity between them

Gather word documents = keyword phrases

Aggregate search words with URL “words”

Text clustering:
preparations
• Add features where possible
o I added URL words to my word set
• Stem words
o Choose the right stemmer – too severe can be bad
• Stop words
o NLTK tokeniser
o Scikit learn TF-IDF tokeniser
• Low frequency cut-off
o 2 => words appearing less than twice in whole corpus
• High frequency cut-off
o 0.5 => words that appear in more than 50% of documents
• N-grams
o Single words, bi-grams, tri-grams
• Beware of foreign languages
o Separate datasets if possible

Dimensionality
• Get a sparse matrix
o Mostly zeros
• Reduce the number of dimensions
o PCA
o Spectral clustering
• The “curse” of dimensionality

Assess the quality of your
clusters
• Internal: Purity, completeness & homogeneity
• External: Adjusted Rand index, Normalised
Information index

Hierarchical Clustering
Beyond Python (!?)

Life on the inside:
Elasticsearch
• Why not perform pre-processing and clustering
inside elasticsearch?
• Document store
• TF-IDF and other
• Stop words
• Language specific analysers

Elasticsearch
- try it ! -
• https://www.elastic.co/
• NoSQL document store
• Aggregations and stats
• Fast, distributed
• Quick to set up

Lingo 3G algorithm
• Lingo 3G: Hierarchical clustering off-the-shelf
• Built-in part of speech (POS)
• User-defined word/synonym/label dictionaries
• Built-in stemmer / word inflection database
• Multi-lingual support, advanced tuning
• Commercial: costs attached
http://download.carrotsearch.com/lingo3g/manual/#section.es
http://project.carrot2.org/algorithms.html

Elasticsearch with
clustering – Utopia?
Carrot2’s Lingo3G in action :
http://search.carrot2.org/stable/search
Foamtree visualisation example
Visualisation of hierarchical structure possible for
large datasets via “lazy loading”
http://get.carrotsearch.com/foamtree/demo/demos/large.html

Limitations of hierarchical
clustering
• Can’t undo what’s done (divisive method, work
on sub clusters, cannot re-merge). Even true for
agglomerative (once merged will never split it
again)
• Every split or merge must be refined
• Methods may not scale well, checking all possible
pairs, complexity goes high
There are extensions: BIRCH, CURE and
CHAMELEON

Thank you!
A decent introductory course to clustering;
https://www.coursera.org/course/clusteranalysis
Hierarchical (agglomerative) clustering in Python:
http://scikit-
learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
Recent (ish) relevant Kaggle challenge: https://www.kaggle.com/c/lshtc
Visualisation: http://carrotsearch.com/foamtree-overview
Clustering elsewhere (Lingo, Lingo3G) with
Carrot2:http://download.carrotsearch.com/
Elasticsearch: https://www.elastic.co/
Analytics SEO: http://www.analyticsseo.com/
Me: @norhustla / frank.kelly@cantab.net
Attribution: http://wynway.com/

Extra slide: Why work
inside the database?
1. Sharing data (management of)
Support concurrent access by multiple readers and writers
2. Data Model Enforcement
Make sure all applications see clean, organised data
3. Scale
Work with datasets too large to fit in memory (over a certain size,
need specialised algorithms to deal with the data -> bottleneck)
The database organises and exposes algorithms for you
conveniently
4. Flexibility
Use the data in new, unanticipated ways -> anticipate a broad set
of ways of accessing the data

Hierarchical clustering in Python and beyond

In this document