Unsupervised ML SOM
and how to choose a
Data Science method
Quick Review of the methods we learned
> Statistical analysis
> Supervised ML
   –   Linear regression
   –   NN,
   –   KNN,
   –   Decision Tree
   –   SVM
> Unsupervised ML
   – K-Means Clustering
   – PCA
   – ….. Why another one,
Why another method, SOM
> Demonstrate some data science methods that are not
  widely used or well known, but also can be very useful
  for material informatic study
> Introduce a method I have used, and feel is adequate
  to the uniqueness of many materials study
  applications.
> Demonstrate how various data science methods can
  be used together to drive improved results
> Demonstrate a few projects using the same methods
  so that we can understand a methods from user point
  of view
What is Self-Organizing Map (SOM)
> An Unsupervised ML method
> Dimensional reduction, enabling powerful
  visualizations of the data:
   – K-Means does clustering, but neither dimensionality
     reduction nor visualization
   – PCA does dimensionality reduction, enabling visualization to
     certain level (not applicable if the first 3 principal
     components won’t represent the data well), however, it
     does not perform clustering. Besides, the visualization does
     not keep the original topographic information.
> Give some insights into how data is clustered in high
  dimensions
What is SOM
> You can think of SOM as an artificial neural network
  with a single neuronal layer, whose neurons are
  arranged in a two-dimensional matrix.
   – The 2D matrix can been seen as a position map that
     captures the characteristics of the data
> Merits of SOM
   – Effective in training big datasets
   – Since this is a 2D matrix, visualization of the resulting map
     is possible
   – kept the topography of the original data,
   – Possible to present the Euclidean distance between data
     points
Algorithm of SOM
– Normalization of the input data, all features will be distributed more
  balancely
– Initialization: each (x,y) position in the map is assigned a weight for each
  input neuron, thus associating a weight vector for each map position.
– Iteration:
    > Choose a sample from dataset
    > Calculate Euclidean distance between that sample and each weight vector
    > The (x,y) position ”closest” to the sample is declared the Best Matching Unit
    > The weights vector for the BMU get adjusted to more closely match the sample.
      Amount of adjustment (learning) decreases as we go through iterations
    > The weights vector for neighbors of the BMU also get adjusted, to a lesser extent.
      The number of neighbors and how much they get adjusted also depends on
      hyperparameters and the number of iterations.
– Convergence:
    > Max number of iterations
    > Monitoring of topological error
– Reference: https://link.springer.com/article/10.1007/BF00337288
Self-Organizing Map (SOM)
How does it work?
                                 𝑎!
                                 𝑏!
                                 𝑐!
                            𝑥! = 𝑑
                                  !
                                 𝑒!
                                 𝑓!
Two Dimensional Mesh structure
                Each connection can deform
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
                    a
        1                   11            12
b               6                10
                                               f
2
        4                        8
    3
                                      9
            5           7
c
                                               e
            d
Self-Organizing Map (SOM) Algorithm
> Dragging Nodes
> “Flattening a crumpled paper”
U-matrix and how to use it to get insights for
clustering
> After training, the nodes in
  the 2D key map are not
  evenly distributed. The
  adjacent data point might
  not be similar to each
  other in the higher
  dimension space.
> U-matrix use the concept
  of the heatmap to
  illustrate the distance in
  Euclidean space
Using SOM in conjunction with other methods
> Since this is a dimensionality
  reduction method, for smaller
  dataset, you can initialize your
  SOM map using the first 2
  Principal components,
  essentially the 2D PCA map
> K-means can also be run on the
  same dataset, and
  corresponding clusters can be
  visualized on SOM map.
 K-Means clustering and U-Matrix
 They can be compared to validate the results!
> SOM can provide a means to visualize K-Means!
> If the boundary matches well, then the training is
  successful
Different Implementations of SOM
> SOM is just an algorithm, there are many
  packages you can use that implement it
> We will introduce
  – An augmented version of SOMPY, a version our group has
    contributions on
  – MiniSOM
The uniqueness and functions of augmented
SOMPY
https://github.com/DataScienceUWMSE/SOM
> Utilizes PCA for initialization, and include K-Means
  Clustering overlay
> “Heat maps” provide a way to visualize each
  feature after training
> Projection function helps users find additional
  correlations or patterns among features,
  including for categorical data
“heatmap” concept
> Map each node’s
  weight onto the 2D
  map
> Number of heat maps
  equals to number of
  input variables
Example of utilizing the
heatmap on materials research
Example 1 Granta Data Set: Experimental Commercial Materials
Property Dataset
> Training data set
  contains 398 commercial
  materials and 21
  numerical properties
 Example of utilizing the heatmap on materials
 research
Example 1 Granta Data Set: Experimental Commercial Materials
Property Dataset (continue)
Project information concept
> Overlay one specific data
  property onto SOM, can
  use even categorical
  values
> Easily identify patterns
Example of utilizing the project function on
materials research
Example 1 Granta Data Set: Experimental Commercial Materials
Property Dataset (continue), finding the outliers’ uniqueness
Example of utilizing the projection function on
materials research
Example 2 OPV materials study using an experimental dataset
Reference Y.Huang, J. Phys. Chem. C 2020, 124, 12871−12882
> Dataset includes 1203 donor
  polymers of Donor-Acceptor
  pairs, with properties
  related to the proficiency of
  the charge transfer.
Molecular Descriptors
Python package of Molecular Descriptor
> There are Python tools to extract molecular
  structural or geometrical information from
  notation of molecule, such as SMILES (Simplified
  molecular-input line-entry system)
> We will introduce Mordred, (covered in the Hands-
  on session)
The advantage of using MiniSOM
> SOMPY is not as easy to use as the other packages
  introduced in this class.
  – The Augmented SOMPY has contribution from a few
    Materials Science researchers in our group, including
    your TA Jimin, Qian
> MiniSOM is relatively easier to use, well
  documented and constantly maintained, and
  have the basic implementation of the SOM
  algorithm
What MiniSOM provides
> It has :
   –   The core implementation of SOM
   –   Visualization
   –   U-Matrix (“distance map” in MiniSOM)
   –   Project certain feature onto SOM
> Doesn’t have:
   – PCA initialization
   – Cannot generate heatmap for each features
   – K-Means clustering,
Hyperparameters of SOM
> Length of input vectors (the number of properties)
> Map size, the most important one
> Map topology – rectangular or hexagonal
   – Important in defining the notion of “neighbors”
> Sigma – spread of the neighborhood function
> Learning Rate – initial learning rate, decreases with the
  number of iterations
> Decay function – defines how much learning rate and sigma
  decrease with the number of iterations
> Neighborhood function – defines how much neighbors of
  the BMU get impacted at each iteration (eg gaussian,
  bubble,…)
> Activation distance function (eg Euclidean distance)
> Initialization method – random or PCA
Hands-on session and HW for this week