Understanding the Machine Learning Algorithms

Understanding
The ML algorithms

Supervised Vs Unsupervised Learning
• Supervised Learning:
Learns by known examples.
• Unsupervised Learning
Tries to find hidden structure in an unlabeled data
• Reinforcement Learning
Learns by interacting with the current responses.
Supervised Learning Unsupervised Learning Reinforcement Learning
Regression K-Means Clustering Genetic Algorithm
K-nearest Neighbors PCA EclatN
SVM Associated Rules AprioriN
Decision Trees Neural Networks
Random Forest
Neural Networks
Rupak Roy

Machine Learning by types
Unsupervised Learning:
• Dimension Reduction: PCA, LSA, SVD, LDA, T-SNE
• Pattern Search: Euclat, Apriori
• Clustering: K-means, Agglomerative, DBSCAN, Fuzzy C-means
Supervised Learning:
• Classification: K-NN, Naïve Bayes, SVM, Decision Trees, Logistic
Regression.
• Regression: Linear Regression, Polynomial Regression, Lasso
Regression.
Rupak Roy

Machine Learning by types
Reinforcement Learning
Genetic algorithm, ASC, SARSA, Q-learning, Deep Q-network(DQN)
Ensemble Methods
• Boosting: AdaBoost, catBoost, LightGBM, XGBoost
• Bagging: Random Forest
• Stacking
Neural Nets and Deep Leaning
• Convolutional Neural Networks(CNN): DCNN
• Recurrent Neural Networks(RNN): LSM, LSTM, GRU
• Generative Adversarial Networks(GAN)
• Autoencoders: Seq2seq
• Perceptron, Multi Layer Perceptron(MLP)
Rupak Roy

Algorithm Map
Here‟s a handy map of algorithm

Classification
It is technique where the algorithm splits the
data into homogeneous group
Examples:
1) From an album of tagged photos, recognize some one in a picture.
2) Analyze bank data of weird, looking transactions & flag those for
fraud even spam filtering
3) Given someone music choice and a bunch of features of that
music.
4) Cluster university student into types based on learning styles.
5) Recognition of handwritten characters as well as language
Popular algorithms: Naive Bayes, Decision Tree, Logistic Regression, K-
Nearest Neighbors, Support Vector Machine
Rupak Roy

Naive Bayes Rule
In spam filtering the Naive Bayes algorithm was widely used. The
algorithm takes the count of “a particular word" mention in the spam list
with a normal mail, then it multiplies both probabilities using the Bayes
equation.
Good word list
Spam list
Later, spammers figure it out how to trick spam filters by adding lots of
"good" words at the end of the email and this method is
called Bayesian poisoning.
Rupak Roy
Great -235
Opportunities -3
Speak -44
Meeting -246
Collaborative-3
Sales-77
Scope - 98
100% - 642
Fast -78
Hurry - 40
“hello”
P(B|A) P(A)
P(A|B) = = Not Spam
P(B)

Naive Bayes Rule
It ignore few things:
words, word order, length. It just looks for frequency to do the
classification
Naïve Bayes strength & weakness
Advantage:
Being a supervised classification algorithm it is easy to implement
Weakness:
It breaks in funny ways. Previously when people did Google search for
Chicago bulls. It gave animals rather than city.
Because phrases that comprises multiple words with distinct different
meanings. Don‟t work with Naïve Bayes. And requires categorical
variable as target.
Assumptions: Bag of words position doesn‟t matter.
Conditional independence. Eg. „Great‟ occurring not dependent or
word „fabulous‟ in the same document.
Rupak Roy

Naive Bayes Rule
Prior probability of Green = no.of green objects/total no. of objects
Prior probability of Red = No. of Red objects/ total number of objects
Green 40/60=4/6
Red 20/60=2/6
Prior probability is computed without any knowledge about the point
likelihood computed after knowing what the data point is.
What is the likelihood of Red point= no. of red points/ total no. of points in
the neighborhood
What is the likelihood of green point = no. of green points/ total no. of points
in the neighborhood
Posterior probability of ‘x’ being Green = prior probability of green X
likelihood of „x‟ given Green = 4/6 X1/40=1/60 = 0.016
Posterior probability of ‘x’ being Red = prior probability of Red X likelihood of
„x‟ given Red = 2/6 X 3/20 =1/20 = 0.05
Prior Probability X test evidence = posterior probability

Naive Bayes Rule
Finally we classify „x‟ as Red since it class membership achieves the
largest posterior probability.
Formula to remember
In Naïve Bayes we simply take the maximum & convert them into Yes &
No, Classification.
Rupak Roy

Naive Bayes Rule
Marty
Love
.1
Deal
.8
Life
.1
Rupak Roy
Alica
Love
.5
Deal
.2
Life
.3
Assume,
Prior Probability
P(Alica)=0.5
P(Marty)=0.5
Love Life: So what is the probability of who wrote this mail:
Marty: .1.1 * .5
Alica: .5 .3 * .5(Its Alica) easy by seeing
Life Deal: Marty: .1 .8 .5(prior prob.) = 0.04
Alica: .2 .3 .5(prior prob.) = 0.03. So its Marty.
We can also do the same like
Posterior P(Marty|”Life Deal”)=0.04/(0.04+0.03)=4/7=57
P(Alica|”Life Deal”)=0.03/0.07=3/7=48
(0.04+0.03 i.e. 0.07 way to scale/normalize to 1)

Support Vector Machine
The most popular method of classical classification.
It tries to draw two lines between data points with the largest margin
between them.
Which is the line that best separates the data?
And why this line is the
best line that separates
the data?
What this does it maximizes the distance to the
nearest points and is named as MARGIN.
Margin is the distance between the line and the
nearest point between two classes.
Rupak Roy

Support Vector Machine
Which line here is the best line?
This(blue) line maximizes the distance between the
data points while sacrificing a class which in turn
called as Class Error. So the 2nd(green) is the best
line that maximizes the distance between 2 classes
Support Vector Machine first classifies classes
correctly then maximizes the margin.
How can we solve this?
SVM‟s are good to find the
decision boundaries that max
the distance between classes
and at the same tolerates
the individual outliers.
Outlier

Support Vector Machine (SVM)
Non-Linear Data
Yes SVM will work!
SVM‟s will use Feature X and Y and will convert it
to a label (either Blue or Red)
Now we will have 3 dimensional space where we can separate
the classes linearly.
We will find we will have small amount of Z in X axis and small with blue class.
Z measures the distance from the origin.
So is this linearly separable? Yes!
This blue line in actual represents the circle.
x
Y
𝑧 = 𝑥2
+ 𝑦2
𝑦
𝑥
SVM
Labels
𝑥
𝑧

Decision Tree Classifier
Give a Loan?
Decision trees can separate Non-Linear
To Linear decision surface
DT splits based on node purity..
Entropy – controls how a DT decides where to split the data.
Entropy is a measure of how disorganized a system is.
Common problem: Over-fitting. Solution to this ensemble methods
Credit
History
Good
Debt<1000
No
Time
Bad
Time >18
P=.3
Rupak Roy

Bias-Variance Dilemma
A high biased machine learning algorithm is one that practically ignores
the data. For example train the car (biased) it does the same &
doesn‟t do any thing differently (bad for machine learning).
Again in unbaised car it will result very poor since it doesn‟t have the
biased to generalize to new stuffs.
So in reality we want something in between and we will call it as
Bias-Variance Trade off where the algorithm uses Bias model to
generalize but still very open to listen to new data(un-biased).
Rupak Roy

Clustering
K-nn K nearest neighbor or Memory based reasoning is a powerful data
mining technique that can be used to solve a wide variety of data
mining techniques. It is a classification technique that groups together
observations that are close to each other using distance function to
measure similarity between observation.
Rupak Roy

Feature Scaling
Feature scaling is a method used to normalize the range of
independent variables or features of data. In data processing, it is also
known as data normalization and is generally performed during the
data preprocessing step
Which algorithm would be affected by Feature Scaling?
* Decision Tree: No because there is no trade off.
* SVM: Yes with RB F kernel
* K-means clustering: Yes
* Linear Regression: No because variables are independent to each
other.
Rupak Roy

PCA: Principal component analysis
Principle component is method where can understand the direction in the
data that can project our data on to while loosing a minimal amount of
information. In other words is a dimension-reduction tool that can be used
to reduce a large set of variables to a small set that still contains most of the
information in the large set.
Its like Compression while preserving the information.
When to use PCA
• Latent features (latent features are 'hidden' features to distinguish them
from observed features. An example would be text analysis. 'words'
extracted from the documents are features. Factorize the words we will
get 'topics', where 'topic' is a group of words with semantic relevance. So
these are the variables which cannot be measured directly.)
• To reduce noise for other algorithms enabling them process faster.
• Face recognition- Images which have many pixels, high dimensionality
space, with PCA we can reduce high dimensionality space for svm or
other classification algorithms for faster processing of actual classification
of the picture.

PCA: Principal component analysis
And how do we condense our N features to few so hat we really get
the heart of the information?
Let‟s see how can we do that.
Rupak Roy

Understanding the Machine Learning Algorithms

More Related Content

What's hot

Similar to Understanding the Machine Learning Algorithms

More from Rupak Roy

Recently uploaded

Understanding the Machine Learning Algorithms