KEMBAR78
Understanding the Machine Learning Algorithms | PDF
Understanding
The ML algorithms
Supervised Vs Unsupervised Learning
• Supervised Learning:
Learns by known examples.
• Unsupervised Learning
Tries to find hidden structure in an unlabeled data
• Reinforcement Learning
Learns by interacting with the current responses.
Supervised Learning Unsupervised Learning Reinforcement Learning
Regression K-Means Clustering Genetic Algorithm
K-nearest Neighbors PCA EclatN
SVM Associated Rules AprioriN
Decision Trees Neural Networks
Random Forest
Neural Networks
Rupak Roy
Machine Learning by types
Unsupervised Learning:
• Dimension Reduction: PCA, LSA, SVD, LDA, T-SNE
• Pattern Search: Euclat, Apriori
• Clustering: K-means, Agglomerative, DBSCAN, Fuzzy C-means
Supervised Learning:
• Classification: K-NN, Naïve Bayes, SVM, Decision Trees, Logistic
Regression.
• Regression: Linear Regression, Polynomial Regression, Lasso
Regression.
Rupak Roy
Machine Learning by types
Reinforcement Learning
Genetic algorithm, ASC, SARSA, Q-learning, Deep Q-network(DQN)
Ensemble Methods
• Boosting: AdaBoost, catBoost, LightGBM, XGBoost
• Bagging: Random Forest
• Stacking
Neural Nets and Deep Leaning
• Convolutional Neural Networks(CNN): DCNN
• Recurrent Neural Networks(RNN): LSM, LSTM, GRU
• Generative Adversarial Networks(GAN)
• Autoencoders: Seq2seq
• Perceptron, Multi Layer Perceptron(MLP)
Rupak Roy
Algorithm Map
Here‟s a handy map of algorithm
Classification
It is technique where the algorithm splits the
data into homogeneous group
Examples:
1) From an album of tagged photos, recognize some one in a picture.
2) Analyze bank data of weird, looking transactions & flag those for
fraud even spam filtering
3) Given someone music choice and a bunch of features of that
music.
4) Cluster university student into types based on learning styles.
5) Recognition of handwritten characters as well as language
Popular algorithms: Naive Bayes, Decision Tree, Logistic Regression, K-
Nearest Neighbors, Support Vector Machine
Rupak Roy
Naive Bayes Rule
In spam filtering the Naive Bayes algorithm was widely used. The
algorithm takes the count of “a particular word" mention in the spam list
with a normal mail, then it multiplies both probabilities using the Bayes
equation.
Good word list
Spam list
Later, spammers figure it out how to trick spam filters by adding lots of
"good" words at the end of the email and this method is
called Bayesian poisoning.
Rupak Roy
Great -235
Opportunities -3
Speak -44
Meeting -246
Collaborative-3
Sales-77
Scope - 98
100% - 642
Fast -78
Hurry - 40
“hello”
P(B|A) P(A)
P(A|B) = = Not Spam
P(B)
Naive Bayes Rule
It ignore few things:
words, word order, length. It just looks for frequency to do the
classification
Naïve Bayes strength & weakness
Advantage:
Being a supervised classification algorithm it is easy to implement
Weakness:
It breaks in funny ways. Previously when people did Google search for
Chicago bulls. It gave animals rather than city.
Because phrases that comprises multiple words with distinct different
meanings. Don‟t work with Naïve Bayes. And requires categorical
variable as target.
Assumptions: Bag of words position doesn‟t matter.
Conditional independence. Eg. „Great‟ occurring not dependent or
word „fabulous‟ in the same document.
Rupak Roy
Naive Bayes Rule
Prior probability of Green = no.of green objects/total no. of objects
Prior probability of Red = No. of Red objects/ total number of objects
Green 40/60=4/6
Red 20/60=2/6
Prior probability is computed without any knowledge about the point
likelihood computed after knowing what the data point is.
What is the likelihood of Red point= no. of red points/ total no. of points in
the neighborhood
What is the likelihood of green point = no. of green points/ total no. of points
in the neighborhood
Posterior probability of ‘x’ being Green = prior probability of green X
likelihood of „x‟ given Green = 4/6 X1/40=1/60 = 0.016
Posterior probability of ‘x’ being Red = prior probability of Red X likelihood of
„x‟ given Red = 2/6 X 3/20 =1/20 = 0.05
Prior Probability X test evidence = posterior probability
Naive Bayes Rule
Finally we classify „x‟ as Red since it class membership achieves the
largest posterior probability.
Formula to remember
In Naïve Bayes we simply take the maximum & convert them into Yes &
No, Classification.
Rupak Roy
Naive Bayes Rule
Marty
Love
.1
Deal
.8
Life
.1
Rupak Roy
Alica
Love
.5
Deal
.2
Life
.3
Assume,
Prior Probability
P(Alica)=0.5
P(Marty)=0.5
Love Life: So what is the probability of who wrote this mail:
Marty: .1.1 * .5
Alica: .5 .3 * .5(Its Alica) easy by seeing
Life Deal: Marty: .1 .8 .5(prior prob.) = 0.04
Alica: .2 .3 .5(prior prob.) = 0.03. So its Marty.
We can also do the same like
Posterior P(Marty|”Life Deal”)=0.04/(0.04+0.03)=4/7=57
P(Alica|”Life Deal”)=0.03/0.07=3/7=48
(0.04+0.03 i.e. 0.07 way to scale/normalize to 1)
Support Vector Machine
The most popular method of classical classification.
It tries to draw two lines between data points with the largest margin
between them.
Which is the line that best separates the data?
And why this line is the
best line that separates
the data?
What this does it maximizes the distance to the
nearest points and is named as MARGIN.
Margin is the distance between the line and the
nearest point between two classes.
Rupak Roy
Support Vector Machine
Which line here is the best line?
This(blue) line maximizes the distance between the
data points while sacrificing a class which in turn
called as Class Error. So the 2nd(green) is the best
line that maximizes the distance between 2 classes
Support Vector Machine first classifies classes
correctly then maximizes the margin.
How can we solve this?
SVM‟s are good to find the
decision boundaries that max
the distance between classes
and at the same tolerates
the individual outliers.
Outlier
Support Vector Machine (SVM)
Non-Linear Data
Yes SVM will work!
SVM‟s will use Feature X and Y and will convert it
to a label (either Blue or Red)
Now we will have 3 dimensional space where we can separate
the classes linearly.
We will find we will have small amount of Z in X axis and small with blue class.
Z measures the distance from the origin.
So is this linearly separable? Yes!
This blue line in actual represents the circle.
x
Y
𝑧 = 𝑥2
+ 𝑦2
𝑦
𝑥
SVM
Labels
𝑥
𝑧
Decision Tree Classifier
Give a Loan?
Decision trees can separate Non-Linear
To Linear decision surface
DT splits based on node purity..
Entropy – controls how a DT decides where to split the data.
Entropy is a measure of how disorganized a system is.
Common problem: Over-fitting. Solution to this ensemble methods
Credit
History
Good
Debt<1000
No
Time
Bad
Time >18
P=.3
Rupak Roy
Bias-Variance Dilemma
A high biased machine learning algorithm is one that practically ignores
the data. For example train the car (biased) it does the same &
doesn‟t do any thing differently (bad for machine learning).
Again in unbaised car it will result very poor since it doesn‟t have the
biased to generalize to new stuffs.
So in reality we want something in between and we will call it as
Bias-Variance Trade off where the algorithm uses Bias model to
generalize but still very open to listen to new data(un-biased).
Rupak Roy
Clustering
K-nn K nearest neighbor or Memory based reasoning is a powerful data
mining technique that can be used to solve a wide variety of data
mining techniques. It is a classification technique that groups together
observations that are close to each other using distance function to
measure similarity between observation.
Rupak Roy
Feature Scaling
Feature scaling is a method used to normalize the range of
independent variables or features of data. In data processing, it is also
known as data normalization and is generally performed during the
data preprocessing step
Which algorithm would be affected by Feature Scaling?
* Decision Tree: No because there is no trade off.
* SVM: Yes with RB F kernel
* K-means clustering: Yes
* Linear Regression: No because variables are independent to each
other.
Rupak Roy
PCA: Principal component analysis
Principle component is method where can understand the direction in the
data that can project our data on to while loosing a minimal amount of
information. In other words is a dimension-reduction tool that can be used
to reduce a large set of variables to a small set that still contains most of the
information in the large set.
Its like Compression while preserving the information.
When to use PCA
• Latent features (latent features are 'hidden' features to distinguish them
from observed features. An example would be text analysis. 'words'
extracted from the documents are features. Factorize the words we will
get 'topics', where 'topic' is a group of words with semantic relevance. So
these are the variables which cannot be measured directly.)
• To reduce noise for other algorithms enabling them process faster.
• Face recognition- Images which have many pixels, high dimensionality
space, with PCA we can reduce high dimensionality space for svm or
other classification algorithms for faster processing of actual classification
of the picture.
PCA: Principal component analysis
And how do we condense our N features to few so hat we really get
the heart of the information?
Let‟s see how can we do that.
Rupak Roy

Understanding the Machine Learning Algorithms

  • 1.
  • 2.
    Supervised Vs UnsupervisedLearning • Supervised Learning: Learns by known examples. • Unsupervised Learning Tries to find hidden structure in an unlabeled data • Reinforcement Learning Learns by interacting with the current responses. Supervised Learning Unsupervised Learning Reinforcement Learning Regression K-Means Clustering Genetic Algorithm K-nearest Neighbors PCA EclatN SVM Associated Rules AprioriN Decision Trees Neural Networks Random Forest Neural Networks Rupak Roy
  • 3.
    Machine Learning bytypes Unsupervised Learning: • Dimension Reduction: PCA, LSA, SVD, LDA, T-SNE • Pattern Search: Euclat, Apriori • Clustering: K-means, Agglomerative, DBSCAN, Fuzzy C-means Supervised Learning: • Classification: K-NN, Naïve Bayes, SVM, Decision Trees, Logistic Regression. • Regression: Linear Regression, Polynomial Regression, Lasso Regression. Rupak Roy
  • 4.
    Machine Learning bytypes Reinforcement Learning Genetic algorithm, ASC, SARSA, Q-learning, Deep Q-network(DQN) Ensemble Methods • Boosting: AdaBoost, catBoost, LightGBM, XGBoost • Bagging: Random Forest • Stacking Neural Nets and Deep Leaning • Convolutional Neural Networks(CNN): DCNN • Recurrent Neural Networks(RNN): LSM, LSTM, GRU • Generative Adversarial Networks(GAN) • Autoencoders: Seq2seq • Perceptron, Multi Layer Perceptron(MLP) Rupak Roy
  • 5.
    Algorithm Map Here‟s ahandy map of algorithm
  • 6.
    Classification It is techniquewhere the algorithm splits the data into homogeneous group Examples: 1) From an album of tagged photos, recognize some one in a picture. 2) Analyze bank data of weird, looking transactions & flag those for fraud even spam filtering 3) Given someone music choice and a bunch of features of that music. 4) Cluster university student into types based on learning styles. 5) Recognition of handwritten characters as well as language Popular algorithms: Naive Bayes, Decision Tree, Logistic Regression, K- Nearest Neighbors, Support Vector Machine Rupak Roy
  • 7.
    Naive Bayes Rule Inspam filtering the Naive Bayes algorithm was widely used. The algorithm takes the count of “a particular word" mention in the spam list with a normal mail, then it multiplies both probabilities using the Bayes equation. Good word list Spam list Later, spammers figure it out how to trick spam filters by adding lots of "good" words at the end of the email and this method is called Bayesian poisoning. Rupak Roy Great -235 Opportunities -3 Speak -44 Meeting -246 Collaborative-3 Sales-77 Scope - 98 100% - 642 Fast -78 Hurry - 40 “hello” P(B|A) P(A) P(A|B) = = Not Spam P(B)
  • 8.
    Naive Bayes Rule Itignore few things: words, word order, length. It just looks for frequency to do the classification Naïve Bayes strength & weakness Advantage: Being a supervised classification algorithm it is easy to implement Weakness: It breaks in funny ways. Previously when people did Google search for Chicago bulls. It gave animals rather than city. Because phrases that comprises multiple words with distinct different meanings. Don‟t work with Naïve Bayes. And requires categorical variable as target. Assumptions: Bag of words position doesn‟t matter. Conditional independence. Eg. „Great‟ occurring not dependent or word „fabulous‟ in the same document. Rupak Roy
  • 9.
    Naive Bayes Rule Priorprobability of Green = no.of green objects/total no. of objects Prior probability of Red = No. of Red objects/ total number of objects Green 40/60=4/6 Red 20/60=2/6 Prior probability is computed without any knowledge about the point likelihood computed after knowing what the data point is. What is the likelihood of Red point= no. of red points/ total no. of points in the neighborhood What is the likelihood of green point = no. of green points/ total no. of points in the neighborhood Posterior probability of ‘x’ being Green = prior probability of green X likelihood of „x‟ given Green = 4/6 X1/40=1/60 = 0.016 Posterior probability of ‘x’ being Red = prior probability of Red X likelihood of „x‟ given Red = 2/6 X 3/20 =1/20 = 0.05 Prior Probability X test evidence = posterior probability
  • 10.
    Naive Bayes Rule Finallywe classify „x‟ as Red since it class membership achieves the largest posterior probability. Formula to remember In Naïve Bayes we simply take the maximum & convert them into Yes & No, Classification. Rupak Roy
  • 11.
    Naive Bayes Rule Marty Love .1 Deal .8 Life .1 RupakRoy Alica Love .5 Deal .2 Life .3 Assume, Prior Probability P(Alica)=0.5 P(Marty)=0.5 Love Life: So what is the probability of who wrote this mail: Marty: .1.1 * .5 Alica: .5 .3 * .5(Its Alica) easy by seeing Life Deal: Marty: .1 .8 .5(prior prob.) = 0.04 Alica: .2 .3 .5(prior prob.) = 0.03. So its Marty. We can also do the same like Posterior P(Marty|”Life Deal”)=0.04/(0.04+0.03)=4/7=57 P(Alica|”Life Deal”)=0.03/0.07=3/7=48 (0.04+0.03 i.e. 0.07 way to scale/normalize to 1)
  • 12.
    Support Vector Machine Themost popular method of classical classification. It tries to draw two lines between data points with the largest margin between them. Which is the line that best separates the data? And why this line is the best line that separates the data? What this does it maximizes the distance to the nearest points and is named as MARGIN. Margin is the distance between the line and the nearest point between two classes. Rupak Roy
  • 13.
    Support Vector Machine Whichline here is the best line? This(blue) line maximizes the distance between the data points while sacrificing a class which in turn called as Class Error. So the 2nd(green) is the best line that maximizes the distance between 2 classes Support Vector Machine first classifies classes correctly then maximizes the margin. How can we solve this? SVM‟s are good to find the decision boundaries that max the distance between classes and at the same tolerates the individual outliers. Outlier
  • 14.
    Support Vector Machine(SVM) Non-Linear Data Yes SVM will work! SVM‟s will use Feature X and Y and will convert it to a label (either Blue or Red) Now we will have 3 dimensional space where we can separate the classes linearly. We will find we will have small amount of Z in X axis and small with blue class. Z measures the distance from the origin. So is this linearly separable? Yes! This blue line in actual represents the circle. x Y 𝑧 = 𝑥2 + 𝑦2 𝑦 𝑥 SVM Labels 𝑥 𝑧
  • 15.
    Decision Tree Classifier Givea Loan? Decision trees can separate Non-Linear To Linear decision surface DT splits based on node purity.. Entropy – controls how a DT decides where to split the data. Entropy is a measure of how disorganized a system is. Common problem: Over-fitting. Solution to this ensemble methods Credit History Good Debt<1000 No Time Bad Time >18 P=.3 Rupak Roy
  • 16.
    Bias-Variance Dilemma A highbiased machine learning algorithm is one that practically ignores the data. For example train the car (biased) it does the same & doesn‟t do any thing differently (bad for machine learning). Again in unbaised car it will result very poor since it doesn‟t have the biased to generalize to new stuffs. So in reality we want something in between and we will call it as Bias-Variance Trade off where the algorithm uses Bias model to generalize but still very open to listen to new data(un-biased). Rupak Roy
  • 17.
    Clustering K-nn K nearestneighbor or Memory based reasoning is a powerful data mining technique that can be used to solve a wide variety of data mining techniques. It is a classification technique that groups together observations that are close to each other using distance function to measure similarity between observation. Rupak Roy
  • 18.
    Feature Scaling Feature scalingis a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step Which algorithm would be affected by Feature Scaling? * Decision Tree: No because there is no trade off. * SVM: Yes with RB F kernel * K-means clustering: Yes * Linear Regression: No because variables are independent to each other. Rupak Roy
  • 19.
    PCA: Principal componentanalysis Principle component is method where can understand the direction in the data that can project our data on to while loosing a minimal amount of information. In other words is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set. Its like Compression while preserving the information. When to use PCA • Latent features (latent features are 'hidden' features to distinguish them from observed features. An example would be text analysis. 'words' extracted from the documents are features. Factorize the words we will get 'topics', where 'topic' is a group of words with semantic relevance. So these are the variables which cannot be measured directly.) • To reduce noise for other algorithms enabling them process faster. • Face recognition- Images which have many pixels, high dimensionality space, with PCA we can reduce high dimensionality space for svm or other classification algorithms for faster processing of actual classification of the picture.
  • 20.
    PCA: Principal componentanalysis And how do we condense our N features to few so hat we really get the heart of the information? Let‟s see how can we do that. Rupak Roy