KEMBAR78
Introduction to machine_learning | PPTX
1 
Introduction to Machine Learning 
Kiran Lonikar
2 
What is learning? 
Tom Mitchell: Learning is to improve some performance measure P of executing some 
task T with experience E. 
In plain English: Performing some task better with experience and training… 
Key Elements: 
• Remember or memorize the past experiences E 
• Generalize from the experiences E 
Observe how kids learn to read words: They make mistakes even when reading 
previously known words, then correct themselves. Especially happens when reading 
words with silent letters, and those ending with tion. 
Warning: This is a highly mathematical subject!
3 
What is Machine Learning 
How would you build a computer program which “learns” from experiences? 
Generally a three phase process 
• Express Experience E mathematically: Build a 
set of features related to the experiences (Feature 
Extraction from raw data) 
• Memorize and Generalize: Build a mathematical 
model or set of rules from the experiences (training) 
• Apply the mathematical model to features of the 
future tasks
4 
Machine Learning in Action… 
• Word Lens mobile app 
• OCR in web pages: 
http://newscarousel.herokuapp.com/scribble-js/Scribble.html
5 
Types of ML Systems 
• Supervised Learning 
• Classification 
• Logistic Regression, SVM, NB, Decision Trees, ANN etc. 
• Regression 
• Recommender Systems* 
• User-user/item-item similarity, matrix factorization etc. 
• Unsupervised Learning 
• Clustering 
• K-means, Fuzzy K-Means, Model based (LDA) clustering etc. 
• Dimensionality reduction 
• Principal Component Analysis (PCA) 
• Anomaly Detection
6 
Classification 
Identify speaker’s gender from the voice spectrum 
Amplitude 
Frequency 
• Training: Build a model using data: {(a1, f1, 
g1), (a2, f2, g2), … (am, fm, gm)} 
• Logistic Regression (LR): p(g = F | a, f; θ) 
= hθ(θ0 + θ1a + θ2f) 
• Decision Boundary: p < 0.5, g = M, else F
7 
Logistic Regression 
• If we let 
• y = 1 when g = F, and y = 0 when g = M, and define 
vector x = [a, f] 
• and define a function hθ(x) = sigmoid(θT*x) where 
sigmoid(z) = 1/(1+e-z). It represents probability 
p(y=1|x,θ). 
• Cost J(θ) = -Σ(y*log(h) + (1-y)*log(1-h)) - 
λθTθ over all training examples for some λ. 
• Optimization algorithm (gradient descent): Obtain θ which 
minimizes J(θ). 
• Try to fit model θ to cross validation data, vary λ for 
optimum fitment. 
• Test model θ against test data: hθ(x) ≥ 0.5, predict 
gender = F, otherwise predict gender as M.
8 
Recommender Systems 
• User j specifies ratings for item i: y(i,j) Training Data 
• Guess ratings for other items: The blanks 
Items 
Users 
1 5 
3 
4 
2 
4 
2 
5 
2 
1 
3 
2 
4 
5 
3 
3 
2 
4 
3 
3 
1 
1 
3 
4 
5 
• Collaborative Filtering: k features of each item: 
• Feature vector xi for item i: {xi 
1,xi 
2, … xi 
k} 
• Parameter Vector θj for user j: {θj 
1, θj 
2, … θj 
k} 
• For user j’s estimated rating for item i: (θj)T xi
9 
Recommender Systems 
• Learn xi and θj: 
• Given xi , minimize Σ((θj)T xi - y(i,j))2 for all i where user j 
has rated item i to find optimum θj. 
• Given θj, minimize Σ((θj)T xi - y(i,j))2 for all j where user j 
has rated item i to find optimum xi. 
• Simultaneously: minimize Σ((θj)T xi - y(i,j))2 for all (i,j) 
where user j has rated item i to find optimum θj and xi. 
• Find factors X and ϴ of ratings matrix Y such that Y ≈ X ϴT 
• Other Algorithms: user-user similarity, item-item 
similarity 
• Useful even when users are not humans, for e.g.. 
Wiki documents as users and links as items.
10 
Clustering 
• Example: Top two occurring terms in documents 
• Training set: {x1, x2, x3, … xm}, vector xi 
• No labels (yi) specified 
#Term 2 
#Term 1
11 
Clustering: Applications 
• Computer Science 
• Document Clustering 
• Google news: Organizing similar news from different sources 
• News Categorizing 
• Social networks analysis 
• Features reduction: Speeding up ML pipelines 
• Cluster Centroids as new features 
• Image compression (Reduce number of colors): Pre-processing for faster, memory efficient 
computations 
• Deep Learning: Alternate supervised and unsupervised learning 
• Recommender Systems 
• Physics: 
• Astronomy 
• Particle physics 
• Market segmentation 
• http://en.wikipedia.org/wiki/Cluster_analysis#Applications
12 
K-Means Clustering 
1. Randomly choose initial cluster centroids 
#Term 2 
#Term 1 
2. Assign each training example to a cluster: Pick 
closest centroid 
3. Move centroids: Re-compute centroids as average 
of training points assigned 
4. Repeat 2, 3 for max iterations count or convergence
13 
Popular Machine Learning Tools 
• Apache Mahout: 
• Various Recommender Systems, clustering, and 
classification algorithms 
• Java based, with some algorithms having Hadoop Map- 
Reduce implementations. Recently started spark 
implementations, with a new ML DSL. 
• Stable, widely used in production, community support. 
• R: 
• Popular in statistics world. Has its own language 
• GNU license 
• Spark MLLib, Mlbase(http://www.mlbase.org/) 
• Scala based. Runs on spark (in memory, distributed)
14 
Popular Machine Learning tools 
• Weka: 
• Java based 
• GNU License 
• Vowpal Wabbit: http://hunch.net/~vw/, 
https://github.com/JohnLangford/vowpal_wabbit 
• Google Prediction API 
• http://en.wikipedia.org/wiki/Machine_learning#Soft 
ware
15 
Machine Learning In Action 
• Mobile: 
• Speech Recognition: Google Now, Siri 
• Languages/NLP: Google Translate 
• Vision: face recognition in cameras and online photos, OCR 
• Misc: Handwriting driven Myscript calculator and Stylus 
keyboard 
• Applications 
• OCR of printed documents and handwriting 
• Automatic tagging of photos based on similar faces 
• Biology and Medicine: 
• DNA analysis for likelihood of diseases, personalized drugs 
etc.
16 
Resources 
• Online Courses: 
• Coursera: Machine Learning (Andrew Ng) 
• Coursera: Neural Networks for Machine Learning (Geoffrey 
Hinton) 
• Udacity: Intro to Artificial Intelligence (Peter Norvig, Sebastian 
Thrun) 
• CMU: Introduction to Machine Learning (Alex Smola) 
• Berkely: Scalable Machine Learning (Alex Smola) 
• Books: 
• Pattern Recognition and Machine Learning: Christopher Bishop 
• Machine Learning: Tom Mitchell 
• Mahout In Action 
• Artificial Intelligence: A modern approach (http://aima.cs.berkeley.edu/) 
• Machine Learning in Action
17 
Resources 
• Quora: 
• http://www.quora.com/How-do-you-explain-Machine- 
Learning-and-Data-Mining-to-non-Computer-Science-people 
• http://www.quora.com/Machine-Learning 
• Misc.: 
• http://fastml.com/ 
• http://alex.smola.org/ 
• https://funnel.hasgeek.com/fifthel2014/1132-realizing-large-scale- 
distributed-deep-learning-ne 
• http://spark-summit.org/2014/agenda 
• Tutorial on HMM, Speech Recognition: Rabiner 
• Tesseract OCR library

Introduction to machine_learning

  • 1.
    1 Introduction toMachine Learning Kiran Lonikar
  • 2.
    2 What islearning? Tom Mitchell: Learning is to improve some performance measure P of executing some task T with experience E. In plain English: Performing some task better with experience and training… Key Elements: • Remember or memorize the past experiences E • Generalize from the experiences E Observe how kids learn to read words: They make mistakes even when reading previously known words, then correct themselves. Especially happens when reading words with silent letters, and those ending with tion. Warning: This is a highly mathematical subject!
  • 3.
    3 What isMachine Learning How would you build a computer program which “learns” from experiences? Generally a three phase process • Express Experience E mathematically: Build a set of features related to the experiences (Feature Extraction from raw data) • Memorize and Generalize: Build a mathematical model or set of rules from the experiences (training) • Apply the mathematical model to features of the future tasks
  • 4.
    4 Machine Learningin Action… • Word Lens mobile app • OCR in web pages: http://newscarousel.herokuapp.com/scribble-js/Scribble.html
  • 5.
    5 Types ofML Systems • Supervised Learning • Classification • Logistic Regression, SVM, NB, Decision Trees, ANN etc. • Regression • Recommender Systems* • User-user/item-item similarity, matrix factorization etc. • Unsupervised Learning • Clustering • K-means, Fuzzy K-Means, Model based (LDA) clustering etc. • Dimensionality reduction • Principal Component Analysis (PCA) • Anomaly Detection
  • 6.
    6 Classification Identifyspeaker’s gender from the voice spectrum Amplitude Frequency • Training: Build a model using data: {(a1, f1, g1), (a2, f2, g2), … (am, fm, gm)} • Logistic Regression (LR): p(g = F | a, f; θ) = hθ(θ0 + θ1a + θ2f) • Decision Boundary: p < 0.5, g = M, else F
  • 7.
    7 Logistic Regression • If we let • y = 1 when g = F, and y = 0 when g = M, and define vector x = [a, f] • and define a function hθ(x) = sigmoid(θT*x) where sigmoid(z) = 1/(1+e-z). It represents probability p(y=1|x,θ). • Cost J(θ) = -Σ(y*log(h) + (1-y)*log(1-h)) - λθTθ over all training examples for some λ. • Optimization algorithm (gradient descent): Obtain θ which minimizes J(θ). • Try to fit model θ to cross validation data, vary λ for optimum fitment. • Test model θ against test data: hθ(x) ≥ 0.5, predict gender = F, otherwise predict gender as M.
  • 8.
    8 Recommender Systems • User j specifies ratings for item i: y(i,j) Training Data • Guess ratings for other items: The blanks Items Users 1 5 3 4 2 4 2 5 2 1 3 2 4 5 3 3 2 4 3 3 1 1 3 4 5 • Collaborative Filtering: k features of each item: • Feature vector xi for item i: {xi 1,xi 2, … xi k} • Parameter Vector θj for user j: {θj 1, θj 2, … θj k} • For user j’s estimated rating for item i: (θj)T xi
  • 9.
    9 Recommender Systems • Learn xi and θj: • Given xi , minimize Σ((θj)T xi - y(i,j))2 for all i where user j has rated item i to find optimum θj. • Given θj, minimize Σ((θj)T xi - y(i,j))2 for all j where user j has rated item i to find optimum xi. • Simultaneously: minimize Σ((θj)T xi - y(i,j))2 for all (i,j) where user j has rated item i to find optimum θj and xi. • Find factors X and ϴ of ratings matrix Y such that Y ≈ X ϴT • Other Algorithms: user-user similarity, item-item similarity • Useful even when users are not humans, for e.g.. Wiki documents as users and links as items.
  • 10.
    10 Clustering •Example: Top two occurring terms in documents • Training set: {x1, x2, x3, … xm}, vector xi • No labels (yi) specified #Term 2 #Term 1
  • 11.
    11 Clustering: Applications • Computer Science • Document Clustering • Google news: Organizing similar news from different sources • News Categorizing • Social networks analysis • Features reduction: Speeding up ML pipelines • Cluster Centroids as new features • Image compression (Reduce number of colors): Pre-processing for faster, memory efficient computations • Deep Learning: Alternate supervised and unsupervised learning • Recommender Systems • Physics: • Astronomy • Particle physics • Market segmentation • http://en.wikipedia.org/wiki/Cluster_analysis#Applications
  • 12.
    12 K-Means Clustering 1. Randomly choose initial cluster centroids #Term 2 #Term 1 2. Assign each training example to a cluster: Pick closest centroid 3. Move centroids: Re-compute centroids as average of training points assigned 4. Repeat 2, 3 for max iterations count or convergence
  • 13.
    13 Popular MachineLearning Tools • Apache Mahout: • Various Recommender Systems, clustering, and classification algorithms • Java based, with some algorithms having Hadoop Map- Reduce implementations. Recently started spark implementations, with a new ML DSL. • Stable, widely used in production, community support. • R: • Popular in statistics world. Has its own language • GNU license • Spark MLLib, Mlbase(http://www.mlbase.org/) • Scala based. Runs on spark (in memory, distributed)
  • 14.
    14 Popular MachineLearning tools • Weka: • Java based • GNU License • Vowpal Wabbit: http://hunch.net/~vw/, https://github.com/JohnLangford/vowpal_wabbit • Google Prediction API • http://en.wikipedia.org/wiki/Machine_learning#Soft ware
  • 15.
    15 Machine LearningIn Action • Mobile: • Speech Recognition: Google Now, Siri • Languages/NLP: Google Translate • Vision: face recognition in cameras and online photos, OCR • Misc: Handwriting driven Myscript calculator and Stylus keyboard • Applications • OCR of printed documents and handwriting • Automatic tagging of photos based on similar faces • Biology and Medicine: • DNA analysis for likelihood of diseases, personalized drugs etc.
  • 16.
    16 Resources •Online Courses: • Coursera: Machine Learning (Andrew Ng) • Coursera: Neural Networks for Machine Learning (Geoffrey Hinton) • Udacity: Intro to Artificial Intelligence (Peter Norvig, Sebastian Thrun) • CMU: Introduction to Machine Learning (Alex Smola) • Berkely: Scalable Machine Learning (Alex Smola) • Books: • Pattern Recognition and Machine Learning: Christopher Bishop • Machine Learning: Tom Mitchell • Mahout In Action • Artificial Intelligence: A modern approach (http://aima.cs.berkeley.edu/) • Machine Learning in Action
  • 17.
    17 Resources •Quora: • http://www.quora.com/How-do-you-explain-Machine- Learning-and-Data-Mining-to-non-Computer-Science-people • http://www.quora.com/Machine-Learning • Misc.: • http://fastml.com/ • http://alex.smola.org/ • https://funnel.hasgeek.com/fifthel2014/1132-realizing-large-scale- distributed-deep-learning-ne • http://spark-summit.org/2014/agenda • Tutorial on HMM, Speech Recognition: Rabiner • Tesseract OCR library

Editor's Notes

  • #5 Lets look at some real life applications. Word lens is a very popular mobile app which performs OCR, translation and inline display of translated text on the app screen. It’s uses of chain of ML classification algorithms: detects areas of text in the images, performs OCR, translation. Scribble-js performs classification of scribbled text using two pre-trained models: Logistic Regression and Artificial Neural Network Applications in particle physics: http://www.techrepublic.com/blog/european-technology/cern-where-the-big-bang-meets-big-data/ https://developers.google.com/events/io/sessions/333315382
  • #6 Recommender systems are a special kind of supervised learning. Here features are learnt from the user preferences. Clustering has application in image compression too apart from classical ML applications. Canopy clustering is another clustering algorithm, usually used to pick initial cluster centroids before running k-means clustering.
  • #8 https://github.com/klonikar/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkLRMultiClass.scala
  • #12 Particle Physics: http://www.lpthe.jussieu.fr/~salam/repository/docs/kt-cgta-v2.pdf Higgs Boson: http://www.exploratorium.edu/origins/cern/ideas/higgs.html
  • #13 Step 3 of moving cluster centroids using average minimizes distance for Euclidian distance measures. For non-Euclidian distance measures, the algorithm may not converge.