KEMBAR78
Mahout and Distributed Machine Learning 101 | PDF
Introduction to
machine learning
with mahout
John Ternent
@jaternent
Orlando Data Science – www.orlandods.com
May 13, 2014
Welcome!
Updates
Social Media
 Facebook.com/orlandodata
 Twitter.com/orlandodata
 LinkedIn
OrlandoDS.com
 Social Network
 Forum
 Articles and Content
 And More
 Send articles to: scott@orlandods.com
Orlando Wiki
 Completely Open
 Aggregate Learning Resources!
 Go NUTS
May 28th Event
 Full Sail, UCF, and Florida Polytechnic
 Submit Your Questions! @orlandodata
Member Survey
 Need n=30!!!
 OrlandoDS.com/member-survey
 OR: find it in our past meetup
announcements
Learn Hadoop
 First Class: June 3rd.
 Location: Here
Future Plans
 Establish Non-Profit
 Increase Global Following
 Become Strong Networking and
Education Resource for YOU
A (very) little bit about
me…
 Consultant (Management & Technology)
 Open Source Evangelist
 Full-spectrum data nerd
A little about you!
 Rate yourself (1 – 10) on Mahout
 Rate yourself (1 – 10) on Machine
Learning/Data Mining
 Rate yourself (1 – 10) on Big
Data/Hadoop
 Please wait… optimizing presentation…
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by
P, improves with experience E.
-- Tom M. Mitchell, 1997
Data mining is defined as the process of
discovering patterns in data. The process
must be automatic or (more usually)
semiautomatic. The patterns discovered must
be meaningful in that they lead to … an
economic advantage.
-- Ian H. Witten & Eibe Frank, 2005
If you’re in academia, you call it “machine
learning.” If you’re in business, you call it
“data mining.”
 Mark Hall
I create or
improve general
purpose
algorithms for
machine
learning
I use multiple
machine
learning
algorithms for
practical data
discovery
Source : xkcd Source : xkcd
Machine Learning Uses
Clustering
Classification
Recommendation
Machine Learning
Algorithms
 Regression
 K-means Clustering
 K-NN
 CART
 Neural Networks
 Support Vector Machines
 Association Rules
 Principal Component Analysis
 Singular Value Decomposition
 Ensemble Methods
 Naïve Bayes
 …
Real-World Applications
 Recommender Systems
 Image recognition
 Signal Processing
 Propensity to buy/churn
 Fraud analysis
 Text analytics
 Spam filtering
 Forecasting methods
 Revenue management
 …
The Problem … and Opportunity
Big Data™
If you have to choose, having more data does indeed trump a
better algorithm. However, what is better than just having
more data on its own is also having an algorithm that
annotates the data with new linkages and statistics which alter
the underlying data asset.”
- Omar Tawakol
Weka Explorer can handle ~1M instances, 25 attributes (50
MB file)
- Ian Witten
Potential Solutions
 Expand RAM
 Use incremental algorithms
 Use distributable algorithms
Scale
Up
Scale Out
Hadoop in 30 seconds
Input
Input
Input
Input
Input
Input
Input
Map (K,V)
Map (K,V)
Map (K,V)
Map (K,V)
Shuffle
/ Sort
Reduce
Reduce
Reduce
Output
Output
Output
Finally -- Mahout
 A Java-based library of machine
learning algorithms designed to support
distributed processing
 Initially on MapReduce, now leaning
heavily towards Spark
 Primarily focused on Recommenders,
Clustering, and Classification spaces.
Running Mahout
 Locally – download mahout distro.
/bin/mahout is the wrapper script, default shows all
the example programs available.
Lots of tools included to convert data into vector
formats and pre-process text, worth a look
 Amazon EC2
Configure stack from scratch on EC2 servers
 Amazon EMR
Quicker start, a lot of the build is already optimized
for MapReduce jobs, just add Mahout as a custom
jar and pass the script as a parameter
Running Recommenders
 Multiple Recommender Algorithms
User-based
Item-based
 A Recommender Needs:
DataModel (e.g. FileDataModel)
Similarity driver (PearsonCorrelationSimilarity)
Neighborhood (NearstNUserNeighborhood,
ThresholdUserNeighborhood)
Recommender
Running Recommenders
 Tip : If you have no preferences, there
are Boolean equivalents of the
recommender classes
 Evaluate user vs. item similarities
 Example
Clustering Algorithms
 To cluster you need:
Location in n-dimensional space
Distance metric
Threshold
 K-means
 Canopy
 Dirichlet
 Fuzzy K-means
 Spectral Clustering
Clustering
Clustering Text
 Identify k topics in a document corpus
 Requires conversion of text into vector
 Lucene utilities are available to vectorize
text and apply stop-word or weighting
criteria.
 Seqdirectory – from a directory of text
files
 Lucene.vector – from a Lucene index
Classifiers
 NaïveBayes
 RandomForests
 LogisticRegression (SGD)
 HiddenMarkov
 Example : 20 Newsgroups
Sidebar : Risks of Big Data
Unsupervised Learning

Mahout and Distributed Machine Learning 101

  • 1.
    Introduction to machine learning withmahout John Ternent @jaternent Orlando Data Science – www.orlandods.com May 13, 2014
  • 3.
  • 4.
    Social Media  Facebook.com/orlandodata Twitter.com/orlandodata  LinkedIn
  • 5.
    OrlandoDS.com  Social Network Forum  Articles and Content  And More  Send articles to: scott@orlandods.com
  • 6.
    Orlando Wiki  CompletelyOpen  Aggregate Learning Resources!  Go NUTS
  • 7.
    May 28th Event Full Sail, UCF, and Florida Polytechnic  Submit Your Questions! @orlandodata
  • 8.
    Member Survey  Needn=30!!!  OrlandoDS.com/member-survey  OR: find it in our past meetup announcements
  • 9.
    Learn Hadoop  FirstClass: June 3rd.  Location: Here
  • 10.
    Future Plans  EstablishNon-Profit  Increase Global Following  Become Strong Networking and Education Resource for YOU
  • 11.
    A (very) littlebit about me…  Consultant (Management & Technology)  Open Source Evangelist  Full-spectrum data nerd
  • 12.
    A little aboutyou!  Rate yourself (1 – 10) on Mahout  Rate yourself (1 – 10) on Machine Learning/Data Mining  Rate yourself (1 – 10) on Big Data/Hadoop  Please wait… optimizing presentation…
  • 13.
    A computer programis said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. -- Tom M. Mitchell, 1997 Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to … an economic advantage. -- Ian H. Witten & Eibe Frank, 2005
  • 14.
    If you’re inacademia, you call it “machine learning.” If you’re in business, you call it “data mining.”  Mark Hall I create or improve general purpose algorithms for machine learning I use multiple machine learning algorithms for practical data discovery Source : xkcd Source : xkcd
  • 15.
  • 16.
    Machine Learning Algorithms  Regression K-means Clustering  K-NN  CART  Neural Networks  Support Vector Machines  Association Rules  Principal Component Analysis  Singular Value Decomposition  Ensemble Methods  Naïve Bayes  …
  • 17.
    Real-World Applications  RecommenderSystems  Image recognition  Signal Processing  Propensity to buy/churn  Fraud analysis  Text analytics  Spam filtering  Forecasting methods  Revenue management  …
  • 18.
    The Problem …and Opportunity Big Data™ If you have to choose, having more data does indeed trump a better algorithm. However, what is better than just having more data on its own is also having an algorithm that annotates the data with new linkages and statistics which alter the underlying data asset.” - Omar Tawakol Weka Explorer can handle ~1M instances, 25 attributes (50 MB file) - Ian Witten
  • 19.
    Potential Solutions  ExpandRAM  Use incremental algorithms  Use distributable algorithms Scale Up Scale Out
  • 20.
    Hadoop in 30seconds Input Input Input Input Input Input Input Map (K,V) Map (K,V) Map (K,V) Map (K,V) Shuffle / Sort Reduce Reduce Reduce Output Output Output
  • 21.
    Finally -- Mahout A Java-based library of machine learning algorithms designed to support distributed processing  Initially on MapReduce, now leaning heavily towards Spark  Primarily focused on Recommenders, Clustering, and Classification spaces.
  • 22.
    Running Mahout  Locally– download mahout distro. /bin/mahout is the wrapper script, default shows all the example programs available. Lots of tools included to convert data into vector formats and pre-process text, worth a look  Amazon EC2 Configure stack from scratch on EC2 servers  Amazon EMR Quicker start, a lot of the build is already optimized for MapReduce jobs, just add Mahout as a custom jar and pass the script as a parameter
  • 23.
    Running Recommenders  MultipleRecommender Algorithms User-based Item-based  A Recommender Needs: DataModel (e.g. FileDataModel) Similarity driver (PearsonCorrelationSimilarity) Neighborhood (NearstNUserNeighborhood, ThresholdUserNeighborhood) Recommender
  • 24.
    Running Recommenders  Tip: If you have no preferences, there are Boolean equivalents of the recommender classes  Evaluate user vs. item similarities  Example
  • 25.
    Clustering Algorithms  Tocluster you need: Location in n-dimensional space Distance metric Threshold  K-means  Canopy  Dirichlet  Fuzzy K-means  Spectral Clustering
  • 26.
  • 27.
    Clustering Text  Identifyk topics in a document corpus  Requires conversion of text into vector  Lucene utilities are available to vectorize text and apply stop-word or weighting criteria.  Seqdirectory – from a directory of text files  Lucene.vector – from a Lucene index
  • 28.
    Classifiers  NaïveBayes  RandomForests LogisticRegression (SGD)  HiddenMarkov  Example : 20 Newsgroups
  • 29.
    Sidebar : Risksof Big Data Unsupervised Learning