Mahout and Distributed Machine Learning 101

Introduction to
machine learning
with mahout
John Ternent
@jaternent
Orlando Data Science – www.orlandods.com
May 13, 2014

Social Media
 Facebook.com/orlandodata
 Twitter.com/orlandodata
 LinkedIn

OrlandoDS.com
 Social Network
 Forum
 Articles and Content
 And More
 Send articles to: scott@orlandods.com

Orlando Wiki
 Completely Open
 Aggregate Learning Resources!
 Go NUTS

May 28th Event
 Full Sail, UCF, and Florida Polytechnic
 Submit Your Questions! @orlandodata

Member Survey
 Need n=30!!!
 OrlandoDS.com/member-survey
 OR: find it in our past meetup
announcements

Learn Hadoop
 First Class: June 3rd.
 Location: Here

Future Plans
 Establish Non-Profit
 Increase Global Following
 Become Strong Networking and
Education Resource for YOU

A (very) little bit about
me…
 Consultant (Management & Technology)
 Open Source Evangelist
 Full-spectrum data nerd

A little about you!
 Rate yourself (1 – 10) on Mahout
 Rate yourself (1 – 10) on Machine
Learning/Data Mining
 Rate yourself (1 – 10) on Big
Data/Hadoop
 Please wait… optimizing presentation…

A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by
P, improves with experience E.
-- Tom M. Mitchell, 1997
Data mining is defined as the process of
discovering patterns in data. The process
must be automatic or (more usually)
semiautomatic. The patterns discovered must
be meaningful in that they lead to … an
economic advantage.
-- Ian H. Witten & Eibe Frank, 2005

If you’re in academia, you call it “machine
learning.” If you’re in business, you call it
“data mining.”
 Mark Hall
I create or
improve general
purpose
algorithms for
machine
learning
I use multiple
machine
learning
algorithms for
practical data
discovery
Source : xkcd Source : xkcd

Machine Learning Uses
Clustering
Classification
Recommendation

Machine Learning
Algorithms
 Regression
 K-means Clustering
 K-NN
 CART
 Neural Networks
 Support Vector Machines
 Association Rules
 Principal Component Analysis
 Singular Value Decomposition
 Ensemble Methods
 Naïve Bayes
 …

Real-World Applications
 Recommender Systems
 Image recognition
 Signal Processing
 Propensity to buy/churn
 Fraud analysis
 Text analytics
 Spam filtering
 Forecasting methods
 Revenue management
 …

The Problem … and Opportunity
Big Data™
If you have to choose, having more data does indeed trump a
better algorithm. However, what is better than just having
more data on its own is also having an algorithm that
annotates the data with new linkages and statistics which alter
the underlying data asset.”
- Omar Tawakol
Weka Explorer can handle ~1M instances, 25 attributes (50
MB file)
- Ian Witten

Potential Solutions
 Expand RAM
 Use incremental algorithms
 Use distributable algorithms
Scale
Up
Scale Out

Hadoop in 30 seconds
Input
Input
Input
Input
Input
Input
Input
Map (K,V)
Map (K,V)
Map (K,V)
Map (K,V)
Shuffle
/ Sort
Reduce
Reduce
Reduce
Output
Output
Output

Finally -- Mahout
 A Java-based library of machine
learning algorithms designed to support
distributed processing
 Initially on MapReduce, now leaning
heavily towards Spark
 Primarily focused on Recommenders,
Clustering, and Classification spaces.

Running Mahout
 Locally – download mahout distro.
/bin/mahout is the wrapper script, default shows all
the example programs available.
Lots of tools included to convert data into vector
formats and pre-process text, worth a look
 Amazon EC2
Configure stack from scratch on EC2 servers
 Amazon EMR
Quicker start, a lot of the build is already optimized
for MapReduce jobs, just add Mahout as a custom
jar and pass the script as a parameter

Running Recommenders
 Multiple Recommender Algorithms
User-based
Item-based
 A Recommender Needs:
DataModel (e.g. FileDataModel)
Similarity driver (PearsonCorrelationSimilarity)
Neighborhood (NearstNUserNeighborhood,
ThresholdUserNeighborhood)
Recommender

Running Recommenders
 Tip : If you have no preferences, there
are Boolean equivalents of the
recommender classes
 Evaluate user vs. item similarities
 Example

Clustering Algorithms
 To cluster you need:
Location in n-dimensional space
Distance metric
Threshold
 K-means
 Canopy
 Dirichlet
 Fuzzy K-means
 Spectral Clustering

Clustering Text
 Identify k topics in a document corpus
 Requires conversion of text into vector
 Lucene utilities are available to vectorize
text and apply stop-word or weighting
criteria.
 Seqdirectory – from a directory of text
files
 Lucene.vector – from a Lucene index

Classifiers
 NaïveBayes
 RandomForests
 LogisticRegression (SGD)
 HiddenMarkov
 Example : 20 Newsgroups

Sidebar : Risks of Big Data
Unsupervised Learning

Mahout and Distributed Machine Learning 101

More Related Content

What's hot

Similar to Mahout and Distributed Machine Learning 101

Recently uploaded

Mahout and Distributed Machine Learning 101