KEMBAR78
Machine Learning and Apache Mahout : An Introduction | PPTX
+ Machine
Learning
and
Apache
Mahout
Varad Meru
Software Development Engineer
Orzota, Inc.
about.me/vrdmr

Š Varad Meru, 2013
+

2

Who Am I
īŽ

Orzota, Inc.
īŽ
īŽ

īŽ

Making BigData Easy
Designing a Cloud-based platform for ETL, Analytics

Past Work Experience
īŽ

īŽ

Persistent Systems Ltd.
Recommendation Engines and User Behavior Analytics.

Area of Interest
īŽ

Machine Learning

īŽ

Distributed Systems

īŽ

Recommendation Engines
+

3

Outline
īŽ

Introduction

īŽ

Machine Learning
īŽ
īŽ
īŽ

īŽ

īŽ

Apache Mahout
īŽ
īŽ
īŽ

īŽ

Introduction and History
Types of Learning Algorithms
Applications
What’s New

History
Architecture
Applications and Examples

Conclusion
Š Varad Meru, 2013
+
Machine Learning
Rise of the Machine-Era

4
+

5

Introduction
“Machine Learning is Programming Computers to
optimize a Performance Criterion using Example Data
or Past Experience”
īŽ

Term coined by Arthur Samuel
īŽ

"Field of study that gives computers the ability to learn without being
explicitly programmed“.

īŽ

Branch of Artificial Intelligence and Statistics

īŽ

Focuses on prediction based on known properties

īŽ

Used as a sub-process in Data Mining.
īŽ

Data Mining focuses on discovering new, unknown properties.
+

6

Learning Algorithms
īŽ

Supervised Learning
īŽ
īŽ

īŽ

Unsupervised Learning
īŽ

īŽ

īŽ

Unlabelled input data.
Creating a function to predict the relation and output

Semi-Supervised Learning
īŽ

īŽ

Labelled input data.
Creating classifiers to predict unseen inputs.

Combines Supervised and Unsupervised Learning methodology

Reinforcement Learning
īŽ

Reward-Punishment based agent.
+

7

Supervised Learning
Introduction
īŽ

Learn from the Data

īŽ

Data is already labelled
īŽ

īŽ

Expert, Crowd-sourced or case-based labelling of data.

Applications
īŽ

Handwriting Recognition

īŽ

Spam Detection

īŽ

Information Retrieval
īŽ

īŽ

Personalisation based on ranks

Speech Recognition
+

8

Supervised Learning
Algorithms
īŽ

Decision Trees

īŽ

k-Nearest Neighbours

īŽ

Naive Bayes

īŽ

Logistic Regression

īŽ

Perceptron and Multi-level Perceptrons

īŽ

Neural Networks

īŽ

SVM and Kernel estimation
+

9

Supervised Learning
Example: Naive Bayes Classifier
īŽ

President Obama’s Speech’s Word Map
+

10

Supervised Learning
Example: Naive Bayes Classifier
īŽ

A Spam Document’s Word Map
+

11

Supervised Learning
Example: Naive Bayes Classifier
īŽ

Running a test on the Classifier

“Order a trial Adobe
chicken daily EABList new summer
savings, welcome!”

Classifier

Spam
Bin
+

12

Unsupervised Learning
Introduction
īŽ

Finding hidden structure in data

īŽ

Unlabelled Data

īŽ

SMEs needed post-processing to verify, validate and use the
output

īŽ

Used in exploratory analysis rather than predictive analytics

īŽ

Applications
īŽ

Pattern Recognition

īŽ

Groupings based on a distance measure
īŽ

Group of People, Objects, ...
+

13

Unsupervised Learning
Algorithms
īŽ

Clustering
īŽ

k-Means, MinHash, Hierarchical Clustering

īŽ

Hidden Markov Models

īŽ

Feature Extraction methods

īŽ

Self-organizing Maps (Neural Nets)
+

14

Unsupervised Learning
Example K-Means

Source: http://apandre.wordpress.com/visible-data/cluster-analysis/
+

15

Learning Problem
Cat and Dog Problem
īŽ

Humans can easily classify which is a cat and which is a dog.

īŽ

But how can a computer do that?

īŽ

Some attempts used Clustering Mechanisms to solve it – Cooccurence Clustering, Deep Learning
+
Apache Mahout
Scalable Machine Learning Library

16
Š Varad Meru, 2013
+

17

History and Etymology
īŽ

Inspired from MapReduce for Machine
Learning on Multicore” Ng et. al.

īŽ

Written in Java. Apache License.

īŽ

Founders
īŽ

Mahout – Isabel Drost, Grant Ingersoll, Karl
Witten.

īŽ

Taste – Sean Owen

īŽ

Mahout – Keeper/Driver of Elephants.

īŽ

Current Release – 0.8 (stable)

Š Varad Meru, 2013
+

Size

Need
īŽ

BigData
īŽ

Ever-growing data.

īŽ

Yesterday’s methods to
process tomorrow’s data

īŽ
īŽ

Cheap Storage

Scalable from Ground Up
īŽ

īŽ

Lines
Sample
Data
KBs –
low MBs
Prototype
Data

Analysis and
Visualisation
Analysis and
Visualisation

Tools18

Whiteboard,
Bash, ...
Matlab,
Octave, R,
Processing,
Bash, ...

Storage

MySQL (DBs),
...

Analysis

NumPy, SciPy,
Pandas,
Weka..

MBs – low
GBs

Should be build on top of anyOnline
existing Distributed Systems Data
framework
Should contain distributed
version of ML algorithms

Classification

GBs
– TBs
– PBs

Visualisation

Flare,
AmCharts,
Raphael

Storage

HDFS, Hbase,
Cassandra,...

Analysis

Hive, Giraph,
Hama, Mahout
+

19

Mahout Modules

Applications

Evolutionary
Algorithms

Classification

Utilies
Lucene/Vectorizer

Clustering

Recommenders

Math
Vectors/ Matrics/SVD

Regression

Collections
(Primitives)

FPM

Dimension
Reduction

Hadoop
+

20

Recommender
Systems

Š Varad Meru, 2013
+

21

Recommender Systems
Introduction
īŽ

Types of Recommender Systems
īŽ
īŽ

īŽ

īŽ

Content Based Recommendations
Collaborative Filtering Recommendations
īŽ User-User Recommendations
īŽ Item-Item Recommendations
Dimensionality Reduction (SVD) Recommendations

Applications
īŽ
īŽ
īŽ
īŽ
īŽ

Products you would like to buy
People you might want to connect with
Potential Life-Partners
Recommending Songs you might like
...
+

22

Recommender Systems
Collaborative Filtering in Action

īŽ

Assuming people
have seen at least
one movie.
īŽ

Cold Start?

īŽ
īŽ

Š Varad Meru, 2013

1: seen
0: not seen
+

23

Collaborative Filtering in Action
īŽ

Tanimoto Coefficient

T ( a, b)

NA

NC
NB

NC

īŽ

NA – Number of Customers
who bought A

īŽ

NB – Number of Customers who
bought B

īŽ

NC – Number of Customers
who bought A and B

Š Varad Meru, 2013
+

24

Collaborative Filtering in Action
īŽ

Cosine Coefficient

C (a, b)

NC
NA

NB

īŽ

NA – Number of Customers
who bought A

īŽ

NB – Number of Customers who
bought B

īŽ

NC – Number of Customers
who bought A and B

Š Varad Meru, 2013
+

25

Apache Mahout
Recommender System
Architecture
īŽ

Two Modes
īŽ
īŽ

īŽ

Stand-alone non distributed (“Taste”)
Scalable Distributed Algorithmic version
for Collaborative Filtering

Top-level Packages
īŽ

Data Model

īŽ

User Similarity

īŽ

Item Similarity

īŽ

User Neighbourhood

īŽ

Recommender
+

26

Naive Bayes Classifier

“Order a trial Adobe
chicken daily EABList new summer
savings, welcome!”

Classifier
+

27

Naive Bayes Classifier
īŽ

Naive Bayes is a pretty complex process in Mahout: training
the classifier requires four separate Hadoop jobs.

īŽ

Training:
īŽ
īŽ

Calculate per-Document
Statistics

īŽ

Normalize across Categories

īŽ

īŽ

Read the Features

Calculate normalizing factor
of each label

Testing
īŽ

Classification (fifth job, explicitly invoked)

Š Varad Meru, 2013
+

28

K-Means Clustering
Iterations
+

29

K-Means Clustering
MapReduce Version
30

+

Summary
â€ĸ

Machine Learning
â€ĸ
â€ĸ

â€ĸ

Learning Algorithms

Varied Applications

Mahout
â€ĸ

Scaling to Giga/Tera/Peta Scale

â€ĸ

Free and Open Source
+

31

More Info.
1.

“Scalable Similarity-Based Neighborhood Methods with
MapReduce” by Sebastian Schelter, Christoph Boden and
Volker Markl. – RecSys 2012.

2.

“Case Study Evaluation of Mahout as a Recommender Platform”
by Carlos E. Seminario and David C. Wilson - Workshop on
Recommendation Utility Evaluation: Beyond RMSE (RUE 2012)

3.

http://mahout.apache.org/ - Apache Mahout Project Page

4.

http://www.ibm.com/developerworks/java/library/j-mahout/ Introducing Apache Mahout

5.

[VIDEO] “Collaborative filtering at scale” by Sean Owen

6.

[BOOK] “Mahout in Action” by Owen et. al., Manning Pub.
Š Varad Meru, 2013
+
Questions?

32
Š Varad Meru, 2013
33

+

Thank You
Go BigData!!! 

Š Varad Meru, 2014

Machine Learning and Apache Mahout : An Introduction