KEMBAR78
Apache Mahout | PDF
Apache

   The Elephant Driver
          Presenters:
      Antonio Loureiro Severien
     Emmanouil Dimogerontakis
     Muhammad Anis uddin Nasir
What is Apache Mahout?
● Machine learning and data mining framework for
  classification, clustering and recommendation

● The Apache Mahout free machine learning library's goal
  is to build scalable machine learning tools for use on
  analysing big data on a distributed manner
Machine Learning
"Machine Learning is programming computers to optimize a
performance criterion using example data or past
experience" - Alpaydin, 2004

Machine learning is concerned with the design and
development of algorithms that allow machines to make
decisions or even evolve behaviors based on collection of
empirical data.
Data Mining
Data mining, also called knowledge discovery in
databases(KDD) is the process of discovering interesting
and useful patterns and relationships in large volumes of
data.
Combines tools from:
    ● statistics
    ● artificial intelligence (such as neural networks and
       machine learning)
with database management to analyze large data sets.
-Britannica Online Encyclopedia
Why Machine Learning and Data
Mining?

● Data, Data, DATA!!!


● Tasks too Hard to Program


● Customizing software
Available Machine Learning Tools


●   WEKA
●   R
●   KEEL
●   Others...


Not enough?
Apache Mahout vs others?
Many open source Machine Learning
libraries either:
● Lack Community
● Lack Documentation and Examples
● Lack the Apache License
    (business opportunity)
● Are research-oriented
    (not fit for production yet)
● Lack Scalability
Mahout = Elephant Driver?
Why we need scalability?
● Big Data
Applications
● Recommendation features
● Clustering of information
● Classification

Examples: Movie recommendations, stock
analysis, fraud detection, ad-sense
recommendation, etc...

            How do we do this?
Supported Algorithms
●   Classification
●   Clustering
●   Recommender / Collaborative Filtering
●   Evolutionary Algorithms
●   Pattern Mining
●   Regression
●   Dimension reduction
●   Similarity Vectors
Classification
(learn to assign categories to documents)

Fully functional
 ● Logistic Regression (SGD)
 ● Bayesian

Integrated to Mahout Development
 ● Random Forests (integrated)
 ● Online Passive Aggressive (integrated)
 ● Boosting (awaiting patch commit)

Open to be worked on...
 ● Hidden Markov Models (HMM) - Training is done in Map-Reduce
 ● Support Vector Machines (SVM) (open)
 ● Perceptron and Winnow (open)
 ● Neural Network (open)
Clustering
(group items that are topically related)

Fully functional
 ● Expectation Maximization (EM)
 ● Hierarchical Clustering

Integrated to Mahout Development
 ● Canopy Clustering
 ● K-Means Clustering
 ● Fuzzy K-Means
 ● Mean Shift Clustering
 ● Dirichlet Process Clustering
 ● Latent Dirichlet Allocation
 ● Spectral Clustering
 ● Minhash Clustering
 ● Top Down Clustering
Recommenders /
Collaborative Filtering
(find items a user might like /
find items that appear together)

Integrated to Mahout Development
●   Non-distributed recommenders ("Taste") (integrated)
●   Distributed Item-Based Collaborative Filtering (integrated)
●   Collaborative Filtering using a parallel matrix factorization (integrated)
Who is using it?
Opportunities
●   Developers
●   Researchers
●   Small Business
●   Large Business
●   Consultancy...
    ○ on Mahout
    ○ on specific data analysis
● Open data
● etc...
Apache Mahout
Business?

Ideas?

Suggestions?

Questions?
Where to start?
● Wikipedia Bayes Example
   ○   https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html


● What does it do?
   ○ Classify wikipedia data dump by countries.
   ○ Objective: Predict what country an unseen article
     should be categorized into.
References
General
http://www.slideshare.net/sdec2011/sdec2011-mahout-the-what-the-how-and-
the-why
http://www.slideshare.net/gsingers/intro-to-mahout-dc-hadoop
http://www.slideshare.net/aneeshabakharia/lca2011-mahout
Hands-on
http://www.slideshare.net/OReillyOSCON/hands-on-mahout
Who is using it?
https://cwiki.apache.org/MAHOUT/powered-by-mahout.html
Apache Mahout
http://mahout.apache.org/
Quickstart
https://cwiki.apache.org/MAHOUT/quickstart.html

Apache Mahout

  • 1.
    Apache The Elephant Driver Presenters: Antonio Loureiro Severien Emmanouil Dimogerontakis Muhammad Anis uddin Nasir
  • 2.
    What is ApacheMahout? ● Machine learning and data mining framework for classification, clustering and recommendation ● The Apache Mahout free machine learning library's goal is to build scalable machine learning tools for use on analysing big data on a distributed manner
  • 3.
    Machine Learning "Machine Learningis programming computers to optimize a performance criterion using example data or past experience" - Alpaydin, 2004 Machine learning is concerned with the design and development of algorithms that allow machines to make decisions or even evolve behaviors based on collection of empirical data.
  • 4.
    Data Mining Data mining,also called knowledge discovery in databases(KDD) is the process of discovering interesting and useful patterns and relationships in large volumes of data. Combines tools from: ● statistics ● artificial intelligence (such as neural networks and machine learning) with database management to analyze large data sets. -Britannica Online Encyclopedia
  • 5.
    Why Machine Learningand Data Mining? ● Data, Data, DATA!!! ● Tasks too Hard to Program ● Customizing software
  • 6.
    Available Machine LearningTools ● WEKA ● R ● KEEL ● Others... Not enough?
  • 7.
    Apache Mahout vsothers? Many open source Machine Learning libraries either: ● Lack Community ● Lack Documentation and Examples ● Lack the Apache License (business opportunity) ● Are research-oriented (not fit for production yet) ● Lack Scalability
  • 8.
  • 9.
    Why we needscalability? ● Big Data
  • 10.
    Applications ● Recommendation features ●Clustering of information ● Classification Examples: Movie recommendations, stock analysis, fraud detection, ad-sense recommendation, etc... How do we do this?
  • 11.
    Supported Algorithms ● Classification ● Clustering ● Recommender / Collaborative Filtering ● Evolutionary Algorithms ● Pattern Mining ● Regression ● Dimension reduction ● Similarity Vectors
  • 12.
    Classification (learn to assigncategories to documents) Fully functional ● Logistic Regression (SGD) ● Bayesian Integrated to Mahout Development ● Random Forests (integrated) ● Online Passive Aggressive (integrated) ● Boosting (awaiting patch commit) Open to be worked on... ● Hidden Markov Models (HMM) - Training is done in Map-Reduce ● Support Vector Machines (SVM) (open) ● Perceptron and Winnow (open) ● Neural Network (open)
  • 13.
    Clustering (group items thatare topically related) Fully functional ● Expectation Maximization (EM) ● Hierarchical Clustering Integrated to Mahout Development ● Canopy Clustering ● K-Means Clustering ● Fuzzy K-Means ● Mean Shift Clustering ● Dirichlet Process Clustering ● Latent Dirichlet Allocation ● Spectral Clustering ● Minhash Clustering ● Top Down Clustering
  • 14.
    Recommenders / Collaborative Filtering (finditems a user might like / find items that appear together) Integrated to Mahout Development ● Non-distributed recommenders ("Taste") (integrated) ● Distributed Item-Based Collaborative Filtering (integrated) ● Collaborative Filtering using a parallel matrix factorization (integrated)
  • 15.
  • 16.
    Opportunities ● Developers ● Researchers ● Small Business ● Large Business ● Consultancy... ○ on Mahout ○ on specific data analysis ● Open data ● etc...
  • 17.
  • 18.
    Where to start? ●Wikipedia Bayes Example ○ https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html ● What does it do? ○ Classify wikipedia data dump by countries. ○ Objective: Predict what country an unseen article should be categorized into.
  • 19.