KEMBAR78
Intro to Mahout | PPTX
Ofer Vugman
 May 2012
Agenda and such…


   What is ML (Machine Learning)
   ML Common Use Cases
   Mahout Overview
   Algorithms in Mahout
   Mahout Commercial Use
   Mahout Summary
What is ML



       “Machine Learning is programming
      computers to optimize a performance
       criterion using example data or past
                    experience”


 Intro. To Machine Learning by E. Alpaydin
ML Common Use Cases


 Recommendation
ML Common Use Cases


 Classification
ML Common Use Cases


 Clustering
ML Common Libraries
Mahout Overview – What ?


A mahout is a person who keeps and drives
  an elephant
Mahout Overview – What ?


 A scalable machine learning library
Mahout Overview – What ?


 Began life at 2008 as a subproject of
  Apache’s Lucene project
 On 2010 Mahout became a top-level
  Apache project in its own right
 Implemented in Java
 Built upon Apache’s Hadoop (Look ! An
  Elephant !)
Mahout Overview – Why ?


 Many open source ML libraries either:
   Lack community
   Lack documentation and examples
   Lack scalability
   Lack the Apache license
   Are research oriented
   Not well tested
   Not built over existing production quality
    libraries
Mahout Overview – Why ?


 Scalability
   Scalable to reasonably large datasets (core
    algorithms implemented in Map/Reduce,
    runnable on Hadoop)
   Scalable to support your business case
    (Apache License)
   Scalable community
Mahout Overview – Why ?


 Built over existing production quality
  libraries
Mahout Overview – Use Cases


 Mahout currently supports mainly four
  use cases:
  1. Recommendation
  2. Clustering
  3. Classification
  4. Frequent Itemset Mining
Mahout Overview - Technical


 System Requirements
     Linux (or Cygwin on Windows)
     Java 1.6.x or greater
     Maven 2.0.11 or greater to build the source
      code
     Hadoop 0.2 or greater*


* Not all algorithms are implemented to work on Hadoop clusters
Algorithms in Mahout


 We’ll focus on one example:
   Collaborative Filtering (Recommenders)



 Yet there are many (many !!) more, you
  can find them all on
  https://cwiki.apache.org/confluence/dis
  play/MAHOUT/Algorithms
Algorithms Examples –
Recommendation

 Help users find items they might like
  based on historical preferences




 Based on example by Sebastian Schelter in “Distributed Itembased
  Collaborative Filtering with Apache Mahout”
Algorithms Examples –
Recommendation




      Alice   5     1   4




      Bob     ?     2   5




     Peter    4     3   2
Algorithms Examples –
Recommendation

 Algorithm
   Neighborhood-based approach
   Works by finding similarly rated items in the
    user-item-matrix (e.g. cosine, Pearson-
    Correlation, Tanimoto Coefficient)
   Estimates a user's preference towards an
    item by looking at his/her preferences
    towards similar items
Algorithms Examples –
Recommendation

 Prediction: Estimate Bob's preference
  towards “The Matrix”
  1. Look at all items that
        a) are similar to “The Matrix“
        b) have been rated by Bob
           => “Alien“, “Inception“
  2. Estimate the unknown preference with a
     weighted sum
Algorithms Examples –
Recommendation

 MapReduce phase 1
   Map – Make user the key
    (Alice, Matrix, 5)        Alice (Matrix, 5)
    (Alice, Alien, 1)         Alice (Alien, 1)
    (Alice, Inception, 4)     Alice (Inception, 4)
    (Bob, Alien, 2)           Bob (Alien, 2)
    (Bob, Inception, 5)       Bob (Inception, 5)
    (Peter, Matrix, 4)        Peter (Matrix, 4)
    (Peter, Alien, 3)         Peter (Alien, 3)
    (Peter, Inception, 2)     Peter (Inception, 2)
Algorithms Examples –
Recommendation

 MapReduce phase 1
   Reduce – Create inverted index
 Alice (Matrix, 5)
 Alice (Alien, 1)
 Alice (Inception, 4)     Alice (Matrix, 5) (Alien, 1) (Inception, 4)
 Bob (Alien, 2)           Bob (Alien, 2) (Inception, 5)
 Bob (Inception, 5)       Peter(Matrix, 4) (Alien, 3) (Inception, 2)
 Peter (Matrix, 4)
 Peter (Alien, 3)
 Peter (Inception, 2)
Algorithms Examples –
Recommendation

 MapReduce phase 2
    Map – Isolate all co-occurred ratings (all
      cases where a user rated both items)
                                              Matrix, Alien (5,1)
                                              Matrix, Alien (4,3)
Alice (Matrix, 5) (Alien, 1) (Inception, 4)   Alien, Inception (1,4)
Bob (Alien, 2) (Inception, 5)                 Alien, Inception (2,5)
Peter(Matrix, 4) (Alien, 3) (Inception, 2)    Alien, Inception (3,2)
                                              Matrix, Inception (4,2)
                                              Matrix, Inception (5,4)
Algorithms Examples –
Recommendation

 MapReduce phase 2
   Reduce – Compute similarities

  Matrix, Alien (5,1)
  Matrix, Alien (4,3)
  Alien, Inception (1,4)    Matrix, Alien (-0.47)
  Alien, Inception (2,5)    Matrix, Inception (0.47)
  Alien, Inception (3,2)    Alien, Inception(-0.63)
  Matrix, Inception (4,2)
  Matrix, Inception (5,4)
Algorithms Examples –
Recommendation




      Alice   5     1   4




      Bob     1.5   2   5




     Peter    4     3   2
Mahout Commercial Use


 Commercial use
Mahout Resources

 Mahout website - http://mahout.apache.org/
 Introducing Apache Mahout –
  http://www.ibm.com/developerworks/java/lib
  rary/j-mahout/
 “Mahout In Action” by Sean Owen and Robin
  Anil
Mahout Summary


 ML is all over the web today
 Mahout is about scalable machine
  learning
 Mahout has functionality for many of
  today’s common machine learning tasks
 MapReduce magic in
  action
Mahout Summary




     Thank you and good night

Intro to Mahout

  • 1.
  • 2.
    Agenda and such…  What is ML (Machine Learning)  ML Common Use Cases  Mahout Overview  Algorithms in Mahout  Mahout Commercial Use  Mahout Summary
  • 3.
    What is ML “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”  Intro. To Machine Learning by E. Alpaydin
  • 4.
    ML Common UseCases  Recommendation
  • 5.
    ML Common UseCases  Classification
  • 6.
    ML Common UseCases  Clustering
  • 7.
  • 8.
    Mahout Overview –What ? A mahout is a person who keeps and drives an elephant
  • 9.
    Mahout Overview –What ?  A scalable machine learning library
  • 10.
    Mahout Overview –What ?  Began life at 2008 as a subproject of Apache’s Lucene project  On 2010 Mahout became a top-level Apache project in its own right  Implemented in Java  Built upon Apache’s Hadoop (Look ! An Elephant !)
  • 11.
    Mahout Overview –Why ?  Many open source ML libraries either:  Lack community  Lack documentation and examples  Lack scalability  Lack the Apache license  Are research oriented  Not well tested  Not built over existing production quality libraries
  • 12.
    Mahout Overview –Why ?  Scalability  Scalable to reasonably large datasets (core algorithms implemented in Map/Reduce, runnable on Hadoop)  Scalable to support your business case (Apache License)  Scalable community
  • 13.
    Mahout Overview –Why ?  Built over existing production quality libraries
  • 14.
    Mahout Overview –Use Cases  Mahout currently supports mainly four use cases: 1. Recommendation 2. Clustering 3. Classification 4. Frequent Itemset Mining
  • 15.
    Mahout Overview -Technical  System Requirements  Linux (or Cygwin on Windows)  Java 1.6.x or greater  Maven 2.0.11 or greater to build the source code  Hadoop 0.2 or greater* * Not all algorithms are implemented to work on Hadoop clusters
  • 16.
    Algorithms in Mahout We’ll focus on one example:  Collaborative Filtering (Recommenders)  Yet there are many (many !!) more, you can find them all on https://cwiki.apache.org/confluence/dis play/MAHOUT/Algorithms
  • 17.
    Algorithms Examples – Recommendation Help users find items they might like based on historical preferences  Based on example by Sebastian Schelter in “Distributed Itembased Collaborative Filtering with Apache Mahout”
  • 18.
    Algorithms Examples – Recommendation Alice 5 1 4 Bob ? 2 5 Peter 4 3 2
  • 19.
    Algorithms Examples – Recommendation Algorithm  Neighborhood-based approach  Works by finding similarly rated items in the user-item-matrix (e.g. cosine, Pearson- Correlation, Tanimoto Coefficient)  Estimates a user's preference towards an item by looking at his/her preferences towards similar items
  • 20.
    Algorithms Examples – Recommendation Prediction: Estimate Bob's preference towards “The Matrix” 1. Look at all items that  a) are similar to “The Matrix“  b) have been rated by Bob => “Alien“, “Inception“ 2. Estimate the unknown preference with a weighted sum
  • 21.
    Algorithms Examples – Recommendation MapReduce phase 1  Map – Make user the key (Alice, Matrix, 5) Alice (Matrix, 5) (Alice, Alien, 1) Alice (Alien, 1) (Alice, Inception, 4) Alice (Inception, 4) (Bob, Alien, 2) Bob (Alien, 2) (Bob, Inception, 5) Bob (Inception, 5) (Peter, Matrix, 4) Peter (Matrix, 4) (Peter, Alien, 3) Peter (Alien, 3) (Peter, Inception, 2) Peter (Inception, 2)
  • 22.
    Algorithms Examples – Recommendation MapReduce phase 1  Reduce – Create inverted index Alice (Matrix, 5) Alice (Alien, 1) Alice (Inception, 4) Alice (Matrix, 5) (Alien, 1) (Inception, 4) Bob (Alien, 2) Bob (Alien, 2) (Inception, 5) Bob (Inception, 5) Peter(Matrix, 4) (Alien, 3) (Inception, 2) Peter (Matrix, 4) Peter (Alien, 3) Peter (Inception, 2)
  • 23.
    Algorithms Examples – Recommendation MapReduce phase 2  Map – Isolate all co-occurred ratings (all cases where a user rated both items) Matrix, Alien (5,1) Matrix, Alien (4,3) Alice (Matrix, 5) (Alien, 1) (Inception, 4) Alien, Inception (1,4) Bob (Alien, 2) (Inception, 5) Alien, Inception (2,5) Peter(Matrix, 4) (Alien, 3) (Inception, 2) Alien, Inception (3,2) Matrix, Inception (4,2) Matrix, Inception (5,4)
  • 24.
    Algorithms Examples – Recommendation MapReduce phase 2  Reduce – Compute similarities Matrix, Alien (5,1) Matrix, Alien (4,3) Alien, Inception (1,4) Matrix, Alien (-0.47) Alien, Inception (2,5) Matrix, Inception (0.47) Alien, Inception (3,2) Alien, Inception(-0.63) Matrix, Inception (4,2) Matrix, Inception (5,4)
  • 25.
    Algorithms Examples – Recommendation Alice 5 1 4 Bob 1.5 2 5 Peter 4 3 2
  • 26.
  • 27.
    Mahout Resources  Mahoutwebsite - http://mahout.apache.org/  Introducing Apache Mahout – http://www.ibm.com/developerworks/java/lib rary/j-mahout/  “Mahout In Action” by Sean Owen and Robin Anil
  • 28.
    Mahout Summary  MLis all over the web today  Mahout is about scalable machine learning  Mahout has functionality for many of today’s common machine learning tasks  MapReduce magic in action
  • 29.
    Mahout Summary Thank you and good night

Editor's Notes

  • #14 The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers (2008)Apache Lucene(TM) is a high-performance, full-featured text search engine library  (2005)