KEMBAR78
Machine Learning with Apache Mahout | KEY
Machine Learning
       with Apache Mahout
http://twitter.com/danielglauser



                        http://www.linkedin.com/in/danglauser




    danglauser@gmail.com
What is Machine
  Learning?
What is Machine
      Learning?

A branch of Artificial Intelligence
What is Machine
      Learning?

A branch of Artificial Intelligence
Creative use of statistics
What is Machine
      Learning?

A branch of Artificial Intelligence
Creative use of statistics
Smart decisions from large data sets
What is Machine
      Learning?

A branch of Artificial Intelligence
Creative use of statistics
Smart decisions from large data sets
All of the above
Common Applications
Common Applications


        ?
Spam
Filtering
Credit
Card
Fraud
Medical Diagnostics
Search Engines
Sentiment Analysis
Math Alert

If you want to go big with Machine Learning
math is necessary
What math?
Statistics

                Discrete Math


     Linear Algebra

               Probability
Apache Mahout

A platform for Machine Learning
Roll your own algorithm, use the platform
Easy integration with Hadoop
History


• 2005 The Taste framework
• 2008 Services built on Lucene
Mahout is composed
        of...
Recommender Engines
Classification
Clustering
Frequent itemsets
A brief intro to:

Recommender Engines
Classification
Clustering
Recommendations


For a given set of input, make a
recommendation
Recommendations


Rank the best out of many possibilities
Recommenders are
     typically

User based
or
Item based
Neighborhood
Nearest N Users    Threshold
Similarity
PearsonCorrelationSimilarity

   Produces a value between 1 and -1
   Tendency of two series to move together
PearsonCorrelationSimilarity

   1 - the two series are similar
   0 - no similarity
   -1 - opposite similarity
PearsonCorrelationSimilarity
        Problems
   Doesn’t take into account how many items
   overlap between users
   Cannot find similarity between two users if
   they only have one item in common
   Undefined if two users have identical
   preferences
Similarity Algorithms

PersonCorrelationSimilarity
EuclidianDistanceSimilarity
TanimotoCoefficientSimilarity
LogLikelyhoodSimilarity
To the code!
How big is a Java Object?
GenericPreference

user id - long - 8 bytes
item id - long - 8 bytes
preference value - float - 4 bytes
PreferenceArray

Why not just use an array or an ArrayList?

A little overhead x millions of items =

                  a *lot* of overhead
GenericUserPreferenceArray

   item id - long - 8 bytes
   preference value - float - 4 bytes
                                       ] x millions
                                       -
   one user id - long - 8 bytes
Phew!
Clustering
Clustering
Clustering

Surface naturally occurring groups of data
A notion of similarity (and dissimilarity)
Clustering

Algorithms do not require training
Stopping condition - iterate until close
enough
Common Clustering
       Algorithms
K-Means
Fuzzy K-Means
Meanshift
Centroid generation
Direchlet clustering
Representing Data

Feature Selection
Vectorization
Feature Selection


Figure out what features of your data are
interesting
Vectorization


Represent the interesting features in an n-
dimensional space
N-Dimensional Space
Every word in a group of documents
Size, shape, color of an object
N-Dimensional Space
Every word in a group of documents
Size, shape, color of an object
Representing Vectors
DenseVector
RandomAccessSparseVector
SequentialAccessSparseVector
Representing Vectors
DenseVector
                               Random Seek
RandomAccessSparseVector



SequentialAccessSparseVector
Hadoop SequenceFiles

Input vectors   SequenceFile(s)

   Initial
                SequenceFile(s)
  Centoids
K-Means

50+ years old, in commonly used for 25 years
Set the number of clusters - k
Works well even if you don’t pick a good - k
K-Means

Guess at initial placement of the centers (centroids)


                                         ]-
Expectation - assign the nearest                Wash,
points to each centroid                          rinse,
Maximization - reposition the centroid          repeat
C1




C2

          C3
C1




C2

          C3
C1




               C1




C2
                    C3
                         C3

     C2
C1




          C3


C2
C1




          C3


C2
C1
           C1




                C3
                 C3

C2
 C2
C1




          C3


C2
C1




          C3


C2
C1
      C1




             C3
            C3


 C2
C2
C1




          C3


C2
C1




          C3


C2
C1
       C1




            C3
             C3


C2
 C2
Stop!
     C1




          C3


C2
Clustering
Clustering
Classification
Classification
Classification
Classification




BFF39D   577335     B3E631   D0F5B0   90B073   AFCF3C
Classification




BFF39D   577335     B3E631   D0F5B0   90B073   AFCF3C
Classification
BFF39D   577335     B3E631   D0F5B0   90B073   AFCF3C




                  Green
Attributes of Classification
        Algorithms

  Require training (supervised)
  Make a single decision with a very limited set
  of outcomes
Classification


Typical answers naturally fit into categories
Examples of Classification

  Credit card fraud prediction
  Customer attrition
  Diabetes detector
  Search Engine
Training - learned process that produces a model
Model - output of the training algorithm
Predictor variable - input for classification model
Target variable - what we are trying to predict
Classification
Common Algorithms
Stochastic Gradient Decent (SGD)
Support Vector Machine (SVM)
Naive Bayes
Complementary Naive Bayes
Random Forrest
Going Distributed
Overhead


Parallel processing requires management
overhead
Especially when spread over multiple machines
Vector                     SequenceFile



                          Keys     Values
             Implements                     Implements


         WritableComparable               Writeable
Java

WritableComparable   Comparable


         Writeable   Serializable
Recap
Recommender


Rank large datasets
Clustering

Group your data
Classification

Train me to think like you
Integration with
        Hadoop


Through SequenceFiles and Map/Reduce jobs
Resources
Resources
n-dimensional space http://en.wikipedia.org/wiki/File:Coord_system_CA_0.svg

Batman http://www.flickr.com/photos/farukahmet/3005752670/sizes/l/in/photostream/

duke http://kenai.com/projects/duke/pages/Home

mahout logo http://mahout.apache.org/

scalability diagram http://manning.com/owen/




                                                             Thanks!
classification diagram http://manning.com/owen/

phew http://www.flickr.com/photos/iain/1022210850/

clouds http://www.flickr.com/photos/spazzo_1493/3682989696/

spam http://www.flickr.com/photos/johotravels/4334224546/

credit card http://www.flickr.com/photos/thetruthabout/4542026865/

medical diagnostics http://www.flickr.com/photos/adrianclarkmbbs/3063516728/

search engines http://www.flickr.com/photos/enda/144377951/

angry http://www.flickr.com/photos/jmgasalla/3467458535/

crystal ball http://www.flickr.com/photos/mache/142561526/

glasses http://www.flickr.com/photos/nickwheeleroz/2220008689/

coffee http://www.flickr.com/photos/mr_t_in_dc/2818254382/
http://twitter.com/danielglauser



                        http://www.linkedin.com/in/danglauser




    danglauser@gmail.com

Machine Learning with Apache Mahout