KEMBAR78
Hadoop Summit 2010 Machine Learning Using Hadoop | PDF
Machine Learning on Hadoop
              Krishna Prasad Chitrapura
              Sr. Scientist, Yahoo! Labs
              pkrishna@yahoo-inc.com
Outline
  •  ML 101
      –  Basic formulation
    –  ML is not Data mining
          Generalization and Optimality
  •  Issues using Hadoop for ML
      –  Iterations
      –  Sparseness
  •  Case Study:
      –  Learning URL Patterns for Webpage De-duplication, published in
         WSDM 2010.
      –  PLANET: Massively Parallel Learning of Tree Ensembles with
         MapReduce, VLDB 2009.
ML 101
             •  Basic problem:
                 –  Matrix of data points and features.
                 –  Each data point is labeled.
                 –  Learn the labeling function and predict the labels of unseen data
                    points.
                       Numeric Label is regression else classification.
                       M features/Attributes
 N Data points




                                               Labels
                    NXM Table
Data Mining vs Machine Learning
  •  Machine learning is about finding a guaranteed generalized
     approximation to the boundary separating the classes.
  •  Data-Mining is about describing the data in using simple algebra.
      –  Hadoop is perfect for data processing and Mining.
  •  An Example (Student: Marks  Class (Pass/Fail) )
 Student   Course1   Course2   Course3   Course4   Course5   Course6   Course7   Class

 R1        88        76        43        54        90        55        49        Pass
 R2        60        45        32        51        80        53        60        Fail
 …         …         ..        ..        ..

  •  A Hard problem
      –  All students who fail may not fail due to same course
      –  Finding the boundary per course is not easy (Lenient Courses/
         evaluation)
How does a typical learning algorithm solve this?
  •  Intuition1: Courses in which every one fails or every one passes are
     not of much use here (Comments ? Lets assume unknown range).
  •  Intuition 2: Courses in which 50% pass and fail? (Good. but can over-
     fit if there is a big spread in marks).
  •  Overall Intuition: Courses which have high density of labels and good
     separation are best.
  •  Optimality:
    –  Criteria:
          Separability assumption – Convex guarantee (We don’t pass
           some one who got low marks in a course based on
           performance in other courses).
          Metrics space of features ( Triangular in-equality)
    –  Approximation to optimality can be obtained by greedy iterations
       or hill climbing.
A Typical Tree:



                     B >= 45)




           D >= 35
How does ML work – continued?
 •  An Old class of learners – Tree induction.
     –  [Split] Choose attribute (subject) which can best describe the final
        class with least encoding.
           If the {attribute {=,≤,≥} value} can homogeneously describe the
            outcome you are done.
           Else for each {attribute {=,≤,≥} value} group choose another
            attribute and iterate from above.
     –  Intuition: Look at the toughest course– who got low marks here
        also fails the exam. Amongst the one who passed this course look
        at which course they have failed and split on that (so on..).
     –  When do we stop? What do we mean by homogeneous?
   –  What is over-fit? How do we prune?
How would I implement this in Map-Reduce
      •  Series of Map-Reduces
      •  Each Stage:
        –  Map:
              Collect stats
                –  {Attribute {=,≤,≥} value}, {#Class1,#Class2,….}
        –  Reducer:
              Choose the best split (E.g.: Gain Ratio)
                                                     # c(k) = v
     ∀k ∈ K,IG(k) = Entropy(C) − ∑                              Entropy(C | c(k) = v)
                                          v ∈{c(k )}   # c(k)
      •  How good is this?
        –  Pretty bad (3B data took well over 100 hours on 100Nodes).
€          Why?
              Map Blows up the space (NXM) X number of maps.
        –  One quick solution : Combiners.
What else is bad?
•  Data sparsity in the Internet:
    –  Any attribute we choose on the
       internet follows power-law:
          (80:20 rule of layman).
          Lots of attribute values occurs
           only once.
•  Why is this bad? (Not a Blame Game).
        Hadoop’s problem
          –  Too many files
          –  Each file is a map.
          –  Empty Reducers.
        Our problem – Majority of the of
         the splits are useless.
What tricks did we use?
  •  Observations:
      –  The first split is the hardest (You have to look at all the data).
            In fact, difficult to beat the performance of a single box with
             sampling.
      –  Most of the long tail can be grouped together.
  •  Tricks:
      –  Speculation helps
            Not only Hadoop speculative execution
            When doing the first split – you can choose the candidates for
             the next few levels.
    –  At each split group all attribute values that are meaninglessly
       small. (Also use Gnu Natural Hash).
Performance

                  • Our observations                                             • Panda et al
                 25000




                 20000
Time Taken (S)




                 15000
                                                                      Single Node (Sampling)
                                                                      100 Node (No grouping)
                                                                      100 Node (Grouping)
                 10000                                                100 Node(speculation)




                  5000




                     0
                         1   2   3     4    5   6   7    8   9   10

                                     Depth of the Tree
To Conclude
 •  Hadoop is a great tool for data aggregations.
 •  With careful handling can obtain perfect scale-ups.
 •  Lots of research still needs to go on to build ML tools on Hadoop
     –  http://lucene.apache.org/mahout/
   –  Main Pieces to Build
         Smart way to carry information across iterations.
         Smart ways to avoid data sparsity.
   –  Small things Hadoop can help with
         Avoid unnecessary small files (Maps across single file).
         Automatic balanced distribution of keys into reducer.

Hadoop Summit 2010 Machine Learning Using Hadoop

  • 1.
    Machine Learning onHadoop Krishna Prasad Chitrapura Sr. Scientist, Yahoo! Labs pkrishna@yahoo-inc.com
  • 2.
    Outline • ML 101 –  Basic formulation –  ML is not Data mining   Generalization and Optimality •  Issues using Hadoop for ML –  Iterations –  Sparseness •  Case Study: –  Learning URL Patterns for Webpage De-duplication, published in WSDM 2010. –  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce, VLDB 2009.
  • 3.
    ML 101 •  Basic problem: –  Matrix of data points and features. –  Each data point is labeled. –  Learn the labeling function and predict the labels of unseen data points.   Numeric Label is regression else classification. M features/Attributes N Data points Labels NXM Table
  • 4.
    Data Mining vsMachine Learning •  Machine learning is about finding a guaranteed generalized approximation to the boundary separating the classes. •  Data-Mining is about describing the data in using simple algebra. –  Hadoop is perfect for data processing and Mining. •  An Example (Student: Marks  Class (Pass/Fail) ) Student Course1 Course2 Course3 Course4 Course5 Course6 Course7 Class R1 88 76 43 54 90 55 49 Pass R2 60 45 32 51 80 53 60 Fail … … .. .. .. •  A Hard problem –  All students who fail may not fail due to same course –  Finding the boundary per course is not easy (Lenient Courses/ evaluation)
  • 5.
    How does atypical learning algorithm solve this? •  Intuition1: Courses in which every one fails or every one passes are not of much use here (Comments ? Lets assume unknown range). •  Intuition 2: Courses in which 50% pass and fail? (Good. but can over- fit if there is a big spread in marks). •  Overall Intuition: Courses which have high density of labels and good separation are best. •  Optimality: –  Criteria:   Separability assumption – Convex guarantee (We don’t pass some one who got low marks in a course based on performance in other courses).   Metrics space of features ( Triangular in-equality) –  Approximation to optimality can be obtained by greedy iterations or hill climbing.
  • 6.
    A Typical Tree: B >= 45) D >= 35
  • 7.
    How does MLwork – continued? •  An Old class of learners – Tree induction. –  [Split] Choose attribute (subject) which can best describe the final class with least encoding.   If the {attribute {=,≤,≥} value} can homogeneously describe the outcome you are done.   Else for each {attribute {=,≤,≥} value} group choose another attribute and iterate from above. –  Intuition: Look at the toughest course– who got low marks here also fails the exam. Amongst the one who passed this course look at which course they have failed and split on that (so on..). –  When do we stop? What do we mean by homogeneous? –  What is over-fit? How do we prune?
  • 8.
    How would Iimplement this in Map-Reduce •  Series of Map-Reduces •  Each Stage: –  Map:   Collect stats –  {Attribute {=,≤,≥} value}, {#Class1,#Class2,….} –  Reducer:   Choose the best split (E.g.: Gain Ratio) # c(k) = v ∀k ∈ K,IG(k) = Entropy(C) − ∑ Entropy(C | c(k) = v) v ∈{c(k )} # c(k) •  How good is this? –  Pretty bad (3B data took well over 100 hours on 100Nodes). € Why?   Map Blows up the space (NXM) X number of maps. –  One quick solution : Combiners.
  • 9.
    What else isbad? •  Data sparsity in the Internet: –  Any attribute we choose on the internet follows power-law:   (80:20 rule of layman).   Lots of attribute values occurs only once. •  Why is this bad? (Not a Blame Game).   Hadoop’s problem –  Too many files –  Each file is a map. –  Empty Reducers.   Our problem – Majority of the of the splits are useless.
  • 10.
    What tricks didwe use? •  Observations: –  The first split is the hardest (You have to look at all the data).   In fact, difficult to beat the performance of a single box with sampling. –  Most of the long tail can be grouped together. •  Tricks: –  Speculation helps   Not only Hadoop speculative execution   When doing the first split – you can choose the candidates for the next few levels. –  At each split group all attribute values that are meaninglessly small. (Also use Gnu Natural Hash).
  • 11.
    Performance • Our observations • Panda et al 25000 20000 Time Taken (S) 15000 Single Node (Sampling) 100 Node (No grouping) 100 Node (Grouping) 10000 100 Node(speculation) 5000 0 1 2 3 4 5 6 7 8 9 10 Depth of the Tree
  • 12.
    To Conclude • Hadoop is a great tool for data aggregations. •  With careful handling can obtain perfect scale-ups. •  Lots of research still needs to go on to build ML tools on Hadoop –  http://lucene.apache.org/mahout/ –  Main Pieces to Build   Smart way to carry information across iterations.   Smart ways to avoid data sparsity. –  Small things Hadoop can help with   Avoid unnecessary small files (Maps across single file).   Automatic balanced distribution of keys into reducer.