KEMBAR78
Running with Elephants: Predictive Analytics with HDInsight | PPTX
Running with Elephants
Predictive Analytics with Mahout & HDInsight
Introduction
Chris Price
Senior BI Consultant with Pragmatic Works
Author
Regular Speaker
Data Geek & Super Dad!
@BluewaterSQL
http://bluewatersql.wordpress.com/
cprice@pragmaticworks.com
You are the demo….
SQL Brewhaus
http://sqlbrewhaus.azurewebsites.net
Create an Account… Rate some beers…
Don’t worry your info
will only be sold to the
HIGHEST bidder
Agenda
• Business Case for Recommendations
• How a Recommendation Engine Works
• Recommendation Implementation & Integration
• Evaluating Recommendations
• Challenges of Implementing Recommendations
Making the Business Case
Objective
Increase
Revenue
Increase #
of Orders
Increase
Items per
Order
Increase
Average
Item Price
Up-Sell Website
Navigational
Inefficiency
Cross-Sell
Business Case Example
Increased
Revenue
Recommendation Engines
• Take observation data and use data mining/machine
learning algorithms to predict outcomes
• Assumptions:
• People with similar interest have common preferences
• Sufficiently large number of preferences available
Recommendation Options
• Collaborative Filtering (Mahout)
• User-Based
• Item-Based
• Content-Based (Mahout Clustering)
• Data Mining (SSAS)
• Association
• Clustering
Technology
• A scalable machine learning library
• Fast, Efficient & Pragmatic
• Many of the algorithms can be run on Hadoop
HDInsight
• Hadoop on Windows
• HDInsight on Windows Azure (Seamlessly scale in the cloud)
• HortonWorks Data Platform/HDP (On-Premise Solution)
Generating Recommendations
1. Sources of Data
2. Clean & Prepare Data
3. Generate Recommendations
• Build User/Item matrix
• Calculate User Similarity
• Form Neighborhoods
• Generate Recommendations
Sources of Data
• Implicit
• Ratings
• Feedback
• Demographics
• Psychographics (Personality/Lifestyle/Attitude),
• Ephemeral Need (Need for a moment)
• Explicit
• Purchase History
• Click/Browse History
• Product/Item
• Taxonomy
• Attributes
• Descriptions
Our focus for today
Data Preparation
• Clean-Up:
• Remove Outliers (Z-Score)
• Remove frequent buyers (Skew)
• Normalize Data (Unity-Based)
• Format Data into CSV input file:
<User ID>, <Item ID>, <Rating>
How it Works?
• Build a User/Item Matrix
Items
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
… 1 1
N
Neighborhood Formation
U2
U1
U5
U3
U6
U7
U4
Neighborhood Formation
• Requires some experimentation
• Similarity Metrics
• Pearson Correlation
• Euclidean Distance
• Spearman Correlation
• Cosine
• Tanimoto Coefficient
• Log-Likelihood
How it Works?
• Find users similar to U5
• Use a similarity metric (kNN)
• U1 & U7 are identified as most similar to U5
Items
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
… 1 1
N
How it Works?
• Generate Recommendations:
• Find items that have not been reviewed (I1 and I6)
• Predict rating by taking weighted sum
Items
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 0.5 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
5 1 1
6 0.7 1
Pseudo-Code Implementation
for each item i that u has no preference
for each user v that has a preference for i
compute similarity s between u and v
calculate running average of v‘s
preference for i, weighted by s
return top ranked (weighted average) i
Restrict to Neighborhood
Mahout Implementation
• Real-Time Recommendations
• Write Java Code and host in JVM Instance
• Limited scalability
• Requires Training Data
• Integration typically handled through web services
• Batch-Based Recommendations
• Uses MapReduce jobs on Hadoop
• Offline, Slow, yet scalable
• Out-of-the-box recommender jobs
Mahout MapReduce Implementation
1 – Generate List of ItemIDs
2 – Create Preference Vector
3 – Count Unique Users
4 – Transpose Preference Vectors
5 – Row Similarity
• Compute Weights
• Computer Similarities
• Similarity Matrix
6 – Pre-Partial Multiply, Similarity Matrix
7 – Pre-Partial Multiply, Preferences
8 – Partial Multiple (Steps 6 & 7)
9 – Filter Items
10 – Aggregate & Recommend
Integrating Mahout
• Real-Time
• Requires Java coding
• Web Service
• Process:
• Load training data (memory pressure)
• Generate recommendations
• Batch
• ETL from source
• Generate input file (UserID, ItemID, Rating)
• Load to HDFS
• Process with Mahout/Hadoop
• ETL output from HDFS/Hadoop
• 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5]
• UserID [ItemID:Estimate Rating, ………]
Handling Recommendations
Storing Recommendations:
• Hive
• Data Warehouse system for Hadoop
• Hive ODBC Driver
• MongoDB
• Leading NOSQL database
• JSON-like storage with flexible schema
• C#/.Net MongoDB Driver
• HBase
• Open-source distributed, column-oriented database modeled after
Google’s BigTable
• Use Pig/MapReduce to process output files and load HBase table
• Java API for easy reading
• Source System (SQL Server, etc)
Evaluating the Recommendations
• How good are your recommendations?
• How do you evaluate the recommendation engine?
• Two options both split data into test & training data sets:
• Average Difference
• Root-Mean Square
• How it works?
I1 I2 I3
Estimated Review 3.5 4.0 1.5
Actual Review 4.0 2.0 2.0
Absolute Difference 0.5 2.0 0.5
Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0
Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23
Evaluating the Recommendations
DataModel model = new FileDataModel(new File(“ratings.csv”));
RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder bldr = new RecommenderBuilder(){
@Override
public Recommender buildRecommender(DataModel model) throws TasteException{
//Use the Pearson Correlation to calculate similarity
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
//Generate neighborhoods of approx. 10 users
UserNeighborhood hood = new NearestUserNeighborhood(10, similarity,
model);
return new GenericUserBasedRecommender(model, hood, similarity);
}
};
//Use 70% of the data to train the model and 30% to test
double score = eval.evaluate(bldr, model, 0.7, 1.0);
Challenges
1. Context
2. Cold Start
3. Data Scarsity
4. Popularity Bias
5. Curse of Dimensionality
Context Challenges
???
January
20 degrees &
Snowing…..
Other Challenges
• Cold Start
• Occurs when either a new item or new user is introduced
• Can be handled by:
• Can substitute average item/user profile
• Use another recommendation generation technique (Content-Based)
• Data Sparsity
• Too many items/user make finding intersections difficult
• Popularity Bias
• Skewed towards popular items, people with “unique” taste are
left out
• Curse of Dimensionality
• More items/user leads to more noise and greater error
Resources
Mahout in Action
Sean Owen, Robin Anil, Ted Dunning,
Ellen Friedman
Hadoop: The Definitive Guide
Tom White
Thank You!
@BluewaterSQL
http://bluewatersql.wordpress.com/
cprice@pragmaticworks.com
QUESTIONS???

Running with Elephants: Predictive Analytics with HDInsight

  • 1.
    Running with Elephants PredictiveAnalytics with Mahout & HDInsight
  • 2.
    Introduction Chris Price Senior BIConsultant with Pragmatic Works Author Regular Speaker Data Geek & Super Dad! @BluewaterSQL http://bluewatersql.wordpress.com/ cprice@pragmaticworks.com
  • 3.
    You are thedemo…. SQL Brewhaus http://sqlbrewhaus.azurewebsites.net Create an Account… Rate some beers… Don’t worry your info will only be sold to the HIGHEST bidder
  • 4.
    Agenda • Business Casefor Recommendations • How a Recommendation Engine Works • Recommendation Implementation & Integration • Evaluating Recommendations • Challenges of Implementing Recommendations
  • 5.
    Making the BusinessCase Objective Increase Revenue Increase # of Orders Increase Items per Order Increase Average Item Price Up-Sell Website Navigational Inefficiency Cross-Sell
  • 6.
  • 7.
    Recommendation Engines • Takeobservation data and use data mining/machine learning algorithms to predict outcomes • Assumptions: • People with similar interest have common preferences • Sufficiently large number of preferences available
  • 8.
    Recommendation Options • CollaborativeFiltering (Mahout) • User-Based • Item-Based • Content-Based (Mahout Clustering) • Data Mining (SSAS) • Association • Clustering
  • 9.
    Technology • A scalablemachine learning library • Fast, Efficient & Pragmatic • Many of the algorithms can be run on Hadoop HDInsight • Hadoop on Windows • HDInsight on Windows Azure (Seamlessly scale in the cloud) • HortonWorks Data Platform/HDP (On-Premise Solution)
  • 10.
    Generating Recommendations 1. Sourcesof Data 2. Clean & Prepare Data 3. Generate Recommendations • Build User/Item matrix • Calculate User Similarity • Form Neighborhoods • Generate Recommendations
  • 11.
    Sources of Data •Implicit • Ratings • Feedback • Demographics • Psychographics (Personality/Lifestyle/Attitude), • Ephemeral Need (Need for a moment) • Explicit • Purchase History • Click/Browse History • Product/Item • Taxonomy • Attributes • Descriptions Our focus for today
  • 12.
    Data Preparation • Clean-Up: •Remove Outliers (Z-Score) • Remove frequent buyers (Skew) • Normalize Data (Unity-Based) • Format Data into CSV input file: <User ID>, <Item ID>, <Rating>
  • 13.
    How it Works? •Build a User/Item Matrix Items Users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 2 1 1 1 3 1 1 1 1 1 4 1 1 1 … 1 1 N
  • 14.
  • 15.
    Neighborhood Formation • Requiressome experimentation • Similarity Metrics • Pearson Correlation • Euclidean Distance • Spearman Correlation • Cosine • Tanimoto Coefficient • Log-Likelihood
  • 16.
    How it Works? •Find users similar to U5 • Use a similarity metric (kNN) • U1 & U7 are identified as most similar to U5 Items Users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 1 2 1 1 1 3 1 1 1 1 1 4 1 1 1 … 1 1 N
  • 17.
    How it Works? •Generate Recommendations: • Find items that have not been reviewed (I1 and I6) • Predict rating by taking weighted sum Items Users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 0.5 1 1 2 1 1 1 3 1 1 1 1 1 4 1 1 1 5 1 1 6 0.7 1
  • 18.
    Pseudo-Code Implementation for eachitem i that u has no preference for each user v that has a preference for i compute similarity s between u and v calculate running average of v‘s preference for i, weighted by s return top ranked (weighted average) i Restrict to Neighborhood
  • 19.
    Mahout Implementation • Real-TimeRecommendations • Write Java Code and host in JVM Instance • Limited scalability • Requires Training Data • Integration typically handled through web services • Batch-Based Recommendations • Uses MapReduce jobs on Hadoop • Offline, Slow, yet scalable • Out-of-the-box recommender jobs
  • 20.
    Mahout MapReduce Implementation 1– Generate List of ItemIDs 2 – Create Preference Vector 3 – Count Unique Users 4 – Transpose Preference Vectors 5 – Row Similarity • Compute Weights • Computer Similarities • Similarity Matrix 6 – Pre-Partial Multiply, Similarity Matrix 7 – Pre-Partial Multiply, Preferences 8 – Partial Multiple (Steps 6 & 7) 9 – Filter Items 10 – Aggregate & Recommend
  • 21.
    Integrating Mahout • Real-Time •Requires Java coding • Web Service • Process: • Load training data (memory pressure) • Generate recommendations • Batch • ETL from source • Generate input file (UserID, ItemID, Rating) • Load to HDFS • Process with Mahout/Hadoop • ETL output from HDFS/Hadoop • 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5] • UserID [ItemID:Estimate Rating, ………]
  • 22.
    Handling Recommendations Storing Recommendations: •Hive • Data Warehouse system for Hadoop • Hive ODBC Driver • MongoDB • Leading NOSQL database • JSON-like storage with flexible schema • C#/.Net MongoDB Driver • HBase • Open-source distributed, column-oriented database modeled after Google’s BigTable • Use Pig/MapReduce to process output files and load HBase table • Java API for easy reading • Source System (SQL Server, etc)
  • 23.
    Evaluating the Recommendations •How good are your recommendations? • How do you evaluate the recommendation engine? • Two options both split data into test & training data sets: • Average Difference • Root-Mean Square • How it works? I1 I2 I3 Estimated Review 3.5 4.0 1.5 Actual Review 4.0 2.0 2.0 Absolute Difference 0.5 2.0 0.5 Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0 Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23
  • 24.
    Evaluating the Recommendations DataModelmodel = new FileDataModel(new File(“ratings.csv”)); RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator(); RecommenderBuilder bldr = new RecommenderBuilder(){ @Override public Recommender buildRecommender(DataModel model) throws TasteException{ //Use the Pearson Correlation to calculate similarity UserSimilarity similarity = new PearsonCorrelationSimilarity(model); //Generate neighborhoods of approx. 10 users UserNeighborhood hood = new NearestUserNeighborhood(10, similarity, model); return new GenericUserBasedRecommender(model, hood, similarity); } }; //Use 70% of the data to train the model and 30% to test double score = eval.evaluate(bldr, model, 0.7, 1.0);
  • 25.
    Challenges 1. Context 2. ColdStart 3. Data Scarsity 4. Popularity Bias 5. Curse of Dimensionality
  • 26.
  • 27.
    Other Challenges • ColdStart • Occurs when either a new item or new user is introduced • Can be handled by: • Can substitute average item/user profile • Use another recommendation generation technique (Content-Based) • Data Sparsity • Too many items/user make finding intersections difficult • Popularity Bias • Skewed towards popular items, people with “unique” taste are left out • Curse of Dimensionality • More items/user leads to more noise and greater error
  • 28.
    Resources Mahout in Action SeanOwen, Robin Anil, Ted Dunning, Ellen Friedman Hadoop: The Definitive Guide Tom White
  • 29.