Running with Elephants: Predictive Analytics with HDInsight

Running with Elephants
Predictive Analytics with Mahout & HDInsight

Introduction
Chris Price
Senior BI Consultant with Pragmatic Works
Author
Regular Speaker
Data Geek & Super Dad!
@BluewaterSQL
http://bluewatersql.wordpress.com/
cprice@pragmaticworks.com

You are the demo….
SQL Brewhaus
http://sqlbrewhaus.azurewebsites.net
Create an Account… Rate some beers…
Don’t worry your info
will only be sold to the
HIGHEST bidder

Agenda
• Business Case for Recommendations
• How a Recommendation Engine Works
• Recommendation Implementation & Integration
• Evaluating Recommendations
• Challenges of Implementing Recommendations

Making the Business Case
Objective
Increase
Revenue
Increase #
of Orders
Increase
Items per
Order
Increase
Average
Item Price
Up-Sell Website
Navigational
Inefficiency
Cross-Sell

Business Case Example
Increased
Revenue

Recommendation Engines
• Take observation data and use data mining/machine
learning algorithms to predict outcomes
• Assumptions:
• People with similar interest have common preferences
• Sufficiently large number of preferences available

Recommendation Options
• Collaborative Filtering (Mahout)
• User-Based
• Item-Based
• Content-Based (Mahout Clustering)
• Data Mining (SSAS)
• Association
• Clustering

Technology
• A scalable machine learning library
• Fast, Efficient & Pragmatic
• Many of the algorithms can be run on Hadoop
HDInsight
• Hadoop on Windows
• HDInsight on Windows Azure (Seamlessly scale in the cloud)
• HortonWorks Data Platform/HDP (On-Premise Solution)

Generating Recommendations
1. Sources of Data
2. Clean & Prepare Data
3. Generate Recommendations
• Build User/Item matrix
• Calculate User Similarity
• Form Neighborhoods
• Generate Recommendations

Sources of Data
• Implicit
• Ratings
• Feedback
• Demographics
• Psychographics (Personality/Lifestyle/Attitude),
• Ephemeral Need (Need for a moment)
• Explicit
• Purchase History
• Click/Browse History
• Product/Item
• Taxonomy
• Attributes
• Descriptions
Our focus for today

Data Preparation
• Clean-Up:
• Remove Outliers (Z-Score)
• Remove frequent buyers (Skew)
• Normalize Data (Unity-Based)
• Format Data into CSV input file:
<User ID>, <Item ID>, <Rating>

How it Works?
• Build a User/Item Matrix
Items
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
… 1 1
N

Neighborhood Formation
U2
U1
U5
U3
U6
U7
U4

Neighborhood Formation
• Requires some experimentation
• Similarity Metrics
• Pearson Correlation
• Euclidean Distance
• Spearman Correlation
• Cosine
• Tanimoto Coefficient
• Log-Likelihood

How it Works?
• Find users similar to U5
• Use a similarity metric (kNN)
• U1 & U7 are identified as most similar to U5
Items
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
… 1 1
N

How it Works?
• Generate Recommendations:
• Find items that have not been reviewed (I1 and I6)
• Predict rating by taking weighted sum
Items
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 0.5 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
5 1 1
6 0.7 1

Pseudo-Code Implementation
for each item i that u has no preference
for each user v that has a preference for i
compute similarity s between u and v
calculate running average of v‘s
preference for i, weighted by s
return top ranked (weighted average) i
Restrict to Neighborhood

Mahout Implementation
• Real-Time Recommendations
• Write Java Code and host in JVM Instance
• Limited scalability
• Requires Training Data
• Integration typically handled through web services
• Batch-Based Recommendations
• Uses MapReduce jobs on Hadoop
• Offline, Slow, yet scalable
• Out-of-the-box recommender jobs

Mahout MapReduce Implementation
1 – Generate List of ItemIDs
2 – Create Preference Vector
3 – Count Unique Users
4 – Transpose Preference Vectors
5 – Row Similarity
• Compute Weights
• Computer Similarities
• Similarity Matrix
6 – Pre-Partial Multiply, Similarity Matrix
7 – Pre-Partial Multiply, Preferences
8 – Partial Multiple (Steps 6 & 7)
9 – Filter Items
10 – Aggregate & Recommend

Integrating Mahout
• Real-Time
• Requires Java coding
• Web Service
• Process:
• Load training data (memory pressure)
• Generate recommendations
• Batch
• ETL from source
• Generate input file (UserID, ItemID, Rating)
• Load to HDFS
• Process with Mahout/Hadoop
• ETL output from HDFS/Hadoop
• 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5]
• UserID [ItemID:Estimate Rating, ………]

Handling Recommendations
Storing Recommendations:
• Hive
• Data Warehouse system for Hadoop
• Hive ODBC Driver
• MongoDB
• Leading NOSQL database
• JSON-like storage with flexible schema
• C#/.Net MongoDB Driver
• HBase
• Open-source distributed, column-oriented database modeled after
Google’s BigTable
• Use Pig/MapReduce to process output files and load HBase table
• Java API for easy reading
• Source System (SQL Server, etc)

Evaluating the Recommendations
• How good are your recommendations?
• How do you evaluate the recommendation engine?
• Two options both split data into test & training data sets:
• Average Difference
• Root-Mean Square
• How it works?
I1 I2 I3
Estimated Review 3.5 4.0 1.5
Actual Review 4.0 2.0 2.0
Absolute Difference 0.5 2.0 0.5
Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0
Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23

Evaluating the Recommendations
DataModel model = new FileDataModel(new File(“ratings.csv”));
RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder bldr = new RecommenderBuilder(){
@Override
public Recommender buildRecommender(DataModel model) throws TasteException{
//Use the Pearson Correlation to calculate similarity
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
//Generate neighborhoods of approx. 10 users
UserNeighborhood hood = new NearestUserNeighborhood(10, similarity,
model);
return new GenericUserBasedRecommender(model, hood, similarity);
}
};
//Use 70% of the data to train the model and 30% to test
double score = eval.evaluate(bldr, model, 0.7, 1.0);

Challenges
1. Context
2. Cold Start
3. Data Scarsity
4. Popularity Bias
5. Curse of Dimensionality

Context Challenges
???
January
20 degrees &
Snowing…..

Other Challenges
• Cold Start
• Occurs when either a new item or new user is introduced
• Can be handled by:
• Can substitute average item/user profile
• Use another recommendation generation technique (Content-Based)
• Data Sparsity
• Too many items/user make finding intersections difficult
• Popularity Bias
• Skewed towards popular items, people with “unique” taste are
left out
• Curse of Dimensionality
• More items/user leads to more noise and greater error

Resources
Mahout in Action
Sean Owen, Robin Anil, Ted Dunning,
Ellen Friedman
Hadoop: The Definitive Guide
Tom White

Thank You!
@BluewaterSQL
http://bluewatersql.wordpress.com/
cprice@pragmaticworks.com
QUESTIONS???

Running with Elephants: Predictive Analytics with HDInsight

More Related Content

What's hot

Similar to Running with Elephants: Predictive Analytics with HDInsight

Recently uploaded

Running with Elephants: Predictive Analytics with HDInsight