Building and deploying analytics

Robert L. Grossman
Collin Bennett
Open Data Group
December 4, 2012
Building and Deploying
Big Data Analytic Models

5.1 SAMS
Scores
Actions
Measures
Model
Consumer

Analytic
Diamond
Analytic
strategy &
governance
Analytic algorithms
& models
Analytic Infrastructure
Analytic operations,
security & compliance

SAM
Models
Scores
Actions
Measures Dashboards
Consuming models:
Model
Consumer

5
Model
Producer
ModelData
CART, SVM, k-means, etc. PMML
Building models:

Life Cycle of Predictive Model
Exploratory Data Analysis (R) Process the data (Hadoop)
Build model in
dev/modeling environment
Deploy model in
operational systems with
scoring application
(Hadoop, streams &
Augustus)
Refine model
Analytic modeling
Operations
PMML
Log
files
Retire model and deploy
improved model

5.2 Analytic Algorithms & Models

Key Modeling Activities
 Exploring data analysis – goal is to understand the
data so that features can be generated and model
evaluated
 Generating features – goal is to shape events into
meaningful features that enable meaningful
prediction of the dependent variable
 Building models – goal is to derive statistically valid
models from learning set
 Evaluating models – goal is to develop meaningful
report so that the performance of one model can be
compared to another model

? Naïve Bayes
? K nearest neighbor
1957 K-means
1977 EM Algorithm
1984 CART
1993 C4.5
1994 A Priori
1995 SVM
1997 AdaBoost
1998 PageRank
9
Beware of any vendor
whose unique value
proposition includes
some “secret analytic
sauce.”
Source: X. Wu, et. al., Top 10 algorithms in data mining, Knowledge and Information Systems,
Volume 14, 2008, pages 1-37.

Pessimistic View: We Get a Significantly
New Algorithm Every Decade or So
1970’s neural networks
1980’s classifications and regression trees
1990’s support vector machine
2000’s graph algorithms

In general, understanding the data through
exploratory analysis and generating good
features is much more important than the
type of predictive model you use.

Questions About Datasets
• Number of records
• Number of data fields
• Size
– Does it fit into memory?
– Does it fit on a single disk or disk array?
• How many missing values?
• How many duplicates?
• Are there labels (for the dependent variable)?
• Are there keys?

Questions About Data Fields
• Data fields
– What is the mean, variance, …?
– What is the distribution? Plot the histograms.
– What are extreme values, missing values, unusual
values?
• Pairs of data fields
– What is the correlation? Plot the scatter plots.
• Visualize the data
– Histograms, scatter plots, etc.

stable clusters new cluster
…

Computing Count Distributions
def valueDistrib(b, field, select=None):
# given a dataframe b and an attribute field,
# returns the class distribution of the field
ccdict = {}
if select:
for i in select:
count = ccdict.get(b[i][field], 0)
ccdict[b[i][field]] = count + 1
else:
for i in range(len(b)):
count = ccdict.get(b[i][field], 0)
ccdict[b[i][field]] = count + 1
return ccdict

Features
• Normalize
– [0, 1] or [-1, 1]
• Aggregate
• Discretize or bin
• Transform continuous to continuous or
discrete to discrete
• Compute ratios
• Use percentiles
• Use indicator variables

Predicting Office Building Renewal
Using Text-based Features
Classification Avg R
imports 5.17
clothing 4.17
van 3.18
fashion 2.50
investment 2.42
marketing 2.38
oil 2.11
air 2.09
system 2.06
foundation 2.04
Classification Avg R
technology 2.03
apparel 2.03
law 2.00
casualty 1.91
bank 1.88
technologies 1.84
trading 1.82
associates 1.81
staffing 1.77
securities 1.77

Twitter Intention & Prediction
• Use machine learning
classification
techniques:
– Support Vector
Machines
– Trees
– Naïve Bayes
• Models require clean
data and lots of hand-
scored examples

SVM Naïve Bayes Trees
• Namibia floods classified by R package E1071.
• Often implementation dependent.

If You Can Ask For Something
• Ask for more data
• Ask for “orthogonal data”

Trees
For CART trees: L. Breiman, J. Friedman, R. A. Olshen, C. J. Stone,
Classification and Regression Trees, 1984, Chapman & Hall.

Trees Partition Feature Space
• Trees partition the feature space into regions by asking whether an
attribute is less than a threshold.
R
M
1.75
2.45 4.95
M < 2.45
R < 1.75?
A
B
C
C
A B
C
M < 4.95?
C

Split Using
Entropy
split 1
split 2
split 3
split 4
split 1
split 2
split 3
split 4
vs
increase in information = 0.32 increase in information = 0
information = 0.64objects have attributes
or

Growing the Tree Step 1. Class proportions.
Node u with n objects
n1 of class A (red)
n2 of class B (blue), etc.
Step 2. Entropy
I (u) = - nj /n log nj /n
Step 3. Split proportions.
m1 sent to child 1– node u1
m2 sent to child 2– node u2
Step 4. Choose attribute
to maximize
D = I(u) - mj /n I (uj)
blue
blue
red
red
blue
u
u
1
u
2

Key Idea: Combine Weak Learners
• Average several weak models, rather than build
one complex model.
Model 1
Model 2
Model 3

Combining Weak Learners
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 Classifier 3 Classifiers 5 Classifiers
55% 57.40% 59.30%
60% 64.0% 68.20%
65% 71.00% 76.50%

Ensembles
are unreasonably effective at building
models.
Average for regression trees. Use majority
voting for classification.
f1(x) + f2(x) + … + f64(x)

Building ensembles over clusters of
commodity workstations has been used
since the 1990’s.

Ensemble Models
1. Partition data and
scatter
2. Build models (e.g.
tree-based model)
3. Gather models
4. Form collection of
models into ensemble
(e.g. majority vote for
classification &
averaging for
regression)
1. partition data
3. gather
models
model
4. form
ensemble
2. build
model

Example: Fraud
• Ensembles of trees have proved remarkably
effective in detecting fraud
• Trees can be very large
• Very fast to score
• Lots of variants
– Random forests
– Random trees
• Sometimes need to reverse engineer reason
codes.

37
CUSUM
 Assume that we have a mean and
standard deviation for both
distributions.
Observed
Distribution
Baseline
f0(x) = baseline density
f1(x) = observed density
g(x) = log(f1(x)/f0(x))
Z0 = 0
Zn+1 = max{Zn+g(x), 0}
Alert if Zn+1>threshold
f0(x) f1(x)

38
Cubes of Models (Segmented Models)
Build separate segmented model
for every entity of interest
Divide & conquer data (segment)
using multidimensional data
cubes
For each distinct cube, establish
separate baselines for each
quantify of interest
Detect changes from baselines
Entity
(bank, etc.)
Geospatial
region
Estimate separate baselines for each
quantify of interest
Type of
Transaction

39
Examples - Change Detection
Using Cubes of Models
• Highway Traffic Data
– each day (7) x each hour (24) x each sensor
(hundreds) x each weather condition (5) x each
special event (dozens)
– 50,000 baselines models used in current
testbed

Multiple Models in Hadoop
• Trees in Mapper
• Build lots of small trees (over historical data in
Hadoop)
• Load all of them into the mapper
• Mapper
– For each event, score against each tree
– Map emits
• Key t value = event id t tree id, score
• Reducer
– Aggregate scores for each event
– Select a “winner” for each event

5.4 Evaluating and
Comparing Models

The Take Home Message
• Given two predictive models, develop a
methodology to determine which is better.
• Get a predictive model up in development
quickly, no matter how bad. Call it the
Champion.
• Two methodologies
– Champion-Challenger
– Automated Testing and Deployment Environment

Champion Challenger
Methodology
• Develop a new model each week. Call it the
Challenger.
• Meet each week to discuss the performance
of the model. Compare the two models and
keep the better one as the new Champion.

Automating Testing & Deployment
Environment
• Develop a custom environment for the large scale
testing and deployment of many (tens, hundreds,
thousands) of different models.
• Develop an appropriate experimental design.
• Test on a small scale.
• Deploy those that test well on a large(r) scale.
• Continuously improve the design and automation
of the process.
• Think of as an automated scale up of A/B testing.

Have Weekly Meetings
• Have the modeler and business owner and
engineer on the operations side meet once a
week.
• Discuss misclassifications.
• Discuss business value from the alerts
• Avoid jargon (Detection rate, False Positives,
etc.) whenever possible. Instead…
• Always use visualizations.

2-Class Confusion Matrix
• There are two rates
true positive rate = TPrate = (#TP)/(#P)
false positive rate = FPrate = (#FP)/(#N)
True class
Predicted class
positive negative
positive (#P) #TP #P - #TP
negative (#N) #FP #N - #FP

Confusion Matrix
• Recall the TP and FP rates are defined by:
– TPrate = TP / P = recall
– FPrate = FP / N
• Precision and accuracy are defined by:
– Precision = TP / (TP + FP)
– Classifier assigns TP + FP to the positive class
– Accuracy = (TP + TN) / (P + N)

Precision and Accuracy
– TPrate = TP / P = recall
– FPrate = FP / N
• Precision and accuracy are defined by:
– Precision = TP / (TP + FP)
– Classifier assigns TP + FP to the positive class
– Accuracy = (TP + TN) / (P + N)
Graph to produce
the ROC Curve

Why visualize performance ?
• A single number tells only part of the story
– There are fundamentally two numbers of interest
(FP and TP), which cannot be summarized with a
single number.
– How are errors distributed across the classes?
• Shape of curves are much more informative.

Example: 3 classifiers
True
Predicted
pos neg
pos 60 40
neg 20 80
True
Predicted
pos neg
pos 70 30
neg 50 50
True
Predicted
pos neg
pos 40 60
neg 30 70
Classifier 1
TP = 0.4
FP = 0.3
Classifier 2
TP = 0.7
FP = 0.5
Classifier 3
TP = 0.6
FP = 0.2

ROC plot for the 3 Classifiers
Perfect
Classifier
False Positive Rate
Tree Positive Rate
Random
Classifier

ROC Curve
– TPrate = TP / P
– FPrate = FP / N
• The ROC Curve is defined by:
– x = FPrate
– y = TPrate

Creating an ROC Curve
• A classifier produces a single ROC point.
• If the classifier has a threshold or sensitivity
parameter, varying it produces a series of ROC
points (confusion matrices).

Performance of a Random Classifier
• When there are an equal number of Positives
(P) and Negatives (N), then a random classifier
is accurate about 50% of the time.
• The lift measures the relative performance of
a classifier as compared to a random classifier
• Especially important with imbalanced classes.

Lift Curve
• The Lift Curve is defined by:
– x = (TP + FP) / (P + N)
– y = TP

Questions?
For the most current version of these slides,
please see tutorials.opendatagroup.com

About Open Data
• Open Data began operations in 2001 and has
built predictive models for companies for over
ten years
• Open Data provides management consulting,
outsourced analytic services, & analytic staffing
• For more information
• www.opendatagroup.com
• info@opendatagroup.com

Building and deploying analytics

More Related Content

What's hot

Similar to Building and deploying analytics

Recently uploaded

Building and deploying analytics