KEMBAR78
Building and deploying analytics | PPTX
Robert L. Grossman
Collin Bennett
Open Data Group
December 4, 2012
Building and Deploying
Big Data Analytic Models
5.1 SAMS
Scores
Actions
Measures
Model
Consumer
Analytic
Diamond
Analytic
strategy &
governance
Analytic algorithms
& models
Analytic Infrastructure
Analytic operations,
security & compliance
SAM
Models
Scores
Actions
Measures Dashboards
Consuming models:
Model
Consumer
5
Model
Producer
ModelData
CART, SVM, k-means, etc. PMML
Building models:
Life Cycle of Predictive Model
Exploratory Data Analysis (R) Process the data (Hadoop)
Build model in
dev/modeling environment
Deploy model in
operational systems with
scoring application
(Hadoop, streams &
Augustus)
Refine model
Analytic modeling
Operations
PMML
Log
files
Retire model and deploy
improved model
5.2 Analytic Algorithms & Models
Key Modeling Activities
 Exploring data analysis – goal is to understand the
data so that features can be generated and model
evaluated
 Generating features – goal is to shape events into
meaningful features that enable meaningful
prediction of the dependent variable
 Building models – goal is to derive statistically valid
models from learning set
 Evaluating models – goal is to develop meaningful
report so that the performance of one model can be
compared to another model
? Naïve Bayes
? K nearest neighbor
1957 K-means
1977 EM Algorithm
1984 CART
1993 C4.5
1994 A Priori
1995 SVM
1997 AdaBoost
1998 PageRank
9
Beware of any vendor
whose unique value
proposition includes
some “secret analytic
sauce.”
Source: X. Wu, et. al., Top 10 algorithms in data mining, Knowledge and Information Systems,
Volume 14, 2008, pages 1-37.
Pessimistic View: We Get a Significantly
New Algorithm Every Decade or So
1970’s neural networks
1980’s classifications and regression trees
1990’s support vector machine
2000’s graph algorithms
In general, understanding the data through
exploratory analysis and generating good
features is much more important than the
type of predictive model you use.
Questions About Datasets
• Number of records
• Number of data fields
• Size
– Does it fit into memory?
– Does it fit on a single disk or disk array?
• How many missing values?
• How many duplicates?
• Are there labels (for the dependent variable)?
• Are there keys?
Questions About Data Fields
• Data fields
– What is the mean, variance, …?
– What is the distribution? Plot the histograms.
– What are extreme values, missing values, unusual
values?
• Pairs of data fields
– What is the correlation? Plot the scatter plots.
• Visualize the data
– Histograms, scatter plots, etc.
stable clusters new cluster
…
Computing Count Distributions
def valueDistrib(b, field, select=None):
# given a dataframe b and an attribute field,
# returns the class distribution of the field
ccdict = {}
if select:
for i in select:
count = ccdict.get(b[i][field], 0)
ccdict[b[i][field]] = count + 1
else:
for i in range(len(b)):
count = ccdict.get(b[i][field], 0)
ccdict[b[i][field]] = count + 1
return ccdict
Features
• Normalize
– [0, 1] or [-1, 1]
• Aggregate
• Discretize or bin
• Transform continuous to continuous or
discrete to discrete
• Compute ratios
• Use percentiles
• Use indicator variables
Predicting Office Building Renewal
Using Text-based Features
Classification Avg R
imports 5.17
clothing 4.17
van 3.18
fashion 2.50
investment 2.42
marketing 2.38
oil 2.11
air 2.09
system 2.06
foundation 2.04
Classification Avg R
technology 2.03
apparel 2.03
law 2.00
casualty 1.91
bank 1.88
technologies 1.84
trading 1.82
associates 1.81
staffing 1.77
securities 1.77
Twitter Intention & Prediction
• Use machine learning
classification
techniques:
– Support Vector
Machines
– Trees
– Naïve Bayes
• Models require clean
data and lots of hand-
scored examples
SVM Naïve Bayes Trees
• Namibia floods classified by R package E1071.
• Often implementation dependent.
If You Can Ask For Something
• Ask for more data
• Ask for “orthogonal data”
5.3 Multiple Models
Trees
For CART trees: L. Breiman, J. Friedman, R. A. Olshen, C. J. Stone,
Classification and Regression Trees, 1984, Chapman & Hall.
Trees Partition Feature Space
• Trees partition the feature space into regions by asking whether an
attribute is less than a threshold.
R
M
1.75
2.45 4.95
M < 2.45
R < 1.75?
A
B
C
C
A B
C
M < 4.95?
C
Split Using
Entropy
split 1
split 2
split 3
split 4
split 1
split 2
split 3
split 4
vs
increase in information = 0.32 increase in information = 0
information = 0.64objects have attributes
or
Growing the Tree Step 1. Class proportions.
Node u with n objects
n1 of class A (red)
n2 of class B (blue), etc.
Step 2. Entropy
I (u) = - nj /n log nj /n
Step 3. Split proportions.
m1 sent to child 1– node u1
m2 sent to child 2– node u2
Step 4. Choose attribute
to maximize
D = I(u) - mj /n I (uj)
blue
blue
red
red
blue
u
u
1
u
2
29
Multiple Models: Ensembles
Key Idea: Combine Weak Learners
• Average several weak models, rather than build
one complex model.
Model 1
Model 2
Model 3
Combining Weak Learners
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 Classifier 3 Classifiers 5 Classifiers
55% 57.40% 59.30%
60% 64.0% 68.20%
65% 71.00% 76.50%
Ensembles
are unreasonably effective at building
models.
Average for regression trees. Use majority
voting for classification.
f1(x) + f2(x) + … + f64(x)
Building ensembles over clusters of
commodity workstations has been used
since the 1990’s.
Ensemble Models
1. Partition data and
scatter
2. Build models (e.g.
tree-based model)
3. Gather models
4. Form collection of
models into ensemble
(e.g. majority vote for
classification &
averaging for
regression)
1. partition data
3. gather
models
model
4. form
ensemble
2. build
model
Example: Fraud
• Ensembles of trees have proved remarkably
effective in detecting fraud
• Trees can be very large
• Very fast to score
• Lots of variants
– Random forests
– Random trees
• Sometimes need to reverse engineer reason
codes.
CUSUM
37
CUSUM
 Assume that we have a mean and
standard deviation for both
distributions.
Observed
Distribution
Baseline
f0(x) = baseline density
f1(x) = observed density
g(x) = log(f1(x)/f0(x))
Z0 = 0
Zn+1 = max{Zn+g(x), 0}
Alert if Zn+1>threshold
f0(x) f1(x)
38
Cubes of Models (Segmented Models)
Build separate segmented model
for every entity of interest
Divide & conquer data (segment)
using multidimensional data
cubes
For each distinct cube, establish
separate baselines for each
quantify of interest
Detect changes from baselines
Entity
(bank, etc.)
Geospatial
region
Estimate separate baselines for each
quantify of interest
Type of
Transaction
39
Examples - Change Detection
Using Cubes of Models
• Highway Traffic Data
– each day (7) x each hour (24) x each sensor
(hundreds) x each weather condition (5) x each
special event (dozens)
– 50,000 baselines models used in current
testbed
Multiple Models in Hadoop
• Trees in Mapper
• Build lots of small trees (over historical data in
Hadoop)
• Load all of them into the mapper
• Mapper
– For each event, score against each tree
– Map emits
• Key t value = event id t tree id, score
• Reducer
– Aggregate scores for each event
– Select a “winner” for each event
5.4 Evaluating and
Comparing Models
The Take Home Message
• Given two predictive models, develop a
methodology to determine which is better.
• Get a predictive model up in development
quickly, no matter how bad. Call it the
Champion.
• Two methodologies
– Champion-Challenger
– Automated Testing and Deployment Environment
Champion Challenger
Methodology
• Develop a new model each week. Call it the
Challenger.
• Meet each week to discuss the performance
of the model. Compare the two models and
keep the better one as the new Champion.
Automating Testing & Deployment
Environment
• Develop a custom environment for the large scale
testing and deployment of many (tens, hundreds,
thousands) of different models.
• Develop an appropriate experimental design.
• Test on a small scale.
• Deploy those that test well on a large(r) scale.
• Continuously improve the design and automation
of the process.
• Think of as an automated scale up of A/B testing.
Have Weekly Meetings
• Have the modeler and business owner and
engineer on the operations side meet once a
week.
• Discuss misclassifications.
• Discuss business value from the alerts
• Avoid jargon (Detection rate, False Positives,
etc.) whenever possible. Instead…
• Always use visualizations.
2-Class Confusion Matrix
• There are two rates
true positive rate = TPrate = (#TP)/(#P)
false positive rate = FPrate = (#FP)/(#N)
True class
Predicted class
positive negative
positive (#P) #TP #P - #TP
negative (#N) #FP #N - #FP
Confusion Matrix
• Recall the TP and FP rates are defined by:
– TPrate = TP / P = recall
– FPrate = FP / N
• Precision and accuracy are defined by:
– Precision = TP / (TP + FP)
– Classifier assigns TP + FP to the positive class
– Accuracy = (TP + TN) / (P + N)
Precision and Accuracy
• Recall the TP and FP rates are defined by:
– TPrate = TP / P = recall
– FPrate = FP / N
• Precision and accuracy are defined by:
– Precision = TP / (TP + FP)
– Classifier assigns TP + FP to the positive class
– Accuracy = (TP + TN) / (P + N)
Graph to produce
the ROC Curve
Why visualize performance ?
• A single number tells only part of the story
– There are fundamentally two numbers of interest
(FP and TP), which cannot be summarized with a
single number.
– How are errors distributed across the classes?
• Shape of curves are much more informative.
Example: 3 classifiers
True
Predicted
pos neg
pos 60 40
neg 20 80
True
Predicted
pos neg
pos 70 30
neg 50 50
True
Predicted
pos neg
pos 40 60
neg 30 70
Classifier 1
TP = 0.4
FP = 0.3
Classifier 2
TP = 0.7
FP = 0.5
Classifier 3
TP = 0.6
FP = 0.2
ROC plot for the 3 Classifiers
Perfect
Classifier
False Positive Rate
Tree Positive Rate
Random
Classifier
ROC Curve
• Recall the TP and FP rates are defined by:
– TPrate = TP / P
– FPrate = FP / N
• The ROC Curve is defined by:
– x = FPrate
– y = TPrate
Creating an ROC Curve
• A classifier produces a single ROC point.
• If the classifier has a threshold or sensitivity
parameter, varying it produces a series of ROC
points (confusion matrices).
ROC Curve
Performance of a Random Classifier
• When there are an equal number of Positives
(P) and Negatives (N), then a random classifier
is accurate about 50% of the time.
• The lift measures the relative performance of
a classifier as compared to a random classifier
• Especially important with imbalanced classes.
Lift Curve
• The Lift Curve is defined by:
– x = (TP + FP) / (P + N)
– y = TP
Questions?
For the most current version of these slides,
please see tutorials.opendatagroup.com
About Open Data
• Open Data began operations in 2001 and has
built predictive models for companies for over
ten years
• Open Data provides management consulting,
outsourced analytic services, & analytic staffing
• For more information
• www.opendatagroup.com
• info@opendatagroup.com

Building and deploying analytics

  • 1.
    Robert L. Grossman CollinBennett Open Data Group December 4, 2012 Building and Deploying Big Data Analytic Models
  • 2.
  • 3.
    Analytic Diamond Analytic strategy & governance Analytic algorithms &models Analytic Infrastructure Analytic operations, security & compliance
  • 4.
  • 5.
  • 6.
    Life Cycle ofPredictive Model Exploratory Data Analysis (R) Process the data (Hadoop) Build model in dev/modeling environment Deploy model in operational systems with scoring application (Hadoop, streams & Augustus) Refine model Analytic modeling Operations PMML Log files Retire model and deploy improved model
  • 7.
  • 8.
    Key Modeling Activities Exploring data analysis – goal is to understand the data so that features can be generated and model evaluated  Generating features – goal is to shape events into meaningful features that enable meaningful prediction of the dependent variable  Building models – goal is to derive statistically valid models from learning set  Evaluating models – goal is to develop meaningful report so that the performance of one model can be compared to another model
  • 9.
    ? Naïve Bayes ?K nearest neighbor 1957 K-means 1977 EM Algorithm 1984 CART 1993 C4.5 1994 A Priori 1995 SVM 1997 AdaBoost 1998 PageRank 9 Beware of any vendor whose unique value proposition includes some “secret analytic sauce.” Source: X. Wu, et. al., Top 10 algorithms in data mining, Knowledge and Information Systems, Volume 14, 2008, pages 1-37.
  • 10.
    Pessimistic View: WeGet a Significantly New Algorithm Every Decade or So 1970’s neural networks 1980’s classifications and regression trees 1990’s support vector machine 2000’s graph algorithms
  • 11.
    In general, understandingthe data through exploratory analysis and generating good features is much more important than the type of predictive model you use.
  • 12.
    Questions About Datasets •Number of records • Number of data fields • Size – Does it fit into memory? – Does it fit on a single disk or disk array? • How many missing values? • How many duplicates? • Are there labels (for the dependent variable)? • Are there keys?
  • 13.
    Questions About DataFields • Data fields – What is the mean, variance, …? – What is the distribution? Plot the histograms. – What are extreme values, missing values, unusual values? • Pairs of data fields – What is the correlation? Plot the scatter plots. • Visualize the data – Histograms, scatter plots, etc.
  • 15.
  • 17.
    Computing Count Distributions defvalueDistrib(b, field, select=None): # given a dataframe b and an attribute field, # returns the class distribution of the field ccdict = {} if select: for i in select: count = ccdict.get(b[i][field], 0) ccdict[b[i][field]] = count + 1 else: for i in range(len(b)): count = ccdict.get(b[i][field], 0) ccdict[b[i][field]] = count + 1 return ccdict
  • 18.
    Features • Normalize – [0,1] or [-1, 1] • Aggregate • Discretize or bin • Transform continuous to continuous or discrete to discrete • Compute ratios • Use percentiles • Use indicator variables
  • 19.
    Predicting Office BuildingRenewal Using Text-based Features Classification Avg R imports 5.17 clothing 4.17 van 3.18 fashion 2.50 investment 2.42 marketing 2.38 oil 2.11 air 2.09 system 2.06 foundation 2.04 Classification Avg R technology 2.03 apparel 2.03 law 2.00 casualty 1.91 bank 1.88 technologies 1.84 trading 1.82 associates 1.81 staffing 1.77 securities 1.77
  • 20.
    Twitter Intention &Prediction • Use machine learning classification techniques: – Support Vector Machines – Trees – Naïve Bayes • Models require clean data and lots of hand- scored examples
  • 21.
    SVM Naïve BayesTrees • Namibia floods classified by R package E1071. • Often implementation dependent.
  • 22.
    If You CanAsk For Something • Ask for more data • Ask for “orthogonal data”
  • 23.
  • 24.
    Trees For CART trees:L. Breiman, J. Friedman, R. A. Olshen, C. J. Stone, Classification and Regression Trees, 1984, Chapman & Hall.
  • 25.
    Trees Partition FeatureSpace • Trees partition the feature space into regions by asking whether an attribute is less than a threshold. R M 1.75 2.45 4.95 M < 2.45 R < 1.75? A B C C A B C M < 4.95? C
  • 26.
    Split Using Entropy split 1 split2 split 3 split 4 split 1 split 2 split 3 split 4 vs increase in information = 0.32 increase in information = 0 information = 0.64objects have attributes or
  • 27.
    Growing the TreeStep 1. Class proportions. Node u with n objects n1 of class A (red) n2 of class B (blue), etc. Step 2. Entropy I (u) = - nj /n log nj /n Step 3. Split proportions. m1 sent to child 1– node u1 m2 sent to child 2– node u2 Step 4. Choose attribute to maximize D = I(u) - mj /n I (uj) blue blue red red blue u u 1 u 2
  • 28.
  • 29.
    Key Idea: CombineWeak Learners • Average several weak models, rather than build one complex model. Model 1 Model 2 Model 3
  • 30.
    Combining Weak Learners 1 11 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 1 Classifier 3 Classifiers 5 Classifiers 55% 57.40% 59.30% 60% 64.0% 68.20% 65% 71.00% 76.50%
  • 31.
    Ensembles are unreasonably effectiveat building models. Average for regression trees. Use majority voting for classification. f1(x) + f2(x) + … + f64(x)
  • 32.
    Building ensembles overclusters of commodity workstations has been used since the 1990’s.
  • 33.
    Ensemble Models 1. Partitiondata and scatter 2. Build models (e.g. tree-based model) 3. Gather models 4. Form collection of models into ensemble (e.g. majority vote for classification & averaging for regression) 1. partition data 3. gather models model 4. form ensemble 2. build model
  • 34.
    Example: Fraud • Ensemblesof trees have proved remarkably effective in detecting fraud • Trees can be very large • Very fast to score • Lots of variants – Random forests – Random trees • Sometimes need to reverse engineer reason codes.
  • 35.
  • 36.
    37 CUSUM  Assume thatwe have a mean and standard deviation for both distributions. Observed Distribution Baseline f0(x) = baseline density f1(x) = observed density g(x) = log(f1(x)/f0(x)) Z0 = 0 Zn+1 = max{Zn+g(x), 0} Alert if Zn+1>threshold f0(x) f1(x)
  • 37.
    38 Cubes of Models(Segmented Models) Build separate segmented model for every entity of interest Divide & conquer data (segment) using multidimensional data cubes For each distinct cube, establish separate baselines for each quantify of interest Detect changes from baselines Entity (bank, etc.) Geospatial region Estimate separate baselines for each quantify of interest Type of Transaction
  • 38.
    39 Examples - ChangeDetection Using Cubes of Models • Highway Traffic Data – each day (7) x each hour (24) x each sensor (hundreds) x each weather condition (5) x each special event (dozens) – 50,000 baselines models used in current testbed
  • 39.
    Multiple Models inHadoop • Trees in Mapper • Build lots of small trees (over historical data in Hadoop) • Load all of them into the mapper • Mapper – For each event, score against each tree – Map emits • Key t value = event id t tree id, score • Reducer – Aggregate scores for each event – Select a “winner” for each event
  • 40.
  • 41.
    The Take HomeMessage • Given two predictive models, develop a methodology to determine which is better. • Get a predictive model up in development quickly, no matter how bad. Call it the Champion. • Two methodologies – Champion-Challenger – Automated Testing and Deployment Environment
  • 42.
    Champion Challenger Methodology • Developa new model each week. Call it the Challenger. • Meet each week to discuss the performance of the model. Compare the two models and keep the better one as the new Champion.
  • 43.
    Automating Testing &Deployment Environment • Develop a custom environment for the large scale testing and deployment of many (tens, hundreds, thousands) of different models. • Develop an appropriate experimental design. • Test on a small scale. • Deploy those that test well on a large(r) scale. • Continuously improve the design and automation of the process. • Think of as an automated scale up of A/B testing.
  • 44.
    Have Weekly Meetings •Have the modeler and business owner and engineer on the operations side meet once a week. • Discuss misclassifications. • Discuss business value from the alerts • Avoid jargon (Detection rate, False Positives, etc.) whenever possible. Instead… • Always use visualizations.
  • 45.
    2-Class Confusion Matrix •There are two rates true positive rate = TPrate = (#TP)/(#P) false positive rate = FPrate = (#FP)/(#N) True class Predicted class positive negative positive (#P) #TP #P - #TP negative (#N) #FP #N - #FP
  • 46.
    Confusion Matrix • Recallthe TP and FP rates are defined by: – TPrate = TP / P = recall – FPrate = FP / N • Precision and accuracy are defined by: – Precision = TP / (TP + FP) – Classifier assigns TP + FP to the positive class – Accuracy = (TP + TN) / (P + N)
  • 47.
    Precision and Accuracy •Recall the TP and FP rates are defined by: – TPrate = TP / P = recall – FPrate = FP / N • Precision and accuracy are defined by: – Precision = TP / (TP + FP) – Classifier assigns TP + FP to the positive class – Accuracy = (TP + TN) / (P + N) Graph to produce the ROC Curve
  • 48.
    Why visualize performance? • A single number tells only part of the story – There are fundamentally two numbers of interest (FP and TP), which cannot be summarized with a single number. – How are errors distributed across the classes? • Shape of curves are much more informative.
  • 49.
    Example: 3 classifiers True Predicted posneg pos 60 40 neg 20 80 True Predicted pos neg pos 70 30 neg 50 50 True Predicted pos neg pos 40 60 neg 30 70 Classifier 1 TP = 0.4 FP = 0.3 Classifier 2 TP = 0.7 FP = 0.5 Classifier 3 TP = 0.6 FP = 0.2
  • 50.
    ROC plot forthe 3 Classifiers Perfect Classifier False Positive Rate Tree Positive Rate Random Classifier
  • 51.
    ROC Curve • Recallthe TP and FP rates are defined by: – TPrate = TP / P – FPrate = FP / N • The ROC Curve is defined by: – x = FPrate – y = TPrate
  • 52.
    Creating an ROCCurve • A classifier produces a single ROC point. • If the classifier has a threshold or sensitivity parameter, varying it produces a series of ROC points (confusion matrices).
  • 53.
  • 54.
    Performance of aRandom Classifier • When there are an equal number of Positives (P) and Negatives (N), then a random classifier is accurate about 50% of the time. • The lift measures the relative performance of a classifier as compared to a random classifier • Especially important with imbalanced classes.
  • 55.
    Lift Curve • TheLift Curve is defined by: – x = (TP + FP) / (P + N) – y = TP
  • 56.
    Questions? For the mostcurrent version of these slides, please see tutorials.opendatagroup.com
  • 57.
    About Open Data •Open Data began operations in 2001 and has built predictive models for companies for over ten years • Open Data provides management consulting, outsourced analytic services, & analytic staffing • For more information • www.opendatagroup.com • info@opendatagroup.com