Anomaly
detection. Curve
fitting
Introduction
We are drowning in the deluge of data
that are being collected
world-wide, while needing for
knowledge at the same time
Anomalous events occur relatively
infrequently
However, when they do occur, their
consequences can be quite dramatic and
quite often in a negative sense
• Anomaly (or outlier) analyses include
investigating whether the data are valid or
invalid. “Mining needle in a haystack.
• States define what value or combination So much hay and so little time”
of values are outside the expected norm.
What are anomalies?
•Anomaly is a pattern in the data that does not
conform to the expected behavior
•Also referred to as outliers, exceptions,
peculiarities, surprise, etc.
•Anomalies translate to significant (often critical)
real life entities
•Cyber intrusions
•Credit card fraud
What is anomaly (outlier)?
•Invalid outliers are the target of anomaly
(outlier) analysis, as they represent errors in
the data.
•Valid outliers may appear to be outside the
norm, but investigation demonstrates that
the data are not in error.
•Valid outliers may occur due to random
variation, which occurs due to chance and is
inherent in a system.
Anomaly detection
• Very often, in large data sets, there exists samples that do not comply
with the general behavior of the data model.
• Such samples, which are significantly different or inconsistent with the
remaining set of data, are called anomalies (outliers).
• If, e.g., the display of a person's age in the database is −20 the value is
obviously not correct, and error could have been caused by a default
setting of the field "unrecorded age" in the computer program.
• On the other hand, if, in the database, the number of children for one
person is 20, it is unusual and must be checked. This could be a
typographical error, or it could be correct and represent real value for the
“number of children” attribute.
Anomaly detection
• Build models of normal data
• Detect any deviation from normal data
• Flag deviation as suspect
• Identify new types of intrusions as
deviation from normal behavior
Anomaly detection
•Compare characteristics of system with
expected values
•report when statistics do not match
•Threshold metric: when statistics deviate
from normal by threshold, sound alarm
•E.g., Number of failed logins
•Statistical moments: based on
mean/standard deviation of observations
•Number of user events in a system
•Time periods of user activity
The Machine Learning Way
Supervised Semi-Supervised Unsupervised
rare class novelty detection Outlier detection
mining
Supervised Anomaly Detection
Data available with good and bad
labels
Bad Labels are Rare ”Rare Class
Mining”
We assume that any new anomaly
will be similar to some past
anomaly
Novelty Detection
Time
Data Available with Only Normal Labels
Detect abnormalities among new observations
Anomaly Detection
No Labels available whatsoever
The only information
whatsoever is that labels are
”rare” and ”isolated” in a sense
to be determined with regard to
the remaining of the data
Anomaly Detection Cluster
Based Detectors
Real World Anomalies
•Credit Card Fraud
•An abnormally high purchase
made on a credit card
•Cyber Intrusions
•A web server involved in ftp
traffic
Motivation for anomaly detection
• Fraud Detection (Credit card, telecommunications,
criminal activity in e-Commerce)
• Customized Marketing (high/low income buying
habits)
• Medical Treatments (unusual responses to various
medicines)
• Analysis of performance statistics (professional
athletes)
• Weather Prediction
• Financial Applications (loan approval, stock
tracking)
Simple Example
Y
•N1 and N2 are
N1 o1
regions of normal O3
behavior
•Points o1 and o2
are anomalies o2
•Points in region O3 N2
are anomalies
X
Convex vs. Concave
•A polygon P is convex if for every pair of points
x and y in P, the line xy is also in P; otherwise,
it is called concave.
P x y P x y
concave convex
Concave Polygon
Convex Polygon
Which polygons are convex?
2
3
1
4
The convex hull problem
concave polygon: convex polygon:
• The convex hull of a set of planar points is the
smallest convex polygon containing all of the points.
Convex Hull Method
• Extreme points are assumed to be anomalies (outliers)
• Use convex hull method to detect extreme values
• What if the anomaly occurs in the middle of the data?
Input Data
•Most common form Start Dest Number
Tid SrcIP Dest IP Attack
time of bytes
of data handled by 1 206.135.38.95 11:07:20 160.94.179.223
Port
139 192 No
anomaly detection 2 206.163.37.95 11:13:56 160.94.179.219 139 195 No
3 206.163.37.95 11:14:29 160.94.179.217 139 180 No
techniques is Record 4 206.163.37.95 11:14:30 160.94.179.255 139 199 No
Data 5 206.163.37.95 11:14:32 160.94.179.254
6 206.163.37.95 11:14:35 160.94.179.253
139
139
19
177
Yes
No
•Univariate 7 206.163.37.95 11:14:36 160.94.179.252 139 172 No
8 206.163.37.95 11:14:38 160.94.179.251 139 285 Yes
•Multivariate 9 206.163.37.95 11:14:41 160.94.179.250 139 195 No
10 206.163.37.95 11:14:44 160.94.179.249 139 163 Yes
10
Input Data – Nature of Attributes
• Nature of attributes
• Binary
• Categorical Number
Tid SrcIP Duration Dest IP Internal
• Continuous of bytes
1 206.163.37.81 0.10 160.94.179.208 150 No
• Hybrid 2 206.163.37.99 0.27 160.94.179.235 208 No
3 160.94.123.45 1.23 160.94.179.221 195 Yes
4 206.163.37.37 112.03 160.94.179.253 199 No
5 206.163.37.41 0.32 160.94.179.244 181 No
Types of Anomaly
•Point Anomalies
•Contextual Anomalies
•Collective Anomalies
Point Anomalies
• An individual data instance is anomalous with regard
to data
Y
N1 o1
O3
o2
N2
X
Contextual Anomalies
• An individual data instance is anomalous within a context
• Requires a notion of context
Anomaly
Normal
Collective Anomalies
• A collection of related data instances is anomalous
• Requires a relationship among data instances
• Sequential Data
• Spatial Data
• Graph Data
• The individual instances within a collective anomaly are not
anomalous by themselves
Anomalous Subsequence
Fraud Detection
• Fraud detection refers to detection of criminal activities
occurring in commercial organizations
• Malicious users might be the actual customers of the
organization or might be posing as a customer (also known as
identity theft).
• Types of fraud
• Credit card fraud
• Insurance claim fraud
• Mobile / cell phone fraud
• Insider trading
• Challenges
• Fast and accurate real-time detection
• Misclassification cost is very high
Healthcare Informatics
• Detect anomalous patient records
• Indicate disease outbreaks, instrumentation errors, etc.
• Key Challenges
• Only normal labels available
• Misclassification cost is very high
• Data can be complex
Nearest Neighbor Based Techniques
• Uses k “closest” points (nearest neighbors) for
performing classification
• Key assumption: normal points have close neighbors
while anomalies are located far from other points
• General two-step approach
1. Compute neighborhood for each data record
2. Analyze the neighborhood to determine whether data
record is anomaly or not
• Distance based methods
• Anomalies are data points most distant from
other points
Nearest Neighbor Classifiers
K-Nearest Neighbor (KNN)
•Features
•All instances correspond to points in an n-
dimensional Euclidean space
•Classification is delayed till a new instance
arrives
•Classification done by comparing feature
vectors of the different points
•Target function may be discrete or real-
valued
1-Nearest Neighbor
3-Nearest Neighbor
Points X1 (Acid Durability ) X2(strength) Y=Classification
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
Points X1(Acid Durability) X2(Strength) Y(Classification)
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 ?
KNN
P1 P2 P3 P4
(7,7) (7,4) (3,4) (1,4)
Euclidean
Distance of
P5(3,7) from
P1 P2 P3 P4
(7,7) (7,4) (3,4) (1,4)
Euclidean
Distance of
P5(3,7) from
Class BAD BAD GOOD GOOD
Points X1(Durability) X2(Strength) Y(Classification)
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 GOOD
Base Rate Fallacy
Bayes theorem:
More generally:
Why Bayes?
Because Bayes answers the questions we really care
about.
Pr(I have disease | test +) vs Pr(test + | disease)
Pr(A better than B | data) vs Pr(extreme data | A=B)
Note: blue = Bayesian, red = frequentist
Probability using Bayes’ Theorem
Application of
Prior New Posterior
Bayes’
Probabilities Information Probabilities
Theorem
Example
You are waiting on a subway platform for a train that is
known to run on a regular schedule, only you don't know
how much time is scheduled to pass between train arrivals,
nor how long it's been since the last train departed. As
more time passes, do you (a) grow more confident that the
train will arrive soon, since its eventual arrival can only be
getting closer, not further away, or (b) grow less confident
that the train will arrive soon, since the longer you wait, the
more likely it seems that either the scheduled arrival times
are far apart or else that you happened to arrive just after
the last train left – or both.
If you choose (a), you're thinking like a frequentist. If you
choose (b), you're thinking like a Bayesian.
Example (Bayes probability)
Suppose a test is 95% accurate when a
disease is present and 97% accurate when
the disease is absent. Suppose that 1% of
the population has the disease. What is
P(have the disease | test +)?
P(dis)P(test+ | dis)
p(dis | test+) =
P(dis)P(test+ | dis) + P(dis)P(test+ | dis)
(0.01)(0.95) 0.0095
= = 0.24
(0.01)(0.95) + (0.99)(0.03) 0.0095 + 0.0297
Example (Bayes probability)
A consulting firm submitted a bid for a large project. The firm’s
management initially felt they had a 50-50 chance of getting
the project. However, the agency to which the bid was
submitted subsequently asked for additional information on
the bid. Past experience indicates that that for 75% of
successful bids and 40% of unsuccessful bids the agency
requested an additional information. Compute the posterior
probability that the bid will be successful given a request for
additional information.
P(S1 B) (.5)(.75)
P(S 1 | B) = =
P(S 1 B) + P(S 2 B) (.5)(.75) + (.5)(.4)
.375
= .652
.575
Curve fitting
The fitting of experimental data to a mathematical
equation is called regression. Regression may be
characterized by different adjectives according to the
mathematical form being used for the fit and the
number of variables. For example, linear regression
involves using a straight-line or linear equation for the
fit.
Linear regression minimizes the squared distance
between data points and the equation modeling the
data points. This prevents positive and negative
“errors” from canceling.
Motivation
Given a set of experimental data:
x 1 2 3
y 5.1 5.9 6.3
• The relationship between x
and y may not be clear.
• Find a function f(x) that best
fit the data
1 2 3
Data Modeling
• Given: data points, functional form,
find constants in function
• Example: given (xi, yi), find line through them;
i.e., find a and b in y = ax+b
y=ax+b
(x6,y6)
(x3,y 3)
(x5,y5)
(x1,y1)
(x7,y7)
(x4,y4)
(x2,y2)
Curve Fitting
Least-Squares Regression
• Function is "best fit" to data.
•Does not necessarily pass through
points.
• Used for scattered data (experimental)
•Can develop models for
analysis/design.
Curve Fitting by Least-Squares
Regression
Objective:
Obtain low order approximation (curve or surface) that
"best fits" data
Note:
• Because the order of the approximation is < the number
of data points, the curve or surface can not pass
through all points.
• We will need a consistent criterion for determining the
"best fit."
Typical Usage:
Scattered (experimental) data
Develop empirical models for analysis/design.
Least-Squares Regression
1. In laboratory, apply x, measure y, tabulate data.
2. Plot data and examine the relationship.
y i
x i x
Least-Squares Regression
1. In laboratory, apply x, measure y, tabulate data.
2. Plot data and examine the relationship.
yi
xi x
Procedure
•Given paired data points (xi,yi).
•Produce a scatterplot of the paired data
points.
•Fit a linear equation
Y = aX+b
and its graph (curve) to the data.
•Evaluate the fit.
•Analyze the result.
Table of Data
Scatterplot of Data
Breeding Chows and Vizslas
10
9
8
7
Dispositio
6
5
4
3
n
2
1
0
0 1 2 3 4 5 6 7 8 9
Appearance
Plotting the Means (x,y)=(5,5.82)
Breeding Chows and Vizslas
10
9
8
Disposition
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9
Appearance
Adding a Trendline
Breeding Chows and Vizslas
10
9
8
Dispositio
7
6
5
e2
4
e1 e3
3
n
2
1
0
0 1 2 3 4 5 6 7 8 9
Appearance
Least Squares Fit
• Minimize the error sum of squares
Q = (ei)2
•The smaller Q, the closer the correlation
coefficient R2 is to 1.
•A perfect fit (all points on the line) has R2=1.
Trendline and Equation
Breeding Chows and Vizslas D= 1.0476A + 0.5801
R2 = 0.6619
10
9
8
Disposition
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
Appearance
No Correlation
•This graph shows R2 = 0.0003
7
essentially no 6
correlation 5
between the Y4
variables X and Y. 3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
X
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
Linear Correlation
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X
Linear Correlation
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Linear Correlation
No relationship
X
High Correlation
•This graph 24
shows a 20
high degree 16
of Y 12
correlation 8
between X 4
and Y. 0
0 2 4 6 8 10 12
X Y = 1.6927X + 3.0273
R2 = 0.9996
Outliers
•Outliers 24
affect the 20
degree of 16
correlation. Y 12
•Outliers 8
affect the 4
fitted curve. 0
0 2 4 6 8 10 12
X Y = 1.5698X+ 3.9045
R2 = 0.7968
Least Squares Fit of a Straight Line:
Example
Fit a straight line to the x and y values in the following
Table:
xi yi xiyi xi2
1 0.5 0.5 1 xi = 28 yi = 24.0
2 2.5 5 4
x2 =
xi yi =119.5
3 2 6 9
4 4 16 16 x = 28 = 4
5 3.5 17.5 25 7
6 6 36 36
24
7 5.5 38.5 49 y= = 3.428571
7
28 24 119.5 140
Least Squares Fit of a Straight
Line: Example 2 (cont’d)
n xi yi − xi yi
a1 =
n x − ( xi )
2 2
i
= 7 119.5 − 28 24 = 0.8392857
7 140 − 282
a0 = y − a1x
= 3.428571− 0.8392857 4 = 0.07142857
Y = 0.07142857 + 0.8392857 x
Least Squares Fit of a Straight
Line: Example 2 (Error Analysis)
^
xi yi (yi − y)2 e i2 = ( yi − y)2
1 0.5 8.5765 0.1687 S t = (
i y − y)2
= 22.7143
2 2.5 0.8622 0.5625
Sr =
2
3 2.0 2.0408 0.3473
e
4 4.0 0.3265 0.3265
5 3.5 0.0051 0.5896
r 2
= St − Sr = 0.868
6 6.0 6.6122 0.7972 St
7 5.5 4.2908 0.1993
28 24.0 22.7143 2.9911
r = r2 = 0.868 = 0.932
Least Squares Fit of a Straight Line:
Example 2 (Error Analysis)
• The standard deviation (quantifies the spread around the mean):
St 22.7143 = 1.9457
sy = =
n −1 7 −1
• The standard error of estimate (quantifies the spread around the
regression line)
Sr 2.9911 = 0.7735
sy / x = =
n−2 7−2
Because S y / x ,Sy the linear regression model has good fitness