1
Data Mining & Analytics
— Unit 5 —
— (Chapter 12) —
Outlier Detection
2
Chapter 12. Outlier Analysis
Outlier and Outlier Analysis
Outlier Detection Methods
Statistical Approaches
Proximity-Base Approaches
Clustering-Base Approaches
Classification Approaches
3
What Are Outliers?
Outlier: A data object that deviates significantly from
the normal objects as if it were generated by a different
mechanism
Examples of Outliers:
1.Bank Transactions: A customer making a transaction of
₹10,000 daily but suddenly making a ₹10,00,000 transaction.
2.Student Scores: A class scoring between 40-80 marks, but
one student scores 5 or 99.
3.Sensor Readings: A temperature sensor recording 100°C
when the normal range is between 20°C - 30°C.
4. Unusual credit card purchase, sports: Michael J ordon,
Wayne Gretzky, ...
4
Outliers Vs Noise
Outliers are different from the noise data
Noise is random error or variance in a measured variable
Noise should be removed before outlier detection
Noise is irrelevant data that should be filtered out to improve accuracy.
5
What Are Outliers?
Outliers are interesting: It violates the mechanism that generates the
normal data
Outlier detection vs. novelty detection: early stage, outlier; but later
merged into the model
Applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis
6
Outliers
7
Detecting Outlier:
Clustering based outlier detection using distance to the
closest cluster:
K-Means clustering
technique
Algorithm:
1.Calculate the mean of each cluster
2.Initialize the Threshold value
3.Calculate the distance of the test data from each
cluster mean
4.Find the nearest cluster to the test data
5.If (Distance > Threshold) then, Outlier
8
K-Means Outlier Detection
We have the following dataset:
9
10
11
•Clusters found:
•C1: (1,2), (2,1), (3,2)
•C2: (8,8), (9,8), (8,9)
•Outlier: (50,50) is too far from any cluster.
12
Types of Outliers (I)
Three kinds: global, contextual and collective outliers
13
Types of Outliers (I)
Global outlier (or point anomaly)
A Global Outlier (Point Anomaly) is a single data point that
significantly deviates from the rest of the dataset.
It does not follow the general pattern of the data and appears far
away from other points.
Object is Og if it significantly deviates from the rest of the data set
Ex. Intrusion detection in computer networks
Issue: Find an appropriate measurement of deviation
14
15
16
How to Detect Global Outliers?
17
Types of Outliers (I)
Visualization Methods
•Boxplots (Shows extreme values)
•Scatter plots (Detects outliers in 2D)
•Histogram (Identifies rare values)
Global Outlier
Global Outlier
18
Types of Outliers (II)
Contextual outlier (or conditional outlier)
A Contextual Outlier is a data point that is only considered
an outlier in a specific context but appears normal
otherwise.
Unlike global outliers, which are extreme across the
entire dataset, contextual outliers depend on certain
conditions or attributes (contextual features).
19
Example: Temperature Data
Consider temperature readings in different seasons:
The same temperature (25°C) could be normal in summer but an outlier in winter.
20
Example 2: Credit Card Transactions
21
Types of Outliers (II)
Contextual outlier (or conditional outlier)
Object is O if it deviates significantly based on a selected context
c
Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
Attributes of data objects should be divided into two groups
Contextual attributes: defines the context, e.g., time & location
Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
Issue: How to define or formulate meaningful context?
22
How to Detect Contextual Outliers?
Since contextual outliers depend on conditions, we use context-aware
methods:
1.Time-Series Methods (for detecting anomalies over time)
1. Seasonal Decomposition: Identifies unusual values based on
seasonality.
2. LSTMs & RNNs: Learn patterns and detect deviations in sequential
data.
2.Regression-Based Methods (for detecting unexpected values given
conditions)
1. Linear Regression: Predict expected values; if a data point deviates
significantly, it's an outlier.
3.Density-Based Approaches
1. DBSCAN (Density-Based Spatial Clustering): Finds low-density points
based on neighbors.
23
Types of Outliers (III)
Collective Outliers
• A Collective Outlier is a group of data points that may appear normal
individually but together form an anomalous pattern.
• Unlike global outliers (single extreme values) and contextual outliers
(outliers in specific conditions), collective outliers are only anomalous
when viewed as a group.
24
Example : Network Intrusion Detection
•Each individual login attempt is normal
•But multiple failed logins +unusual access together indicate a potential attack (e.g., brute-force attack).
25
Example 2: Credit Card Fraud Detection
•A single international transaction may not be suspicious, but...
•A sudden sequence of high-value transactions across different countries is abnormal
(potential fraud).
26
Types of Outliers (III)
Collective Outliers
A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers
Applications: E.g., intrusion detection: Collective Outlier
When a number of computers keep sending
denial-of-service packages to each other
Detection of collective outliers
Consider not only behavior of individual objects, but also that of
groups of objects
Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure
on objects.
A data set may have multiple types of outlier
One object may belong to more than one type of outlier
27
Challenges in Detecting Outliers
•High Dimensionality: In datasets with many features (e.g.,
medical records, finance), defining an outlier is difficult.
•Concept Drift: In real-time data (e.g., stock market, weather),
normal patterns change over time, making past outliers valid.
•No Ground Truth: It is often unclear whether an unusual
value is a genuine outlier or a meaningful pattern.
•Imbalanced Data: If outliers are rare, a model may ignore
them, treating them as noise.
•Masked Outliers: Some outliers may blend within clusters,
making them hard to detect.
28
Challenges of Outlier Detection
Modeling normal objects and outliers properly
Hard to enumerate all possible normal behaviors in an application
The border between normal and outlier objects is often a gray area
Application-specific outlier detection
Choice of distance measure among objects and the model of
relationship among objects are often application-dependent
E.g., clinic data: a small deviation could be an outlier; while in
marketing analysis, larger fluctuations
Handling noise in outlier detection
Noise may distort the normal objects and blur the distinction between
normal objects and outliers. It may help hide outliers and reduce the
effectiveness of outlier detection
Understandability
Understand why these are outliers: Justification of the detection
Specify the degree of an outlier: the unlikelihood of the object being
generated by a normal mechanism
29
Thank You!!!
30