KEMBAR78
Unit 5 - Lecture 1 - Outlier Detection | PDF | Outlier | Cluster Analysis
0% found this document useful (0 votes)
15 views30 pages

Unit 5 - Lecture 1 - Outlier Detection

The document discusses outlier detection in data mining, defining outliers as data points that significantly deviate from normal patterns. It categorizes outliers into global, contextual, and collective types, and outlines various detection methods including statistical, proximity-based, and clustering approaches. The document also highlights challenges in detecting outliers, such as high dimensionality, concept drift, and noise interference.

Uploaded by

julybabies2804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views30 pages

Unit 5 - Lecture 1 - Outlier Detection

The document discusses outlier detection in data mining, defining outliers as data points that significantly deviate from normal patterns. It categorizes outliers into global, contextual, and collective types, and outlines various detection methods including statistical, proximity-based, and clustering approaches. The document also highlights challenges in detecting outliers, such as high dimensionality, concept drift, and noise interference.

Uploaded by

julybabies2804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

1

Data Mining & Analytics


— Unit 5 —
— (Chapter 12) —

Outlier Detection

2
Chapter 12. Outlier Analysis

 Outlier and Outlier Analysis


 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches

3
What Are Outliers?

 Outlier: A data object that deviates significantly from


the normal objects as if it were generated by a different
mechanism
Examples of Outliers:
1.Bank Transactions: A customer making a transaction of
₹10,000 daily but suddenly making a ₹10,00,000 transaction.
2.Student Scores: A class scoring between 40-80 marks, but
one student scores 5 or 99.
3.Sensor Readings: A temperature sensor recording 100°C
when the normal range is between 20°C - 30°C.
4. Unusual credit card purchase, sports: Michael J ordon,
Wayne Gretzky, ...

4
Outliers Vs Noise
 Outliers are different from the noise data
 Noise is random error or variance in a measured variable
 Noise should be removed before outlier detection
 Noise is irrelevant data that should be filtered out to improve accuracy.

5
What Are Outliers?

 Outliers are interesting: It violates the mechanism that generates the


normal data
 Outlier detection vs. novelty detection: early stage, outlier; but later
merged into the model
 Applications:
 Credit card fraud detection

 Telecom fraud detection

 Customer segmentation

 Medical analysis

6
Outliers

7
Detecting Outlier:
Clustering based outlier detection using distance to the
closest cluster:
K-Means clustering
technique
Algorithm:
1.Calculate the mean of each cluster

2.Initialize the Threshold value

3.Calculate the distance of the test data from each


cluster mean
4.Find the nearest cluster to the test data

5.If (Distance > Threshold) then, Outlier

8
K-Means Outlier Detection
We have the following dataset:

9
10
11
•Clusters found:
•C1: (1,2), (2,1), (3,2)
•C2: (8,8), (9,8), (8,9)
•Outlier: (50,50) is too far from any cluster.

12
Types of Outliers (I)

 Three kinds: global, contextual and collective outliers

13
Types of Outliers (I)
 Global outlier (or point anomaly)
 A Global Outlier (Point Anomaly) is a single data point that

significantly deviates from the rest of the dataset.


 It does not follow the general pattern of the data and appears far

away from other points.


 Object is Og if it significantly deviates from the rest of the data set

 Ex. Intrusion detection in computer networks

 Issue: Find an appropriate measurement of deviation

14
15
16
How to Detect Global Outliers?

17
Types of Outliers (I)
Visualization Methods
•Boxplots (Shows extreme values)
•Scatter plots (Detects outliers in 2D)
•Histogram (Identifies rare values)

Global Outlier
Global Outlier

18
Types of Outliers (II)
 Contextual outlier (or conditional outlier)
 A Contextual Outlier is a data point that is only considered
an outlier in a specific context but appears normal
otherwise.
 Unlike global outliers, which are extreme across the
entire dataset, contextual outliers depend on certain
conditions or attributes (contextual features).

19
Example: Temperature Data
Consider temperature readings in different seasons:

The same temperature (25°C) could be normal in summer but an outlier in winter.

20
Example 2: Credit Card Transactions

21
Types of Outliers (II)
 Contextual outlier (or conditional outlier)
 Object is O if it deviates significantly based on a selected context
c
 Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
 Attributes of data objects should be divided into two groups

Contextual attributes: defines the context, e.g., time & location

Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
 Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
 Issue: How to define or formulate meaningful context?

22
How to Detect Contextual Outliers?

Since contextual outliers depend on conditions, we use context-aware


methods:
1.Time-Series Methods (for detecting anomalies over time)
1. Seasonal Decomposition: Identifies unusual values based on
seasonality.
2. LSTMs & RNNs: Learn patterns and detect deviations in sequential
data.
2.Regression-Based Methods (for detecting unexpected values given
conditions)
1. Linear Regression: Predict expected values; if a data point deviates
significantly, it's an outlier.
3.Density-Based Approaches
1. DBSCAN (Density-Based Spatial Clustering): Finds low-density points
based on neighbors.

23
Types of Outliers (III)
 Collective Outliers

• A Collective Outlier is a group of data points that may appear normal


individually but together form an anomalous pattern.

• Unlike global outliers (single extreme values) and contextual outliers


(outliers in specific conditions), collective outliers are only anomalous
when viewed as a group.

24
Example : Network Intrusion Detection

•Each individual login attempt is normal


•But multiple failed logins +unusual access together indicate a potential attack (e.g., brute-force attack).

25
Example 2: Credit Card Fraud Detection

•A single international transaction may not be suspicious, but...


•A sudden sequence of high-value transactions across different countries is abnormal
(potential fraud).

26
Types of Outliers (III)
 Collective Outliers
 A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers
 Applications: E.g., intrusion detection: Collective Outlier

When a number of computers keep sending
denial-of-service packages to each other
 Detection of collective outliers

Consider not only behavior of individual objects, but also that of
groups of objects

Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure
on objects.
 A data set may have multiple types of outlier
 One object may belong to more than one type of outlier
27
Challenges in Detecting Outliers

•High Dimensionality: In datasets with many features (e.g.,


medical records, finance), defining an outlier is difficult.
•Concept Drift: In real-time data (e.g., stock market, weather),
normal patterns change over time, making past outliers valid.
•No Ground Truth: It is often unclear whether an unusual
value is a genuine outlier or a meaningful pattern.
•Imbalanced Data: If outliers are rare, a model may ignore
them, treating them as noise.
•Masked Outliers: Some outliers may blend within clusters,
making them hard to detect.

28
Challenges of Outlier Detection
 Modeling normal objects and outliers properly
 Hard to enumerate all possible normal behaviors in an application

 The border between normal and outlier objects is often a gray area

 Application-specific outlier detection


 Choice of distance measure among objects and the model of

relationship among objects are often application-dependent


 E.g., clinic data: a small deviation could be an outlier; while in

marketing analysis, larger fluctuations


 Handling noise in outlier detection
 Noise may distort the normal objects and blur the distinction between

normal objects and outliers. It may help hide outliers and reduce the
effectiveness of outlier detection
 Understandability
 Understand why these are outliers: Justification of the detection

 Specify the degree of an outlier: the unlikelihood of the object being

generated by a normal mechanism

29
Thank You!!!

30

You might also like