0% found this document useful (0 votes)

28 views25 pages

Exploratory Data Analysis (EDA) Techniques - Lect 5

Exploratory Data Analysis (EDA) is a systematic approach to examining and summarizing dataset characteristics using visual methods and statistics, serving as the first step in data analysis workflows. EDA involves understanding data distributions, identifying patterns, detecting anomalies, and checking assumptions before applying formal models. Techniques include descriptive statistics, measures of variability, and various visualizations to aid in data interpretation and analysis.

Uploaded by

Isaac Acquah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views25 pages

Exploratory Data Analysis (EDA) Techniques - Lect 5

Uploaded by

Isaac Acquah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Exploratory Data Analysis (EDA)

Techniques
Dr. Nana Yaw Duodu
Computer Science Department
Accra Technical University
DATA VISUALISATION COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 2

DATA VISUALISATION COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 3

EXPLORATORY DATA ANALYSIS (EDA) TECHNIQUES
COMPUTER SCIENCE DEPARTMENT

• Exploratory Data Analysis (EDA) is the process of systematically examining and

summarising the main characteristics of a dataset, often using visual methods
and summary statistics, before applying formal statistical models or machine
learning techniques.

• EDA help analysts in that it ensures the data is well-understood, properly

prepared, and ready for more advanced analysis, modelling, or interpretation.

• EDA is regarded as the first step in any data analysis workflow.

7/15/2025 FACULTY OF APPLIED SCIENCES 4

EXPLORATORY DATA ANALYSIS (EDA) TECHNIQUES
COMPUTER SCIENCE DEPARTMENT

• Exploratory Data Analysis (EDA involves:

1. Understanding the data by examining distributions, relationships, and basic descriptive
statistics.

2. Identifying patterns or trends that may exist within the dataset.

3. Detecting anomalies or outliers that could impact subsequent analysis.

4. Checking underlying assumptions that need to be met before formal modelling, such as
normality or linearity.

7/15/2025 FACULTY OF APPLIED SCIENCES 5

DATA ANALYSIS WORKFLOW
COMPUTER SCIENCE DEPARTMENT

• A data analysis workflow defines a systematic, repeatable, and scalable process

for analysing data.

• It comprises several distinct stages, each with its prescribed tasks and
objectives, offering a structured approach to ensure methodical data analysis.

7/15/2025 FACULTY OF APPLIED SCIENCES 6

DATA ANALYSIS WORKFLOW
COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 7

DATA ANALYSIS WORKFLOW
COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 8

EXPLORATORY DATA ANALYSIS (EDA) TECHNIQUES
COMPUTER SCIENCE DEPARTMENT

• Introduce EDA as the first step in any data analysis workflow.

• Emphasise its purpose: to understand the data, spot patterns, detect anomalies,
and check assumptions

7/15/2025 FACULTY OF APPLIED SCIENCES 9

EXPLAIN EDA TECHNIQUES IN LAYERS
COMPUTER SCIENCE DEPARTMENT

• Break down EDA into four layers, each building on the previous:

7/15/2025 FACULTY OF APPLIED SCIENCES 10

DESCRIPTIVE STATISTICS
COMPUTER SCIENCE DEPARTMENT

• Measures of central tendency (mean, median, mode)

• Measures of variability (variance, standard deviation, range, interquartile

range)

7/15/2025 FACULTY OF APPLIED SCIENCES 11

DESCRIPTIVE STATISTICS-CENTRAL TENDENCIES
COMPUTER SCIENCE DEPARTMENT
• Mean (average): Represents the arithmetic average of all data points. It is useful for
understanding the overall “center” of normally distributed data.
• Example: Average cholesterol level in a heart disease dataset.
• Median: Represents the middle value when the data is ordered. It is less sensitive to outliers
than the mean.
• Example: Median age of patients can be insightful when data contains extreme values.
• Mode: Represents the most frequently occurring value. It is particularly useful for categorical
data.
• Example: Mode of the ‘chest pain type’ variable to see the most common type of chest pain among
patients.
• 2. Comparing Distributions
• Measures of central tendency help compare different subgroups (e.g. patients with and
without heart disease).
• Example: Comparing the mean cholesterol levels between these groups can highlight important differences.
• Example: Comparing mean ages of patients with and without heart disease.
• Example: Comparing median cholesterol levels by gender.

7/15/2025 FACULTY OF APPLIED SCIENCES 12

DESCRIPTIVE STATISTICS-CENTRAL TENDENCIES
COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 13

DESCRIPTIVE STATISTICS-CENTRAL TENDENCIES
COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 14

Assessing Symmetry and Skewness
COMPUTER SCIENCE DEPARTMENT

• 3. By comparing the mean and median:

• If mean ≈ median, the distribution is approximately symmetric.
• If mean > median, the distribution is right-skewed.
• If mean < median, the distribution is left-skewed.
• 4. Guiding Further Analysis
• These measures can inform decisions on which statistical tests are appropriate.
• Example: If the data is skewed, non-parametric tests may be preferred.
• 5. Establishing Baselines
• They provide a baseline for comparing new data points or groups.
• Example: Knowing the average age or cholesterol level allows for identifying unusual cases.

7/15/2025 FACULTY OF APPLIED SCIENCES 15

Assessing Symmetry and Skewness
COMPUTER SCIENCE DEPARTMENT

• 3. Detecting Outliers and Anomalies

• Outliers affect the mean more than the median.
• By comparing these measures, analysts can detect potential outliers or anomalies that
may require further investigation.
• 4. Guiding Further Analysis
• Measures of central tendency inform the choice of statistical tests and models.
• For instance, normally distributed data (mean ≈ median ≈ mode) might suggest
parametric tests, while skewed data may require non-parametric tests.
• 5. Comparing Subgroups
• These measures allow comparison between different subgroups in the data.
• Example: Comparing mean ages of patients with and without heart disease.
• Example: Comparing median cholesterol levels by gender.

7/15/2025 FACULTY OF APPLIED SCIENCES 16

MEASURES OF DISPERSION
COMPUTER SCIENCE DEPARTMENT

• Measures of dispersion, also called

measures of variability, "describe the
extent to which the values of a
variable are different." (Wallace &
Van Fleet, 2012, p. 293).
•
• The most common measures of
dispersion are range, variance,
standard deviation, and the
coefficient of variation.

7/15/2025 FACULTY OF APPLIED SCIENCES 17

USEFULNESS OF MEASURES OF VARIABILITY IN EDA
COMPUTER SCIENCE DEPARTMENT
• 1. Understanding Data Spread
• Variance and standard deviation measure how much the data points deviate from the mean.
• A high variance or standard deviation suggests data points are widely spread, while a low variance or standard deviation suggests they are clustered
close to the mean.
• 2. Detecting Outliers
• Range (difference between maximum and minimum) can highlight extreme values.
• Interquartile Range (IQR) focuses on the middle 50% of the data, reducing the influence of outliers compared to the range.
• 3. Comparing Subgroups
• Measures of variability help compare the consistency of different subgroups.
• For example, comparing standard deviations of cholesterol levels between males and females in a heart disease dataset can reveal which group has more variability.
• 4. Checking Assumptions
• Many statistical models assume homogeneity of variance (e.g. equal spread across groups).
• EDA using these measures helps determine if this assumption holds or if data transformation is needed.
• 5. Guiding Feature Selection and Engineering
• Variables with low variability may not be useful in predictive modelling, while variables with high variability can be more informative.
• This helps in selecting relevant features and avoiding noise.

7/15/2025 FACULTY OF APPLIED SCIENCES 18

DESCRIPTIVE STATISTICS
COMPUTER SCIENCE DEPARTMENT
• Usefulness of Measures of Variability in EDA
• 1. Understanding Data Spread
• Variance and standard deviation measure how much the data points deviate from the mean.
• A high variance or standard deviation suggests data points are widely spread, while a low variance or standard deviation suggests they are clustered
close to the mean.
• 2. Detecting Outliers
• Range (difference between maximum and minimum) can highlight extreme values.
• Interquartile Range (IQR) focuses on the middle 50% of the data, reducing the influence of outliers compared to the range.
• 3. Comparing Subgroups
• Measures of variability help compare the consistency of different subgroups.
• For example, comparing standard deviations of cholesterol levels between males and females in a heart disease dataset can reveal which group has more variability.
• 4. Checking Assumptions
• Many statistical models assume homogeneity of variance (e.g. equal spread across groups).
• EDA using these measures helps determine if this assumption holds or if data transformation is needed.
• 5. Guiding Feature Selection and Engineering
• Variables with low variability may not be useful in predictive modelling, while variables with high variability can be more informative.
• This helps in selecting relevant features and avoiding noise.

7/15/2025 FACULTY OF APPLIED SCIENCES 19

VISUAL ANALYSIS
COMPUTER SCIENCE DEPARTMENT

• Histograms, box plots, scatter plots, bar charts, and heatmaps.

• Show how each visualisation helps answer different questions:
• Distribution → histograms, box plots
• Relationships → scatter plots, heatmaps
• Comparisons → bar charts

7/15/2025 FACULTY OF APPLIED SCIENCES 20

DEEPER DATA CHARACTERISTICS
COMPUTER SCIENCE DEPARTMENT
• Discuss skewness, kurtosis, and normality tests (e.g., Q-Q plots).
• Highlight the importance of checking missing values and outliers using visual
and statistical methods

7/15/2025 FACULTY OF APPLIED SCIENCES 21

ADVANCED TECHNIQUES
COMPUTER SCIENCE DEPARTMENT
• Introduce correlation analysis using heatmaps.
• Introduce dimensionality reduction (PCA) conceptually, but reserve detailed
explanation for later advanced courses if needed.

7/15/2025 FACULTY OF APPLIED SCIENCES 22

HANDS-ON EXERCISES
COMPUTER SCIENCE DEPARTMENT

• Use the dataset provided to complete the following:

1. Calculate basic statistics.

2. Create relevant visualisations.

3. Identify outliers and missing values.

4. Interpret relationships between variables.

7/15/2025 FACULTY OF APPLIED SCIENCES 23

NATURE OF VISUALIZATION
COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 24

NATURE OF VISUALIZATION
COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 25

Part 7
No ratings yet
Part 7
26 pages
Unit 3
No ratings yet
Unit 3
222 pages
Unit 3
No ratings yet
Unit 3
47 pages
Exploratory Data Analysis Presentation
No ratings yet
Exploratory Data Analysis Presentation
16 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Ds Unit 2 QB
No ratings yet
Ds Unit 2 QB
25 pages
Module 1 - 2 - EDA
No ratings yet
Module 1 - 2 - EDA
12 pages
AI6322 - Module 3 - Exploratory Data Analysis (EDA) - MODULE
No ratings yet
AI6322 - Module 3 - Exploratory Data Analysis (EDA) - MODULE
15 pages
Unit 1
No ratings yet
Unit 1
52 pages
Assignment EDA
No ratings yet
Assignment EDA
4 pages
Comparing Tools Provided by Python and R For Exploratory Data Analysis
No ratings yet
Comparing Tools Provided by Python and R For Exploratory Data Analysis
12 pages
CH4 Exploratory Data Analysis
No ratings yet
CH4 Exploratory Data Analysis
12 pages
Chap 2 B
No ratings yet
Chap 2 B
15 pages
EDA Feature Eng - Estimation Inference and Hypothesis
No ratings yet
EDA Feature Eng - Estimation Inference and Hypothesis
53 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Advanced EDA for Data Analysts
No ratings yet
Advanced EDA for Data Analysts
47 pages
EDA - Module 4
No ratings yet
EDA - Module 4
35 pages
03a EDA
No ratings yet
03a EDA
47 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
13 pages
Ad3301 Unit 1
No ratings yet
Ad3301 Unit 1
15 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
Unit .......
No ratings yet
Unit .......
45 pages
6.1EDA Inferential
No ratings yet
6.1EDA Inferential
3 pages
Unit 2
No ratings yet
Unit 2
20 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
17 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
43 pages
5.1 Exploratory Analysis en
No ratings yet
5.1 Exploratory Analysis en
79 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
EDA Unit 1
No ratings yet
EDA Unit 1
41 pages
Datascience Unit-4
No ratings yet
Datascience Unit-4
6 pages
Komorowski EDA2016
No ratings yet
Komorowski EDA2016
20 pages
05 AIHC Exp02
No ratings yet
05 AIHC Exp02
11 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
Wa0000.
No ratings yet
Wa0000.
15 pages
Document
No ratings yet
Document
21 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
P23MBA547 Predictive Analytics
No ratings yet
P23MBA547 Predictive Analytics
133 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
3 pages
UNIT II-DSDA - Docx Notes
No ratings yet
UNIT II-DSDA - Docx Notes
26 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
Exploratory Data Analysis: Datascience Using Python Topic: 3
No ratings yet
Exploratory Data Analysis: Datascience Using Python Topic: 3
32 pages
Data Mining: Prepared By: Eesha Tur Razia Babar
No ratings yet
Data Mining: Prepared By: Eesha Tur Razia Babar
49 pages
Lecture 2.1 Data - Exploration
No ratings yet
Lecture 2.1 Data - Exploration
22 pages
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
9 pages
EDA Techniques and Assumptions
100% (2)
EDA Techniques and Assumptions
790 pages
Exploratory Data Analysis Engineering Statistics Handbook PDF
100% (1)
Exploratory Data Analysis Engineering Statistics Handbook PDF
790 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Class 2 Exploratory Data Analysis
100% (1)
Class 2 Exploratory Data Analysis
18 pages
EDA Techniques and Case Studies
100% (4)
EDA Techniques and Case Studies
791 pages
EDA Approaches for Analysts
No ratings yet
EDA Approaches for Analysts
37 pages
Telyu 05
No ratings yet
Telyu 05
53 pages
Best Journal
No ratings yet
Best Journal
11 pages
4 DataUnderstanding
No ratings yet
4 DataUnderstanding
51 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
Exploratory Data Analysis-Engineering Statistics Handbook NIST 2002
100% (1)
Exploratory Data Analysis-Engineering Statistics Handbook NIST 2002
804 pages
Machine Learnig - Ensembling Model - Lect 9
No ratings yet
Machine Learnig - Ensembling Model - Lect 9
10 pages
Machine Learnig - Building A Model - Lect 8
No ratings yet
Machine Learnig - Building A Model - Lect 8
18 pages
Introduction To Data Science - Lect 1
No ratings yet
Introduction To Data Science - Lect 1
23 pages
Data Manipulation With Pandas and NumPy - Lect 3
No ratings yet
Data Manipulation With Pandas and NumPy - Lect 3
20 pages
Course Outline - BCS 404 - Introduction To Data Science With Python 2025
No ratings yet
Course Outline - BCS 404 - Introduction To Data Science With Python 2025
3 pages
Virtualization Lab Work
No ratings yet
Virtualization Lab Work
3 pages
Lessons Learned Template
100% (1)
Lessons Learned Template
3 pages
Class Time Table
No ratings yet
Class Time Table
2 pages
M Tech VLSIES 15-16 April29
No ratings yet
M Tech VLSIES 15-16 April29
72 pages
Minutes of Meeting Different Projects
No ratings yet
Minutes of Meeting Different Projects
14 pages
Cbse Ugc Net Paper 1 June 2010
No ratings yet
Cbse Ugc Net Paper 1 June 2010
20 pages
Business Plan For Cultural Village
0% (1)
Business Plan For Cultural Village
13 pages
Fire Blocks
No ratings yet
Fire Blocks
6 pages
Chapter 5 - Discrete Random Variables and Their Probability Distrubutions
No ratings yet
Chapter 5 - Discrete Random Variables and Their Probability Distrubutions
9 pages
RSW 2
No ratings yet
RSW 2
10 pages
Knowing More Than One Language: The Psycholinguistics of Bilingualism
No ratings yet
Knowing More Than One Language: The Psycholinguistics of Bilingualism
47 pages
Business Letters Punctuations and Styles
No ratings yet
Business Letters Punctuations and Styles
16 pages
Pace Book 1
No ratings yet
Pace Book 1
44 pages
Surkhan-Sherabad Oasis Irrigation Development
No ratings yet
Surkhan-Sherabad Oasis Irrigation Development
4 pages
Multi-modal RL for Explainable Recommendations
No ratings yet
Multi-modal RL for Explainable Recommendations
11 pages
Report Part 1
No ratings yet
Report Part 1
7 pages
Amcrps Gen Cat en 2011-2
No ratings yet
Amcrps Gen Cat en 2011-2
52 pages
SARB Final Interactive
No ratings yet
SARB Final Interactive
354 pages
FBW1102 en
No ratings yet
FBW1102 en
603 pages
Lesson 3 - Barriers of Communication
No ratings yet
Lesson 3 - Barriers of Communication
26 pages
Sample Questions L3 Module 4
100% (3)
Sample Questions L3 Module 4
7 pages
Coisa
No ratings yet
Coisa
23 pages
Scotch 3cdmas Mock Exam
No ratings yet
Scotch 3cdmas Mock Exam
32 pages
Listening Script 23
No ratings yet
Listening Script 23
3 pages
ENG 102 Course Improvement Plan
No ratings yet
ENG 102 Course Improvement Plan
5 pages
Tunnel Design & Construction Seminar
No ratings yet
Tunnel Design & Construction Seminar
3 pages
Holistic Decision Making
100% (1)
Holistic Decision Making
8 pages
Delhi Public School Panvel Syllabus For Half Yearly Assessment Academic Year 2022-23
No ratings yet
Delhi Public School Panvel Syllabus For Half Yearly Assessment Academic Year 2022-23
1 page
Indigo (The Search For True Understanding and Balance)
100% (1)
Indigo (The Search For True Understanding and Balance)
197 pages
Pipe Riser
No ratings yet
Pipe Riser
4 pages
Object-Oriented Relationships Guide
No ratings yet
Object-Oriented Relationships Guide
27 pages

Exploratory Data Analysis (EDA) Techniques - Lect 5

Uploaded by

Exploratory Data Analysis (EDA) Techniques - Lect 5

Uploaded by

Exploratory Data Analysis (EDA)

7/15/2025 FACULTY OF APPLIED SCIENCES 2

7/15/2025 FACULTY OF APPLIED SCIENCES 3

• Exploratory Data Analysis (EDA) is the process of systematically examining and

• EDA help analysts in that it ensures the data is well-understood, properly

• EDA is regarded as the first step in any data analysis workflow.

7/15/2025 FACULTY OF APPLIED SCIENCES 4

• Exploratory Data Analysis (EDA involves:

2. Identifying patterns or trends that may exist within the dataset.

3. Detecting anomalies or outliers that could impact subsequent analysis.

7/15/2025 FACULTY OF APPLIED SCIENCES 5

• A data analysis workflow defines a systematic, repeatable, and scalable process

7/15/2025 FACULTY OF APPLIED SCIENCES 6

7/15/2025 FACULTY OF APPLIED SCIENCES 7

7/15/2025 FACULTY OF APPLIED SCIENCES 8

• Introduce EDA as the first step in any data analysis workflow.

7/15/2025 FACULTY OF APPLIED SCIENCES 9

7/15/2025 FACULTY OF APPLIED SCIENCES 10

• Measures of central tendency (mean, median, mode)

• Measures of variability (variance, standard deviation, range, interquartile

7/15/2025 FACULTY OF APPLIED SCIENCES 11

7/15/2025 FACULTY OF APPLIED SCIENCES 12

7/15/2025 FACULTY OF APPLIED SCIENCES 13

7/15/2025 FACULTY OF APPLIED SCIENCES 14

• 3. By comparing the mean and median:

7/15/2025 FACULTY OF APPLIED SCIENCES 15

• 3. Detecting Outliers and Anomalies

7/15/2025 FACULTY OF APPLIED SCIENCES 16

• Measures of dispersion, also called

7/15/2025 FACULTY OF APPLIED SCIENCES 17

7/15/2025 FACULTY OF APPLIED SCIENCES 18

7/15/2025 FACULTY OF APPLIED SCIENCES 19

• Histograms, box plots, scatter plots, bar charts, and heatmaps.

7/15/2025 FACULTY OF APPLIED SCIENCES 20

7/15/2025 FACULTY OF APPLIED SCIENCES 21

7/15/2025 FACULTY OF APPLIED SCIENCES 22

• Use the dataset provided to complete the following:

1. Calculate basic statistics.

2. Create relevant visualisations.

3. Identify outliers and missing values.

4. Interpret relationships between variables.

7/15/2025 FACULTY OF APPLIED SCIENCES 23

7/15/2025 FACULTY OF APPLIED SCIENCES 24

7/15/2025 FACULTY OF APPLIED SCIENCES 25

You might also like