KEMBAR78
Exploratory Data Analysis (EDA) Techniques - Lect 5 | PDF | Data Analysis | Mean
0% found this document useful (0 votes)
28 views25 pages

Exploratory Data Analysis (EDA) Techniques - Lect 5

Exploratory Data Analysis (EDA) is a systematic approach to examining and summarizing dataset characteristics using visual methods and statistics, serving as the first step in data analysis workflows. EDA involves understanding data distributions, identifying patterns, detecting anomalies, and checking assumptions before applying formal models. Techniques include descriptive statistics, measures of variability, and various visualizations to aid in data interpretation and analysis.

Uploaded by

Isaac Acquah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views25 pages

Exploratory Data Analysis (EDA) Techniques - Lect 5

Exploratory Data Analysis (EDA) is a systematic approach to examining and summarizing dataset characteristics using visual methods and statistics, serving as the first step in data analysis workflows. EDA involves understanding data distributions, identifying patterns, detecting anomalies, and checking assumptions before applying formal models. Techniques include descriptive statistics, measures of variability, and various visualizations to aid in data interpretation and analysis.

Uploaded by

Isaac Acquah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Exploratory Data Analysis (EDA)

Techniques
Dr. Nana Yaw Duodu
Computer Science Department
Accra Technical University
DATA VISUALISATION COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 2


DATA VISUALISATION COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 3


EXPLORATORY DATA ANALYSIS (EDA) TECHNIQUES
COMPUTER SCIENCE DEPARTMENT

• Exploratory Data Analysis (EDA) is the process of systematically examining and


summarising the main characteristics of a dataset, often using visual methods
and summary statistics, before applying formal statistical models or machine
learning techniques.

• EDA help analysts in that it ensures the data is well-understood, properly


prepared, and ready for more advanced analysis, modelling, or interpretation.

• EDA is regarded as the first step in any data analysis workflow.

7/15/2025 FACULTY OF APPLIED SCIENCES 4


EXPLORATORY DATA ANALYSIS (EDA) TECHNIQUES
COMPUTER SCIENCE DEPARTMENT

• Exploratory Data Analysis (EDA involves:


1. Understanding the data by examining distributions, relationships, and basic descriptive
statistics.

2. Identifying patterns or trends that may exist within the dataset.

3. Detecting anomalies or outliers that could impact subsequent analysis.

4. Checking underlying assumptions that need to be met before formal modelling, such as
normality or linearity.

7/15/2025 FACULTY OF APPLIED SCIENCES 5


DATA ANALYSIS WORKFLOW
COMPUTER SCIENCE DEPARTMENT

• A data analysis workflow defines a systematic, repeatable, and scalable process


for analysing data.

• It comprises several distinct stages, each with its prescribed tasks and
objectives, offering a structured approach to ensure methodical data analysis.

7/15/2025 FACULTY OF APPLIED SCIENCES 6


DATA ANALYSIS WORKFLOW
COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 7


DATA ANALYSIS WORKFLOW
COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 8


EXPLORATORY DATA ANALYSIS (EDA) TECHNIQUES
COMPUTER SCIENCE DEPARTMENT

• Introduce EDA as the first step in any data analysis workflow.

• Emphasise its purpose: to understand the data, spot patterns, detect anomalies,
and check assumptions

7/15/2025 FACULTY OF APPLIED SCIENCES 9


EXPLAIN EDA TECHNIQUES IN LAYERS
COMPUTER SCIENCE DEPARTMENT

• Break down EDA into four layers, each building on the previous:

7/15/2025 FACULTY OF APPLIED SCIENCES 10


DESCRIPTIVE STATISTICS
COMPUTER SCIENCE DEPARTMENT

• Measures of central tendency (mean, median, mode)

• Measures of variability (variance, standard deviation, range, interquartile


range)

7/15/2025 FACULTY OF APPLIED SCIENCES 11


DESCRIPTIVE STATISTICS-CENTRAL TENDENCIES
COMPUTER SCIENCE DEPARTMENT
• Mean (average): Represents the arithmetic average of all data points. It is useful for
understanding the overall “center” of normally distributed data.
• Example: Average cholesterol level in a heart disease dataset.
• Median: Represents the middle value when the data is ordered. It is less sensitive to outliers
than the mean.
• Example: Median age of patients can be insightful when data contains extreme values.
• Mode: Represents the most frequently occurring value. It is particularly useful for categorical
data.
• Example: Mode of the ‘chest pain type’ variable to see the most common type of chest pain among
patients.
• 2. Comparing Distributions
• Measures of central tendency help compare different subgroups (e.g. patients with and
without heart disease).
• Example: Comparing the mean cholesterol levels between these groups can highlight important differences.
• Example: Comparing mean ages of patients with and without heart disease.
• Example: Comparing median cholesterol levels by gender.

7/15/2025 FACULTY OF APPLIED SCIENCES 12


DESCRIPTIVE STATISTICS-CENTRAL TENDENCIES
COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 13


DESCRIPTIVE STATISTICS-CENTRAL TENDENCIES
COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 14


Assessing Symmetry and Skewness
COMPUTER SCIENCE DEPARTMENT

• 3. By comparing the mean and median:


• If mean ≈ median, the distribution is approximately symmetric.
• If mean > median, the distribution is right-skewed.
• If mean < median, the distribution is left-skewed.
• 4. Guiding Further Analysis
• These measures can inform decisions on which statistical tests are appropriate.
• Example: If the data is skewed, non-parametric tests may be preferred.
• 5. Establishing Baselines
• They provide a baseline for comparing new data points or groups.
• Example: Knowing the average age or cholesterol level allows for identifying unusual cases.

7/15/2025 FACULTY OF APPLIED SCIENCES 15


Assessing Symmetry and Skewness
COMPUTER SCIENCE DEPARTMENT

• 3. Detecting Outliers and Anomalies


• Outliers affect the mean more than the median.
• By comparing these measures, analysts can detect potential outliers or anomalies that
may require further investigation.
• 4. Guiding Further Analysis
• Measures of central tendency inform the choice of statistical tests and models.
• For instance, normally distributed data (mean ≈ median ≈ mode) might suggest
parametric tests, while skewed data may require non-parametric tests.
• 5. Comparing Subgroups
• These measures allow comparison between different subgroups in the data.
• Example: Comparing mean ages of patients with and without heart disease.
• Example: Comparing median cholesterol levels by gender.

7/15/2025 FACULTY OF APPLIED SCIENCES 16


MEASURES OF DISPERSION
COMPUTER SCIENCE DEPARTMENT

• Measures of dispersion, also called


measures of variability, "describe the
extent to which the values of a
variable are different." (Wallace &
Van Fleet, 2012, p. 293).

• The most common measures of
dispersion are range, variance,
standard deviation, and the
coefficient of variation.

7/15/2025 FACULTY OF APPLIED SCIENCES 17


USEFULNESS OF MEASURES OF VARIABILITY IN EDA
COMPUTER SCIENCE DEPARTMENT
• 1. Understanding Data Spread
• Variance and standard deviation measure how much the data points deviate from the mean.
• A high variance or standard deviation suggests data points are widely spread, while a low variance or standard deviation suggests they are clustered
close to the mean.
• 2. Detecting Outliers
• Range (difference between maximum and minimum) can highlight extreme values.
• Interquartile Range (IQR) focuses on the middle 50% of the data, reducing the influence of outliers compared to the range.
• 3. Comparing Subgroups
• Measures of variability help compare the consistency of different subgroups.
• For example, comparing standard deviations of cholesterol levels between males and females in a heart disease dataset can reveal which group has more variability.
• 4. Checking Assumptions
• Many statistical models assume homogeneity of variance (e.g. equal spread across groups).
• EDA using these measures helps determine if this assumption holds or if data transformation is needed.
• 5. Guiding Feature Selection and Engineering
• Variables with low variability may not be useful in predictive modelling, while variables with high variability can be more informative.
• This helps in selecting relevant features and avoiding noise.

7/15/2025 FACULTY OF APPLIED SCIENCES 18


DESCRIPTIVE STATISTICS
COMPUTER SCIENCE DEPARTMENT
• Usefulness of Measures of Variability in EDA
• 1. Understanding Data Spread
• Variance and standard deviation measure how much the data points deviate from the mean.
• A high variance or standard deviation suggests data points are widely spread, while a low variance or standard deviation suggests they are clustered
close to the mean.
• 2. Detecting Outliers
• Range (difference between maximum and minimum) can highlight extreme values.
• Interquartile Range (IQR) focuses on the middle 50% of the data, reducing the influence of outliers compared to the range.
• 3. Comparing Subgroups
• Measures of variability help compare the consistency of different subgroups.
• For example, comparing standard deviations of cholesterol levels between males and females in a heart disease dataset can reveal which group has more variability.
• 4. Checking Assumptions
• Many statistical models assume homogeneity of variance (e.g. equal spread across groups).
• EDA using these measures helps determine if this assumption holds or if data transformation is needed.
• 5. Guiding Feature Selection and Engineering
• Variables with low variability may not be useful in predictive modelling, while variables with high variability can be more informative.
• This helps in selecting relevant features and avoiding noise.

7/15/2025 FACULTY OF APPLIED SCIENCES 19


VISUAL ANALYSIS
COMPUTER SCIENCE DEPARTMENT

• Histograms, box plots, scatter plots, bar charts, and heatmaps.


• Show how each visualisation helps answer different questions:
• Distribution → histograms, box plots
• Relationships → scatter plots, heatmaps
• Comparisons → bar charts

7/15/2025 FACULTY OF APPLIED SCIENCES 20


DEEPER DATA CHARACTERISTICS
COMPUTER SCIENCE DEPARTMENT
• Discuss skewness, kurtosis, and normality tests (e.g., Q-Q plots).
• Highlight the importance of checking missing values and outliers using visual
and statistical methods

7/15/2025 FACULTY OF APPLIED SCIENCES 21


ADVANCED TECHNIQUES
COMPUTER SCIENCE DEPARTMENT
• Introduce correlation analysis using heatmaps.
• Introduce dimensionality reduction (PCA) conceptually, but reserve detailed
explanation for later advanced courses if needed.

7/15/2025 FACULTY OF APPLIED SCIENCES 22


HANDS-ON EXERCISES
COMPUTER SCIENCE DEPARTMENT

• Use the dataset provided to complete the following:

1. Calculate basic statistics.

2. Create relevant visualisations.

3. Identify outliers and missing values.

4. Interpret relationships between variables.

7/15/2025 FACULTY OF APPLIED SCIENCES 23


NATURE OF VISUALIZATION
COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 24


NATURE OF VISUALIZATION
COMPUTER SCIENCE DEPARTMENT

7/15/2025 FACULTY OF APPLIED SCIENCES 25

You might also like