Exploratory Data Analysis (EDA)
Techniques
Dr. Nana Yaw Duodu
Computer Science Department
Accra Technical University
DATA VISUALISATION COMPUTER SCIENCE DEPARTMENT
7/15/2025 FACULTY OF APPLIED SCIENCES 2
DATA VISUALISATION COMPUTER SCIENCE DEPARTMENT
7/15/2025 FACULTY OF APPLIED SCIENCES 3
EXPLORATORY DATA ANALYSIS (EDA) TECHNIQUES
COMPUTER SCIENCE DEPARTMENT
• Exploratory Data Analysis (EDA) is the process of systematically examining and
summarising the main characteristics of a dataset, often using visual methods
and summary statistics, before applying formal statistical models or machine
learning techniques.
• EDA help analysts in that it ensures the data is well-understood, properly
prepared, and ready for more advanced analysis, modelling, or interpretation.
• EDA is regarded as the first step in any data analysis workflow.
7/15/2025 FACULTY OF APPLIED SCIENCES 4
EXPLORATORY DATA ANALYSIS (EDA) TECHNIQUES
COMPUTER SCIENCE DEPARTMENT
• Exploratory Data Analysis (EDA involves:
1. Understanding the data by examining distributions, relationships, and basic descriptive
statistics.
2. Identifying patterns or trends that may exist within the dataset.
3. Detecting anomalies or outliers that could impact subsequent analysis.
4. Checking underlying assumptions that need to be met before formal modelling, such as
normality or linearity.
7/15/2025 FACULTY OF APPLIED SCIENCES 5
DATA ANALYSIS WORKFLOW
COMPUTER SCIENCE DEPARTMENT
• A data analysis workflow defines a systematic, repeatable, and scalable process
for analysing data.
• It comprises several distinct stages, each with its prescribed tasks and
objectives, offering a structured approach to ensure methodical data analysis.
7/15/2025 FACULTY OF APPLIED SCIENCES 6
DATA ANALYSIS WORKFLOW
COMPUTER SCIENCE DEPARTMENT
7/15/2025 FACULTY OF APPLIED SCIENCES 7
DATA ANALYSIS WORKFLOW
COMPUTER SCIENCE DEPARTMENT
7/15/2025 FACULTY OF APPLIED SCIENCES 8
EXPLORATORY DATA ANALYSIS (EDA) TECHNIQUES
COMPUTER SCIENCE DEPARTMENT
• Introduce EDA as the first step in any data analysis workflow.
• Emphasise its purpose: to understand the data, spot patterns, detect anomalies,
and check assumptions
7/15/2025 FACULTY OF APPLIED SCIENCES 9
EXPLAIN EDA TECHNIQUES IN LAYERS
COMPUTER SCIENCE DEPARTMENT
• Break down EDA into four layers, each building on the previous:
7/15/2025 FACULTY OF APPLIED SCIENCES 10
DESCRIPTIVE STATISTICS
COMPUTER SCIENCE DEPARTMENT
• Measures of central tendency (mean, median, mode)
• Measures of variability (variance, standard deviation, range, interquartile
range)
7/15/2025 FACULTY OF APPLIED SCIENCES 11
DESCRIPTIVE STATISTICS-CENTRAL TENDENCIES
COMPUTER SCIENCE DEPARTMENT
• Mean (average): Represents the arithmetic average of all data points. It is useful for
understanding the overall “center” of normally distributed data.
• Example: Average cholesterol level in a heart disease dataset.
• Median: Represents the middle value when the data is ordered. It is less sensitive to outliers
than the mean.
• Example: Median age of patients can be insightful when data contains extreme values.
• Mode: Represents the most frequently occurring value. It is particularly useful for categorical
data.
• Example: Mode of the ‘chest pain type’ variable to see the most common type of chest pain among
patients.
• 2. Comparing Distributions
• Measures of central tendency help compare different subgroups (e.g. patients with and
without heart disease).
• Example: Comparing the mean cholesterol levels between these groups can highlight important differences.
• Example: Comparing mean ages of patients with and without heart disease.
• Example: Comparing median cholesterol levels by gender.
7/15/2025 FACULTY OF APPLIED SCIENCES 12
DESCRIPTIVE STATISTICS-CENTRAL TENDENCIES
COMPUTER SCIENCE DEPARTMENT
7/15/2025 FACULTY OF APPLIED SCIENCES 13
DESCRIPTIVE STATISTICS-CENTRAL TENDENCIES
COMPUTER SCIENCE DEPARTMENT
7/15/2025 FACULTY OF APPLIED SCIENCES 14
Assessing Symmetry and Skewness
COMPUTER SCIENCE DEPARTMENT
• 3. By comparing the mean and median:
• If mean ≈ median, the distribution is approximately symmetric.
• If mean > median, the distribution is right-skewed.
• If mean < median, the distribution is left-skewed.
• 4. Guiding Further Analysis
• These measures can inform decisions on which statistical tests are appropriate.
• Example: If the data is skewed, non-parametric tests may be preferred.
• 5. Establishing Baselines
• They provide a baseline for comparing new data points or groups.
• Example: Knowing the average age or cholesterol level allows for identifying unusual cases.
7/15/2025 FACULTY OF APPLIED SCIENCES 15
Assessing Symmetry and Skewness
COMPUTER SCIENCE DEPARTMENT
• 3. Detecting Outliers and Anomalies
• Outliers affect the mean more than the median.
• By comparing these measures, analysts can detect potential outliers or anomalies that
may require further investigation.
• 4. Guiding Further Analysis
• Measures of central tendency inform the choice of statistical tests and models.
• For instance, normally distributed data (mean ≈ median ≈ mode) might suggest
parametric tests, while skewed data may require non-parametric tests.
• 5. Comparing Subgroups
• These measures allow comparison between different subgroups in the data.
• Example: Comparing mean ages of patients with and without heart disease.
• Example: Comparing median cholesterol levels by gender.
7/15/2025 FACULTY OF APPLIED SCIENCES 16
MEASURES OF DISPERSION
COMPUTER SCIENCE DEPARTMENT
• Measures of dispersion, also called
measures of variability, "describe the
extent to which the values of a
variable are different." (Wallace &
Van Fleet, 2012, p. 293).
•
• The most common measures of
dispersion are range, variance,
standard deviation, and the
coefficient of variation.
7/15/2025 FACULTY OF APPLIED SCIENCES 17
USEFULNESS OF MEASURES OF VARIABILITY IN EDA
COMPUTER SCIENCE DEPARTMENT
• 1. Understanding Data Spread
• Variance and standard deviation measure how much the data points deviate from the mean.
• A high variance or standard deviation suggests data points are widely spread, while a low variance or standard deviation suggests they are clustered
close to the mean.
• 2. Detecting Outliers
• Range (difference between maximum and minimum) can highlight extreme values.
• Interquartile Range (IQR) focuses on the middle 50% of the data, reducing the influence of outliers compared to the range.
• 3. Comparing Subgroups
• Measures of variability help compare the consistency of different subgroups.
• For example, comparing standard deviations of cholesterol levels between males and females in a heart disease dataset can reveal which group has more variability.
• 4. Checking Assumptions
• Many statistical models assume homogeneity of variance (e.g. equal spread across groups).
• EDA using these measures helps determine if this assumption holds or if data transformation is needed.
• 5. Guiding Feature Selection and Engineering
• Variables with low variability may not be useful in predictive modelling, while variables with high variability can be more informative.
• This helps in selecting relevant features and avoiding noise.
7/15/2025 FACULTY OF APPLIED SCIENCES 18
DESCRIPTIVE STATISTICS
COMPUTER SCIENCE DEPARTMENT
• Usefulness of Measures of Variability in EDA
• 1. Understanding Data Spread
• Variance and standard deviation measure how much the data points deviate from the mean.
• A high variance or standard deviation suggests data points are widely spread, while a low variance or standard deviation suggests they are clustered
close to the mean.
• 2. Detecting Outliers
• Range (difference between maximum and minimum) can highlight extreme values.
• Interquartile Range (IQR) focuses on the middle 50% of the data, reducing the influence of outliers compared to the range.
• 3. Comparing Subgroups
• Measures of variability help compare the consistency of different subgroups.
• For example, comparing standard deviations of cholesterol levels between males and females in a heart disease dataset can reveal which group has more variability.
• 4. Checking Assumptions
• Many statistical models assume homogeneity of variance (e.g. equal spread across groups).
• EDA using these measures helps determine if this assumption holds or if data transformation is needed.
• 5. Guiding Feature Selection and Engineering
• Variables with low variability may not be useful in predictive modelling, while variables with high variability can be more informative.
• This helps in selecting relevant features and avoiding noise.
7/15/2025 FACULTY OF APPLIED SCIENCES 19
VISUAL ANALYSIS
COMPUTER SCIENCE DEPARTMENT
• Histograms, box plots, scatter plots, bar charts, and heatmaps.
• Show how each visualisation helps answer different questions:
• Distribution → histograms, box plots
• Relationships → scatter plots, heatmaps
• Comparisons → bar charts
7/15/2025 FACULTY OF APPLIED SCIENCES 20
DEEPER DATA CHARACTERISTICS
COMPUTER SCIENCE DEPARTMENT
• Discuss skewness, kurtosis, and normality tests (e.g., Q-Q plots).
• Highlight the importance of checking missing values and outliers using visual
and statistical methods
7/15/2025 FACULTY OF APPLIED SCIENCES 21
ADVANCED TECHNIQUES
COMPUTER SCIENCE DEPARTMENT
• Introduce correlation analysis using heatmaps.
• Introduce dimensionality reduction (PCA) conceptually, but reserve detailed
explanation for later advanced courses if needed.
7/15/2025 FACULTY OF APPLIED SCIENCES 22
HANDS-ON EXERCISES
COMPUTER SCIENCE DEPARTMENT
• Use the dataset provided to complete the following:
1. Calculate basic statistics.
2. Create relevant visualisations.
3. Identify outliers and missing values.
4. Interpret relationships between variables.
7/15/2025 FACULTY OF APPLIED SCIENCES 23
NATURE OF VISUALIZATION
COMPUTER SCIENCE DEPARTMENT
7/15/2025 FACULTY OF APPLIED SCIENCES 24
NATURE OF VISUALIZATION
COMPUTER SCIENCE DEPARTMENT
7/15/2025 FACULTY OF APPLIED SCIENCES 25