EXPLORATORY DATA
ANALYSIS
1
WHAT IS EDA?
• The analysis of datasets based on various numerical methods and
graphical tools.
• Exploring data for patterns, trends, underlying structure, deviations
from the trend, anomalies and strange structures.
• It facilitates discovering unexpected as well as conforming the
expected.
• Another definition: An approach/philosophy for data analysis that
employs a variety of techniques (mostly graphical).
2
Exploratory Data Analysis
3
AIM OF THE EDA
• Maximize insight into a dataset
• Uncover underlying structure
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions
• Develop valid models
• Determine optimal factor settings (Xs)
4
AIM OF THE EDA
• The goal of EDA is to open-mindedly explore data.
• EDA is detective work… Unless detective finds the clues, judge or jury
has nothing to consider.
• Here, judge or jury is a confirmatory data analysis
• Confirmatory data analysis goes further, assessing the strengths of
the evidence.
• With EDA, we can examine data and try to understand the meaning
of variables. What are the abbreviations stand for.
5
Exploratory vs Confirmatory Data
Analysis
EDA CDA
• No hypothesis at first • Start with hypothesis
• Generate hypothesis • Test the null hypothesis
• Uses graphical methods (mostly) • Uses statistical models
6
STEPS OF EDA
• Generate good research questions
• Data restructuring: You may need to make new variables from the existing ones.
• Instead of using two variables, obtaining rates or percentages of them
• Creating dummy variables for categorical variables
• Based on the research questions, use appropriate graphical tools and obtain
descriptive statistics. Try to understand the data structure, relationships, anomalies,
unexpected behaviors.
• Try to identify confounding variables, interaction relations and multicollinearity, if
any.
• Handle missing observations
• Decide on the need of transformation (on response and/or explanatory variables).
• Decide on the hypothesis based on your research questions
7
AFTER EDA
• Confirmatory Data Analysis: Verify the hypothesis by statistical
analysis
• Get conclusions and present your results nicely.
8
Classification of EDA*
• Exploratory data analysis is generally cross-classified in two ways. First,
each method is either non-graphical or graphical. And second, each
method is either univariate or multivariate (usually just bivariate).
• Non-graphical methods generally involve calculation of summary statistics,
while graphical methods obviously summarize the data in a diagrammatic
or pictorial way.
• Univariate methods look at one variable (data column) at a time, while
multivariate methods look at two or more variables at a time to explore
relationships. Usually our multivariate EDA will be bivariate (looking at
exactly two variables), but occasionally it will involve three or more
variables.
• It is almost always a good idea to perform univariate EDA on each of the
components of a multivariate EDA before performing the multivariate EDA.
*Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf
9
EXAMPLE 1
Data from the Places Rated Almanac *Boyer and Savageau, 1985)
9 variables fro 329 metropolitan areas in the USA
1.Climate mildness Questions:
2.Housing cost 1.How is climate related to location?
3.Health care and environment 2.Are there clusters in the data (excluding
4.Crime location)?
3.Are nearby cities similar?
5.Transportation supply 4.Any relation bw economic outlook and crime?
6.Educational opportunities and effort 5.What else???
7.Arts and culture facilities
8.Recreational opportunities
9.Personal economic outlook
+ latitude and longitude of each city
10
Examples of Variables
• Identifier(s):
- patient number,
- visit # or measurement date (if measured more than once)
• Attributes at study start (baseline):
- enrollment date,
- demographics (age, BMI, etc.)
- prior disease history, labs, etc.
- assigned treatment or intervention group
- outcome variable
• Attributes measured at subsequent times
- any variables that may change over time
- outcome variable
11
Data Types and Measurement
Scales
• Variables may be one of several types, and have a defined set of
valid values.
• Two main classes of variables are:
Continuous Variables: (Quantitative, numeric).
Continuous data can be rounded or \binned to create categorical data.
Categorical Variables: (Discrete, qualitative).
Some categorical variables (e.g. counts) are sometimes treated as
continuous.
12
Categorical Data
• Unordered categorical data (nominal)
2 possible values (binary or dichotomous)
Examples: gender, alive/dead, yes/no.
Greater than 2 possible values - No order to categories
Examples: marital status, religion, country of birth, race.
• Ordered categorical data (ordinal)
Ratings or preferences
BCCI Contracts
Quality of life scales,
IPL Contracts
(Base Price: 30 L, 50 L, 75L, 1cr, 2cr)
Number of copies of a recessive gene (0, 1 or 2)
13
EDA Part 2: Summarizing Data With
Tables and Plots
Examine the entire data set using basic techniques before starting a
formal statistical analysis.
• Familiarizing yourself with the data.
• Find possible errors and anomalies.
• Examine the distribution of values for each variable.
14
Summarizing Variables
• Categorical variables
Frequency tables - how many observations in each category?
Relative frequency table - percent in each category.
Bar chart and other plots.
• Continuous variables
Bin the observations (create categories .e.g., (0-10), (11-20), etc.) then, treat as
ordered categorical.
Plots specific to Continuous variables.
The goal for both categorical and continuous data is data reduction
while preserving/extracting key information about the process under
investigation.
15
Categorical Data Summaries
• A survey of 100 people asking for their favorite color results in the
following categories:
• Red: 30
• Blue: 40
• Green: 20
• Yellow: 10
16
Frequency Table
• Mode (Most Frequent Category)
• Example: From a set of responses about preferred transportation (Car,
Bike, Walk, Bus), if the counts are:
• Car: 50
• Bike: 30
• Walk: 10
• Bus: 10
• Frequency Table: Categories with counts
• Relative Frequency Table: Percentage in each category
17
Relative Frequency Table
• A relative frequency table shows the proportion of each category
compared to the total.
• Example: In a class of 30 students:
• Male: 12
• Female: 18
• Relative Frequencies:
• Male: 12/30 = 0.4 (or 40%)
• Female: 18/30 = 0.6 (or 60%)
• Summary: 40% of students are male, and 60% are female.
18
Graphing a Frequency Table - Bar
Chart:
A bar chart can visually represent the frequency or percentage of each
category.
Example: Plot a bar chart with the categories (Red, Blue, Green, Yellow)
on the x-axis and the frequency or percentage on the y-axis. The
heights of the bars represent the counts or percentages for each color.
19
Chi-Square Test (for association
between categorical variables)
• Example: If you want to test whether there is an association between
gender (Male, Female) and whether people prefer watching movies
at home or in the theater, you would collect data and perform a chi-
square test on a contingency table.
• Summary: The test might indicate whether gender and movie-
watching preference are significantly related, helping you assess
patterns or trends
20
Continuous Data - Tables
Pie Chart
A pie chart can be used to show the proportion of each category in a whole.
•Example: For the above color preference survey, a pie chart would divide a circle
into segments, each representing the percentage of people who chose each color
(Blue, Red, Green, Yellow). The chart would visually display that Blue takes up the
largest segment.
21
Plotting Functions
R has several distinct plotting systems
Base R functions
• hist()
• barplot()
• boxplot()
• plot()
lattice package
ggplot2 package
22
Techniques involved in Exploratory
Data Analysis
1. Data Collection and Understanding
•Data Sources: Understanding where the data is coming from (e.g., databases, APIs, CSV files, spreadsheets).
•Types of Data: Distinguishing between numerical, categorical, ordinal, and nominal data.
•Data Structure: Exploring rows, columns, and the data types in the dataset (e.g., integer, float, object, etc.).
•Initial Data Inspection: Using basic functions like head(), info(), and describe() to get a quick summary.
2. Data Cleaning
•Handling Missing Data: Techniques for imputation, removal, or using algorithms that handle missing values
(mean/median imputation, forward/backward fill, etc.).
•Handling Duplicates: Identifying and removing duplicate records.
•Data Transformation: Converting data types, changing the format, or scaling features (e.g., converting
categorical variables to numeric).
•Outlier Detection and Treatment: Identifying outliers and deciding how to handle them (removal, capping,
transformation).
23
3. Univariate Analysis
•Summary Statistics: Mean, median, mode, range, variance, standard deviation,
skewness, and kurtosis.
•Histograms: Plotting the distribution of single variables to visualize their
frequency.
•Boxplots: Identifying the spread and central tendency, and detecting potential
outliers.
•Bar Charts: Visualizing the distribution of categorical variables.
•Density Plots: Visualizing the smooth distribution of a variable (Kernel Density
Estimation).
24
4. Bivariate Analysis
•Scatter Plots: Analyzing the relationship between two continuous variables.
•Correlation Matrix: Identifying linear relationships between numerical features
using correlation coefficients (Pearson, Spearman).
•Boxplots and Violin Plots: Comparing distributions of continuous data across
categorical groups.
•Heatmaps: Visualizing the correlation matrix or missing data patterns.
25
5. Multivariate Analysis
•Pairplots: Visualizing relationships between multiple continuous variables at once.
•Heatmaps for Correlation: Analyzing the correlation matrix between several
features.
•Principal Component Analysis (PCA): Reducing the dimensionality of the
dataset to identify the most significant features and visualize high-dimensional data.
•t-SNE (t-Distributed Stochastic Neighbor Embedding): A non-linear
dimensionality reduction technique for high-dimensional data visualization.
•Pairwise Comparisons: Comparing distributions or relationships across multiple
dimensions.
26
6. Feature Engineering
•Feature Creation: Creating new variables based on existing data (e.g., date
extraction, text vectorization).
•Feature Scaling: Normalization and standardization techniques (e.g., Min-Max
Scaling, Z-Score Standardization).
•Feature Encoding: Techniques for encoding categorical variables (e.g., One-Hot
Encoding, Label Encoding, Target Encoding).
•Dimensionality Reduction: Using techniques like PCA, t-SNE, and Autoencoders
to reduce the number of features while retaining essential information.
27
7. Handling Skewed Data
•Data Transformation: Applying log, square root, or Box-Cox transformations to
handle skewed distributions.
•Identifying Skewness: Visualizing skewness with histograms or skewness-kurtosis
tests.
•Dealing with Skewed Target Variable: Using transformations for regression or
classification models to deal with non-normal target distributions.
28
8. Data Visualization
•Plotting Techniques: Understanding how to use different types of plots like line
plots, histograms, boxplots, bar plots, heatmaps, and pie charts to visualize data.
•Seaborn/Matplotlib: Using Python libraries to create advanced visualizations and
customizing plots.
•Faceted Plots: Creating subsets of plots based on different categories or values to
explore relationships in the data.
•Interactive Plots: Using tools like Plotly and Dash for more advanced, interactive
visualizations.
29
9. Detecting Anomalies and Outliers
•Visual Techniques: Using scatter plots, box plots, and z-scores to detect
anomalous points.
•Statistical Methods: Using statistical tests (e.g., Grubbs' Test, Modified Z-score)
for outlier detection.
•Robust Statistics: Using methods that are not sensitive to outliers, such as median
and interquartile ranges (IQR).
30
10. Time Series Analysis (if applicable)
•Trend Analysis: Identifying trends and seasonality in time series data.
•Autocorrelation: Using autocorrelation plots (ACF, PACF) to check for repeating
patterns in time series data.
•Decomposition: Decomposing time series into trend, seasonal, and residual
components.
•Stationarity Tests: Checking if the data is stationary using tests like the
Augmented Dickey-Fuller (ADF) test.
31
11. Data Sampling Techniques
•Random Sampling: Drawing random samples from the dataset for quick analysis.
•Stratified Sampling: Ensuring samples represent all key segments of the
population.
•Bootstrapping: A method of resampling to estimate the variability of a statistic.
12. Handling Categorical Variables
•Frequency Distribution: Checking the distribution of values in categorical
variables.
•Chi-Square Test: Testing for independence between categorical variables.
•Cross-tabulation: Analyzing the relationship between two categorical variables
using contingency tables.
32
13. Data Integrity and Quality Assessment
•Consistency Checks: Ensuring data is consistent across variables and records.
•Missing Data Patterns: Identifying missing data patterns and deciding whether
they are missing completely at random (MCAR), missing at random (MAR), or
missing not at random (MNAR).
•Data Quality Metrics: Evaluating the quality of data by checking for accuracy,
completeness, consistency, and timeliness.
33
14. Multicollinearity
•Variance Inflation Factor (VIF): Measuring multicollinearity between predictor
variables.
•Condition Number: Checking the stability of the regression model when
multicollinearity is present.
•Removing Highly Correlated Features: Addressing multicollinearity by
removing redundant variables.
34
15. Handling Imbalanced Data
•Resampling Methods: Over-sampling the minority class (SMOTE) or
under-sampling the majority class.
•Class Weight Adjustment: Adjusting the weight of classes in models
to give more importance to the minority class.
•Anomaly Detection: Using techniques like Isolation Forest or One-
Class SVM to detect rare events or classes
35