Exploratory Data
Analysis (EDA) in
Data Analytics
Understanding and Visualizing Your Data
What is EDA?
Initial step in data analysis.
Helps summarize main characteristics of data.
Lays foundation for further analysis.
Importance of EDA
•Detects patterns, anomalies, and relationships.
•Ensures data quality and understanding.
Objectives of EDA
• Identify data distribution and
variability.
• Detect missing values and outliers.
• Examine relationships between
variables.
• Generate hypotheses for further
analysis.
Quantitative Data:
Numerical values (e.g., sales,
age).
Qualitative Data:
Types of Categorical values (e.g.,
Data in gender, location).
EDA Time-Series Data:
Observations over time.
Multivariate Data: Multiple
variables analyzed
simultaneously.
Key Steps in EDA
Data Collection
•Gather data from relevant sources.
Data Cleaning
•Handle missing, duplicate, or inconsistent data.
Data Transformation
•Normalize or encode variables.
Data Visualization
•Create charts and graphs to identify patterns.
Tools for EDA
Programming Languages: Python (Pandas, NumPy, Matplotlib,
Seaborn), R.
Software Tools: Excel, Tableau, Power BI.
Specialized LibrariesPython: Scikit-learn, Plotly.R: ggplot2, dplyr.
Statistical Techniques in EDA
DESCRIPTIVE STATISTICS: CORRELATION ANALYSIS: HYPOTHESIS TESTING:
MEAN, MEDIAN, MODE, PEARSON AND SPEARMAN T-TESTS, CHI-SQUARE TESTS.
STANDARD DEVIATION. COEFFICIENTS.
DA: a set of methods used to summarize and describe the main features of a
dataset, such as its central tendency, variability, and distribution.
CA: a statistical method used in research to measure the strength of the linear
relationship between two variables and compute their association
HT: a statistical method that determines if data supports a hypothesis.
Univariate Analysis:
Histograms, box plots,
pie charts.
Visualizati
on Bivariate Analysis:
Technique Scatter plots, bar charts.
s
Multivariate Analysis:
Heatmaps, pair plots
Terms
• Univariate analysis is a statistical method that
examines a single variable in a data set.
• BiVariate analysis: It involves the analysis of two
variables (often denoted as X, Y), for the purpose of
determining the empirical relationship between them.
• Multivariate analysis is a statistical method that
analyzes multiple variables at once to identify patterns
and relationships
Handlin Techniques for
Missing
g Data:Imputation,
deletion,
Outliers interpolation.
and Dealing with
Outliers:Z-score
Missing analysis, IQR
method.
Data
Large datasets and
computational complexity.
Missing or inconsistent data.
Challeng
es in EDA Overfitting to visual patterns.
Misinterpreting visualizations.
Dataset: Use a
popular dataset like
Titanic, Iris, or a real-
Case world business
Study/Examp dataset.
le Steps: Show how
EDA was performed
with key insights and
visualizations.