0% found this document useful (0 votes)

10 views23 pages

Exploratory Data Analysis

Uploaded by

purabupadhyay06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views23 pages

Exploratory Data Analysis

Uploaded by

purabupadhyay06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

1.

What is exploratory data analysis

Exploratory Data Analysis (EDA) is an approach to analyzing datasets to
summarize their main characteristics, often using statistical graphics and other
visual methods. The primary goal of EDA is to explore, understand, and
investigate the data to discover patterns, anomalies, and relationships, and to
generate initial hypotheses, rather than to confirm pre-existing ones

Here’s what it involves:

1. Looking at the data

○ Checking how many rows and columns there are.
○ Seeing what type of data each column has (numbers, text, dates,
etc.).
2. Summarizing the data
○ Finding the average, minimum, maximum, and other stats for
numbers.
○ Counting how many times each category appears for categorical
data.
3. Finding patterns and relationships
○ Seeing if two things are related (e.g., does more study time
usually mean higher scores?).
○ Checking for trends or unusual spikes.
4. Detecting problems
○ Missing values (empty spots).
○ Outliers (values that are way off from the rest).
○ Mistakes or inconsistencies.
5. Visualizing the data
○ Making charts and graphs to understand things faster.
○ Common ones: histograms, scatter plots, box plots, bar charts.

2.Exploratory data analysis includes :

1. Data Collection & Loading

● Gathering and importing data from CSV, Excel, SQL

databases, APIs, etc.
● Checking data formats and consistency.

Example tools: pandas, numpy, SQL.

2. Data Cleaning

● Handling missing values (imputation, removal, etc.)

● Removing duplicates
● Correcting inconsistent data types
● Detecting and handling outliers or incorrect entries.

3. Data Profiling (Basic Statistics)

● Checking data types (numeric, categorical, datetime)

● Summary statistics:
○ Numerical: mean, median, standard deviation, min, max,
quantiles
○ Categorical: frequency counts, mode
● Shape and size of data: df.shape, df.info()

4. Univariate Analysis

Analyzing one variable at a time.

● Numerical variables: Histograms, boxplots, density

plots
● Categorical variables: Bar plots, count plots, pie charts

Goal: Understand distribution and spot anomalies.

5. Bivariate / Multivariate Analysis
Exploring relationships between two or more variables.

● Numerical vs Numerical: Scatter plots, correlation

matrix, pair plots
● Numerical vs Categorical: Grouped boxplots, violin
plots, ANOVA tests
● Categorical vs Categorical: Crosstabs, stacked bar
charts, heatmaps

6. Correlation and Feature Relationships

● Check correlations using:

○ Pearson, Spearman, or Kendall coefficients
● Identify multicollinearity (important for machine learning
models).
7. Data Visualization

Using plots to see patterns, trends, or anomalies:

● Histograms & Boxplots – distribution & outliers

● Heatmaps – correlation
● Pairplots – relationships between multiple features
● Time-series plots – trends over time

Tools: matplotlib, seaborn, plotly, powerBI,

Tableau.

8. Outlier Detection & Handling

● Visual inspection with boxplots or scatter plots
● Statistical methods like Z-score, IQR (Interquartile
Range)
Decide whether to remove, cap, or transform them.

9. Feature Engineering Insights

● Detect skewness, need for scaling or transformation

(log, sqrt, etc.)
● Identify potential new features or interactions.

10. Reporting Insights

● Summarizing findings through reports or dashboards

● Documenting data issues, distributions, and
correlations to guide modeling

3.EDA in data analysis and data science process:

1. Reality → Raw Data Collected

● Data originates from the real world:

○ Business systems
○ Sensors, IoT devices
○ Surveys, transactions, logs
○ External sources (APIs, public datasets)
● This stage captures raw, unprocessed data, often
messy and unstructured.
2.Raw Data → Data is Processed

● The raw data is processed into a structured format.

● Activities include:

○ Formatting or parsing files

○ Merging multiple data sources
○ Handling missing or inconsistent values
3.Data is Processed → Clean Dataset

● Cleaning and transforming data to make it usable.

● Key tasks:
○ Removing duplicates
○ Converting data types (e.g., text to datetime)
○ Handling outliers
○ Normalizing values
○ Encoding categorical variables

A clean dataset is now ready for analysis.

4. Clean Dataset → Exploratory Data Analysis (EDA)

● At this stage, EDA is performed to:

○ Summarize and visualize the data
○ Identify patterns, trends, correlations
○ Detect outliers or anomalies
○ Guide feature engineering for modeling

This step helps analysts understand the data deeply before

modeling.

5.Clean Dataset / EDA → Models & Algorithms

● Insights from EDA inform the modeling phase:

○ Selecting appropriate algorithms (e.g., regression,
decision trees, clustering)
○ Defining feature sets
○ Handling imbalanced data or scaling need
● Models are then trained, validated, and optimized.
6. Models / Clean Dataset → Data Product

● The output of models (predictions, forecasts, scores,

clusters) becomes a data product, such as:
○ Recommendation systems
○ Dashboards
○ Predictive scoring tools
7. Data Product → Visualize Report

● Results and key insights are visualized for better

interpretation.
● Common visualization tools: Tableau, Power BI,
Matplotlib, Seaborn
8. Visualize Report → Make Decisions

● Decision-makers use the insights and visual reports to:

○ Improve business processes
○ Launch new strategies
○ Monitor KPIs
○ Optimize operations
4.What is visualization
EDA visualizations are pictures that help you understand
your data. Instead of reading numbers in a table, you can
see patterns, trends, and problems right away.

Histogram

● Shows how values are spread.

● Example: Ages of people in a survey. You can see if
most people are young, old, or in between.

Bar Chart

● Shows counts of categories.

● Example: Number of people in each favorite color
category.

Box Plot

● Shows the distribution and outliers.

● Example: Salaries in a company. You can see the
median, range, and any very high or low salaries.

Scatter Plot

● Shows relationship between two numbers.

● Example: Study hours vs. exam scores. You can see if
more study usually means higher scores.

Line Plot

● Shows trends over time.

● Example: Stock prices over a year. You can see rises
and falls.

Pie Chart (less common in serious data science)

● Shows parts of a whole.

● Example: Market share of different phone brands.

5.steps involved in EDA

1.Data sourcing : Data sourcing is the process of
identifying, collecting, and integrating relevant data from various
internal and external sources to achieve specific business goals,
inform decision-making, and support operations. It involves defining
objectives, finding suitable sources like databases, APIs, or public
records, and systematically gathering the data, which is then
prepared for analysis and use within an organization's data
infrastructure
3.Internal Sources

○ Data your organization already has.

○ Examples: sales records, customer info, website logs.
2.External Sources
○ Public datasets or data from outside your organization.
○ Examples: government databases, Kaggle datasets, APIs
(like Twitter or weather data).
3.Generated Data
○ You can also create data yourself using surveys,
experiments, or simulations.
2.Data cleaning : After collecting the data the next step
is data cleaning data cleaning means that you get rid of
any information that doesn t need to be there and clean
up by mistake. Data cleaning is the process of clean the
data to improve the quality of the data for further data
analysis and building a machine learning model. The
benefit of data cleaning is that all the incorrect and
irrelevant data is gone and we get the good quality of
data which will help in improving the accuracy of our
machine learning model.The following are some steps
involve in data cleaning :
>Handle Missing Values :
1.Delete Rows/columns : This method we commonly
used to handle missing values rows can be deleted if it
has insignificant number of missing value columns can
be deleted if it has more than 75% of missing value.

2.Replace with mean/Median/mode: This method can

be used on independent variables when it is numerical.
On categorical features we apply the mode method to fill
the missing value.

3.Algorithm Imputation : Some machine learning

algorithm support to handle missing values in the
datasets like knn,naive bayes, random forest.
4.Predicting the missing values : Prediction model is
one of advanced method to handle missing values in this
method dataset with no missing value become training
set and dataset with missing value become the test set
and the missing values is treated as target variable.

>Standardization of the data : standardization and

feature scaling are techniques used to adjust the
range or distribution of data so that all features
(columns) are comparable.
1.Standard Scaler : Standardization helps you to scale
down your feature based on the standard normal
distribution.
Formula

For each value Where:

●x = the original value

●μ = mean of the column

●σ = standard deviation of the column

2.Normalization : Normalization helps you to scale down

your features between a range 0 to 1.

Think of it like making sure all players in a game start at

the same line — no unfair advantages.

>Outlier Treatment : Outliers are the most extremes

values in the data. It is an abnormal observations that
deviate from the norms outliers do not fit normal
behavior of the data.
<Detect Outliers using following method
1.Boxplot
2.histogram

3.Scatter plot

4.Z-score

5.Inter quartile range(Values out of 1.5 time of IQR)

<Handle outlier using following method
1.Remove the outliers
2.Replace Outlier with suitable values by using following
methods
.Quantile method
.Inter Quantile range
3.Use that ml model which are sensitive to outlier
4.Like knn, Decision tree,SVM,Naive Bayes,Ensemble
methods.
>Handle invalid values
1.Encode unicode properly : In case the data is being
read as junck characters try to change encoding eg cpi
1252 instead of utf - 8.

2.Convert incorrect data types : Correct the incorrect

data types to the correct data types for ease of analysis
example if numeric values are stored as string it would
not be possible to calculate metrics such as mean,
median etc
Some of the common data type correction are string to
number : “12,300” to “12300” string date “2013-aug” to
“2013/08”
Number to string “PIN code 110001” to “110001” etc

3.Correct values that go beyond range : If some of the

values are beyond range eg temperature less than -270
c (0k) you would need to correct them as required. A
close look would help you check if there is scope for
correction, or if the value needs to be removed

4.Correct wrong structure : Values that don’t follow a

defined structure can be removed eg in a data set
containing pin codes of indian cities a pin code of 12
digits would be an invalid value and needs to be
removed similarly a phone number of 12 digit would be
an invalid value

3.Types of Data
1.Qualitative : A variable which able to
describe quality of the
population.(Categorical values)

>Nominal
>Ordinal

2.Quantitative : A variable which quantify

the population (Numerical values)
>Discrete
>Continuous

3.Types of analysis
1.Univarite analysis : Univariate Analysis is a
type of statistical analysis that focuses on one
variable at a time. It summarizes and
describes the main characteristics of that
variable to understand its distribution, central
tendency, and dispersion. Uni mens one.
It’s often the first step in Exploratory Data
Analysis (EDA).

>Types of Univarite analysis

1. Types of Univariate Analysis

The type of analysis depends on the type of

data:
a) For Numerical (Continuous) Data
●Descriptive Statistics
○Mean, Median, Mode
○Minimum, Maximum
○Variance, Standard Deviation, Range,
Interquartile Range (IQR)
●Visualizations
○Histogram
○Box plot
○Density plot
b) For Categorical Data

●Frequency Distribution
○Counts and percentages of each
category
●Visualizations
○Bar plot
○Pie chart

< Key Objectives

●Understand data distribution

●Identify outliers
●Detect data entry errors
●Check assumptions for further analysis
(like normality for parametric tests)

2.Bivariate Analysis : Bivariate analysis is a

statistical method that involves the analysis
of two variables to determine the
relationship, correlation, or association
between them. It is often used in data
analysis to understand how one variable
changes with respect to another. Done for
both numerical and category data
Key Objectives of Bivariate Analysis

● To examine the relationship between two variables.

● To determine the strength and direction of the
relationship.
● To check if the relationship is causal or just an
association.
Types of Bivariate Analysis : The type of analysis depends on
the data types of the two variables (categorical or numerical).

Vari Vari Analysis / Test Example

able able
1 2
Num Num Correlation, Height vs.
erica erica Regression, Weight
l l Scatter Plot
Num Cate t-test, ANOVA, Income
erica goric Box Plot vs.
l al Gender
Cate Cate Chi-square test, Gender
goric goric Cross-tabulation, vs.
al al Stacked Bar Chart Smoking
Status
Techniques of Bivariate Analysis

1. Numerical vs Numerical

●Correlation Coefficient (Pearson, Spearman) → Checks strength and

direction of linear/non-linear relationship
● Scatter Plot → Visual representation of the relationship.
● Simple Linear Regression → Predicts one variable based
on the other.

Example:
Relationship between hours studied and exam scores.
2. Numerical vs Categorical

●t-test → Compares means between two groups.

● ANOVA (Analysis of Variance) → Compares means
across more than two groups.
●Box Plots → Visualize distribution
across categories.
Example:
Average salary across different education
levels.
3. Categorical vs Categorical

●Cross-tabulation (Contingency Table) → Shows frequency

distribution.
●Chi-square test of independence → Checks
if variables are independent.
●Stacked Bar Charts → Visualize proportions.

3.Multivariate Analysis : Multivariate analysis

(MVA) is a set of statistical techniques
used to analyze data that involves more
than one dependent or independent
variable simultaneously.
It helps uncover relationships, patterns, and
structures in complex datasets.
It’s used when:
●You want to see how several factors
influence an outcome.
●You need to find patterns or groups in
complex data.
●You want to predict something using
multiple inputs.
Situatio Variables Purpose
n
Predictin Size, location, See how all
g house age, number factors together
prices of rooms affect price
Health Age, diet, Find what
study exercise, combination
stress levels impacts health
most
Marketin Age, income, Identify customer
g preferences, groups
location

Common Methods

●Multiple Regression – predicts one

outcome based on several inputs.
●Factor Analysis – reduces many
variables into fewer underlying factors.
●Cluster Analysis – groups similar data
points together.
●MANOVA (Multivariate Analysis of
Variance) – compares multiple
dependent variables across groups.
●Principal Component Analysis (PCA) –
simplifies data by finding key
components.

Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Data Science Process Overview
No ratings yet
Data Science Process Overview
11 pages
UNIT 1 Exploratory Data Analysis
100% (4)
UNIT 1 Exploratory Data Analysis
21 pages
EDA Guide for Data Analysts
No ratings yet
EDA Guide for Data Analysts
35 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
Exploratory Data Analysis (Eda)
No ratings yet
Exploratory Data Analysis (Eda)
10 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Dev Core
No ratings yet
Dev Core
7 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
23SC3201 Data Science and Challenges-2
No ratings yet
23SC3201 Data Science and Challenges-2
28 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Pa Unit 2
No ratings yet
Pa Unit 2
6 pages
Exploratory Data Analysis Guide
No ratings yet
Exploratory Data Analysis Guide
33 pages
Unit 3-BA
No ratings yet
Unit 3-BA
31 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Notes Unit I
No ratings yet
Notes Unit I
47 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
EDA New
No ratings yet
EDA New
15 pages
Ccs346 Eda Unit 1
No ratings yet
Ccs346 Eda Unit 1
129 pages
Introduction To Data Analysis
100% (1)
Introduction To Data Analysis
94 pages
DSP Unit - Ii
No ratings yet
DSP Unit - Ii
14 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Cs3352 Fods QB
No ratings yet
Cs3352 Fods QB
25 pages
ML Question Answer
No ratings yet
ML Question Answer
4 pages
Lecture 2 The Data Science Process and Tools For Each Step
No ratings yet
Lecture 2 The Data Science Process and Tools For Each Step
8 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Advanced Data Analytics Assignment
No ratings yet
Advanced Data Analytics Assignment
6 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Data Science Process
No ratings yet
Data Science Process
13 pages
FDSMSE Imp
No ratings yet
FDSMSE Imp
6 pages
Data Processes
No ratings yet
Data Processes
4 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
What Is Exploratory Data Analysis
No ratings yet
What Is Exploratory Data Analysis
28 pages
Unit 1
No ratings yet
Unit 1
23 pages
ML Exp No 1
No ratings yet
ML Exp No 1
8 pages
Data Analytics - Module-1.1
No ratings yet
Data Analytics - Module-1.1
42 pages
Unit 3
No ratings yet
Unit 3
83 pages
Unit 1
No ratings yet
Unit 1
50 pages
Unit 2
No ratings yet
Unit 2
21 pages
Unit 4
No ratings yet
Unit 4
33 pages
Data Science Fundamentals Detailed Notes
No ratings yet
Data Science Fundamentals Detailed Notes
31 pages
Steps in The Implementation of Data Analysis
No ratings yet
Steps in The Implementation of Data Analysis
2 pages
ADS IA 1 Syllabus Prep
No ratings yet
ADS IA 1 Syllabus Prep
5 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
34 pages
Unit 1
No ratings yet
Unit 1
9 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Approaches in Data Analysis (Slides)
No ratings yet
Approaches in Data Analysis (Slides)
13 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Unit 1 DXV
No ratings yet
Unit 1 DXV
28 pages
Module 2
No ratings yet
Module 2
78 pages
Dsur Ea2352001010391 W3
No ratings yet
Dsur Ea2352001010391 W3
3 pages
Digital SAT Math Challenges
No ratings yet
Digital SAT Math Challenges
7 pages
Reliable Simulation Model Permeability Arrays by Integration of Core, Test, and Log-Derived Permeability Data
No ratings yet
Reliable Simulation Model Permeability Arrays by Integration of Core, Test, and Log-Derived Permeability Data
11 pages
9 CBSE Maths Test - Number System, Polynomials, Statistics
No ratings yet
9 CBSE Maths Test - Number System, Polynomials, Statistics
2 pages
7QC Tools
No ratings yet
7QC Tools
62 pages
Statistics I
No ratings yet
Statistics I
15 pages
Practical EAF Model For Flicker Assesment
No ratings yet
Practical EAF Model For Flicker Assesment
12 pages
Automatic Image Analysis: Berlin University of Technology
No ratings yet
Automatic Image Analysis: Berlin University of Technology
13 pages
Histogram Samples
No ratings yet
Histogram Samples
20 pages
Statistics (Unit I)
No ratings yet
Statistics (Unit I)
13 pages
Chapter No. 2 Presentation of Data: UMER NASEER / Lecturer in Statistics/ Contact: 03420074311
No ratings yet
Chapter No. 2 Presentation of Data: UMER NASEER / Lecturer in Statistics/ Contact: 03420074311
4 pages
CS3361 Lab Exp
No ratings yet
CS3361 Lab Exp
9 pages
Math 7 Quiz 2
No ratings yet
Math 7 Quiz 2
2 pages
Plotly Tutorial
100% (3)
Plotly Tutorial
95 pages
III Sem BCom BBA BMMC Genl Course Basic Numerical Course On31Dec2015
No ratings yet
III Sem BCom BBA BMMC Genl Course Basic Numerical Course On31Dec2015
13 pages
Ebr MCQS
No ratings yet
Ebr MCQS
7 pages
Chapter 9 Textbook
No ratings yet
Chapter 9 Textbook
96 pages
Statistics For Economics
No ratings yet
Statistics For Economics
58 pages
Chap03 PPT
No ratings yet
Chap03 PPT
57 pages
Statistics: Class Intervals & Graphs
No ratings yet
Statistics: Class Intervals & Graphs
5 pages
Statistics For Geoscientists: Pieter Vermeesch
No ratings yet
Statistics For Geoscientists: Pieter Vermeesch
225 pages
Math 1040 Term Project - Skittles 2
No ratings yet
Math 1040 Term Project - Skittles 2
6 pages
Methods of Data Organization and Presentation
No ratings yet
Methods of Data Organization and Presentation
75 pages
Alg1 Pe 11 03
No ratings yet
Alg1 Pe 11 03
8 pages
Practical Lab Sessions 2018
No ratings yet
Practical Lab Sessions 2018
25 pages
Used Motor Vehicle Imports
No ratings yet
Used Motor Vehicle Imports
40 pages
Five S and QC Tools Introduction
No ratings yet
Five S and QC Tools Introduction
72 pages
Histogram & Data Distribution Guide
No ratings yet
Histogram & Data Distribution Guide
8 pages
Presentation and Analysis of Business Data
No ratings yet
Presentation and Analysis of Business Data
16 pages
Amore Frozen Food TN
No ratings yet
Amore Frozen Food TN
11 pages
Classification of Gender by Voice Recognition Using Machine Learning Algorithms
No ratings yet
Classification of Gender by Voice Recognition Using Machine Learning Algorithms
13 pages

Exploratory Data Analysis

Uploaded by

Exploratory Data Analysis

Uploaded by

1.

What is exploratory data analysis

Here’s what it involves:

1.​ Looking at the data

2.Exploratory data analysis includes :

●​ Gathering and importing data from CSV, Excel, SQL

Example tools: pandas, numpy, SQL.

●​ Handling missing values (imputation, removal, etc.)

3. Data Profiling (Basic Statistics)

●​ Checking data types (numeric, categorical, datetime)

Analyzing one variable at a time.

●​ Numerical variables: Histograms, boxplots, density

Goal: Understand distribution and spot anomalies.

●​ Numerical vs Numerical: Scatter plots, correlation

6. Correlation and Feature Relationships

●​ Check correlations using:

Using plots to see patterns, trends, or anomalies:

●​ Histograms & Boxplots – distribution & outliers

Tools: matplotlib, seaborn, plotly, powerBI,

8. Outlier Detection & Handling

9. Feature Engineering Insights

●​ Detect skewness, need for scaling or transformation

10. Reporting Insights

●​ Summarizing findings through reports or dashboards

3.EDA in data analysis and data science process:

●​ Data originates from the real world:

●​ The raw data is processed into a structured format.​

○​ Formatting or parsing files

●​ Cleaning and transforming data to make it usable.

A clean dataset is now ready for analysis.

4. Clean Dataset → Exploratory Data Analysis (EDA)

●​ At this stage, EDA is performed to:

This step helps analysts understand the data deeply before

5.Clean Dataset / EDA → Models & Algorithms

●​ Insights from EDA inform the modeling phase:

●​ The output of models (predictions, forecasts, scores,

●​ Results and key insights are visualized for better

●​ Decision-makers use the insights and visual reports to:

●​ Shows how values are spread.

●​ Shows counts of categories.

●​ Shows the distribution and outliers.

●​ Shows relationship between two numbers.

●​ Shows trends over time.

Pie Chart (less common in serious data science)

●​ Shows parts of a whole.

5.steps involved in EDA

○​ Data your organization already has.

2.Replace with mean/Median/mode: This method can

3.Algorithm Imputation : Some machine learning

>Standardization of the data : standardization and

For each value Where:

●​x = the original value​

●​μ = mean of the column​

2.Normalization : Normalization helps you to scale down

Think of it like making sure all players in a game start at

>Outlier Treatment : Outliers are the most extremes

5.Inter quartile range(Values out of 1.5 time of IQR)

2.Convert incorrect data types : Correct the incorrect

3.Correct values that go beyond range : If some of the

4.Correct wrong structure : Values that don’t follow a

2.Quantitative : A variable which quantify

>Types of Univarite analysis

The type of analysis depends on the type of

< Key Objectives

●​Understand data distribution

2.Bivariate Analysis : Bivariate analysis is a

●​ To examine the relationship between two variables.

Vari Vari Analysis / Test Example

●​Correlation Coefficient (Pearson, Spearman) → Checks strength and

●​t-test → Compares means between two groups.

●​Cross-tabulation (Contingency Table) → Shows frequency

3.Multivariate Analysis : Multivariate analysis

●​Multiple Regression – predicts one

You might also like

1. Looking at the data

● Gathering and importing data from CSV, Excel, SQL

● Handling missing values (imputation, removal, etc.)

● Checking data types (numeric, categorical, datetime)

● Numerical variables: Histograms, boxplots, density

● Numerical vs Numerical: Scatter plots, correlation

● Check correlations using:

● Histograms & Boxplots – distribution & outliers

● Detect skewness, need for scaling or transformation

● Summarizing findings through reports or dashboards

● Data originates from the real world:

● The raw data is processed into a structured format.

○ Formatting or parsing files

● Cleaning and transforming data to make it usable.

● At this stage, EDA is performed to:

● Insights from EDA inform the modeling phase:

● The output of models (predictions, forecasts, scores,

● Results and key insights are visualized for better

● Decision-makers use the insights and visual reports to:

● Shows how values are spread.

● Shows counts of categories.

● Shows the distribution and outliers.

● Shows relationship between two numbers.

● Shows trends over time.

● Shows parts of a whole.

○ Data your organization already has.

●x = the original value

●μ = mean of the column

●Understand data distribution

● To examine the relationship between two variables.

●Correlation Coefficient (Pearson, Spearman) → Checks strength and

●t-test → Compares means between two groups.

●Cross-tabulation (Contingency Table) → Shows frequency

●Multiple Regression – predicts one