KEMBAR78
Exploratory Data Analysis | PDF | Data Analysis | Outlier
0% found this document useful (0 votes)
10 views23 pages

Exploratory Data Analysis

Uploaded by

purabupadhyay06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views23 pages

Exploratory Data Analysis

Uploaded by

purabupadhyay06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

1.

What is exploratory data analysis


Exploratory Data Analysis (EDA) is an approach to analyzing datasets to
summarize their main characteristics, often using statistical graphics and other
visual methods. The primary goal of EDA is to explore, understand, and
investigate the data to discover patterns, anomalies, and relationships, and to
generate initial hypotheses, rather than to confirm pre-existing ones

Here’s what it involves:

1.​ Looking at the data


○​ Checking how many rows and columns there are.
○​ Seeing what type of data each column has (numbers, text, dates,
etc.).
2.​ Summarizing the data
○​ Finding the average, minimum, maximum, and other stats for
numbers.
○​ Counting how many times each category appears for categorical
data.
3.​ Finding patterns and relationships
○​ Seeing if two things are related (e.g., does more study time
usually mean higher scores?).
○​ Checking for trends or unusual spikes.
4.​ Detecting problems
○​ Missing values (empty spots).
○​ Outliers (values that are way off from the rest).
○​ Mistakes or inconsistencies.
5.​ Visualizing the data
○​ Making charts and graphs to understand things faster.
○​ Common ones: histograms, scatter plots, box plots, bar charts.

2.Exploratory data analysis includes :


1. Data Collection & Loading

●​ Gathering and importing data from CSV, Excel, SQL


databases, APIs, etc.
●​ Checking data formats and consistency.

Example tools: pandas, numpy, SQL.

2. Data Cleaning

●​ Handling missing values (imputation, removal, etc.)


●​ Removing duplicates
●​ Correcting inconsistent data types
●​ Detecting and handling outliers or incorrect entries.

3. Data Profiling (Basic Statistics)

●​ Checking data types (numeric, categorical, datetime)


●​ Summary statistics:
○​ Numerical: mean, median, standard deviation, min, max,
quantiles
○​ Categorical: frequency counts, mode
●​ Shape and size of data: df.shape, df.info()

4. Univariate Analysis

Analyzing one variable at a time.

●​ Numerical variables: Histograms, boxplots, density


plots
●​ Categorical variables: Bar plots, count plots, pie charts

Goal: Understand distribution and spot anomalies.


5. Bivariate / Multivariate Analysis
Exploring relationships between two or more variables.

●​ Numerical vs Numerical: Scatter plots, correlation


matrix, pair plots
●​ Numerical vs Categorical: Grouped boxplots, violin
plots, ANOVA tests
●​ Categorical vs Categorical: Crosstabs, stacked bar
charts, heatmaps​

6. Correlation and Feature Relationships

●​ Check correlations using:


○​ Pearson, Spearman, or Kendall coefficients
●​ Identify multicollinearity (important for machine learning
models).
7. Data Visualization

Using plots to see patterns, trends, or anomalies:

●​ Histograms & Boxplots – distribution & outliers


●​ Heatmaps – correlation
●​ Pairplots – relationships between multiple features
●​ Time-series plots – trends over time

Tools: matplotlib, seaborn, plotly, powerBI,


Tableau.

8. Outlier Detection & Handling


●​ Visual inspection with boxplots or scatter plots
●​ Statistical methods like Z-score, IQR (Interquartile
Range)​
Decide whether to remove, cap, or transform them.​

9. Feature Engineering Insights

●​ Detect skewness, need for scaling or transformation


(log, sqrt, etc.)
●​ Identify potential new features or interactions.

10. Reporting Insights

●​ Summarizing findings through reports or dashboards


●​ Documenting data issues, distributions, and
correlations to guide modeling

3.EDA in data analysis and data science process:


1. Reality → Raw Data Collected

●​ Data originates from the real world:


○​ Business systems
○​ Sensors, IoT devices
○​ Surveys, transactions, logs
○​ External sources (APIs, public datasets)
●​ This stage captures raw, unprocessed data, often
messy and unstructured.
2.Raw Data → Data is Processed

●​ The raw data is processed into a structured format.​

●​ Activities include:​

○​ Formatting or parsing files


○​ Merging multiple data sources
○​ Handling missing or inconsistent values
3.Data is Processed → Clean Dataset

●​ Cleaning and transforming data to make it usable.


●​ Key tasks:
○​ Removing duplicates
○​ Converting data types (e.g., text to datetime)
○​ Handling outliers
○​ Normalizing values
○​ Encoding categorical variables

A clean dataset is now ready for analysis.

4. Clean Dataset → Exploratory Data Analysis (EDA)

●​ At this stage, EDA is performed to:


○​ Summarize and visualize the data
○​ Identify patterns, trends, correlations
○​ Detect outliers or anomalies
○​ Guide feature engineering for modeling

This step helps analysts understand the data deeply before


modeling.

5.Clean Dataset / EDA → Models & Algorithms

●​ Insights from EDA inform the modeling phase:


○​ Selecting appropriate algorithms (e.g., regression,
decision trees, clustering)
○​ Defining feature sets
○​ Handling imbalanced data or scaling need
●​ Models are then trained, validated, and optimized.
6. Models / Clean Dataset → Data Product

●​ The output of models (predictions, forecasts, scores,


clusters) becomes a data product, such as:
○​ Recommendation systems
○​ Dashboards
○​ Predictive scoring tools
7. Data Product → Visualize Report

●​ Results and key insights are visualized for better


interpretation.
●​ Common visualization tools: Tableau, Power BI,
Matplotlib, Seaborn
8. Visualize Report → Make Decisions

●​ Decision-makers use the insights and visual reports to:


○​ Improve business processes
○​ Launch new strategies
○​ Monitor KPIs
○​ Optimize operations
4.What is visualization
EDA visualizations are pictures that help you understand
your data. Instead of reading numbers in a table, you can
see patterns, trends, and problems right away.

Histogram

●​ Shows how values are spread.


●​ Example: Ages of people in a survey. You can see if
most people are young, old, or in between.

Bar Chart

●​ Shows counts of categories.


●​ Example: Number of people in each favorite color
category.

Box Plot

●​ Shows the distribution and outliers.


●​ Example: Salaries in a company. You can see the
median, range, and any very high or low salaries.

Scatter Plot

●​ Shows relationship between two numbers.


●​ Example: Study hours vs. exam scores. You can see if
more study usually means higher scores.

Line Plot

●​ Shows trends over time.


●​ Example: Stock prices over a year. You can see rises
and falls.

Pie Chart (less common in serious data science)

●​ Shows parts of a whole.


●​ Example: Market share of different phone brands.

5.steps involved in EDA


1.Data sourcing : Data sourcing is the process of
identifying, collecting, and integrating relevant data from various
internal and external sources to achieve specific business goals,
inform decision-making, and support operations. It involves defining
objectives, finding suitable sources like databases, APIs, or public
records, and systematically gathering the data, which is then
prepared for analysis and use within an organization's data
infrastructure
3.Internal Sources

○​ Data your organization already has.


○​ Examples: sales records, customer info, website logs.
2.​External Sources
○​ Public datasets or data from outside your organization.
○​ Examples: government databases, Kaggle datasets, APIs
(like Twitter or weather data).
3.​Generated Data
○​ You can also create data yourself using surveys,
experiments, or simulations.​
2.Data cleaning : After collecting the data the next step
is data cleaning data cleaning means that you get rid of
any information that doesn t need to be there and clean
up by mistake. Data cleaning is the process of clean the
data to improve the quality of the data for further data
analysis and building a machine learning model. The
benefit of data cleaning is that all the incorrect and
irrelevant data is gone and we get the good quality of
data which will help in improving the accuracy of our
machine learning model.The following are some steps
involve in data cleaning :
>Handle Missing Values :
1.Delete Rows/columns : This method we commonly
used to handle missing values rows can be deleted if it
has insignificant number of missing value columns can
be deleted if it has more than 75% of missing value.

2.Replace with mean/Median/mode: This method can


be used on independent variables when it is numerical.
On categorical features we apply the mode method to fill
the missing value.

3.Algorithm Imputation : Some machine learning


algorithm support to handle missing values in the
datasets like knn,naive bayes, random forest.
4.Predicting the missing values : Prediction model is
one of advanced method to handle missing values in this
method dataset with no missing value become training
set and dataset with missing value become the test set
and the missing values is treated as target variable.

>Standardization of the data : standardization and


feature scaling are techniques used to adjust the
range or distribution of data so that all features
(columns) are comparable.
1.Standard Scaler : Standardization helps you to scale
down your feature based on the standard normal
distribution.
Formula

For each value Where:

●​x = the original value​

●​μ = mean of the column​


●​σ = standard deviation of the column

2.Normalization : Normalization helps you to scale down


your features between a range 0 to 1.

Think of it like making sure all players in a game start at


the same line — no unfair advantages.

>Outlier Treatment : Outliers are the most extremes


values in the data. It is an abnormal observations that
deviate from the norms outliers do not fit normal
behavior of the data.
<Detect Outliers using following method
1.Boxplot
2.histogram

3.Scatter plot

4.Z-score

5.Inter quartile range(Values out of 1.5 time of IQR)


<Handle outlier using following method
1.Remove the outliers
2.Replace Outlier with suitable values by using following
methods
.Quantile method
.Inter Quantile range
3.Use that ml model which are sensitive to outlier
4.Like knn, Decision tree,SVM,Naive Bayes,Ensemble
methods.
>Handle invalid values
1.Encode unicode properly : In case the data is being
read as junck characters try to change encoding eg cpi
1252 instead of utf - 8.

2.Convert incorrect data types : Correct the incorrect


data types to the correct data types for ease of analysis
example if numeric values are stored as string it would
not be possible to calculate metrics such as mean,
median etc
Some of the common data type correction are string to
number : “12,300” to “12300” string date “2013-aug” to
“2013/08”
Number to string “PIN code 110001” to “110001” etc

3.Correct values that go beyond range : If some of the


values are beyond range eg temperature less than -270
c (0k) you would need to correct them as required. A
close look would help you check if there is scope for
correction, or if the value needs to be removed

4.Correct wrong structure : Values that don’t follow a


defined structure can be removed eg in a data set
containing pin codes of indian cities a pin code of 12
digits would be an invalid value and needs to be
removed similarly a phone number of 12 digit would be
an invalid value

3.Types of Data
1.Qualitative : A variable which able to
describe quality of the
population.(Categorical values)

>Nominal
>Ordinal

2.Quantitative : A variable which quantify


the population (Numerical values)
>Discrete
>Continuous

3.Types of analysis
1.Univarite analysis : Univariate Analysis is a
type of statistical analysis that focuses on one
variable at a time. It summarizes and
describes the main characteristics of that
variable to understand its distribution, central
tendency, and dispersion. Uni mens one.
It’s often the first step in Exploratory Data
Analysis (EDA).

>Types of Univarite analysis


1. Types of Univariate Analysis

The type of analysis depends on the type of


data:
a) For Numerical (Continuous) Data
●​Descriptive Statistics
○​Mean, Median, Mode
○​Minimum, Maximum
○​Variance, Standard Deviation, Range,
Interquartile Range (IQR)
●​Visualizations
○​Histogram
○​Box plot
○​Density plot
b) For Categorical Data

●​Frequency Distribution
○​Counts and percentages of each
category
●​Visualizations
○​Bar plot
○​Pie chart

< Key Objectives

●​Understand data distribution


●​Identify outliers
●​Detect data entry errors
●​Check assumptions for further analysis
(like normality for parametric tests)

2.Bivariate Analysis : Bivariate analysis is a


statistical method that involves the analysis
of two variables to determine the
relationship, correlation, or association
between them. It is often used in data
analysis to understand how one variable
changes with respect to another. Done for
both numerical and category data
Key Objectives of Bivariate Analysis

●​ To examine the relationship between two variables.


●​ To determine the strength and direction of the
relationship.
●​ To check if the relationship is causal or just an
association.
Types of Bivariate Analysis : The type of analysis depends on
the data types of the two variables (categorical or numerical).

Vari Vari Analysis / Test Example


able able
1 2
Num Num Correlation, Height vs.
erica erica Regression, Weight
l l Scatter Plot
Num Cate t-test, ANOVA, Income
erica goric Box Plot vs.
l al Gender
Cate Cate Chi-square test, Gender
goric goric Cross-tabulation, vs.
al al Stacked Bar Chart Smoking
Status
Techniques of Bivariate Analysis

1. Numerical vs Numerical

●​Correlation Coefficient (Pearson, Spearman) → Checks strength and


direction of linear/non-linear relationship
●​ Scatter Plot → Visual representation of the relationship.
●​ Simple Linear Regression → Predicts one variable based
on the other.

Example:​
Relationship between hours studied and exam scores.
2. Numerical vs Categorical

●​t-test → Compares means between two groups.


●​ ANOVA (Analysis of Variance) → Compares means
across more than two groups.
●​Box Plots → Visualize distribution
across categories.
Example:​
Average salary across different education
levels.
3. Categorical vs Categorical

●​Cross-tabulation (Contingency Table) → Shows frequency


distribution.
●​Chi-square test of independence → Checks
if variables are independent.
●​Stacked Bar Charts → Visualize proportions.

3.Multivariate Analysis : Multivariate analysis


(MVA) is a set of statistical techniques
used to analyze data that involves more
than one dependent or independent
variable simultaneously.​
It helps uncover relationships, patterns, and
structures in complex datasets.
It’s used when:
●​You want to see how several factors
influence an outcome.
●​You need to find patterns or groups in
complex data.
●​You want to predict something using
multiple inputs.
Situatio Variables Purpose
n
Predictin Size, location, See how all
g house age, number factors together
prices of rooms affect price
Health Age, diet, Find what
study exercise, combination
stress levels impacts health
most
Marketin Age, income, Identify customer
g preferences, groups
location

Common Methods

●​Multiple Regression – predicts one


outcome based on several inputs.​
●​Factor Analysis – reduces many
variables into fewer underlying factors.
●​Cluster Analysis – groups similar data
points together.
●​MANOVA (Multivariate Analysis of
Variance) – compares multiple
dependent variables across groups.
●​Principal Component Analysis (PCA) –
simplifies data by finding key
components.

You might also like