The Yenepoya Institute of Arts, Science, Commerce and
Management (YIASCM)
Course: IV Semester BCA ( All specializations) & BSc
Statistics for Machine
Learning
Unit-2: Exploratory Data Analysis and Visualization
Objective
A the end of this session the leaner will able to understand
about
Data pre-processing
Steps for data pre-processing
Data Cleaning
How to clean data for Machine Learning?
What is Data Pre-processing?
Data Pre-processing includes the steps we need to follow to
transform or encode data so that it may be easily parsed by
the machine.
The main agenda for a model to be accurate and precise in
predictions is that the algorithm should be able to easily
interpret the data's features.
Data Pre-processing
Data pre-processing is a process of preparing the raw data
and making it suitable for a machine learning model.
It is the first and crucial step while creating a machine
learning model.
When creating a machine learning project, it is not always a
case that we come across the clean and formatted data.
And while doing any operation with data, it is mandatory to
clean it and put in a formatted way.
So for this, we use data pre-processing task.
Why is Data Pre-processing
important?
• The majority of the real-world datasets for machine learning are highly susceptible to be
missing, inconsistent, and noisy due to their heterogeneous origin. Noisy data means
meaningless data.
• Applying data mining algorithms on this noisy data would not give quality results as they
would fail to identify patterns effectively.
• Data Processing is, therefore, important to improve the overall data quality.
Duplicate or missing values may give an incorrect view of the overall statistics of
data.
Outliers and inconsistent data points often tend to disturb the model’s overall
learning, leading to false predictions.
Quality decisions must be based on quality data.
Data Pre-processing is important to get this quality data, without which it would just
be a Garbage In, Garbage Out scenario.
Why do we need Data Pre-processing?
• A real-world data generally contains noises, missing values, and maybe in
an unusable format which cannot be directly used for machine learning
models.
• Data pre-processing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy
and efficiency of a machine learning model.
Steps for data pre-processing
Data Cleaning: This is particularly done as part of data pre-processing to clean the data by filling missing
values, smoothing the noisy data, resolving the inconsistency, and removing outliers.
1. Missing values
Here are a few ways to solve this issue:
Ignore those tuples
This method should be considered when the dataset is huge and numerous
missing values are present within a tuple.
Fill in the missing values
There are many methods to achieve this, such as filling in the values manually,
predicting the missing values using regression method, or numerical methods like
attribute mean.
2. Noisy Data
It involves removing a random error or variance in a measured variable. It can be done
with the help of the following techniques:
Binning
It is the technique that works on sorted data values to smoothen any noise
present in it. The data is divided into equal-sized bins, and each bin/bucket is dealt with
independently. All data in a segment can be replaced by its mean, median or boundary
values.
Regression
This data mining technique is generally used for prediction. It helps to smoothen
noise by fitting all the data points in a regression function. The linear regression
equation is used if there is only one independent attribute; else Polynomial equations
are used.
Clustering
Creation of groups/clusters from data having similar values. The values that don't
lie in the cluster can be treated as noisy data and can be removed.
3. Removing outliers:
Outliers are those data points which differs significantly from other
observations present in given dataset. It can occur because of variability in
measurement and due to misinterpretation in filling data points.
Most common causes of outliers on a data set:
Data Entry Errors: Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data.
Measurement Error (instrument errors): It is the most common source of
outliers. This is caused when the measurement instrument used turns out to be
faulty.
Experimental errors (data extraction or experiment planning errors)
Intentional (dummy outliers made to test detection methods)
Data processing errors (data manipulation or data set unintended mutations)
Sampling errors (extracting or mixing data from wrong or various sources)
How to detect Outliers?
1. Z-score method 6. Isolation Forest
2. Robust Z-score 7. Linear Regression Models
3. I.Q.R method (PCA, LMS)
4. Winsorization method 8. Standard Deviation
(Percentile Capping) 9. Percentile
5. DBSCAN Clustering 10. Visualizing the data
z score
This method assumes that the variable has a Gaussian distribution. It represents the
number of standard deviations an observation is away from the mean.
IQR Method
In this method by using Inter Quartile Range(IQR), we detect outliers.
IQR tells us the variation in the data set.
Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR treated
as outliers.
Q1 represents the 1st quartile of the data.
Q2 represents the 2nd quartile of the data.
Q3 represents the 3rd quartile of the data.
(Q1–1.5 x IQR) represent the smallest value in the data set and
(Q3+1.5 x IQR) represent the largest value in the data set.
Visualizing the data
Data visualization is useful for data cleaning, exploring data,
detecting outliers and unusual groups, identifying trends and
clusters etc. Here the list of data visualization plots to spot the
outliers.
Box and whisker plot (box plot)
Scatter plot
Histogram
Distribution Plot
QQ plot
Methods to prevent outliers:
1. Deleting observations
2. Transforming values
3. Imputation
4. Separately treating
5. Deleting observations
Data Exploration :
Data exploration refers to the initial step in data analysis.
Data analysts use data visualization and statistical techniques to describe
dataset characterizations, such as size, quantity, and accuracy, to understand
the nature of the data better.
Data exploration techniques include both manual analysis and automated data
exploration software solutions that visually explore and identify relationships
between different data variables, the structure of the dataset, the presence of
outliers, and the distribution of data values to reveal patterns and points of
interest, enabling data analysts to gain greater insight into the raw data.
Data is often gathered in large, unstructured volumes from various sources.
Data analysts must first understand and develop a comprehensive view of the
data before extracting relevant data for further analysis, such as univariate,
bivariate, multivariate, and principal components analysis.
Why is Data Exploration Important?
Humans process visual data better than numerical data.
Therefore it is extremely challenging for data scientists and data analysts to
assign meaning to thousands of rows and columns of data points and
communicate that meaning without any visual components.
Data visualization in data exploration leverages familiar visual cues such as
shapes, dimensions, colours, lines, points, and angles so that data analysts
can effectively visualize and define the metadata and then perform data
cleansing.
Performing the initial step of data exploration enables data analysts to
understand better and visually identify anomalies and relationships that
might otherwise go undetected.
What can Data Exploration Do?
• The goals of data Exploration come into these three categories.
1. Archival: Data Exploration can convert data from physical formats (such as
books, newspapers, and invoices) into digital formats (such as databases)
for backup.
2. Transfer the data format: If you want to transfer the data from your current
website into a new website under development, you can collect data from
your own website by extracting it.
3. Data analysis: As the most common goal, the extracted data can be further
analysed to generate insights. This may sound similar to the data analysis
process in data mining, but note that data analysis is the goal of data
Exploration, not part of its process. What's more, the data is analysed
differently. One example is that e-store owners extract product details from
eCommerce websites like Amazon to monitor competitors' strategies.
Data Visualization
Data visualization is a crucial aspect of
machine learning that enables
analysts to understand and make
sense of data patterns, relationships,
and trends.
Through data visualization, insights
and patterns in data can be easily
interpreted and communicated to a
wider audience, making it a critical
component of machine learning
Significance of Data Visualization
Data visualization helps machine learning analysts to better understand and
analyze complex data sets by presenting them in an easily understandable
format.
Data visualization is an essential step in data preparation and analysis as it
helps to identify outliers, trends, and patterns in the data that may be
missed by other forms of analysis.
With the increasing availability of big data, it has become more important
than ever to use data visualization techniques to explore and understand
the data.
Machine learning algorithms work best when they have high-quality and
clean data, and data visualization can help to identify and remove any
inconsistencies or anomalies in the data.
Types of Data Visualization Approaches
Machine learning may make use of a wide variety of data visualization
approaches. They are….
Line Charts
Scatter Plots
Bar Charts
Heat Maps
Tree Maps:
Box Plots
Line Charts:
In a line chart, each data point is
represented by a point on the
graph, and these points are
connected by a line.
We may find patterns and trends
in the data across time by using
line charts.
Time-series data is frequently
displayed using line charts
Scatter Plots:
A quick and efficient method of
displaying the relationship between
two variables is to use scatter plots.
With one variable plotted on the x-axis
and the other variable drawn on the y-
axis, each data point in a scatter plot is
represented by a point on the graph.
We may use scatter plots to visualize
data to find patterns, clusters, and
outliers
Heat Maps:
Heat maps are a type of graphical
representation that displays data
in a matrix format.
The value of the data point that
each matrix cell represents
determines its hue.
Heat maps are often used to
visualize the correlation between
variables or to identify patterns in
time-series data.
Tree Maps:
Tree maps are used to display
hierarchical data in a compact
format and are useful in showing
the relationship between different
levels of a hierarchy
Box Plots:
Box plots are a graphical representation of
the distribution of a set of data.
In a box plot, the median is shown by a
line inside the box, while the centre box
depicts the range of the data.
The whiskers extend from the box to the
highest and lowest values in the data,
excluding outliers.
Box plots can help us to identify the
spread and skewness of the data.
Uses of Data Visualization:
• Data visualization has several uses in machine learning
Identify trends and patterns in data: It may be challenging to spot trends and
patterns in data using conventional approaches, but data visualization tools may
be utilized to do so.
Communicate insights to stakeholders: Data visualization can be used to
communicate insights to stakeholders in a format that is easily understandable
and can help to support decision-making processes.
Monitor machine learning models: Data visualization can be used to monitor
machine learning models in real time and to identify any issues or anomalies in
the data.
Improve data quality: Data visualization can be used to identify outliers and
inconsistencies in the data and to improve data quality by removing them.
Feature selection:
Feature Selection is the method of reducing the input
variable to your model by using only relevant data and
getting rid of noise in data.
It is the process of automatically choosing relevant features
for your machine learning model based on the type of
problem you are trying to solve
Why Feature Selection?
Machine learning models follow a simple rule: whatever goes in, comes
out. If we put garbage into our model, we can expect the output to be
garbage too. In this case, garbage refers to noise in our data.
To train a model, we collect enormous quantities of data to help the
machine learn better. Usually, a good portion of the data collected is
noise, while some of the columns of our dataset might not contribute
significantly to the performance of our model.
Further, having a lot of data can slow down the training process and
cause the model to be slower.
The model may also learn from this irrelevant data and be inaccurate.
Hence apart from choosing the right model for our data, we need to
choose the right data to put in our model.
In the below table, we can see that the model of the car, the year of
manufacture, and the miles it has traveled are pretty important to find out if
the car is old enough to be crushed or not. However, the name of the
previous owner of the car does not decide if the car should be crushed or not.
Further, it can confuse the algorithm into finding patterns between names
and the other features. Hence we can drop the column.
Feature Selection Models
Feature selection models are of two types:
Supervised Models: Supervised feature
selection refers to the method which uses
the output label class for feature selection.
They use the target variables to identify the
variables which can increase the efficiency
of the model
Unsupervised Models: Unsupervised
feature selection refers to the method
which does not need the output label class
for feature selection. We use them for
unlabeled data.
Types of Supervised models:
We can further divide the supervised models into three :
1. Filter Method: In this method, features are dropped based on their relation to the
output, or how they are correlating to the output. We use correlation to check if the
features are positively or negatively correlated to the output labels and drop features
accordingly. Eg: Information Gain, Chi-Square Test, Fisher’s Score, etc.
2. Wrapper Method: We split our data into subsets and train a model using this. Based
on the output of the model, we add and subtract features and train the model again. It
forms the subsets using a greedy approach and evaluates the accuracy of all the
possible combinations of features. Eg: Forward Selection, Backwards Elimination, etc.
3. Intrinsic Method: This method combines the qualities of both the Filter and Wrapper
method to create the best subset. This method takes care of the machine training
iterative process while maintaining the computation cost to be minimum. Eg: Lasso and
Ridge Regression.
How to Choose a Feature Selection Model?
How do we know which feature selection model will work out for our
model? The process is relatively simple, with the model depending on
the types of input and output variables.
Variables are of two main types:
Numerical Variables: Which include integers, float, and numbers.
Categorical Variables: Which include labels, strings, Boolean variables,
etc.
Based on whether we have numerical or categorical variables as inputs
and outputs, we can choose our feature selection model as follows:
Input Variable Output Variable Feature Selection Model
Numerical Numerical • Pearson’s correlation coefficient
• Spearman’s rank coefficient
• ANOVA correlation coefficient
(linear).
Numerical Categorical
• Kendall’s rank coefficient
(nonlinear).
• Kendall’s rank coefficient (linear).
Categorical Numerical • ANOVA correlation coefficient
(nonlinear).
• Chi-Squared test (contingency
Categorical Categorical tables).
• Mutual Information.
What are the advantages and limitations of z-score method, IQR method and
visualizing the data method of detecting outliers?
1. Z-score method:
Advantages:
• It provides a standardized score indicating how many standard deviations an
observation is from the mean.
• It's easy to understand and implement.
• It works well for normally distributed data.
Limitations:
• It assumes that the data is normally distributed, which may not always be the
case.
• It can be sensitive to extreme values, especially in small datasets.
• It may not be effective for skewed distributions.
2. Interquartile Range (IQR) method:
Advantages:
• It's robust to outliers and resistant to skewed distributions.
• It's simple to calculate and understand.
• It provides a measure of the spread of the middle 50% of the data.
Limitations:
• It relies on quartiles, which may not be representative if the dataset is small.
• It may not be as informative about the location of outliers as the z-score
method.
• It's less effective for normally distributed data compared to the z-score
method.
3. Visualizing the data method:
Advantages:
• It allows for a quick and intuitive understanding of the distribution of data.
• It can reveal patterns and anomalies that may not be evident from summary
statistics alone.
• It's particularly useful for identifying outliers in multidimensional datasets.
Limitations:
• It can be subjective and dependent on the choice of visualization technique.
• It may not be suitable for large datasets with many variables, as it can be
time-consuming.
• It requires some expertise to interpret the visualizations accurately.
What are the potential benefits of reducing the number of features in a dataset,
and how can feature selection improve model performance and interpretability?
Reducing the number of features in a dataset, also known as feature selection, can offer
several potential benefits:
1.Improved model performance:
• By removing irrelevant or redundant features, feature selection can reduce
overfitting, where the model learns noise in the data rather than the underlying
patterns. This can lead to better generalization performance on unseen data.
• It can also reduce computational complexity, making the training process faster
and more efficient, especially for algorithms that are sensitive to the curse of
dimensionality.
2.Enhanced interpretability:
• With fewer features, it becomes easier to understand and interpret the model.
Simplifying the model can help identify the most important factors driving the
predictions, making it easier to communicate the results to stakeholders.
• Feature selection can highlight the most relevant variables, allowing domain
experts to gain insights into the underlying processes and relationships in the
data.
3. Reduced overfitting:
• Feature selection helps to mitigate the risk of overfitting by reducing the model's
reliance on noisy or irrelevant features. This allows the model to capture the
underlying patterns in the data more effectively, leading to better generalization
performance on new data.
• Removing irrelevant features can also improve the model's robustness to changes
in the dataset, such as missing values or outliers.
4. Faster training and inference:
• With fewer features, models require less computational resources for training and
inference. This can lead to faster model development and deployment, which is
particularly important in real-time or resource-constrained applications.
5. Improved data visualization:
• Feature selection can result in a reduced-dimensional space, making it easier to
visualize and explore the data. This can help identify relationships and patterns
that may not be apparent in high-dimensional spaces.
How can outliers arise in a dataset?
Outliers can arise in a dataset due to various reasons, and they can have
different underlying causes depending on the nature of the data and the
context of the problem
Measurement errors: Outliers can occur due to errors in data collection,
recording, or measurement. These errors could be human errors, instrument
malfunctions, or data entry mistakes. For example, a sensor malfunctioning
could record an extreme value, leading to an outlier.
Natural variation: In many real-world phenomena, there can be inherent
variability or randomness. Occasionally, extreme or unusual events can
occur, resulting in outliers. For instance, in a dataset of daily temperatures,
an unusually hot or cold day could be considered an outlier.
Data entry errors: Outliers may arise from mistakes during data entry or
data preprocessing. Human error or typos when entering data into a
database or spreadsheet can result in values that are far from the typical
range.
Sampling errors: Outliers can occur due to issues with the sampling process.
If the sample size is too small or if the sampling method is biased, it may not
accurately represent the underlying population, leading to outliers.
Genuine extreme values: Sometimes, outliers represent real and
meaningful observations that are genuinely extreme. These could be rare
events, anomalies, or exceptional cases that are legitimately part of the data
distribution.
Data transformation: Outliers can also arise during data transformation or
aggregation processes. For example, when summarizing data from multiple
sources, inconsistencies or extreme values in one source can lead to outliers in
the aggregated dataset.
Intentional data points: In some cases, outliers may be deliberately
introduced into the dataset. This could be done for various reasons, such as
testing the robustness of a model or representing special cases in the analysis.