The Yenepoya Institute of Arts, Science, Commerce and
Management (YIASCM)
Course: IV Semester BCA ( All specializations) & BSc
Introduction to Machine Learning
Lecture 4: Unit-1:- Introduction to Machine Learning
Objective
• A the end of this session the leaner will able to understand about
⮚ Data pre-processing
⮚ Steps for data pre-processing
⮚ Data Cleaning
⮚ How to clean data for Machine Learning?
⮚ Data Wrangling
Data Pre-processing
• Data pre-processing is a process of preparing the raw data and
making it suitable for a machine learning model.
• It is the first and crucial step while creating a machine learning model.
• When creating a machine learning project, it is not always a case that
we come across the clean and formatted data.
• And while doing any operation with data, it is mandatory to clean it
and put in a formatted way.
• So for this, we use data pre-processing task.
What is Data Pre-processing?
• Data Pre-processing includes the steps we need to follow to transform
or encode data so that it may be easily parsed by the machine.
• The main agenda for a model to be accurate and precise in
predictions is that the algorithm should be able to easily interpret the
data's features.
Why is Data Pre-processing important?
• The majority of the real-world datasets for machine learning are highly
susceptible to be missing, inconsistent, and noisy due to their
heterogeneous origin.
• Applying data mining algorithms on this noisy data would not give quality
results as they would fail to identify patterns effectively.
• Data Processing is, therefore, important to improve the overall data
quality.
❖Duplicate or missing values may give an incorrect view of the overall statistics of
data.
❖Outliers and inconsistent data points often tend to disturb the model’s overall
learning, leading to false predictions.
• Quality decisions must be based on quality data.
• Data Pre-processing is important to get this quality data, without
which it would just be a Garbage In, Garbage Out scenario.
Why do we need Data Pre-processing?
• A real-world data generally contains noises, missing values, and maybe
in an unusable format which cannot be directly used for machine
learning models.
• Data pre-processing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy
and efficiency of a machine learning model.
Steps for data pre-processing
Data Cleaning
• Data Cleaning is particularly done as part of data pre-processing to clean the
data by filling missing values, smoothing the noisy data, resolving the
inconsistency, and removing outliers.
1. Missing values
Here are a few ways to solve this issue:
• Ignore those tuples
This method should be considered when the dataset is huge and
numerous missing values are present within a tuple.
• Fill in the missing values
There are many methods to achieve this, such as filling in the values
manually, predicting the missing values using regression method, or numerical
methods like attribute mean.
2. Noisy Data
• It involves removing a random error or variance in a measured
variable. It can be done with the help of the following techniques:
a. Binning
It is the technique that works on sorted data values to smoothen
any noise present in it. The data is divided into equal-sized bins, and
each bin/bucket is dealt with independently. All data in a segment can
be replaced by its mean, median or boundary values.
b. Regression
This data mining technique is generally used for prediction. It
helps to smoothen noise by fitting all the data points in a regression
function. The linear regression equation is used if there is only one
independent attribute; else Polynomial equations are used.
c. Clustering
Creation of groups/clusters from data having similar values. The
values that don't lie in the cluster can be treated as noisy data and can
be removed.
3. Removing outliers
Clustering techniques group together similar data points. The
tuples that lie outside the cluster are outliers/inconsistent data.
Data Integration
• Data Integration is one of the data preprocessing steps that are used
to merge the data present in multiple sources into a single larger data
store like a data warehouse.
• Data Integration is needed especially when we are aiming to solve a
real-world scenario like detecting the presence of nodules from CT
Scan images.
• The only option is to integrate the images from multiple medical
nodes to form a larger database.
• We might run into some issues while adopting Data Integration as one
of the Data Pre-processing steps:
❖Schema integration and object matching: The data can be
present in different formats, and attributes that might
cause difficulty in data integration.
❖Removing redundant attributes from all data sources.
❖Detection and resolution of data value conflicts.
Data Transformation
• Once data clearing has been done, we need to consolidate the
quality data into alternate forms by changing the value,
structure, or format of data using the below-mentioned Data
Transformation strategies.
a. Generalization
b. Normalization
c. Attribute Selection
d. Aggregation
• Generalization
The low-level or granular data that we have converted to
high-level information by using concept hierarchies. We can
transform the primitive data in the address like the city to
higher-level information like the country.
• Normalization
It is the most important Data Transformation technique widely
used. The numerical attributes are scaled up or down to fit within a
specified range. In this approach, we are constraining our data attribute
to a particular container to develop a correlation among different data
points.
• Normalization can be done in multiple ways, which are highlighted
here:
1. Min-max normalization
2. Z-Score normalization
3. Decimal scaling normalization
• Attribute Selection
New properties of data are created from existing attributes to
help in the data mining process. For example, date of birth, data
attribute can be transformed to another property like is_senior_citizen
for each tuple, which will directly influence predicting diseases or
chances of survival, etc.
• Aggregation
It is a method of storing and presenting data in a summary
format. For example sales, data can be aggregated and transformed to
show as per month and year format.
Data Reduction
• The size of the dataset in a data warehouse can be too large to be
handled by data analysis and data mining algorithms.
• One possible solution is to obtain a reduced representation of the
dataset that is much smaller in volume but produces the same quality
of analytical results.
• Data Reduction strategies are
a. Data cube aggregation
It is a way of data reduction, in which the gathered data is expressed
in a summary form.
b. Dimensionality reduction
Dimensionality reduction techniques are used to perform feature
extraction. The dimensionality of a dataset refers to the attributes or
individual features of the data. This technique aims to reduce the number
of redundant features we consider in machine learning algorithms.
Dimensionality reduction can be done using techniques like Principal
Component Analysis etc.
c. Data compression
By using encoding technologies, the size of the data can significantly
reduce. But compressing data can be either lossy or non-lossy. If original data
can be obtained after reconstruction from compressed data, this is referred to
as lossless reduction; otherwise, it is referred to as lossy reduction.
d. Discretization
Data discretization is used to divide the attributes of the continuous nature
into data with intervals. This is done because continuous features tend to have
a smaller chance of correlation with the target variable. Thus, it may be harder
to interpret the results. After discretizing a variable, groups corresponding to
the target can be interpreted. For example, attribute age can be discretized
into bins like below 18, 18-44, 44-60, above 60.
e. Numerosity reduction
The data can be represented as a model or equation like a
regression model. This would save the burden of storing huge datasets
instead of a model.
f. Attribute subset selection
It is very important to be specific in the selection of attributes.
Otherwise, it might lead to high dimensional data, which are difficult to
train due to underfitting/overfitting problems. Only attributes that add
more value towards model training should be considered, and the rest
all can be discarded.
Data Quality Assessment
• Data Quality Assessment includes the statistical approaches one
needs to follow to ensure that the data has no issues.
• Data is to be used for operations, customer management, marketing
analysis, and decision making—hence it needs to be of high quality.
Main components of Data Quality
Assessment
• The completeness with no missing attribute values
• Accuracy and reliability in terms of information
• Consistency in all features
• Maintain data validity
• It does not contain any redundancy
• Data Quality Assurance process has involves three main activities
1. Data profiling: It involves exploring the data to identify the data
quality issues. Once the analysis of the issues is done, the data
needs to be summarized according to no duplicates, blank values
etc identified.
2. Data cleaning: It involves fixing data issues.
3. Data monitoring: It involves maintaining data in a clean state and
having a continuous check on business needs being satisfied by the
data.
Data Pre-processing: Best practices
• The first step in Data Pre-processing is to understand your data. Just looking
at your dataset can give you an intuition of what things you need to focus on.
• Use statistical methods or pre-built libraries that help you visualize the
dataset and give a clear image of how your data looks in terms of class
distribution.
• Summarize your data in terms of the number of duplicates, missing values,
and outliers present in the data.
• Drop the fields you think have no use for the modelling or are closely related
to other attributes. Dimensionality reduction is one of the very important
aspects of Data Pre-processing.
• Do some feature engineering and figure out which attributes contribute most
towards model training.
Data Cleaning
• Data cleaning is one of the important parts of machine learning. It plays
a significant part in building a model.
• It surely isn’t the fanciest part of machine learning and at the same time,
there aren’t any hidden tricks or secrets to uncover.
• However, the success or failure of a project relies on proper data
cleaning.
• Professional data scientists usually invest a very large portion of their
time in this step because of the belief that “Better data beats fancier
algorithms”.
What is data cleaning?
• Data cleaning is the process of preparing data for analysis by weeding
out information that is irrelevant or incorrect.
• This is generally data that can have a negative impact on the model or
algorithm it is fed into by reinforcing a wrong notion.
• Data cleaning not only refers to removing chunks of unnecessary data,
but it’s also often associated with fixing incorrect information within the
train-validation-test dataset and reducing duplicates.
The importance of data cleaning
• Data cleaning is a key step before any form of analysis can be made on it.
• Datasets in pipelines are often collected in small groups and merged
before being fed into a model.
• Merging multiple datasets means that redundancies and duplicates are
formed in the data, which then need to be removed.
• Also, incorrect and poorly collected datasets can often lead to models
learning incorrect representations of the data, thereby reducing their
decision-making powers.
• It's far from ideal.
• The reduction in model accuracy, however, is actually the least of the
problems that can occur when unclean data is used directly.
• Models trained on raw datasets are forced to take in noise as
information and this can lead to accurate predictions when the noise is
uniform within the training and testing set—only to fail when new,
cleaner data is shown to it.
• Data cleaning is therefore an important part of any machine learning
pipeline, and you should not ignore it.
Data cleaning vs. data transformation
• As we’ve seen, data cleaning refers to the removal of unwanted data in
the dataset before it’s fed into the model.
• Data transformation, on the other hand, refers to the conversion or
transformation of data into a format that makes processing easier.
• In data processing pipelines, the incoming data goes through a data
cleansing phase before any form of transformation can occur.
• The data is then transformed, often going through stages like
normalization and standardization before further processing takes place.
5 characteristics of quality data
• Data typically has five characteristics that can be used to determine
its quality.
• These five characteristics are referred to within the data as:
1. Validity
2. Accuracy
3. Completeness
4. Consistency
5. Uniformity
Validity
• Validity refers to the degree to which data collected is accurate,
correct, and appropriate for its intended purpose, ensuring it stick to
predefined rules and logical standards (e.g., correct formats, realistic
values, and relevance)
• Data collection often involves a large group of people presenting their
details in various forms (including phone numbers, addresses, and
birthdays) in a document that is stored digitally.
• Modern methods of data collection find validity an easy-to-maintain
characteristic as they can control the data that is being entered into
digital documents and forms.
• Typical constraints applied on forms and documents to ensure data
validity are:
a. Data-type constraints: Data-type constraints help prevent
inconsistencies arising due to incorrect data types in the wrong fields.
Typically, these are found in fields like age, phone number, and name
where the original data is constrained to contain only alphabetical or
numerical values.
b. Range constraints: Range constraints are applied in fields where
prior information about the possible data is already present. These
fields include—but are not limited to—date, age, height, etc.
c. Unique constraints:
Unique constraints are ones that update themselves each time a
participant enters data into the document or form. This type of constraint
prevents multiple participants from entering the same details for
parameters that are supposed to be unique. Generally, these constraints
are activated at fields like social security number, passport number, and
username.
d. Foreign key constraints:
Foreign Key constraints are applicable to fields where the data can be
limited to a set of previously decided keys. These fields are typically
country and state fields where the range of information that can be
provided is easily known beforehand.
e. Cross-field validation:
Cross-field validation is more of a check than a constraint to make
sure that multiple fields in the document correspond to each other.
For example, if the participant enters a group of values that should
come to a particular number or amount, that amount serves as a
validator that stops the participant from entering the wrong values.
Accuracy
• Accuracy refers to how much the collected data is both feasible and accurate.
• It’s almost impossible to guarantee perfectly accurate data, thanks to the fact
that it contains personal information that’s only available to the participant.
• However, we can make near-accurate assumptions by observing the feasibility
of that data.
• Data in the form of locations, for example, can easily be cross-checked to
confirm whether the location exists or not, or if the postal code matches the
location or not.
• Similarly, feasibility can be a solid criterion for judging.
• A person cannot be 100 feet tall, nor can they weigh a thousand pounds, so
data going along these lines can be easily rejected.
Completeness
• Completeness refers to the degree to which the entered data is
present in its entirety.
• Missing fields and missing values are often impossible to fix, resulting
in the entire data row being dropped.
• The presence of incomplete data, however, can be appropriately fixed
with the help of proper constraints that prevent participants from
filling up incomplete information or leaving out certain fields.
Consistency
• Consistency refers to how the data responds to cross-checks with
other fields.
• Studies are often held where the same participant fills out multiple
surveys which are cross-checked for consistency.
• Cross checks are also included for the same participant in more than a
single field.
How to clean data for Machine Learning?
• As research suggests—
Data cleaning is often the
least enjoyable part of data
science—and also the longest.
• Indeed, cleaning data is an arduous task that requires manually combing
a large amount of data in order to:
a) reject irrelevant information.
b) analyze whether a column needs to be dropped or not.
• Automation of the cleaning process usually requires a an extensive
experience in dealing with dirty data.
• It’s kinda tricky to implement in a manner that doesn’t bring about data
loss.
Remove duplicate or irrelevant data
• Data that’s processed in the form of data frames often has duplicates
across columns and rows that need to be filtered out.
• Duplicates can come about either from the same person participating in
a survey more than once or the survey itself having multiple fields on a
similar topic, thereby eliciting a similar response in a large number of
participants.
• While the latter is easy to remove, the former requires investigation and
algorithms to be employed. Columns in a data frame can also contain
data highly irrelevant to the task at hand, resulting in these columns
being dropped before the data is processed further.
Fix syntax errors
• Data collected over a survey often contains syntactic and grammatical issues,
due mainly to the fact that a huge demographic is represented through it.
• Common syntax issues like date, birthday and age are simple enough to fix,
but syntax issues involving spelling mistakes require more effort.
• Algorithms and methods which find and fix these errors have to be
employed and iterated through the data for the removal of typos and
grammatical and spelling mistakes.
• Syntax errors, meanwhile, can be prevented altogether by structuring the
format in which data is collected, before running checks to ensure that the
participants have not wrongly filled in known fields.
• Setting strict boundaries for fields like State, Country, and School goes a long
way to ensuring quality data.
Filter out unwanted outliers
• Unwanted data in the form of outliers has to be removed before it can be
processed further.
• Outliers are the hardest to detect amongst all other inaccuracies within
the data.
• Thorough analysis is generally conducted before a data point or a set of
data points can be rejected as an outlier.
• Specific models that have a very low outlier tolerance can be easily
manipulated by a good number of outliers, therefore bringing down the
prediction quality.
Handle missing data
• Unfortunately, missing data is unavoidable in poorly designed data
collection procedures.
• It needs to be identified and dealt with as soon as possible.
• While these artifacts are easy to identify, filling up missing regions often
requires careful consideration, as random fills can have unexpected
outcomes on the model quality.
• Often, rows containing missing data are dropped as it’s not worth the
hassle to fill up a single data point accurately.
• When multiple data points have missing data for the same attributes,
the entire column is dropped.
• Under completely unavoidable circumstances and in the face of low
data, data scientists have to fill in missing data with calculated guesses.
• These calculations often require observation of two or more data
points similar to the one under scrutiny and filling in an average value
from these points in the missing regions.
Validate data accuracy
• Data accuracy needs to be validated via cross-checks within data frame
columns to ensure that the data which is being processed is as accurate
as possible.
• Ensuring the accuracy of data is, however, hard to gauge and is possible
only in specific areas where a predefined idea of the data is known.
• Fields like countries, continents, and addresses can only have a set of
predefined values that can be easily validated against.
• In data frames constructed from more than a single source/survey,
cross-checks across sources can be another procedure to validate data
accuracy.
Data Wrangling
• Data Wrangling is a technique that is executed at the time of making an
interactive model.
• In other words, it is used to convert the raw data into the format that is
convenient for the consumption of data.
• This technique is also known as Data Munging.
• This method also follows certain steps such as after extracting the data
from different data sources, sorting of data using the certain algorithms are
performed, decompose the data into a different structured format and
finally store the data into another database.
Why Data Wrangling is necessary?
• Data Wrangling is an important aspect of implementing the model.
• Therefore, data is converted to the proper feasible format before applying
any model to it.
• By performing filtering, grouping, and selecting appropriate data accuracy
and performance of the model could be increased.
• Another concept is that when time-series data has to be handled every
algorithm is executed with different aspects.
• Therefore it is used to convert the time series data into the required
format of the applied model.
• In simple words, the complex data is transformed into a usable format for
performing analysis on it.
Benefits of Data Wrangling
• Data wrangling helps to improve data usability as it converts data into
a compatible format for the end system.
• It helps to quickly build data flows within an intuitive user interface
and easily schedule and automate the data-flow process.
• Integrates various types of information and their sources (like
databases, web services, files, etc.)
• Help users to process very large volumes of data easily and easily
share data-flow techniques.
Why Data Wrangling Matters in Machine
Learning
• Data wrangling has become essential for various purposes like data
analysis and machine learning.
• In cases of analysis and business intelligence operations, data wrangling
brings data closer to analysts and data scientists in the following ways:
a. Data exploration: Data wrangling helps with exploratory data analysis.
Data mapping, a crucial part of the data wrangling process, helps
establish relationships between data and provides analysts and data
scientists with a comprehensive view of their data and how best to use it
to draw insights from it.
b. Grants access to unified, structured, and high-quality data: Data
wrangling involves data cleaning and validation, which helps remove
noisy data and other unnecessary variables, leading to the production
of high-quality data.
c. Improves data workflows: Automated data wrangling helps create
workflows that ensure an organization’s continuous data flow. Data
workflows help accelerate analysis and other organizational processes
reliant on such data.
How Data Wrangling Fits into the Machine
Learning Data Preparation Process
• Data wrangling leads to the creation of more efficient machine learning
models.
• In machine learning, data scientists and ML engineers typically revisit
and fine-tune the data wrangling process of data preparation, so the
first model built is rarely the best.
• This process is an iterative process, and there may be some debate
during the design of the model until the engineer arrives at a
satisfactory and accurate model that fits the use case.
• Data scrambling here may include:
❖The removal of data irrelevant to the analysis.
❖Creation of a new column by aggregation
❖Using feature extraction to create a new column, for example
identifying sex by extracting prefixes for names like Mr and Miss.
What are the best Data Wrangling Tools?
• Tabula: Tabula is a tool that is used to convert the tabular data
present in pdf into a structured form of data, i.e., spreadsheet.
• OpenRefine: OpenRefine is open-source software that provides a
friendly Graphical User Interface (GUI) that helps to manipulate the
data according to your problem statement and makes Data
Preparation process simpler. Therefore, it is highly useful software for
the non-data scientist.
• R: R is an important programming language for the data scientist. It
provides various packages like dplyr, tidyr, etc. for performing data
manipulation.
• Data Wrangler: Data Wrangler is a tool that is used to convert real-
world data into the structured format. After the conversion, the file can
be imported into the required application like Excel, R, etc. Therefore,
less time will be spent on formatting data manually.
• CSVKit: CSVKit is a toolkit that provides the facility of conversion of CSV
files into different formats like CSV to JSON, JSON to CSV, and much
more. It makes the process of wrangling easy.
• Python with Pandas: Python is a language with Pandas library. This
library helps the data scientist to deal with complex problems efficiently
and makes Data Preparation process efficient.
• Mr. Data Converter: Mr. Data Converter is a tool that takes Excel file as
an input and converts the file into required formats. It supports the
conversion of HTML, XML, and JSON format.
Summary
• Data cleaning focuses on identifying and rectifying errors and
inconsistencies, while data wrangling involves transforming and
organizing the data to make it more suitable for analysis or modeling
purposes.
• Both processes are crucial for ensuring the reliability and accuracy of
insights derived from the data.
Reference
• “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools and Techniques to Build Intelligent System” by Aurelien
Geron. (Publisher: O’Reilly Media, Year :2019)
• “ Pattern Recognition and Machine Learning” by Christopher M. Bishop.
(Publisher: Springer, Year: 2006)
• “Machine Learning: A Probabilistics Perspective” by Kevin P Murphy.
(Publisher: The MIT Press, Year: 2012)
• “Python Machine Learning” by Sebastine Rashka and Vahid Mirjalili.
(Publisher: Packt Publishing, Year: 2019)