Data pre-processing
Data pre-processing
• Data pre-processing consists of a series of steps to transform raw data
derived from data
• Data Preprocessing can be defined as a process of converting raw data
into a format that is understandable and usable for further analysis. It
is an important step in the Data Preparation stage.
• It ensures that the outcome of the analysis is accurate, complete,
and consistent.
Why is Data Preprocessing Important
• Data Preprocessing is an important step in the Data Preparation stage
of a Data Science development lifecycle that will ensure reliable,
robust, and consistent results.
• The main objective of this step is to ensure and check the quality of
data before applying any Machine Learning Data Mining methods
• Accuracy - Data Preprocessing will ensure that input data is accurate
and reliable by ensuring there are no manual entry errors, no
duplicates, etc.
• Completeness - It ensures that missing values are handled, and data is
complete for further analysis.
• Consistent - Data Preprocessing ensures that input data is consistent,
i.e., the same data kept in different places should match.
• Timeliness - Whether data is updated regularly and on a timely basis
or not.
• Trustable - Whether data is coming from trustworthy sources or not.
• Interpretability - Raw data is generally unusable, and Data
Preprocessing converts raw data into an interpretable format.
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not
recorded.
• Consistency: To check whether the same data is kept in all the places
that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
PreprocessingOperati
ons
Data Data Data Data Data
Cleaning Integration Transformation Reduction Discretization
1. Data Cleaning: Fill in missing values, noisy data, resolves
inconsistency.
2. Data Integration: integration of multiple databases, data cubes or
files.
3. Data Transformation: Normalization / changing data from one
format or structure to another to make it easier to analyze or use.
4. Data Reduction: Obtains reduced representation in volume but
produces the same or similar analytical results.
5. Data Discretization: Process of converting continuous data into
distinct, separate categories or intervals.
• ex: people age: 0-17, 18-25…etc.
What is Data?
• It is collection of objects & their attributes.
•Attribute: It is property or characteristic of an object.
Ex.: name of person, gender, salary.
• -> Attribute is also known as variable, field,
characteristic, feature.
•Object: Collection of attribute describe an
object.
•-> Object is known as record, point, entity
Attributes
• Attributes are qualities or characteristics that describe
an object, individual, or phenomenon.
• Attributes can be categorical, representing distinct
categories or classes, such as colors, types, or labels.
Types of Attributes
• Nominal Attributes :
Nominal means "relating to names" . The utilities of a nominal attribute are sign or
title of objects . Each value represents some kind of category, code or state, and so
nominal attributes are also referred to as categorical.
• Example : Suppose that skin color and education status are two attributes of
expressing person objects. In our implementation, possible values for skin color
are dark, white, brown. The attributes for education status can contain the values-
undergraduate, postgraduate, matriculate.
• Binary Attributes :
A binary attribute is a category of nominal attributes that contains only two
classes: 0 or 1, where 0 often tells that the attribute is not present, and 1 tells that
it is existing. Binary attributes are mentioned as Boolean if the two conditions
agree to true and false.
• Example - Given the attribute drinker narrate a patient item, 1 specify that the
drinker drinks, while 0 specify that the patient does not. Similarly, suppose the
patient undergoes a medical test that has two practicable outcomes.
Types of
Data
Attribute
Nominal Binary Ordinal Numeric
Attribute Attribute Attribute Attribute
Interval Ratio
Scaled Scaled
1. Nominal
Attribute:
- The values of it are symbol or name of things
- Each value represent some kind of category, code or state.
- It is also referred as “categorized”.
- Values does not have meaningful order.
- E.g. HairColor = { black, brown, grey, red, white}
- ZipCode= {411023, 411028, 444009}
- MaritalStatus={married, unmarried, divorced}
2. Binary
Attribute:
- It is nominal attribute with only two categories or state namely 0
(absent), 1 (present).
- Binary attributes referred as Boolean (true/false).
Symmetric binary: outcomes equally important.
Ex. Gender={male, female}
Assymmetric binary: outcomes not equally important.
One of it value describes the problematic condition &
other is normal condition
Ex. Medical Test= { positive, negative}
Fraud Detection = {yes, no}
3. Ordinal Attribute:
- Attribute with possible values that have a meaningful order or ranking
based on specific characteristic or attribute.
- Ex. Grade(A,B, C), height(tall/ medium/ short), ranking(1-10),
cold-drink bottle(small/ medium/ large).
4. Numeric
Attributes:
- It is measurable, quantity.
- Represent in integer or real values.
1. Interval-Scaled:
- Measured on scale of equal-size units.
- Values of interval-scaled attributes have order and can be +ve, 0 , -ve.
- E.g. Temperature in Celsius and Fahrenheit.
1 degree Celsius = 33.8 Fahrenheit
100 degree Celsius = 212 Fahrenheit
4.2 Ratio-scaled:
- A value as being multiple of another value.
- Values are ordered & we can compute difference
between values, as well as mean, median, mode.
- Ex. Height: 6 feet = twice of 3 feet.
Discrete v/s Continuous
Attributes
1. Discrete Attribute:
- Has only finite or countably infinite set of values.
- It may have numeric values, such as 0, 1, 2....
- E.g. Number of students in a class.(40,41, 45..)
You cannot count 45.5
Blood Groups (A, B, O, AB)
2. Continuous Attribute:
- Has real numbers as attribute values.
- Typically represents as floating-point variable.
- Eg. Temperature, speed, time, distance, age
• Discrete Attribute
• Has only a finite or countable infinite set of values
Ex: zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Continuous Attribute
• Has real numbers as attribute values
• Ex temperature, height, or weight.
• Practically, real values can only be measured and represented using a
finite number of digits.
• Discrete Attribute :
A discrete attribute has a limited or restricted unlimited set of values,
which may appear as integers.
• Ex: The attributes skin color, drinker, medical report, and drink size each
have a finite number of values, and so are discrete.
• Continuous Attribute :
A continuous attribute has real numbers as attribute values.
• Ex - Height, weight, and temperature have real values . Real values can
only be represented and measured using finite number of digits .
Continuous attributes are typically represented as floating-point
variables.
ature Discrete Attributes Continuous Attributes
Definition Take on distinct, countable values Take on any value within a range, including decimals
Value Type Whole numbers only (integers) Real numbers (can be fractional or decimal)
Possible Values Finite or countably infinite set Uncountably infinite set
Examples Number of children, cars, clicks, exam questions Height, weight, temperature, speed
Measured or Counted? Counted Measured
Graphical Representation Bar chart, pie chart, histogram (discrete bins) Histogram (with bins), line plot, density curve
Common Uses Event counting, population stats, categorical encodings Regression tasks, sensor data, time series
Typical Data Types Integer (int) Float (decimal/double)
Interval Between Values Gaps between values (e.g., 1, 2, 3) No gaps; values are continuous along a range
Can apply mathematical Only certain operations (e.g., addition, count, Most mathematical operations are applicable
ops? frequency)
- Number of products sold - Temperature readings
Examples in Real Data - Number of students - Weight of a package
- Shoe size - Distance traveled
Data
Preprocessing
-Data Wrangling
Data
Quality:
• “ability of given dataset to server an intended purpose”
•Data have quality if they satisfy requirements of our intended
use. Ex. Income = -100 (out of range values)
age=4000
If there is much irrelevant and redundant information present or
noisy and unreliable data then knowledge discovery is more difficult.
Data preparation and filtering steps can take considerable amount
of processing time.
• There are many reasons for accurate, complete and consistent in
real-world databases and data warehouses.
1. accurate:-
• It means having correct attribute values.
• It happen due to having typing error or some garbage entry during auto
transmission of data.
Reasons/ What is responsible for inaccurate data:
- There may have been human or computer errors occurring at data
entry.
-User purposely submit incorrect data values for mandatory
fields . Ex. 1 Jan on DOB. Accurate is 1 jan 2021
2. consistency:
• correct and redundant data may result from inconsistencies in naming
conventions or data codes.
•Duplicate tuples also required data
cleaning. Ex. Date.
•The data looks the same everywhere.
Example: One place doesn’t say “MH” while
another says “Maharashtra”
3. Incompleteness:
Cust_id Name Address Age Occupation Category
C01 Ravi Shukla Mumbai 45 Services
C02 Rohit Joshi Pune Grocery Shop Gold
4. Timeliness:
- Refers to relevance and value of data being analyzed or used in relative
to time at which is accessed.
- Timely data is data that is up-to-date and available when needed which
is crucial to make accurate decisions.
- Ex. Fraud detection in banking, Weather forecasting.
Data Munging / Data Wrangling
Operations:
• “task of converting data into feasible format that is suitable for the
consumption of the data.”
• Goal of Data Wrangling - Assure Quality.
Data Munging includes operations:
Cleaning Data
Data Transformation
Data Reduction
Data Discretization
1. Data Cleaning / Data cleansing
/ Scrubbing:
- Done by handle irrelevant or missing data.
- How to clean?
- Filling missing values
- Smoothing noisy data
- Removing outliers
- Resolving any inconsistencies
1.1
Missing
Values:
- Some values in the data may not be filled up for various reasons and
hence are considered missing.
- If some of tuple have no recorded value for some attribute then its
difficult to procced with data.
There are 3 cases of missing data:
1. MCAR (Missing Completely At Random): occurs due to someone to fill in
the value
or have lost information.
2. MAR (Missing At Random): occurs due to someone purposely not filling up the
data mainly due to privacy.
3. MNAR (Missing Not At Random): Occurs as data maybe not available .
Handling Missing Values:
1. Ignore the tuple:
- This method is not very effective, unless the tuple contains several
attributes with missing values.
DataSet
Updated DataSet
2. Fill in missing value
manually:
- It is time consuming and may not be feasible given a large dataset with
many missing values
3. Use a Global constant to fill the missing value:
- Replace all missing attributes values by some constant such as label like
“unknown” or “infinity”.
DataSet
Updated DataSet
4. Use a measure of Central
Tendency for the Attribute
(Mean/Median) to fill missing values:
- Use mean, median, mode.
DataSet
Updated
DataSet
5. Use attribute mean or median for
all samples belonging to same class
as given tuple:
Dataset
Updated Dataset
6. Use the most probable
value to fill the missing
-value
Basically prediction algorithm used to find out missing values.
Identify most probable value
For age:
ages – 25 30
mode – age is 25
For gender:
age- 25 30
mode – gender -
male
1.2 Noisy Data:
- Contains errors or outliers, inaccuracies
- Creates problems in analysis
- Data got corrupted
Ex. i. Typographical Error: “John Smith” -> “Jhn
Smith”
It creates inconsistency
ii. Outliers
iii. Inconsistent Formating:
Some entries are of DD/MM/YYYY
Some entries are of MM/DD/YYYY
iv. Quantity shouldn’t be in negative (-12).
Duplicate Entries:
- It is big problem
- Before start analysis it is suppose to identify such duplicity and handle it
properly.
- Data duplication can also occur when you are trying to group data from
various sources.
- It degrade the data quality which impact on data analysis outcomes.
NULL: (when value is not known.)
-When it come to analytics, NULL’s cannot be processed by many
algorithms. In such case, it is necessary to replace missing values with
some reasonable proxy.
Huge Outliers:
- Outliers is datapoint that differs significantly from other observations.
- They are extreme values that deviate fom other observations on data.
Causes of outliers:
1. Data Entry Error (human error)
2. Measurement Error (instrument error)
3. Sampling Error (mixing data from wrong source)
Out of Date Data:
- Refers to information in a dataset that is no longer current or relevant
due to the passage of time.
- Data become outdated because of changes in real world conditions,
updates in technology.
- This may lead to inaccurate information.
- Ex. Customer & contact info
- Person – address
- Product – Price
- Software -- versions
Formatting Issue:
i. Extra Whitespaces:
- Occurs when there are unintended spaces in the data that can affect
data processing, analysis & consistency.
- Ex. “John Smith” --- “John Smith ”
Email-id “abc@gmail.com” --- “abc@gmail.com”
ii. Invalid Characters:
Some data files will randomly have invalid bytes in middle of them.
Ex. J@hn
2. Data
Transformation
- Process of converting raw data into a format or structure that would
be more suitable for data analysis.
- Conversion of Raw data --- single & easy-to-read format to facilitate
easy analysis.
- Ex. Time format -- 12 hrs format to 24hrs
format Date format -- MM-
DD-YYYY to YYYY-MM-DD
Structural
- Transformation:
Renaming, moving and combining columns in database are related to it.
Benefits of data transformation:
- Get better organized.
- Easier for humans as well as computer use.
- Improves data quality
- Facilitates compatibility between applications, system and types of
data.
Data Transformation
Strategies
Rescaling Normalizing Binarizing Standardizing Labelling One hot
encoding
1. Rescaling:
- Transforming the data so that it fits within a specific scale, like 0-100 or
0-1.
- Here 0 (minimum value) & 1 (maximum value).
- Many statistical or machine learning techniques prefer rescaling the
attributes to fall within a given scale.
2. Normalizing:
- process used to adjust the scale of data so that it fits within a standard
range, usually [0, 1].
- It is used to scale and standardize the features of a dataset.
- Ex. Changing measurement units from meters to inches for height or
from kilograms to pounds for weight, may lead to very different results.
- Min Max Technique
• 3. Binarizing:
- Process of converting data to either 0 or 1 based on threshold value.
- Values above the threshold value marked as – 1.
- Values equal to or below threshold value marked as 0.
- Ex. Speed Limit (above 60km/hr – pay fine | below 60km/hr –
within legal limits)
4. Standardizing:
- Also called as mean removal.
- Standardization is another scaling technique where the values are
centered around the mean with a unit standard deviation.
- It means mean of the attribute becomes zero and resultant distribution
has a unit standard deviation.
5. Label Encoding:
- Used to convert textual labels into numeric form in order to prepare it
to be used in a machine-readable form.
- Labels are assingned a value of 0 to (n-1) where n is number of distinct
values for particular categorial feature.
6. One Hot Coding:
- Refers to splitting the column which contains numerical categorical
data to many columns depending on the number of categories present
n that column.
- It creates binary columns for each category, where only one column
will have a value of 1, and the rest will be 0.
Fruit Apple Banana Orange
Apple 1 0 0
Banana 0 1 0
Orange 0 0 1
Banana 0 1 0
3. Data Reduction
• Data Reduction is the process of reducing the volume or
complexity of data without losing valuable information
• When data is collected from diff.data sources for analysis, it results
in a huge amount of data.
• It is difficult to run complex queries on huge amount of data.
• It takes long time
- It is process which reduces the volume of original data and represent
it in a smaller volume.
1. Dimensionality Reduction:
Dimensionality Reduction is the process of reducing the number of
features (variables or columns) in a dataset while preserving as much
information as possible.
- Process of reducing the number of random variables or attributes
under consideration.
- In ML techniques such as classification and clustering features are
studied to obtained the analysis output.
- Higher the number of features the higher the number of difficulties.
Dimensionality
Reduction
Feature Selection Feature Extraction
1. Feature Selection:
- Process of extracting a subset of features from the original set of all
features of a dataset to obtain a smaller subset that can be used to
model a given problem.
- Feature selection is the process of selecting a subset of relevant,
informative, and non-redundant features (variables) from the
original dataset.
Feature Extraction
• Feature Extraction is a dimensionality reduction technique that
creates new features from the original dataset by transforming the
data into a lower-dimensional space.
• Unlike feature selection (which removes irrelevant features), feature
extraction builds new features that are combinations or
representations of the original ones, often capturing the most
important patterns or structure.