0% found this document useful (0 votes)

8 views57 pages

Data Preprocessing PDF

Data pre-processing is a critical step in transforming raw data into a usable format for analysis, ensuring accuracy, completeness, consistency, and interpretability. It involves various operations such as data cleaning, integration, transformation, reduction, and discretization to enhance data quality. The document also discusses the importance of handling missing values, noisy data, and outliers to maintain the integrity of the dataset.

Uploaded by

bodkedurgesh0001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views57 pages

Data Preprocessing PDF

Uploaded by

bodkedurgesh0001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 57

Data pre-processing

Data pre-processing
• Data pre-processing consists of a series of steps to transform raw data
derived from data
• Data Preprocessing can be defined as a process of converting raw data
into a format that is understandable and usable for further analysis. It
is an important step in the Data Preparation stage.
• It ensures that the outcome of the analysis is accurate, complete,
and consistent.
Why is Data Preprocessing Important

• Data Preprocessing is an important step in the Data Preparation stage

of a Data Science development lifecycle that will ensure reliable,
robust, and consistent results.
• The main objective of this step is to ensure and check the quality of
data before applying any Machine Learning Data Mining methods
• Accuracy - Data Preprocessing will ensure that input data is accurate
and reliable by ensuring there are no manual entry errors, no
duplicates, etc.
• Completeness - It ensures that missing values are handled, and data is
complete for further analysis.
• Consistent - Data Preprocessing ensures that input data is consistent,
i.e., the same data kept in different places should match.
• Timeliness - Whether data is updated regularly and on a timely basis
or not.
• Trustable - Whether data is coming from trustworthy sources or not.
• Interpretability - Raw data is generally unusable, and Data
Preprocessing converts raw data into an interpretable format.
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not
recorded.
• Consistency: To check whether the same data is kept in all the places
that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
PreprocessingOperati
ons
Data Data Data Data Data
Cleaning Integration Transformation Reduction Discretization
1. Data Cleaning: Fill in missing values, noisy data, resolves
inconsistency.
2. Data Integration: integration of multiple databases, data cubes or
files.
3. Data Transformation: Normalization / changing data from one
format or structure to another to make it easier to analyze or use.
4. Data Reduction: Obtains reduced representation in volume but
produces the same or similar analytical results.
5. Data Discretization: Process of converting continuous data into
distinct, separate categories or intervals.
• ex: people age: 0-17, 18-25…etc.
What is Data?
• It is collection of objects & their attributes.

•Attribute: It is property or characteristic of an object.

Ex.: name of person, gender, salary.
• -> Attribute is also known as variable, field,
characteristic, feature.
•Object: Collection of attribute describe an
object.
•-> Object is known as record, point, entity
Attributes
• Attributes are qualities or characteristics that describe
an object, individual, or phenomenon.
• Attributes can be categorical, representing distinct
categories or classes, such as colors, types, or labels.
Types of Attributes
• Nominal Attributes :
Nominal means "relating to names" . The utilities of a nominal attribute are sign or
title of objects . Each value represents some kind of category, code or state, and so
nominal attributes are also referred to as categorical.
• Example : Suppose that skin color and education status are two attributes of
expressing person objects. In our implementation, possible values for skin color
are dark, white, brown. The attributes for education status can contain the values-
undergraduate, postgraduate, matriculate.
• Binary Attributes :
A binary attribute is a category of nominal attributes that contains only two
classes: 0 or 1, where 0 often tells that the attribute is not present, and 1 tells that
it is existing. Binary attributes are mentioned as Boolean if the two conditions
agree to true and false.
• Example - Given the attribute drinker narrate a patient item, 1 specify that the
drinker drinks, while 0 specify that the patient does not. Similarly, suppose the
patient undergoes a medical test that has two practicable outcomes.
Types of
Data
Attribute
Nominal Binary Ordinal Numeric
Attribute Attribute Attribute Attribute

Interval Ratio
Scaled Scaled
1. Nominal
Attribute:
- The values of it are symbol or name of things
- Each value represent some kind of category, code or state.
- It is also referred as “categorized”.
- Values does not have meaningful order.
- E.g. HairColor = { black, brown, grey, red, white}
- ZipCode= {411023, 411028, 444009}
- MaritalStatus={married, unmarried, divorced}
2. Binary
Attribute:
- It is nominal attribute with only two categories or state namely 0
(absent), 1 (present).
- Binary attributes referred as Boolean (true/false).

Symmetric binary: outcomes equally important.

Ex. Gender={male, female}

Assymmetric binary: outcomes not equally important.

One of it value describes the problematic condition &
other is normal condition
Ex. Medical Test= { positive, negative}
Fraud Detection = {yes, no}
3. Ordinal Attribute:

- Attribute with possible values that have a meaningful order or ranking

based on specific characteristic or attribute.
- Ex. Grade(A,B, C), height(tall/ medium/ short), ranking(1-10),
cold-drink bottle(small/ medium/ large).
4. Numeric
Attributes:
- It is measurable, quantity.
- Represent in integer or real values.

1. Interval-Scaled:
- Measured on scale of equal-size units.
- Values of interval-scaled attributes have order and can be +ve, 0 , -ve.
- E.g. Temperature in Celsius and Fahrenheit.
1 degree Celsius = 33.8 Fahrenheit
100 degree Celsius = 212 Fahrenheit
4.2 Ratio-scaled:
- A value as being multiple of another value.
- Values are ordered & we can compute difference
between values, as well as mean, median, mode.
- Ex. Height: 6 feet = twice of 3 feet.
Discrete v/s Continuous
Attributes
1. Discrete Attribute:
- Has only finite or countably infinite set of values.
- It may have numeric values, such as 0, 1, 2....
- E.g. Number of students in a class.(40,41, 45..)
You cannot count 45.5
Blood Groups (A, B, O, AB)
2. Continuous Attribute:

- Has real numbers as attribute values.

- Typically represents as floating-point variable.
- Eg. Temperature, speed, time, distance, age
• Discrete Attribute
• Has only a finite or countable infinite set of values
Ex: zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Continuous Attribute
• Has real numbers as attribute values
• Ex temperature, height, or weight.
• Practically, real values can only be measured and represented using a
finite number of digits.
• Discrete Attribute :
A discrete attribute has a limited or restricted unlimited set of values,
which may appear as integers.
• Ex: The attributes skin color, drinker, medical report, and drink size each
have a finite number of values, and so are discrete.
• Continuous Attribute :
A continuous attribute has real numbers as attribute values.
• Ex - Height, weight, and temperature have real values . Real values can
only be represented and measured using finite number of digits .
Continuous attributes are typically represented as floating-point
variables.
ature Discrete Attributes Continuous Attributes

Definition Take on distinct, countable values Take on any value within a range, including decimals

Value Type Whole numbers only (integers) Real numbers (can be fractional or decimal)

Possible Values Finite or countably infinite set Uncountably infinite set

Examples Number of children, cars, clicks, exam questions Height, weight, temperature, speed

Measured or Counted? Counted Measured

Graphical Representation Bar chart, pie chart, histogram (discrete bins) Histogram (with bins), line plot, density curve

Common Uses Event counting, population stats, categorical encodings Regression tasks, sensor data, time series

Typical Data Types Integer (int) Float (decimal/double)

Interval Between Values Gaps between values (e.g., 1, 2, 3) No gaps; values are continuous along a range

Can apply mathematical Only certain operations (e.g., addition, count, Most mathematical operations are applicable
ops? frequency)

- Number of products sold - Temperature readings

Examples in Real Data - Number of students - Weight of a package
- Shoe size - Distance traveled
Data
Preprocessing
-Data Wrangling
Data
Quality:
• “ability of given dataset to server an intended purpose”
•Data have quality if they satisfy requirements of our intended
use. Ex. Income = -100 (out of range values)
age=4000

 If there is much irrelevant and redundant information present or

noisy and unreliable data then knowledge discovery is more difficult.
 Data preparation and filtering steps can take considerable amount
of processing time.
• There are many reasons for accurate, complete and consistent in
real-world databases and data warehouses.

1. accurate:-
• It means having correct attribute values.
• It happen due to having typing error or some garbage entry during auto
transmission of data.
Reasons/ What is responsible for inaccurate data:
- There may have been human or computer errors occurring at data
entry.
-User purposely submit incorrect data values for mandatory
fields . Ex. 1 Jan on DOB. Accurate is 1 jan 2021
2. consistency:
• correct and redundant data may result from inconsistencies in naming
conventions or data codes.
•Duplicate tuples also required data
cleaning. Ex. Date.
•The data looks the same everywhere.
Example: One place doesn’t say “MH” while
another says “Maharashtra”
3. Incompleteness:
Cust_id Name Address Age Occupation Category

C01 Ravi Shukla Mumbai 45 Services

C02 Rohit Joshi Pune Grocery Shop Gold

4. Timeliness:
- Refers to relevance and value of data being analyzed or used in relative
to time at which is accessed.
- Timely data is data that is up-to-date and available when needed which
is crucial to make accurate decisions.
- Ex. Fraud detection in banking, Weather forecasting.
Data Munging / Data Wrangling
Operations:
• “task of converting data into feasible format that is suitable for the
consumption of the data.”
• Goal of Data Wrangling - Assure Quality.
Data Munging includes operations:
Cleaning Data

Data Transformation

Data Reduction

Data Discretization
1. Data Cleaning / Data cleansing
/ Scrubbing:

- Done by handle irrelevant or missing data.

- How to clean?
- Filling missing values
- Smoothing noisy data
- Removing outliers
- Resolving any inconsistencies
1.1
Missing
Values:
- Some values in the data may not be filled up for various reasons and
hence are considered missing.
- If some of tuple have no recorded value for some attribute then its
difficult to procced with data.

There are 3 cases of missing data:

1. MCAR (Missing Completely At Random): occurs due to someone to fill in
the value
or have lost information.
2. MAR (Missing At Random): occurs due to someone purposely not filling up the
data mainly due to privacy.
3. MNAR (Missing Not At Random): Occurs as data maybe not available .
 Handling Missing Values:
1. Ignore the tuple:
- This method is not very effective, unless the tuple contains several
attributes with missing values.

DataSet

Updated DataSet
2. Fill in missing value
manually:
- It is time consuming and may not be feasible given a large dataset with
many missing values

3. Use a Global constant to fill the missing value:

- Replace all missing attributes values by some constant such as label like
“unknown” or “infinity”.

DataSet

Updated DataSet
4. Use a measure of Central
Tendency for the Attribute
(Mean/Median) to fill missing values:
- Use mean, median, mode.
DataSet

Updated
DataSet
5. Use attribute mean or median for
all samples belonging to same class
as given tuple:
Dataset

Updated Dataset
6. Use the most probable
value to fill the missing
-value
Basically prediction algorithm used to find out missing values.

Identify most probable value

For age:
ages – 25 30
mode – age is 25
For gender:
age- 25 30
mode – gender -
male
1.2 Noisy Data:
- Contains errors or outliers, inaccuracies
- Creates problems in analysis
- Data got corrupted

Ex. i. Typographical Error: “John Smith” -> “Jhn

Smith”
It creates inconsistency
ii. Outliers
iii. Inconsistent Formating:
Some entries are of DD/MM/YYYY
Some entries are of MM/DD/YYYY
iv. Quantity shouldn’t be in negative (-12).
Duplicate Entries:
- It is big problem
- Before start analysis it is suppose to identify such duplicity and handle it
properly.
- Data duplication can also occur when you are trying to group data from
various sources.
- It degrade the data quality which impact on data analysis outcomes.

NULL: (when value is not known.)

-When it come to analytics, NULL’s cannot be processed by many
algorithms. In such case, it is necessary to replace missing values with
some reasonable proxy.
Huge Outliers:
- Outliers is datapoint that differs significantly from other observations.
- They are extreme values that deviate fom other observations on data.

Causes of outliers:
1. Data Entry Error (human error)
2. Measurement Error (instrument error)
3. Sampling Error (mixing data from wrong source)
Out of Date Data:
- Refers to information in a dataset that is no longer current or relevant
due to the passage of time.
- Data become outdated because of changes in real world conditions,
updates in technology.
- This may lead to inaccurate information.
- Ex. Customer & contact info
- Person – address

- Product – Price
- Software -- versions
Formatting Issue:
i. Extra Whitespaces:
- Occurs when there are unintended spaces in the data that can affect
data processing, analysis & consistency.
- Ex. “John Smith” --- “John Smith ”
Email-id  “abc@gmail.com” --- “abc@gmail.com”

ii. Invalid Characters:

Some data files will randomly have invalid bytes in middle of them.
Ex. J@hn
2. Data
Transformation
- Process of converting raw data into a format or structure that would
be more suitable for data analysis.
- Conversion of Raw data --- single & easy-to-read format to facilitate
easy analysis.
- Ex. Time format -- 12 hrs format to 24hrs
format Date format -- MM-
DD-YYYY to YYYY-MM-DD
Structural
- Transformation:
Renaming, moving and combining columns in database are related to it.
Benefits of data transformation:
- Get better organized.
- Easier for humans as well as computer use.
- Improves data quality
- Facilitates compatibility between applications, system and types of
data.
Data Transformation
Strategies
Rescaling Normalizing Binarizing Standardizing Labelling One hot
encoding
1. Rescaling:
- Transforming the data so that it fits within a specific scale, like 0-100 or
0-1.
- Here 0 (minimum value) & 1 (maximum value).

- Many statistical or machine learning techniques prefer rescaling the

attributes to fall within a given scale.
2. Normalizing:
- process used to adjust the scale of data so that it fits within a standard
range, usually [0, 1].
- It is used to scale and standardize the features of a dataset.
- Ex. Changing measurement units from meters to inches for height or
from kilograms to pounds for weight, may lead to very different results.
- Min Max Technique
• 3. Binarizing:
- Process of converting data to either 0 or 1 based on threshold value.
- Values above the threshold value marked as – 1.
- Values equal to or below threshold value marked as 0.
- Ex. Speed Limit (above 60km/hr – pay fine | below 60km/hr –
within legal limits)
4. Standardizing:
- Also called as mean removal.
- Standardization is another scaling technique where the values are
centered around the mean with a unit standard deviation.
- It means mean of the attribute becomes zero and resultant distribution
has a unit standard deviation.
5. Label Encoding:
- Used to convert textual labels into numeric form in order to prepare it
to be used in a machine-readable form.
- Labels are assingned a value of 0 to (n-1) where n is number of distinct
values for particular categorial feature.
6. One Hot Coding:
- Refers to splitting the column which contains numerical categorical
data to many columns depending on the number of categories present
n that column.
- It creates binary columns for each category, where only one column
will have a value of 1, and the rest will be 0.

Fruit Apple Banana Orange

Apple 1 0 0
Banana 0 1 0
Orange 0 0 1
Banana 0 1 0
3. Data Reduction
• Data Reduction is the process of reducing the volume or
complexity of data without losing valuable information
• When data is collected from diff.data sources for analysis, it results
in a huge amount of data.
• It is difficult to run complex queries on huge amount of data.
• It takes long time

- It is process which reduces the volume of original data and represent

it in a smaller volume.
1. Dimensionality Reduction:
Dimensionality Reduction is the process of reducing the number of
features (variables or columns) in a dataset while preserving as much
information as possible.
- Process of reducing the number of random variables or attributes
under consideration.
- In ML techniques such as classification and clustering features are
studied to obtained the analysis output.
- Higher the number of features the higher the number of difficulties.
Dimensionality
Reduction
Feature Selection Feature Extraction
1. Feature Selection:
- Process of extracting a subset of features from the original set of all
features of a dataset to obtain a smaller subset that can be used to
model a given problem.
- Feature selection is the process of selecting a subset of relevant,
informative, and non-redundant features (variables) from the
original dataset.
Feature Extraction
• Feature Extraction is a dimensionality reduction technique that
creates new features from the original dataset by transforming the
data into a lower-dimensional space.
• Unlike feature selection (which removes irrelevant features), feature
extraction builds new features that are combinations or
representations of the original ones, often capturing the most
important patterns or structure.

Full
No ratings yet
Full
367 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Ch.3 Data Preprocessing
No ratings yet
Ch.3 Data Preprocessing
16 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
CL 2
No ratings yet
CL 2
85 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
ML - Data - Preprocessing For Machine Learning
No ratings yet
ML - Data - Preprocessing For Machine Learning
44 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Data
No ratings yet
Data
84 pages
Lect 2
No ratings yet
Lect 2
77 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Lect-2 Getting To Know Your Data-Part-I
No ratings yet
Lect-2 Getting To Know Your Data-Part-I
28 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Module2 - Preprocessing Updated - V3-2
No ratings yet
Module2 - Preprocessing Updated - V3-2
106 pages
Data Preprocessing & Attributes
No ratings yet
Data Preprocessing & Attributes
33 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Chapter-2 Getting To Know Your Data
No ratings yet
Chapter-2 Getting To Know Your Data
92 pages
Data Preparation Modeling Evaluation
No ratings yet
Data Preparation Modeling Evaluation
145 pages
Unit-2 Attributes
No ratings yet
Unit-2 Attributes
4 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Unit I
No ratings yet
Unit I
57 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Data Attributes & Types Explained
No ratings yet
Data Attributes & Types Explained
69 pages
Lec 5
No ratings yet
Lec 5
24 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Chap.3 Data Preprocessing
No ratings yet
Chap.3 Data Preprocessing
6 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
CAC 428 Topic 1 - Introduction To Data
No ratings yet
CAC 428 Topic 1 - Introduction To Data
24 pages
3-Preparing The Data-10-01-2024
No ratings yet
3-Preparing The Data-10-01-2024
127 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Why Data Preprocessing?
No ratings yet
Why Data Preprocessing?
3 pages
1 Data Mining
No ratings yet
1 Data Mining
47 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
2 Data Types Quality
No ratings yet
2 Data Types Quality
15 pages
Chapter2 Data Exploration
No ratings yet
Chapter2 Data Exploration
78 pages
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
39 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Lotte EEGSignalProcessing w4
No ratings yet
Lotte EEGSignalProcessing w4
32 pages
Data Mining Techniques & Types
No ratings yet
Data Mining Techniques & Types
48 pages
Electronics: Exploring Malware Behavior of Webpages Using Machine Learning Technique: An Empirical Study
No ratings yet
Electronics: Exploring Malware Behavior of Webpages Using Machine Learning Technique: An Empirical Study
20 pages
Iml Report
No ratings yet
Iml Report
6 pages
Predicting School Performance with Data Mining
No ratings yet
Predicting School Performance with Data Mining
9 pages
Enhancing Accuracy in Heart Disease Prediction: A Hybrid Approach
No ratings yet
Enhancing Accuracy in Heart Disease Prediction: A Hybrid Approach
27 pages
Project Report 2022
No ratings yet
Project Report 2022
27 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
IoBNT Security with PSO-ANN
No ratings yet
IoBNT Security with PSO-ANN
13 pages
Mini Project Report On Heart Disease Pre
No ratings yet
Mini Project Report On Heart Disease Pre
23 pages
Building Good Training Sets
No ratings yet
Building Good Training Sets
51 pages
Machine Learning: Huawei AI Academy Training Materials
No ratings yet
Machine Learning: Huawei AI Academy Training Materials
46 pages
Model Selection and Feature Selection: Piyush Rai CS5350/6350: Machine Learning
No ratings yet
Model Selection and Feature Selection: Piyush Rai CS5350/6350: Machine Learning
14 pages
Advanced Lectures On Machine Learning ML Summer SC
No ratings yet
Advanced Lectures On Machine Learning ML Summer SC
249 pages
Mathematics 10 03359 v2
No ratings yet
Mathematics 10 03359 v2
20 pages
CampusX DSMP 2.0 Syllabus
No ratings yet
CampusX DSMP 2.0 Syllabus
36 pages
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester - Scheme of Evaluation
No ratings yet
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester - Scheme of Evaluation
14 pages
Credit Risk Assessment Mechanism of Personal Auto Loan Based On Pso-Xgboost Model
No ratings yet
Credit Risk Assessment Mechanism of Personal Auto Loan Based On Pso-Xgboost Model
24 pages
Paper-AIHC-Stock Prediction Using Social Media, News-Revised
No ratings yet
Paper-AIHC-Stock Prediction Using Social Media, News-Revised
21 pages
Pradeep 2 NM Porject Lab Manual
No ratings yet
Pradeep 2 NM Porject Lab Manual
87 pages
DP 100 Demo
No ratings yet
DP 100 Demo
59 pages
TEA EKHO IDS: An Intrusion Detection System For Industrial CPS With Trustworthy Explainable AI and Enhanced Krill Herd Optimization
No ratings yet
TEA EKHO IDS: An Intrusion Detection System For Industrial CPS With Trustworthy Explainable AI and Enhanced Krill Herd Optimization
29 pages
Hybrid Feature Selection Models For Machine
No ratings yet
Hybrid Feature Selection Models For Machine
5 pages
Exploring The High Potential Factors That Affects Students' Academic Performance
No ratings yet
Exploring The High Potential Factors That Affects Students' Academic Performance
9 pages
1 s2.0 S0167404820304314 Main - 1
No ratings yet
1 s2.0 S0167404820304314 Main - 1
19 pages
Classification of Holy Quran Translation
No ratings yet
Classification of Holy Quran Translation
8 pages
2023-24 ML Notes 2
No ratings yet
2023-24 ML Notes 2
16 pages
Trainer - X-Vision
No ratings yet
Trainer - X-Vision
21 pages
Ieee 2011 Java Data Mining Projects SBGC
No ratings yet
Ieee 2011 Java Data Mining Projects SBGC
11 pages
Multi-Disease Prediction Using Machine Learning Algorithm
No ratings yet
Multi-Disease Prediction Using Machine Learning Algorithm
9 pages

Data Preprocessing PDF

Uploaded by

Data Preprocessing PDF

Uploaded by

Data pre-processing

• Data Preprocessing is an important step in the Data Preparation stage

•Attribute: It is property or characteristic of an object.

Symmetric binary: outcomes equally important.

Assymmetric binary: outcomes not equally important.

- Attribute with possible values that have a meaningful order or ranking

- Has real numbers as attribute values.

Possible Values Finite or countably infinite set Uncountably infinite set

Measured or Counted? Counted Measured

Typical Data Types Integer (int) Float (decimal/double)

- Number of products sold - Temperature readings

 If there is much irrelevant and redundant information present or

C01 Ravi Shukla Mumbai 45 Services

C02 Rohit Joshi Pune Grocery Shop Gold

- Done by handle irrelevant or missing data.

There are 3 cases of missing data:

3. Use a Global constant to fill the missing value:

Identify most probable value

Ex. i. Typographical Error: “John Smith” -> “Jhn

NULL: (when value is not known.)

ii. Invalid Characters:

- Many statistical or machine learning techniques prefer rescaling the

Fruit Apple Banana Orange

- It is process which reduces the volume of original data and represent

You might also like