Title: Data Processing
Subtitle: Day 2
Instructor Name: Ei Cho Zin
Date: 30 - 01 - 2025
Data Preprocessing: Preparing
Data for Analysis
Steps in Data Preprocessing
Uses of Data Preprocessing
Advantages & Disadvantages of
Data Preprocessing
AGENDA Data Cleaning in Machine
Learning
Key Steps in Data Cleaning
Data Cleansing Tools
Advantages & Disadvantages of
Data Cleaning in Machine
Learning
Data Preprocessing: Preparing
Data for Analysis
What is Data Preprocessing?
The process of transforming raw data into a
usable format for analysis.
In data mining, it specifically prepares raw
data for mining algorithms.
Involves cleaning, transforming, and
organizing data.
Why Preprocess Data?
Goal: Improve data quality.
Handles missing values, duplicates, and
normalization.
Ensures accuracy and consistency.
STEPS IN DATA PREPROCESSING
Data Cleaning
Identifies and corrects errors/inconsistencies in the dataset.
Ensures accuracy and reliability for analysis.
Key Tasks:
Handling Missing Values: Ignore, fill with mean, or use the
most probable value.
Handling Noisy Data:
Binning Method: Smooth data using mean or boundary
values.
Regression: Fit data to a function for prediction.
Clustering: Group similar data points, detect outliers.
Data Integration
o Combines data from multiple sources into a unified dataset.
o Challenges: Differences in formats, structures, and meanings.
o Techniques:
o Record Linkage: Matches records referring to the same entity.
o Data Fusion: Combines data for a more comprehensive dataset.
Data Transformation
o Converts data into a suitable format for analysis.
o Techniques:
o Normalization: Scales data to a common range.
o Standardization: Adjusts data to have zero mean and unit variance.
o Discretization: Converts continuous data into discrete categories.
o Data Aggregation: Summarizes data (e.g., averages, totals).
o Concept Hierarchy Generation: Organizes data into higher-level concepts.
5
Data Reduction Data Splitting
o Reduces dataset size while preserving key o Split the dataset into training,
information.
validation, and test sets (e.g., 70%
o Techniques: training, 15% validation, 15% test).
Feature Selection: Chooses the most
relevant features.
o Ensure the splits are representative
of the overall data distribution.
Feature Extraction: Transforms data
into lower-dimensional space.
Dimensionality Reduction: E.g.,
Principal Component Analysis (PCA). Data Formatting
Numerosity Reduction: Reduces
data points via sampling.
o Ensure the data is in the correct format for
analysis or modeling (e.g., converting dates to
Data Compression: Encodes data in a
datetime objects, ensuring numerical columns
more compact form.
are of the correct data type).
USES OF DATA PREPROCESSING
1.Data preprocessing transforms raw data into a usable format
for analysis and decision-making. It is applied in various fields:
2.Data Warehousing
o Cleans, integrates, and structures data before storage.
o Ensures consistency and reliability for queries and reporting.
3.Data Mining
o Prepares raw data for analysis.
o Identifies patterns and extracts insights from large datasets.
3.Machine Learning 5.Web Mining
• Prepares data for model training. • Analyzes web usage logs to extract user
• Tasks include: behavior patterns.
• Handling missing values. • Informs marketing strategies and improves
• Normalizing features. user experience.
• Encoding categorical variables.
• Splitting datasets into training and 6.Business Intelligence (BI)
testing sets. • Organizes and cleans data for dashboards
• Improves model performance and accuracy. and reports.
• Provides actionable insights for decision-
makers.
7.Deep Learning
4.Data Science
• Ensures data is clean, structured, and • Normalizes or enhances input data features.
• Optimizes model training processes.
relevant.
• Enhances the quality of insights and
predictive models.
ADVANTAGES & DISADVANTAGES
OF DATA PREPROCESSING
o Improved Data Quality: Clean, o Time-Consuming: Significant time and
consistent, and reliable data for effort required.
analysis.
o Resource-Intensive: Demands
o Better Model Performance: computational power and skilled
Reduced noise and irrelevant data personnel.
for more accurate predictions.
o Potential Data Loss: Risk of losing
o Efficient Data Analysis: valuable information if handled
Streamlined data for faster and incorrectly.
easier processing.
o Complexity: Challenges with large
o Enhanced Decision-Making: Clear, datasets or diverse formats.
well-organized data for better
business decisions.
DATA CLEANING IN MACHINE
LEARNING
SPEAKING IMPACT
Importance:
• A critical step in the machine learning
pipeline.
• Ensures data is accurate, consistent, and
error-free.
• Raw data is often noisy, incomplete, and
inconsistent, which can negatively impact
model accuracy and insights.
• Key Belief:
• “Better data beats fancier algorithms.”
• Clean datasets enhance EDA (Exploratory
Data Analysis) and improve interpretability
for actionable insights.
KEY STEPS IN DATA CLEANING
🔹 Understand the Data
• Analyze structure to identify missing values, duplicates, and outliers.
🔹 Remove Unwanted Observations
• Eliminate irrelevant, duplicate, or redundant data to reduce noise.
🔹 Fix Structural Errors
• Standardize formats, variable types, and inconsistencies for uniformity.
🔹 Manage Outliers
• Identify and handle extreme values to improve model accuracy.
🔹 Handle Missing Data
• Use imputation techniques or remove records to maintain data integrity.
🔹 Document & Validate
• Track changes for transparency and validate improvements iteratively.
DATA CLEANSING TOOLS
TOOLS DESIGNED TO CLEAN, TRANSFORM, AND PREPARE DATA FOR
ANALYSIS:
1.OpenRefine
• Open-source tool for cleaning and 4.Cloudingo
transforming messy data. • Cloud-based tool for data cleansing and
• Features: management.
• Removes duplicates. • Features:
• Enriches data. • Focuses on de-duplication and
• Easy-to-use interface. record management.
2.Trifacta Wrangler • Ensures data accuracy.
• User-friendly tool for data cleaning and
transformation.
• Features: 5.IBM Infosphere Quality Stage
• AI-powered transformation suggestions. • Ideal for large-scale and complex data.
3.TIBCO• Clarity
Streamlines workflows. • Features:
• Tool for profiling, standardizing, and enriching • Handles advanced data cleansing tasks.
data. • Ensures data quality for enterprise-level
• Features: needs.
• Ensures high-quality data.
• Maintains consistency across datasets.
ADVANTAGES & DISADVANTAGES OF
DATA CLEANING IN MACHINE
LEARNING
Advantages Disadvantages
✔ Improved Model Performance – Eliminates
errors, inconsistencies, and irrelevant data,
enhancing learning. ⚠ Time-Consuming – Particularly challenging for
✔ Increased Accuracy – Ensures data is accurate, large and complex datasets.
consistent, and error-free. ⚠ Error-Prone – Risk of losing valuable
✔ Better Data Representation – Transforms data information.
to better reflect relationships and patterns. ⚠ Cost & Resource-Intensive – Requires time,
✔ Enhanced Data Quality – Increases reliability expertise, and specialized tools.
and accuracy. ⚠ Risk of Overfitting – Excessive cleaning may
✔ Improved Data Security – Identifies and remove important data.
removes sensitive
THANK YOU