KEMBAR78
Data Processing | PDF | Data Analysis | Data
0% found this document useful (0 votes)
27 views14 pages

Data Processing

The document outlines the process of data preprocessing, which involves transforming raw data into a usable format for analysis through steps like data cleaning, integration, transformation, reduction, and formatting. It discusses the importance of data preprocessing in various fields such as data mining, machine learning, and business intelligence, highlighting its advantages and disadvantages. Additionally, it emphasizes the significance of data cleaning in machine learning to ensure model accuracy and performance.

Uploaded by

eczhyena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views14 pages

Data Processing

The document outlines the process of data preprocessing, which involves transforming raw data into a usable format for analysis through steps like data cleaning, integration, transformation, reduction, and formatting. It discusses the importance of data preprocessing in various fields such as data mining, machine learning, and business intelligence, highlighting its advantages and disadvantages. Additionally, it emphasizes the significance of data cleaning in machine learning to ensure model accuracy and performance.

Uploaded by

eczhyena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Title: Data Processing

Subtitle: Day 2
Instructor Name: Ei Cho Zin
Date: 30 - 01 - 2025
Data Preprocessing: Preparing
Data for Analysis
Steps in Data Preprocessing
Uses of Data Preprocessing
Advantages & Disadvantages of
Data Preprocessing
AGENDA Data Cleaning in Machine
Learning
Key Steps in Data Cleaning
Data Cleansing Tools
Advantages & Disadvantages of
Data Cleaning in Machine
Learning
Data Preprocessing: Preparing
Data for Analysis

What is Data Preprocessing?


 The process of transforming raw data into a
usable format for analysis.
 In data mining, it specifically prepares raw
data for mining algorithms.
 Involves cleaning, transforming, and
organizing data.
Why Preprocess Data?
 Goal: Improve data quality.
 Handles missing values, duplicates, and
normalization.
 Ensures accuracy and consistency.
STEPS IN DATA PREPROCESSING
Data Cleaning
 Identifies and corrects errors/inconsistencies in the dataset.
 Ensures accuracy and reliability for analysis.
 Key Tasks:
 Handling Missing Values: Ignore, fill with mean, or use the
most probable value.
 Handling Noisy Data:
 Binning Method: Smooth data using mean or boundary
values.
 Regression: Fit data to a function for prediction.
 Clustering: Group similar data points, detect outliers.
Data Integration
o Combines data from multiple sources into a unified dataset.
o Challenges: Differences in formats, structures, and meanings.
o Techniques:
o Record Linkage: Matches records referring to the same entity.
o Data Fusion: Combines data for a more comprehensive dataset.

Data Transformation
o Converts data into a suitable format for analysis.
o Techniques:
o Normalization: Scales data to a common range.
o Standardization: Adjusts data to have zero mean and unit variance.
o Discretization: Converts continuous data into discrete categories.
o Data Aggregation: Summarizes data (e.g., averages, totals).
o Concept Hierarchy Generation: Organizes data into higher-level concepts.
5
Data Reduction Data Splitting
o Reduces dataset size while preserving key o Split the dataset into training,
information.
validation, and test sets (e.g., 70%
o Techniques: training, 15% validation, 15% test).
 Feature Selection: Chooses the most
relevant features.
o Ensure the splits are representative
of the overall data distribution.
 Feature Extraction: Transforms data
into lower-dimensional space.
 Dimensionality Reduction: E.g.,
Principal Component Analysis (PCA). Data Formatting
 Numerosity Reduction: Reduces
data points via sampling.
o Ensure the data is in the correct format for
analysis or modeling (e.g., converting dates to
 Data Compression: Encodes data in a
datetime objects, ensuring numerical columns
more compact form.
are of the correct data type).
USES OF DATA PREPROCESSING

1.Data preprocessing transforms raw data into a usable format


for analysis and decision-making. It is applied in various fields:

2.Data Warehousing
o Cleans, integrates, and structures data before storage.
o Ensures consistency and reliability for queries and reporting.

3.Data Mining
o Prepares raw data for analysis.
o Identifies patterns and extracts insights from large datasets.
3.Machine Learning 5.Web Mining
• Prepares data for model training. • Analyzes web usage logs to extract user
• Tasks include: behavior patterns.
• Handling missing values. • Informs marketing strategies and improves
• Normalizing features. user experience.
• Encoding categorical variables.
• Splitting datasets into training and 6.Business Intelligence (BI)
testing sets. • Organizes and cleans data for dashboards
• Improves model performance and accuracy. and reports.
• Provides actionable insights for decision-
makers.
7.Deep Learning
4.Data Science
• Ensures data is clean, structured, and • Normalizes or enhances input data features.
• Optimizes model training processes.
relevant.
• Enhances the quality of insights and
predictive models.
ADVANTAGES & DISADVANTAGES
OF DATA PREPROCESSING

o Improved Data Quality: Clean, o Time-Consuming: Significant time and


consistent, and reliable data for effort required.
analysis.
o Resource-Intensive: Demands
o Better Model Performance: computational power and skilled
Reduced noise and irrelevant data personnel.
for more accurate predictions.
o Potential Data Loss: Risk of losing
o Efficient Data Analysis: valuable information if handled
Streamlined data for faster and incorrectly.
easier processing.
o Complexity: Challenges with large
o Enhanced Decision-Making: Clear, datasets or diverse formats.
well-organized data for better
business decisions.
DATA CLEANING IN MACHINE
LEARNING

SPEAKING IMPACT
Importance:
• A critical step in the machine learning
pipeline.
• Ensures data is accurate, consistent, and
error-free.
• Raw data is often noisy, incomplete, and
inconsistent, which can negatively impact
model accuracy and insights.
• Key Belief:
• “Better data beats fancier algorithms.”
• Clean datasets enhance EDA (Exploratory
Data Analysis) and improve interpretability
for actionable insights.
KEY STEPS IN DATA CLEANING
🔹 Understand the Data
• Analyze structure to identify missing values, duplicates, and outliers.
🔹 Remove Unwanted Observations
• Eliminate irrelevant, duplicate, or redundant data to reduce noise.
🔹 Fix Structural Errors
• Standardize formats, variable types, and inconsistencies for uniformity.
🔹 Manage Outliers
• Identify and handle extreme values to improve model accuracy.
🔹 Handle Missing Data
• Use imputation techniques or remove records to maintain data integrity.
🔹 Document & Validate
• Track changes for transparency and validate improvements iteratively.
DATA CLEANSING TOOLS
TOOLS DESIGNED TO CLEAN, TRANSFORM, AND PREPARE DATA FOR
ANALYSIS:
1.OpenRefine
• Open-source tool for cleaning and 4.Cloudingo
transforming messy data. • Cloud-based tool for data cleansing and
• Features: management.
• Removes duplicates. • Features:
• Enriches data. • Focuses on de-duplication and
• Easy-to-use interface. record management.
2.Trifacta Wrangler • Ensures data accuracy.
• User-friendly tool for data cleaning and
transformation.
• Features: 5.IBM Infosphere Quality Stage
• AI-powered transformation suggestions. • Ideal for large-scale and complex data.
3.TIBCO• Clarity
Streamlines workflows. • Features:
• Tool for profiling, standardizing, and enriching • Handles advanced data cleansing tasks.
data. • Ensures data quality for enterprise-level
• Features: needs.
• Ensures high-quality data.
• Maintains consistency across datasets.
ADVANTAGES & DISADVANTAGES OF
DATA CLEANING IN MACHINE
LEARNING
Advantages Disadvantages

✔ Improved Model Performance – Eliminates


errors, inconsistencies, and irrelevant data,
enhancing learning. ⚠ Time-Consuming – Particularly challenging for
✔ Increased Accuracy – Ensures data is accurate, large and complex datasets.
consistent, and error-free. ⚠ Error-Prone – Risk of losing valuable
✔ Better Data Representation – Transforms data information.
to better reflect relationships and patterns. ⚠ Cost & Resource-Intensive – Requires time,
✔ Enhanced Data Quality – Increases reliability expertise, and specialized tools.
and accuracy. ⚠ Risk of Overfitting – Excessive cleaning may
✔ Improved Data Security – Identifies and remove important data.
removes sensitive
THANK YOU

You might also like