0% found this document useful (0 votes)

10 views17 pages

1725892639module 3 The Machine Learning Process

This document outlines the machine learning process, focusing on data collection, preparation, model selection, training, and evaluation. It emphasizes best practices for data handling, including cleaning, preprocessing, and feature engineering, while adhering to data protection regulations like GDPR. The document also discusses various model types and evaluation metrics to enhance predictive accuracy and effectiveness in machine learning applications.

Uploaded by

adnaneatmani22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views17 pages

1725892639module 3 The Machine Learning Process

Uploaded by

adnaneatmani22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

What is Machine Learning?

The Machine
Module 3
Learning Process
Learning Outcomes
By the end of this unit the learner will be able to:

 Describe the steps involved in data collection and preparation.

 Understand the process of model selection, training, and evaluation.
 Explain various metrics for evaluating model performance and validation
techniques.

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 1|17

What is Machine Learning?

Module 3
The Machine Learning Process
Data Collection and Preparation
Gathering Data
Data collection and preparation are foundational steps in the machine learning process,
critical to the success and accuracy of models. This process involves gathering relevant data
sources, cleaning and pre-processing the data, and ensuring its quality and suitability for
analysis. Adherence to data protection regulations such as GDPR is crucial, ensuring that data
collection and handling practices are ethical and compliant. In this section, we will discuss
about the stages of data collection and preparation in machine learning, focusing on best
practices and considerations:

Identifying Data Sources

Data Collection: Identifying and accessing relevant data sources is the initial step in the data
collection process. This involves:

 Data Identification: Identifying the types of data needed for the ML project, such as
structured, unstructured, or semi-structured data.

 Data Access: Gaining access to data through internal databases, APIs, third-party data
providers, or data scraping techniques.

Best Practices

1. Data Relevance: Ensure that the data collected is relevant to the problem being
addressed and aligns with the project objectives.

2. Legal and Ethical Compliance: Adhere to data protection regulations such as GDPR,
ensuring that data collection practices are lawful and ethical.

Data Cleaning and Preprocessing

Data Cleaning: Data cleaning involves identifying and correcting errors or inconsistencies in
the dataset. This includes:

 Handling Missing Data: Imputing missing values or removing incomplete records

based on domain knowledge.

 Removing Noise: Filtering out outliers or irrelevant data points that may affect model
performance.

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 2|17

What is Machine Learning?

Data Preprocessing: Data preprocessing prepares the data for analysis and model training.
Steps include:

 Normalization and Standardization: Scaling numerical data to a standard range to

prevent features from dominating the model.

 Feature Engineering: Creating new features from existing data to improve model
performance.

 Text Preprocessing: Tokenization, stemming, and removing stop words in natural

language processing tasks.

Best Practices

1. Automated Tools: Use automated tools and scripts to streamline the data cleaning and
preprocessing process, ensuring consistency and efficiency.

2. Data Quality Checks: Perform thorough data quality checks to validate the accuracy,
completeness, and consistency of the dataset.

Ensuring Data Quality

Data Quality Assurance: Ensuring the quality of data is essential to prevent biases and
inaccuracies that can lead to erroneous predictions. This includes:

 Data Validation: Validating data against business rules and domain knowledge to
ensure it meets quality standards.

 Data Profiling: Profiling data to understand its characteristics, such as distribution and
variance.

Best Practices

1. Data Governance: Implement data governance practices to maintain data integrity,

security, and compliance.

2. Regular Audits: Conduct regular audits and quality assessments to monitor and
maintain data quality over time.

Data Integration and Transformation

Data Integration: Integrating data from multiple sources to create a unified dataset for
analysis. This involves:

 Data Fusion: Combining data from different sources to enrich the dataset and provide
a comprehensive view.

 Schema Integration: Resolving schema conflicts and inconsistencies when integrating

diverse data sources.

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 3|17

What is Machine Learning?

Data Transformation: Transforming data into a format suitable for analysis and model training.
This includes:

 Dimensionality Reduction: Reducing the number of input variables while preserving

important information.

 Aggregation and Discretization: Aggregating data into meaningful groups or

discretizing continuous variables.

Best Practices

1. Scalability: Ensure that data integration and transformation processes are scalable to
handle large volumes of data.

2. Version Control: Implement version control to track changes made during data
transformation and ensure reproducibility.

Data
Data Cleaning
Identifying Ensuring Data Integration
and
Data Sources Quality and
Preprocessing
Transformation

Fig 3.1: Gathering Data

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 4|17

What is Machine Learning?

Data collection and preparation are fundamental stages in the machine learning process,
influencing the quality and reliability of models. By following best practices in identifying data
sources, cleaning and preprocessing data, ensuring data quality, and integrating and
transforming data, organizations can enhance the effectiveness of their machine learning
initiatives. Adherence to data protection regulations and ethical considerations is paramount,
ensuring that data collection and handling practices are lawful and maintain individual privacy.
By adopting a systematic approach to data collection and preparation, organizations can
maximize the value of their data assets and leverage machine learning to drive innovation and
achieve business goals.

Cleaning and Preprocessing Data

Cleaning and preprocessing data are crucial steps in the machine learning process, ensuring
that the data used for analysis and model training is accurate, reliable, and suitable for the
intended purpose. Adherence to data protection regulations such as GDPR is essential,
ensuring that data handling practices are ethical and compliant. In this section, we will discuss
in detail about the stages of cleaning and preprocessing data in machine learning, focusing on
best practices and considerations:

Data Cleaning

Handling Missing Data: Missing data is a common issue in datasets and needs to be addressed
to prevent biases and inaccuracies in the model.

 Missing Data Detection: Identify missing values in the dataset using statistical
methods or visualization techniques.

 Imputation: Replace missing values with mean, median, or mode values for numerical
data, or use techniques like forward or backward filling for time series data.

 Dropping Missing Values: Remove rows with missing data if they cannot be imputed
effectively.

Handling Noisy Data: Noisy data, which includes outliers and irrelevant information, can
adversely affect model performance.

 Outlier Detection: Use statistical methods like Z-score or IQR to detect outliers.

 Filtering or Transforming Outliers: Apply techniques such as trimming, winsorization,

or logarithmic transformation to handle outliers.

Data Normalization and Standardization

Normalization: Normalization scales numerical data to a standard range to prevent features

with large ranges from dominating the model.

 MinMax Scaling: Rescales data to a fixed range (e.g., [0, 1]).

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 5|17
What is Machine Learning?

 Normalization by Z-score: Standardizes data to have mean 0 and variance 1.

Standardization: Standardization transforms data to have a mean of 0 and a standard

deviation of 1, making the data distribution centred around 0.

 Scaling Data: Use scaling techniques like mean centring and variance scaling to
standardize numerical features.

 Robust Scaling: Use robust scaling techniques that are less prone to the influence of
outliers.

Handling Categorical Data

Encoding Categorical Variables: Categorical data needs to be converted into a numerical

format suitable for machine learning models.

 Label Encoding: Converts categorical data into numerical format with integer values.

 One-Hot Encoding: Creates binary columns for each category and assigns a 1 or 0.

Text Data Preprocessing

Tokenization: Tokenization breaks text into individual words or phrases (tokens) for analysis.

 Tokenization Techniques: Use techniques like word tokenization, sentence

tokenization, and n-gram tokenization.

Text Cleaning: Cleaning text data by removing stopwords, punctuation, and special characters.

 Removing Stopwords: Filter out common words that do not add meaning to the text
analysis.

 Removing Special Characters: Strip out symbols, emojis, and other non-alphabetic
characters.

Cleaning and preprocessing data are essential steps in the machine learning process to ensure
that the data is accurate, reliable, and suitable for analysis and model training. Organizations
must adhere to data protection regulations such as GDPR to ensure that data handling
practices are ethical and compliant. By following best practices in handling missing data,
dealing with noisy data, normalizing and standardizing numerical data, encoding categorical
variables, and preprocessing text data, organizations can enhance the effectiveness of their
machine learning models. These steps contribute to improving model performance, ensuring
that machine learning applications deliver valuable insights and predictions that drive
business decisions and innovation.

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 6|17

What is Machine Learning?

Feature Engineering
Feature engineering is a crucial step in the machine learning process, involving the creation
and selection of relevant features from raw data to improve model performance and
predictive accuracy. In the UK, where data protection regulations such as GDPR are stringent,
feature engineering plays a vital role in ensuring that models derive meaningful insights while
respecting individual privacy rights. Below we discuss in detail about the stages of feature
engineering in machine learning:

Feature Selection

Identifying Relevant Features: Identifying features that are most relevant to the problem
being solved is the first step in feature engineering.

 Domain Knowledge: Utilize domain expertise to identify features that are likely to
have a significant impact on the target variable.

 Exploratory Data Analysis (EDA): Conduct exploratory data analysis to identify

correlations between features and the target variable.

Feature Importance Techniques: Various techniques can be used to quantify the importance
of features and prioritize them for model training.

 Statistical Tests: Perform statistical tests such as ANOVA or chi-square to assess the
significance of features.

 Feature Importance Algorithms: Utilize algorithms such as Random Forest, Gradient

Boosting, or Lasso Regression to rank features based on their importance.

Dimensionality Reduction

Principal Component Analysis (PCA): PCA is a commonly used technique for reducing the
dimensionality of datasets while preserving as much variance as possible.

 Eigenvalue Decomposition: Decompose the covariance matrix of the dataset into its
eigenvectors and eigenvalues.

 Dimensionality Reduction: Project the data onto a lower-dimensional subspace

defined by the principal components.

Feature Transformation

Feature Scaling: Scaling features to a similar range can improve the performance of certain
machine learning algorithms.

 Standardization: Transform features to have a mean of 0 and a standard deviation of

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 7|17

What is Machine Learning?

 Normalization: Scale features to a fixed range, such as [0, 1], to prevent features with
large magnitudes from dominating the model.

Feature Encoding

One-Hot Encoding: One-hot encoding is used to convert categorical variables into a binary
format suitable for machine learning algorithms.

 Creation of Binary Columns: Create binary columns for each category, where a 1
indicates the presence of the category and a 0 indicates absence.

 Sparse Matrix Representation: Handle high cardinality categorical variables efficiently

by representing them as sparse matrices.

Handling Time-Series Data

Temporal Features: Incorporating temporal features into models can capture time-dependent
patterns and improve predictive performance.

 Lag Features: Create lag features by incorporating past values of variables as features.

 Rolling Statistics: Calculate rolling statistics such as moving averages or rolling sums to
capture trends over time.

Feature
Selection

Handling Time- Dimensionality

Series Data Reduction

Feature
Encoding

Fig 3.2: Feature Engineering

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 8|17

What is Machine Learning?

Feature engineering is a critical component of the machine learning process, enabling the
creation and selection of relevant features from raw data to improve model performance and
predictive accuracy. In the UK, where data protection regulations such as GDPR are stringent,
feature engineering plays a vital role in ensuring that models derive meaningful insights while
respecting individual privacy rights. By following best practices in feature selection,
dimensionality reduction, feature transformation, feature encoding, and handling time-series
data, organizations can enhance the effectiveness of their machine learning models and derive
valuable insights that drive business decisions and innovation.

Model Selection and Training

Choosing the Right Model
Model selection and training are pivotal stages in the machine learning process, where the
appropriate algorithm is chosen and trained on data to make predictions or derive insights.
Below we discuss in detail about the stages of model selection and training in machine
learning:

Understanding Different Model Types

Classification Models: Classification models are used to predict categorical outcomes based
on input variables.

 Logistic Regression: Suitable for binary classification tasks, where the target variable
has two classes.

 Decision Trees: Effective for both classification and regression tasks, offering
interpretability and handling non-linear relationships.

 Support Vector Machines (SVM): Useful for both linear and non-linear classification
tasks by finding the optimal hyperplane that best separates classes.

 Random Forest: Ensemble method combining multiple decision trees to improve

predictive accuracy and handle complex relationships.

Regression Models: Regression models predict continuous outcomes.

 Linear Regression: Suitable for tasks with a linear relationship between input and
output variables.

 Ridge Regression: Helps prevent overfitting by adding a penalty to the model.

 Lasso Regression: Encourages sparsity by penalizing the absolute size of coefficients.

 Gradient Boosting Machines: Iteratively improves the model by correcting errors of

previous models.

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 9|17

What is Machine Learning?

Clustering Models: Clustering models group data points into clusters based on similarities.

 K-Means Clustering: Partitions data into K clusters based on similarity.

 Hierarchical Clustering: Creates a tree of clusters to represent the data structure.

 DBSCAN: Density-based clustering to identify clusters of varying shapes and sizes.

Dimensionality Reduction Models: Dimensionality reduction models reduce the number of

input variables.

 Principal Component Analysis (PCA): Reduces dimensionality while preserving as

much variance as possible.

 t-Distributed Stochastic Neighbour Embedding (t-SNE): Visualizes high-dimensional

data by reducing dimensionality.

Best Practices in Model Selection

1. Understand the Problem and Data: Before choosing a model, thoroughly understand
the problem, and the characteristics of the data.

2. Evaluate Multiple Algorithms: Compare the performance of different algorithms using

cross-validation and appropriate metrics.

3. Consider Model Complexity: Balance model complexity and interpretability based on

the problem requirements.

4. Iterative Improvement: Continuously refine the model by tuning hyperparameters

and evaluating its performance.

Model Training and Evaluation

1. Splitting Data: Divide the dataset into training and testing sets to train the model on
one set and evaluate its performance on the other.

2. Cross-Validation: Use techniques like k-fold cross-validation to ensure that the model
is trained and tested on different subsets of the data.

3. Hyperparameter Tuning: Adjust hyperparameters such as learning rate, number of

trees, or regularization parameters to optimize model performance.

Model Evaluation Metrics

 Classification Metrics: Accuracy, precision, recall, F1-score, ROC-AUC.

 Regression Metrics: Mean Squared Error (MSE), Mean Absolute Error (MAE), R-
squared.

 Clustering Metrics: Silhouette score, Davies-Bouldin index, Adjusted Rand Index (ARI).

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 10 | 1 7

What is Machine Learning?

Dimensionality
Classification
Reduction
Models
Models

Clustering Regression
Models Models

Fig 3.3: Choosing the Right Model

Model selection and training are crucial steps in the machine learning process, influencing the
accuracy and effectiveness of predictive models. By understanding the problem, selecting
appropriate algorithms, and rigorously evaluating and tuning models, organizations can
develop robust machine learning solutions that provide valuable insights and predictions. By
following best practices and leveraging appropriate tools and techniques, businesses can
harness the power of machine learning to drive innovation and make informed decisions.

Training the Model and Evaluating Model Performance

Training the model and evaluating its performance are critical stages in the machine learning
process, where the selected algorithm is trained on the training data and then assessed using
evaluation metrics to gauge its effectiveness. In this section, we will discuss about the stages
of training the model and evaluating model performance in machine learning:

Splitting the Data

Training Data: The training dataset is used to fit the model's parameters and learn from the
patterns present in the data.

 Features and Labels: The training data consists of input features (X) and corresponding
labels or target variables (y).

 Size of Training Set: Typically, around 70-80% of the total dataset is allocated to the
training set.
Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 11 | 1 7
What is Machine Learning?

Testing Data: The testing dataset is used to evaluate the model's performance and assess its
generalization to unseen data.

 Unseen Data: The testing set should contain data that the model has not been exposed
to during training.

 Size of Testing Set: The remaining 20-30% of the dataset is allocated to the testing set.

Model Training

Fitting the Model: The selected algorithm is trained on the training data to learn the
underlying patterns and relationships.

 Learning Algorithm: The algorithm iteratively adjusts its parameters to minimize the
error between predicted and actual values.

 Optimization Techniques: Techniques such as gradient descent or stochastic gradient

descent are used to optimize the model's parameters.

Validation Data

Validation Set: A separate validation dataset may be used to fine-tune hyperparameters and
assess the model's performance during training.

 Hyperparameter Tuning: Hyperparameters such as learning rate or regularization

strength are adjusted based on performance on the validation set.

 Cross-Validation: Techniques like k-fold cross-validation may be employed to ensure

robustness of model evaluation.

Model Evaluation

Evaluation Metrics: Evaluation metrics are used to assess the model's performance and
determine its effectiveness in making predictions.

 Classification Metrics: Accuracy, precision, recall, F1-score, ROC-AUC.

 Regression Metrics: Mean Squared Error (MSE), Mean Absolute Error (MAE), R-
squared.

 Clustering Metrics: Silhouette score, Davies-Bouldin index, Adjusted Rand Index (ARI).

Performance Visualization

Confusion Matrix: For classification tasks, the confusion matrix provides insights into the
model's performance across different classes.

 True Positive (TP): Instances correctly classified as positive.

 False Positive (FP): Instances incorrectly classified as positive.

What is Machine Learning?

 True Negative (TN): Instances correctly classified as negative.

 False Negative (FN): Instances incorrectly classified as negative.

ROC Curve and Precision-Recall Curve: These curves visualize the trade-off between true
positive rate and false positive rate or precision and recall, respectively.

Training the model and evaluating its performance are essential steps in the machine learning
process, ensuring that models are effective in making predictions and generalizing to unseen
data. By splitting the data into training and testing sets, fitting the model to the training data,
and evaluating its performance using appropriate metrics, organizations can develop robust
machine learning solutions that provide valuable insights and predictions. By following best
practices and leveraging appropriate evaluation techniques, businesses can harness the
power of machine learning to drive innovation and make informed decisions.

Model Evaluation and Validation

Metrics for Evaluating Performance (Accuracy, Precision, Recall, F1
Score)
In machine learning, evaluating model performance is crucial to ensure the effectiveness and
reliability of predictive models. Various metrics are used to assess different aspects of model
performance, such as accuracy, precision, recall, and F1 score. In this section, we will discuss
about these metrics for evaluating model performance in machine learning, focusing on their
definitions, calculations, and interpretation:

Accuracy

Definition: Accuracy measures the proportion of correctly classified instances among the total
number of instances.

 Formula

 Interpretation: Accuracy provides an overall measure of how often the model makes
correct predictions. However, it may not be suitable for imbalanced datasets.

Precision

Definition: Precision measures the proportion of correctly predicted positive instances among
all instances predicted as positive.

What is Machine Learning?

 Interpretation: Precision indicates the model's ability to avoid false positives. It is

useful when the cost of false positives is high.

Recall (Sensitivity)

Definition: Recall measures the proportion of correctly predicted positive instances among all
actual positive instances.

 Interpretation: Recall indicates the model's ability to identify all positive instances. It
is useful when the cost of false negatives is high.

F1 Score

Definition: F1 score is the harmonic mean of precision and recall, providing a single metric
that balances both measures.

 Interpretation: F1 score considers both precision and recall, providing a balanced

measure of model performance. It is particularly useful when there is an uneven class
distribution.

Considerations and Best Practices

1. Interpretability: Understand the business context and implications of false positives

and false negatives when selecting metrics.

2. Threshold Selection: Adjust classification thresholds to optimize the trade-off

between precision and recall based on business needs.

3. Imbalanced Datasets: Use metrics like F1 score when dealing with imbalanced
datasets to account for uneven class distributions.

4. Cross-Validation: Employ techniques such as k-fold cross-validation to ensure

robustness of metric evaluation and avoid overfitting.

Metrics such as accuracy, precision, recall, and F1 score are fundamental in evaluating model
performance in machine learning. They provide insights into how well a model is performing

What is Machine Learning?

and help in optimizing and fine-tuning machine learning algorithms. By understanding these
metrics and their interpretations, organizations can develop effective machine learning
solutions that drive innovation and informed decision-making.

Validation Techniques (Train/Test Split, Cross-Validation) and

Avoiding Overfitting and Underfitting
In machine learning, validation techniques are essential to assess the performance and
generalization ability of models. Techniques such as train/test split and cross-validation help
in evaluating model performance while avoiding common pitfalls like overfitting and
underfitting. In this section, we will discuss about these validation techniques and strategies
to prevent overfitting and underfitting in machine learning, focusing on their definitions,
implementations, and best practices:

Train/Test Split

Definition: The train/test split is a simple validation technique where the dataset is divided
into two subsets: one for training the model and another for testing its performance.

 Implementation: Typically, 70-80% of the data is used for training, and the remaining
20-30% is used for testing.

 Advantages: Easy to implement, provides a quick evaluation of model performance on

unseen data.

Cross-Validation

Definition: Cross-validation is a resampling technique that involves partitioning the data into
multiple subsets (folds) and using each fold as a testing set while the remaining folds are used
for training.

 K-Fold Cross-Validation: The dataset is divided into K subsets (folds), and the model is
trained and evaluated K times, each time using a different fold as the testing set.

 Advantages: Provides a more accurate estimate of model performance compared to a

single train/test split, reduces variability, and maximizes data usage.

Avoiding Overfitting

Definition: Overfitting occurs when a model learns the training data too well, capturing noise
and random fluctuations that do not generalize to unseen data.

 Strategies to Avoid Overfitting

 Cross-Validation: Helps in detecting overfitting by providing a more realistic

estimate of model performance across different subsets of data.

What is Machine Learning?

 Regularization: Adds a penalty to the model's complexity, discouraging overly

complex models that fit noise.

 Early Stopping: Stops training when performance on the validation set stops
improving, preventing the model from learning noise.

Avoiding Underfitting

Definition: Underfitting occurs when a model is too simple to capture the underlying patterns
in the data, leading to poor performance on both training and testing datasets.

 Strategies to Avoid Underfitting:

 Feature Selection/Engineering: Choose relevant features and engineer new

features that capture important patterns in the data.

 Increasing Model Complexity: Use more complex models or ensembles of

models that can capture non-linear relationships in the data.

 Hyperparameter Tuning: Adjust hyperparameters such as learning rate,

number of trees in ensemble methods, or kernel parameters in SVMs to
improve model performance.

Best Practices

1. Use Both Train/Test Split and Cross-Validation

 Use train/test split for a quick initial assessment of model performance.

 Use cross-validation to obtain a more accurate estimate of model performance

and to detect overfitting.

2. Monitor Learning Curves

 Plot learning curves to visualize model performance on training and validation

datasets.

 Monitor for signs of overfitting (gap between training and validation

performance) or underfitting (low overall performance).

3. Regularization Techniques

 Apply regularization techniques (e.g., L1/L2 regularization) to penalize overly

complex models and prevent overfitting.

Validation techniques such as train/test split and cross-validation are essential in evaluating
model performance and ensuring generalization to unseen data in machine learning. By
implementing these techniques and strategies to avoid overfitting and underfitting,
organizations can develop robust machine learning models that provide accurate predictions

What is Machine Learning?

and insights. By following best practices and leveraging appropriate validation techniques,
businesses can harness the power of machine learning to drive innovation and make informed
decisions.

The machine learning process is a systematic approach that begins with collecting and
preparing data. This step ensures that the data is clean and suitable for analysis. Next, the
appropriate model is chosen and trained to learn patterns and make predictions. Finally, the
model's performance is evaluated and validated to ensure it can accurately generalize to new
data. This iterative process allows for continuous improvement and refinement of the model.
By following these steps, machine learning applications in various fields such as healthcare,
finance, and marketing can achieve more reliable and effective results, benefiting society in
numerous ways.

 Artificial Intelligence for Beginners: Explore Multiple Industries Mastering the

Use of Generative AI from Machine Learning to Neural Networks, Natural
Language Processing, & more! by B S Meade III | May 9, 2024

 Signal Processing and Machine Learning with Applications by Michael M.

Richter, Sheuli Paul, et al. | Oct 1, 2022

Unit 2 ML
No ratings yet
Unit 2 ML
14 pages
Machine Learning Essentials Guide
No ratings yet
Machine Learning Essentials Guide
33 pages
Data Pre-processing Guide
No ratings yet
Data Pre-processing Guide
8 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
4 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Module 1
No ratings yet
Module 1
25 pages
Data Mining for Business Insights
No ratings yet
Data Mining for Business Insights
38 pages
Data Cleaning Preprocessing
No ratings yet
Data Cleaning Preprocessing
28 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
Chương
No ratings yet
Chương
12 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
9 pages
ML Workflow Steps: Step 2: Building Dataset
No ratings yet
ML Workflow Steps: Step 2: Building Dataset
5 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Supervised Learning Research Paper Final With Images
No ratings yet
Supervised Learning Research Paper Final With Images
11 pages
DSF - UNIT III Notes
No ratings yet
DSF - UNIT III Notes
17 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
Unit I
No ratings yet
Unit I
41 pages
Unit 2 DA
No ratings yet
Unit 2 DA
3 pages
Lecture No 2 Data Preparation
No ratings yet
Lecture No 2 Data Preparation
23 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Unit 2
No ratings yet
Unit 2
11 pages
Data Preparation Steps for Analysis
No ratings yet
Data Preparation Steps for Analysis
3 pages
UNIT 2 Data Warehousing
No ratings yet
UNIT 2 Data Warehousing
45 pages
Supervised Learning Research Paper With Images
No ratings yet
Supervised Learning Research Paper With Images
10 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
Ocs353 DSF Unit III Notes
No ratings yet
Ocs353 DSF Unit III Notes
11 pages
Aml Midsem
No ratings yet
Aml Midsem
59 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Dsur Ea2352001010391 W7
No ratings yet
Dsur Ea2352001010391 W7
3 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Data Preprocessing for COVID-19 Data
No ratings yet
Data Preprocessing for COVID-19 Data
8 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
ML Question Answer
No ratings yet
ML Question Answer
4 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
Data Mining
No ratings yet
Data Mining
22 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
ML Data Preprocessing Guide
No ratings yet
ML Data Preprocessing Guide
5 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Data Processing in AI
No ratings yet
Data Processing in AI
7 pages
Extracting Knowledge From Data
No ratings yet
Extracting Knowledge From Data
16 pages
Unit 2
No ratings yet
Unit 2
21 pages
ML Da
No ratings yet
ML Da
55 pages
Lec 01
No ratings yet
Lec 01
5 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Machine Learning (Unit I)
No ratings yet
Machine Learning (Unit I)
12 pages
ML Experiments
No ratings yet
ML Experiments
22 pages
Molecules: A QSAR, Pharmacokinetic and Toxicological Study of New Artemisinin Compounds With Anticancer Activity
No ratings yet
Molecules: A QSAR, Pharmacokinetic and Toxicological Study of New Artemisinin Compounds With Anticancer Activity
28 pages
Hadassah's, Front Page
No ratings yet
Hadassah's, Front Page
11 pages
Clever Mukuze Final Thesis
No ratings yet
Clever Mukuze Final Thesis
93 pages
UNIT I - Introduction-Recommender Systems
No ratings yet
UNIT I - Introduction-Recommender Systems
24 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Face Recognition System Upgrade
No ratings yet
Face Recognition System Upgrade
3 pages
1 s2.0 S1057521922002897 Main
No ratings yet
1 s2.0 S1057521922002897 Main
23 pages
ISLR Chap 6 Shaheryar
No ratings yet
ISLR Chap 6 Shaheryar
22 pages
AI-driven Applications.: Differences Between AI vs. Machine Learning vs. Deep Learning
No ratings yet
AI-driven Applications.: Differences Between AI vs. Machine Learning vs. Deep Learning
10 pages
Visualization and Verbalization of Data Jorg Blasius Michael Greenacre Download
100% (2)
Visualization and Verbalization of Data Jorg Blasius Michael Greenacre Download
77 pages
Network Intrusion Detection Using Supervised Machine Learnin (3) )
No ratings yet
Network Intrusion Detection Using Supervised Machine Learnin (3) )
24 pages
Wine Prediction
100% (1)
Wine Prediction
13 pages
PCA Project Advanced Statistics
67% (3)
PCA Project Advanced Statistics
24 pages
Data Science Tool Box Important Viva Question
No ratings yet
Data Science Tool Box Important Viva Question
14 pages
Face Recognition Using LDA-based Algorithms
No ratings yet
Face Recognition Using LDA-based Algorithms
7 pages
Effect of Commercial Lipase Incorporation
No ratings yet
Effect of Commercial Lipase Incorporation
15 pages
Mathematics For Data Science
No ratings yet
Mathematics For Data Science
7 pages
Refernce
No ratings yet
Refernce
11 pages
Lithological Mapping and Hydrothermal Alteration Using Landsat 8 Data: A Case Study in Ariab Mining District, Red Sea Hills, Sudan
100% (1)
Lithological Mapping and Hydrothermal Alteration Using Landsat 8 Data: A Case Study in Ariab Mining District, Red Sea Hills, Sudan
10 pages
User Manual of PAST Software
0% (1)
User Manual of PAST Software
49 pages
CIP-007-6-Cyber Security - Systems Security Managemen
No ratings yet
CIP-007-6-Cyber Security - Systems Security Managemen
53 pages
Descriptive Sensory Evaluation of Virgin Coconut Oil and Refined
No ratings yet
Descriptive Sensory Evaluation of Virgin Coconut Oil and Refined
7 pages
Phenotypic Characterization of Animal Genetic Resources
No ratings yet
Phenotypic Characterization of Animal Genetic Resources
158 pages
Introduction On Multivariate Analysis
No ratings yet
Introduction On Multivariate Analysis
25 pages
Customer Data Analysis
No ratings yet
Customer Data Analysis
14 pages
An Empirical Investigation of Factors Affecting Brand Health Tracking: Implication For Two Wheeler Industry
No ratings yet
An Empirical Investigation of Factors Affecting Brand Health Tracking: Implication For Two Wheeler Industry
19 pages
Wa0002.
No ratings yet
Wa0002.
110 pages
NMR in Perfumes
No ratings yet
NMR in Perfumes
10 pages
Principal Components Analysis (PCA) : 2.1 Outline of Technique
No ratings yet
Principal Components Analysis (PCA) : 2.1 Outline of Technique
21 pages

1725892639module 3 The Machine Learning Process

Uploaded by

1725892639module 3 The Machine Learning Process

Uploaded by

What is Machine Learning?

 Describe the steps involved in data collection and preparation.

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 1|17

Identifying Data Sources

Data Cleaning and Preprocessing

 Handling Missing Data: Imputing missing values or removing incomplete records

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 2|17

 Normalization and Standardization: Scaling numerical data to a standard range to

 Text Preprocessing: Tokenization, stemming, and removing stop words in natural

Ensuring Data Quality

1. Data Governance: Implement data governance practices to maintain data integrity,

Data Integration and Transformation

 Schema Integration: Resolving schema conflicts and inconsistencies when integrating

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 3|17

 Dimensionality Reduction: Reducing the number of input variables while preserving

 Aggregation and Discretization: Aggregating data into meaningful groups or

Fig 3.1: Gathering Data

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 4|17

Cleaning and Preprocessing Data

 Filtering or Transforming Outliers: Apply techniques such as trimming, winsorization,

Data Normalization and Standardization

Normalization: Normalization scales numerical data to a standard range to prevent features

 MinMax Scaling: Rescales data to a fixed range (e.g., [0, 1]).

 Normalization by Z-score: Standardizes data to have mean 0 and variance 1.

Standardization: Standardization transforms data to have a mean of 0 and a standard

Handling Categorical Data

Encoding Categorical Variables: Categorical data needs to be converted into a numerical

Text Data Preprocessing

 Tokenization Techniques: Use techniques like word tokenization, sentence

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 6|17

 Exploratory Data Analysis (EDA): Conduct exploratory data analysis to identify

 Feature Importance Algorithms: Utilize algorithms such as Random Forest, Gradient

 Dimensionality Reduction: Project the data onto a lower-dimensional subspace

 Standardization: Transform features to have a mean of 0 and a standard deviation of

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 7|17

 Sparse Matrix Representation: Handle high cardinality categorical variables efficiently

Handling Time-Series Data

Handling Time- Dimensionality

Fig 3.2: Feature Engineering

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 8|17

Model Selection and Training

Understanding Different Model Types

 Random Forest: Ensemble method combining multiple decision trees to improve

Regression Models: Regression models predict continuous outcomes.

 Ridge Regression: Helps prevent overfitting by adding a penalty to the model.

 Lasso Regression: Encourages sparsity by penalizing the absolute size of coefficients.

 Gradient Boosting Machines: Iteratively improves the model by correcting errors of

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 9|17

 K-Means Clustering: Partitions data into K clusters based on similarity.

 Hierarchical Clustering: Creates a tree of clusters to represent the data structure.

 DBSCAN: Density-based clustering to identify clusters of varying shapes and sizes.

Dimensionality Reduction Models: Dimensionality reduction models reduce the number of

 Principal Component Analysis (PCA): Reduces dimensionality while preserving as

 t-Distributed Stochastic Neighbour Embedding (t-SNE): Visualizes high-dimensional

Best Practices in Model Selection

2. Evaluate Multiple Algorithms: Compare the performance of different algorithms using

3. Consider Model Complexity: Balance model complexity and interpretability based on

4. Iterative Improvement: Continuously refine the model by tuning hyperparameters

Model Training and Evaluation

3. Hyperparameter Tuning: Adjust hyperparameters such as learning rate, number of

Model Evaluation Metrics

 Classification Metrics: Accuracy, precision, recall, F1-score, ROC-AUC.

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 10 | 1 7

Fig 3.3: Choosing the Right Model

Training the Model and Evaluating Model Performance

Splitting the Data

 Optimization Techniques: Techniques such as gradient descent or stochastic gradient

 Hyperparameter Tuning: Hyperparameters such as learning rate or regularization

 Cross-Validation: Techniques like k-fold cross-validation may be employed to ensure

 Classification Metrics: Accuracy, precision, recall, F1-score, ROC-AUC.

 True Positive (TP): Instances correctly classified as positive.

 False Positive (FP): Instances incorrectly classified as positive.

Copyrights © OHSC (Oxford Home Study Centre).All Rights Reserved. 12 | 1 7

 True Negative (TN): Instances correctly classified as negative.

 False Negative (FN): Instances incorrectly classified as negative.

Model Evaluation and Validation