KEMBAR78
ML Module 1 | PDF | Cross Validation (Statistics) | Machine Learning
0% found this document useful (0 votes)
35 views12 pages

ML Module 1

ML unit 1

Uploaded by

Viman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views12 pages

ML Module 1

ML unit 1

Uploaded by

Viman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

MACHINE LEARNING

Module 1
 Introduction to machine learning
 Issues in ML
 Application of ML
 Steps of developing a ML application
 Types of learning
 Concept of classification
 Clustering and prediction
 Training, testing and validation dataset
 Cross validation
 Overfitting and Underfitting of model
 Confusion matrix

Module 2
 System of Linear equations
 Norms
 Inner Product
 Length of Vector
 Distance Between Vectors
 Orthogonal Vectors
 Symmetric Positive Definite matrices
 Determinant
 Trace
 Eigen values and vectors
 Orthogonal projection
 Diagonalization
 SVD and its application

Module 3
 Least square method
 Multivariate linear regression
 Regularized regression
 Using least squares regression for classification
 Support Vector Machine(SVM)

Module 4
 Hebbian learning rule
 Expectation maximization algorithm for clustering

Module 5
 Introduction to classification model
 Fundamental concept
 Evolution of neural networks
 Biological neuron
 Artificial neural network
 NN architecture
 McCulloch-Pitts Model
 Designing a simple network
 Non-separable patterns
 Perceptron model with bias
 Activation functions
 Binary, bipolar, continuous, Ramp.
 Limitations of perceptron.
 Perceptron learning rule
 Delta learning rule(LMS-Widrow Hoff)
 Multi-layer perceptron network
 Adjusting weights of hidden layers
 Error back propagation algorithm
 Logistic regression

Module 6
 Curse of Dimensionality
 Feature selection and feature extraction
 Dimensionality reduction techniques
 Principal component analysis.
Module 1: Introduction to Machine Learning

- Introduction to machine learning:


Machine learning is a subset of artificial intelligence that focuses on the
development of algorithms allowing computers to learn and make predictions or
decisions from data without being explicitly programmed.

- Issues in ML:
 Machine learning faces several challenges that impact the performance
and reliability of models.
 Overfitting occurs when a model learns to capture noise or random
fluctuations in the training data, leading to poor performance on new,
unseen data.
 Underfitting arises when a model is too simple to capture the underlying
patterns in the data, resulting in poor performance on both training and
test data.
 The bias-variance tradeoff refers to the delicate balance between bias
(error due to overly simplistic assumptions) and variance (error due to
sensitivity to small fluctuations in the training data) in a model.
 Lack of interpretability arises when complex models are difficult to
understand and explain to humans. Data quality issues such as missing
values, noisy data, and imbalanced datasets can also impact the
performance of machine learning models.

- Application of ML:
 Machine learning finds applications in various domains, including
healthcare, finance, marketing, manufacturing, and more.
 In healthcare, ML is used for disease diagnosis, drug discovery,
personalized treatment planning, and medical image analysis.
 In finance, it's applied to fraud detection, risk management, algorithmic
trading, and credit scoring.
 In marketing, ML powers recommendation systems, customer
segmentation, churn prediction, and sentiment analysis.
 In manufacturing, it's used for predictive maintenance, quality control,
supply chain optimization, and demand forecasting.
- Steps of developing a ML application:
Developing a machine learning (ML) application involves several key steps to
create an effective and reliable system. Here's a detailed breakdown of the steps
involved:
 Problem Formulation:
- Define the problem you want to solve using machine learning.
- Clearly articulate the objectives and requirements of the project.
- Determine the type of machine learning task (e.g., classification,
regression, clustering) that best suits the problem.
 Data Collection:
- Gather relevant data from various sources, including databases,
APIs, files, or web scraping.
- Ensure the data is representative, diverse, and sufficiently large to
train a robust model.
- Address data quality issues such as missing values, outliers, and
inconsistencies.
 Data Preprocessing:
- Clean the data by handling missing values, outliers, and noisy
observations.
- Normalize or standardize the features to ensure they have similar
scales and distributions.
- Encode categorical variables into numerical representations using
techniques like one-hot encoding or label encoding.
- Split the data into training, validation, and testing sets to evaluate
the model's performance.
 Feature Engineering:
- Extract relevant features from the raw data that are informative for
the learning task.
- Create new features by combining or transforming existing features
to capture meaningful patterns in the data.
- Perform dimensionality reduction techniques to reduce the number
of features and remove redundant information.
 Model Selection:
- Choose the appropriate machine learning algorithm(s) based on the
nature of the problem, data characteristics, and computational
resources.
- Consider various factors such as model complexity, interpretability,
and scalability when selecting the algorithm.
- Experiment with multiple algorithms and compare their
performance using evaluation metrics.
 Model Training:
- Train the selected model(s) on the training data using optimization
techniques such as gradient descent, stochastic gradient descent, or
genetic algorithms.
- Tune the hyperparameters of the model(s) to improve performance
and prevent overfitting.
- Monitor the training process and evaluate the model's performance
on the validation set to ensure it's learning effectively.
 Model Evaluation:
- Assess the performance of the trained model(s) using appropriate
evaluation metrics such as accuracy, precision, recall, F1-score, or
area under the ROC curve.
- Analyze the model's strengths, weaknesses, and failure cases to
identify areas for improvement.
- Validate the model's generalization performance on the testing set
to ensure it performs well on unseen data.
 Model Interpretation:
- Interpret the trained model to understand how it makes predictions
or decisions.
- Analyze the importance of features in influencing the model's
output using techniques like feature importance or SHAP (SHapley
Additive exPlanations).
- Visualize model predictions, decision boundaries, or feature
interactions to gain insights into its behavior.
 Model Deployment:
- Deploy the trained model into production environments such as
web servers, cloud platforms, or edge devices.
- Integrate the model into existing systems or workflows to make
predictions or automate decision-making processes.
- Monitor the deployed model's performance, reliability, and
scalability in real-world scenarios.
- Implement mechanisms for model versioning, rollback, and
updates to ensure continuous improvement and maintenance.
 Feedback Loop:
- Collect feedback from users, stakeholders, or domain experts to
evaluate the model's effectiveness and relevance.
- Incorporate feedback into the model development process to
iteratively improve its performance and adapt to changing
requirements.
- Continuously monitor and update the model based on new data,
feedback, or emerging trends to ensure its long-term viability and
usefulness.

- Types of learning:
Learning in machine learning can be categorized into different types based on
the presence or absence of supervision and feedback during the training process.
Here are the main types of learning:
 Supervised Learning:
- In supervised learning, the algorithm learns from labeled data,
where each training example is paired with a corresponding target
label or output.
- The goal is to learn a mapping from input features to output labels
by minimizing the discrepancy between predicted and true labels.
- Supervised learning tasks include classification (predicting discrete
labels) and regression (predicting continuous values).
- Examples: Email spam classification, handwritten digit
recognition, house price prediction.
 Unsupervised Learning:
- In unsupervised learning, the algorithm learns from unlabeled data,
where only input features are provided without any corresponding
target labels.
- The goal is to discover hidden patterns, structures, or relationships
in the data without explicit guidance.
- Unsupervised learning tasks include clustering (grouping similar
data points), dimensionality reduction (reducing the number of
features), and density estimation (estimating the probability
distribution of the data).
- Examples: Customer segmentation, anomaly detection, topic
modeling.
 Semi-supervised Learning:
- Semi-supervised learning combines elements of both supervised
and unsupervised learning by using a small amount of labeled data
along with a larger amount of unlabeled data.
- The algorithm leverages the labeled data to guide the learning
process and improve the model's performance, especially in cases
where obtaining labeled data is expensive or time-consuming.
- Semi-supervised learning techniques aim to exploit the underlying
structure of the data present in the unlabeled samples to enhance
the model's generalization ability.
- Examples: Text classification with limited labeled data, image
recognition with a small number of labeled images.
 Reinforcement Learning:
- In reinforcement learning, an agent learns to interact with an
environment to achieve a specific goal by taking actions and
receiving feedback in the form of rewards or penalties.
- The agent learns through trial and error by exploring the
environment, selecting actions based on learned policies, and
receiving feedback on the quality of its actions.
- Reinforcement learning tasks involve maximizing cumulative
rewards over time through optimal decision-making and policy
learning.
- Examples: Game playing (e.g., AlphaGo), robot control,
autonomous driving.

- Concept of classification:
 Classification is a fundamental task in supervised learning where the goal
is to assign input data points to predefined categories or classes.
 It's commonly used for tasks such as spam detection, sentiment analysis,
document categorization, and image recognition.
 Classification models learn decision boundaries that separate different
classes in the input feature space, enabling them to classify new instances
into appropriate categories.

- Clustering and prediction:


 Clustering is an unsupervised learning task where the goal is to partition a
dataset into groups or clusters such that data points within the same
cluster are more similar to each other than to those in other clusters.
 It's used for tasks such as customer segmentation, anomaly detection, and
image segmentation.
 Prediction involves making informed guesses about future outcomes
based on historical data.
 It's used for tasks such as sales forecasting, stock price prediction,
weather forecasting, and demand prediction.

- Training, testing, and validation dataset:


 Training Dataset:
- The training dataset is a subset of the available data that is used to
train the machine learning model.
- It consists of input features and their corresponding target labels or
outputs.
- During training, the model learns the underlying patterns and
relationships in the data by adjusting its parameters or weights
based on the provided input-output pairs.
- The training dataset should be representative of the overall data
distribution and cover a diverse range of scenarios to ensure that
the model learns to generalize well to unseen instances.
 Testing Dataset:
- The testing dataset is a separate subset of the available data that is
used to evaluate the performance of the trained model.
- It consists of input features and their corresponding ground truth
labels or outputs, but the model has not seen these instances during
training.
- After training the model on the training dataset, it's evaluated on
the testing dataset to assess its ability to make accurate predictions
on unseen data.
- The testing dataset serves as an independent measure of the
model's generalization performance and helps identify potential
issues such as overfitting or underfitting.
 Validation Dataset:
- The validation dataset is an additional subset of the available data
that is used during the model training process to tune
hyperparameters and assess model performance.
- It serves as a proxy for unseen data and helps prevent overfitting
by providing an unbiased estimate of the model's performance on
new data.
- The validation dataset is used iteratively during training to evaluate
different model configurations, select hyperparameters, and make
adjustments to the model architecture.
- Once the model's hyperparameters are optimized based on the
validation performance, the final model is trained on the entire
training dataset and evaluated on the testing dataset for the final
performance assessment.

- Cross-validation:
Cross-validation is a resampling technique used to assess the performance and
generalization ability of machine learning models. It's particularly useful when
the dataset is limited and needs to be efficiently utilized for both training and
evaluation. Here's an explanation of cross-validation:

1. Concept:
- Cross-validation involves partitioning the dataset into multiple
subsets, called folds, where each fold is used alternately for
training and validation.
- The model is trained on ( k-1 ) folds (training set) and evaluated on
the remaining fold (validation set). This process is repeated ( k )
times, with each fold used exactly once as the validation set.
- The ( k ) results are then averaged to produce a single performance
metric, such as accuracy or mean squared error, which represents
the overall performance of the model.

2. Types of Cross-Validation:
- K-Fold Cross-Validation: The dataset is divided into ( k ) equal-
sized folds, and the model is trained and evaluated ( k ) times, each
time using a different fold as the validation set.
- Stratified K-Fold Cross-Validation: Similar to k-fold cross-
validation, but it ensures that each fold contains approximately the
same proportion of classes as the original dataset, which is useful
for imbalanced datasets.
- Leave-One-Out Cross-Validation (LOOCV): Each data point is
used as a validation set once, with the rest of the data used for
training. This approach is computationally expensive but provides
a reliable estimate of the model's performance.
- Repeated K-Fold Cross-Validation: The k-fold cross-validation
process is repeated multiple times with different random splits of
the data to reduce variability and obtain more stable performance
estimates.

- Overfitting and underfitting of the model:


Overfitting and underfitting are two common problems encountered when
training machine learning models. They occur when the model's performance on
the training data does not generalize well to unseen data. Let's delve into each:

1. Overfitting:
- Overfitting occurs when a model learns to capture noise or irrelevant
patterns in the training data, resulting in poor generalization to new, unseen
data.
- Characteristics of overfitting:
 The model performs well on the training data but poorly on the
testing data.
 The model captures noise or outliers in the training data, leading
to high variance in predictions.
 The model may exhibit complex decision boundaries that fit the
training data too closely, resulting in poor performance on new
instances.
- Causes of overfitting:
 Model Complexity: Complex models with a large number of
parameters have higher capacity to memorize the training data,
increasing the risk of overfitting.
 Insufficient Training Data: Limited training data or imbalanced
datasets may not provide enough diverse examples for the
model to learn meaningful patterns, leading to overfitting.
 Incorrect Hyperparameters: Poor choices of hyperparameters,
such as learning rate, regularization strength, or network
architecture, can exacerbate overfitting.
2. Underfitting:
- Underfitting occurs when a model is too simple to capture the underlying
patterns in the training data, resulting in poor performance on both the training
and testing data.
- Characteristics of underfitting:
 The model performs poorly on both the training and testing
data, indicating a failure to capture the underlying relationships
in the data.
 The model may exhibit high bias, resulting in systematic errors
and an inability to represent the true underlying data
distribution.
 The model may be too simplistic or have insufficient capacity to
learn complex patterns in the data.
- Causes of underfitting:
 Model Complexity: Models that are too simple or have too few
parameters may lack the capacity to capture the underlying
patterns in the data, leading to underfitting.
 Insufficient Training: Inadequate training or insufficient
exposure to diverse examples may prevent the model from
learning meaningful representations of the data.
 Inappropriate Features: If the input features do not adequately
capture the relevant information in the data, the model may
underfit.

- Confusion matrix:
 A confusion matrix is a performance evaluation metric for classification
problems.
 It's a table that summarizes the true positive (TP), true negative (TN),
false positive (FP), and false negative (FN) predictions made by a
classification model.
 It provides valuable insights into the model's predictive performance,
including accuracy, precision, recall, F1-score, and specificity.
 True Positives (TP): The number of instances that were correctly
predicted as positive by the model.
 True Negatives (TN): The number of instances that were correctly
predicted as negative by the model.
 False Positives (FP): The number of instances that were incorrectly
predicted as positive by the model (false alarms).
 False Negatives (FN): The number of instances that were incorrectly
predicted as negative by the model (missed detections).

You might also like