ML UNIT-1 machine learning notes to stu

MACHINE LEARNING
Syllabus UNIT-1
INTRODUCTION- Artificial Intelligence, Machine Learning,
Deep learning, Types of Machine Learning Systems, Main
Challenges of Machine Learning.
STATISTICAL LEARNING: Introduction, Supervised and
Unsupervised Learning, Training and Test Loss, Tradeoffs in
Statistical Learning, Estimating Risk Statistics, Sampling
distribution of an estimator, Empirical Risk Minimization

1. ARTIFICIAL INTELLIGENCE, MACHINE LEARNING, DEEP
LEARNING:
 Artificial Intelligence is a branch of computer science
that helps us to create smart, intelligent machines.
 AI holds a tendency to cause a machine to work like a
human.
What is Artificial Intelligence (AI)?
 Artificial Intelligence is defined as a field of science and
engineering that deals with making intelligent machines
or computers to perform human-like activities.

although all these technologies are used to build intelligent
machines or applications that behave like a human.
 Artificial Intelligence (AI) can be understood as an umbrella
that consists of both Machine learning and deep learning.
 We can say deep learning and machine learning both are
subsets of artificial intelligence.

 The primary goal of Artificial Intelligence is to
simulate human-like intelligence in machines,
empowering them to carry out complex tasks
and decision-making processes autonomously.
The objectives of AI encompass a wide range
of applications.
1. Problem-Solving and Decision Making
2. Natural Language Processing (NLP)

3. Machine Learning and Deep Learning
4. Robotics and Automation
5. Enhancing Healthcare and Medicine
6. Fostering Creativity and Innovation

MACHINE LEARNING
definition
“Machine learning is a subfield of artificial
intelligence that uses algorithms trained on
data sets to create self-learning models that
are capable of predicting outcomes and
classifying information without human
intervention.”

Machine learning is used today for a wide
range of commercial purposes, including
1) Suggesting products to consumers based on
their past purchases.
2) Predicting stock market fluctuations.
3) Translating text from one language to
another.

 Deep learning is the sub-branch of ML that
helps to train ML models with a huge amount
of input and complex algorithms and mainly
works with neural networks.

In deep learning, the deep word refers to the
number of layers in a neural network.
 Deep Learning is a set of algorithms inspired
by the structure and function of the human
brain.
 It uses a huge amount of structured as well as
unstructured data to teach computers and
predicts accurate results.

 The main difference between machine
learning and deep learning technologies is of
presentation of data. Machine learning uses
structured/unstructured data for learning,
while deep learning uses neural networks for
learning models.

Need For Machine Learning:
 As per research, we generate around 2.5
quintillion bytes of data every single day!
 It is estimated that by 2020, 1.7MB of
data will be created every second for
every person on earth.

With the availability of so much
data, it is finally possible to build
predictive models that can study and
analyze complex data to find useful
insights and deliver more accurate
results.

Top Tier companies such as Netflix and
Amazon build such Machine Learning
model by using tons of data in order to
identify profitable opportunities and avoid
unwanted risks.

LIST OF REASONS WHY MACHINE
LEARNING IS SO IMPORTANT:
1) INCREASE IN DATA GENERATION: Due to excessive
production of data, we need a method that can be used to
structure, analyze and draw useful insights from data.
2) IMPROVE DECISION MAKING: By making use of
various algorithms, Machine Learning can be used to make better
business decisions.
For example, Machine Learning is used to forecast sales, predict
downfalls in the stock market, identify risks and anomalies, etc.

3) Uncover patterns & trends in data:
Finding hidden patterns and extracting
key insights from data is the most
essential part of Machine Learning.
4) Solve complex problems: From
detecting the genes linked to the deadly
ALS disease to building self- driving cars,
Machine Learning can be used to solve
the most complex problems.

Amyotrophic lateral sclerosis (ALS) is a group of
progressive diseases that affect the nerve cells
in the brain and spinal cord that control
muscle movement.

•Machine learning is a subset of Artificial
Intelligence (AI) which provides machines the
ability to learn automatically & improve
from experience without being explicitly
programmed to do so.

KEY DEFINITIONS:
•Algorithm: A Machine Learning algorithm is a
set of rules and statistical techniques used to
learn patterns from data and draw significant
information from it.

Model: A model is the main component of
Machine Learning.
A model is trained by using a Machine
Learning Algorithm. An algorithm maps
all the decisions that a model is supposed
to take based on the given input, in order
to get the correct output.

•Predictor Variable: It is a feature(s) of the data that can
be used to predict the output.
•Response Variable: It is the feature or the output variable
that needs to be predicted by using the predictor
variable(s).

•Training Data: The Machine Learning
model is built using the training data. The
training data helps the model to identify
key trends and patterns essential to predict
the output.

Testing Data: After the model is trained, it
must be tested to evaluate how
accurately it can predict an outcome.
This is done by the testing data set.

MACHINE LEARNING PROCESS STEPS:

Step 1: Define the objective of the
Problem Statement
•At this step, we must understand what
exactly needs to be predicted.
Example: the objective is to predict the
possibility of rain by studying weather
conditions.

Step 2: Data Gathering
At this stage, you must be asking questions
such as,
What kind of data is needed to solve this
problem?
 Is the data available?
How can I get the data?

Step 3: Data Preparation
The data you collected is almost never in
the right format.
You will encounter a lot of
inconsistencies in the data set such as
missing values, redundant variables,
duplicate values, etc.
at this stage, you scan the data set for
any inconsistencies

Step 4: Exploratory Data Analysis
Data Exploration involves understanding
the patterns and trends in the data.
At this stage, all the useful insights are
drawn and correlations between the
variables are understood.

Step 5: Building a Machine Learning Model
 This stage always begins by splitting the
data set into two parts, training data, and
testing data.
The training data will be used to build
and analyze the model. The logic of the
model is based on the Machine Learning
Algorithm that is being implemented.

Example:
In the case of predicting rainfall, since the
output will be in the form of True (if it
will rain tomorrow) or
False (no rain tomorrow).

Step 6: Model Evaluation & Optimization
After building a model by using the training
data set,
 it is finally time to put the model to a
test. The testing data set is used to check
the efficiency of the model and how
accurately it can predict the outcome.

Step 7: Predictions
Once the model is evaluated and
improved, It is finally used to make
predictions.
•Example: for predicting the occurrence of
rainfall, the output will be a categorical
variable.

MACHINE LEARNING TYPES:
A machine can learn to solve a
problem by following any one of the
three approaches.
1. Supervised learning.
2. Unsupervised learning.
3. Reinforcement Learning.

SUPERVISED LEARNING:
“Supervised learning is a technique in
which we teach or train the machine
using data which is well labeled. “
The labeled data set is nothing but the
training data set.

Example:
As kids we all needed guidance to solve math
problems. Our teachers helped us understand
what addition is and how it is done.
Similarly, you can think of supervised learning
as a type of Machine Learning that involves a
guide. The labeled data set is the teacher that
will train you to understand patterns in the
data.

Supervised learning examples:
1) Classification 2) regression
Classification: Classification is a supervised
machine learning method where the model
tries to predict the correct label of a given
input data.

Common Classification algorithms are:
• Linear Classifiers
• Support Vector Machines (SVM)
• Decision trees
• K-Nearest Neighbor
• Random Forest.

2) Regression: It is used to understand the
relationship between dependent and
independent variables.
It is commonly used to make projections, such
as for sales revenue for a given business.
Common Regression algorithms are:
1) Linear regression
2) Logistical regression
3) Polynomial regression

UNSUPERVISED LEARNING
Unsupervised learning involves training by
using unlabeled data and allowing the model
to act on that information without guidance.
EXAMPLE:
Think of unsupervised learning as a smart kid
that learns without any guidance.

FOR EXAMPLE, it identifies prominent
features of Tom such as pointy ears, bigger
size, etc, to understand that this image is of
type 1. Similarly, it finds such features in Jerry
and knows that this image is of type 2.
Therefore, it classifies the images into two
different classes without knowing who Tom is
or Jerry is.

TYPES OF UNSUPERVISED LEARNING:
Unsupervised learning is classified into two categories of
algorithms:
1) Clustering: A clustering problem is where we want to
discover the inherent groupings in the data.
EX: such as grouping customers by purchasing behavior.
2) Association: An association rule learning problem is
where we want to discover rules that
describe large portions of our data.
EX: people that buy X also tend to buy Y.

Clustering Types:
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis

3) Reinforcement Learning:
Reinforcement Learning is mainly used
in advanced Machine Learning areas
such as self-driving cars.

MAIN CHALLENGES OF MACHINE
LEARNING
During the development phase our focus is to
select a learning algorithm and train it on
some data.
The two things that might be a problem are a
bad algorithm or bad data, or perhaps both of
them.

The following are some of the challenges
of ML
1. Not enough training data.
 Even for very simple problems you typically
need thousands of examples, and for
complex problems such as image or speech
recognition you may need millions of
examples

2. Poor Quality of data:
 if your training data has lots of errors,
outliers, and noise, it will make it impossible
for your machine learning model to detect a
proper underlying pattern. Hence, it will not
perform well.
“Most Data Scientists spend a significant part
of their time in cleaning data”.

3. Irrelevant Features:
“Garbage in, garbage out (GIGO).”
Our training data must always contain more
relevant and less to none irrelevant features.

4. Non representative training
data:
To make sure that our model generalizes well,
we have to make sure that our training data
should be representative of the new cases
that we want to generalize to.

5. Under fitting the Training Data
It occurs when your model is too simple to
learn the underlying structure of the data.
The model is generated like
--- Training errors will be high (High bias)
--- Testing errors will be high (High variance)

Variance (definition)
“The amount of variability in the target function
in response to a change in the training data is
known as Variance”.

6. Over fitting the Training Data
Over fitting happens when the model is too
complex relative to the amount and noisiness
of the training data.
The model is generated like
--- Training errors will be low ( Low bias)
--- Testing errors will be high (High variance)

The possible solutions are:
1) To simplify the model by selecting one with
fewer parameters .
2) By reducing the number of attributes in the
training data or by constraining the model
3) To gather more training data
4) To reduce the noise in the training data (e.g.,
fix data errors and remove outliers)

STATISTICAL LEARNING:
“When the goal is to interpret the model and
quantify the uncertainty in the data, this
analysis is usually referred to as statistical
learning”.

SUPERVISED AND UNSUPERVISED LEARNING

TRAINING LOSS AND TESTING LOSS:
• Training loss and testing loss are essential
concepts in machine learning that reflect how
well a model is performing during training
and generalizing to new, unseen data.

What is loss?
• Loss is the penalty for a bad prediction.
• That is, loss is a number indicating how bad
the model's prediction was on a single
example.
NOTE: If the model's prediction is perfect, the
loss is zero; otherwise, the loss is greater.
• The goal of training a model is to find a set of
weights and biases that have low loss.

TRAINING LOSS:
• Training loss, also known as the training error
or training cost, measures how well a machine
learning model fits the training data.
• In supervised learning, a machine learning
algorithm builds a model by examining many
examples and attempting to find a model that
minimizes loss; this process is called empirical
risk minimization

• It quantifies the difference between the
model's predictions and the actual target
values in the training dataset.

Figure 3. High loss in the left model; low loss in
the right mode

The arrows represent loss.
The blue lines represent predictions.
• Clearly, the line in the right plot is a much
better predictive model than the line in the
left plot.

• we might be wondering whether we could
create a mathematical function—a loss
function—that would aggregate the individual
losses in a meaningful fashion.
1)SQUARED LOSS:
 a popular loss function
• The linear regression models we'll examine
here use a loss function called squared loss
(also known as L2loss).

The squared loss for a single example is as follows:

where
(x,y) is an example in which
x is the set of features
y is the example's label
Prediction(x) is a function
D is a data set containing many labeled
examples, which are (x,y) pairs.
N is the number of examples in D.

• Low Training Loss: A low training loss
indicates that the model is fitting the training
data well.

• High Training Loss: A high training loss
suggests that the model is struggling to
capture the patterns in the data. It might be
underfitting, where the model is too simplistic
to learn the complexities of the data.

TESTING LOSS:
• Testing loss, also known test error, measures
the model's performance on a separate
dataset that it hasn't seen during training. This
dataset is often referred to as the test
dataset.
• The purpose of testing loss is to evaluate how
well the model generalizes to new data.

TRADE-OFFS IN STATISTICAL
LEARNING

Bias
“Bias is simply defined as the inability of the
model because of that there is some difference
or error occurring between the model’s
predicted value and the actual value.”

Low Bias:
In this case, the model will closely match the
training dataset.
• High Bias:
In this case, the model will not match the
training dataset closely.

Variance
“The amount of variability in the target function
in response to a change in the training data is
known as Variance. “

• When a model takes into consideration the
noise and fluctuation in the data, it is said to
be of High Variance.

DIFFERENT TRADE-OFFS IN STATISTICAL
LEARNING
• These trade-offs involve making decisions that
balance various factors to achieve the best
possible model performance, generalization,
and practicality.

1. Bias-Variance Trade-off:
Bias: difference or error occurring between the
model’s predicted value and the actual value.
Variance: The model's sensitivity to small
fluctuations in the training data.
Trade-off: Finding the right level of
complexity to minimize both bias and
variance for optimal generalization.

2. Overfitting vs. Underfitting:
Overfitting: Over fitting happens when the model is
too complex relative to the amount and noisiness
of the training data.
Underfitting: Occurs when a model is too simple to
capture the patterns in the data, resulting in poor
performance on both training and testing data.
Trade-off: Balancing model complexity to avoid
overfitting (high variance) while ensuring that the
model can capture the true patterns.

3. Model Complexity vs. Interpretability:
Model Complexity: More complex models can
capture intricate relationships in the data.
Interpretability: Simpler models are often more
interpretable and provide insights into the
underlying process.
Trade-off: Choosing between complexity for
accuracy and simplicity for explainability,
depending on the application's requirements.

4. Underfitting vs. Feature Engineering:
Underfitting: Simplistic models may not fully utilize
available features.
Feature Engineering: Adding or transforming
features can improve model performance but may
also introduce noise.
Trade-off: Balancing the need for effective feature
utilization with the potential complexity added
through feature engineering.

5. Model Performance vs. Computational
Resources:
Model Performance: Using more complex
models or larger datasets can lead to better
performance.
Computational Resources: Complex models may
require more computation time, memory, or
processing power.
Trade-off: Achieving the desired performance
within the available computational
constraints.

6. Data Quality vs. Model Performance:
Data Quality: High-quality, clean data leads to
more accurate models.
Model Performance: Models can't perform
better than the data they're trained on; noisy
or inaccurate data can degrade performance.
Trade-off: Balancing the effort invested in
improving data quality with the expected
improvements in model performance.

7. Model Selection vs. Model Complexity:
Model Selection: Choosing the best algorithm or
model for a given task.
Model Complexity: Different models have
varying levels of complexity and may perform
differently on different datasets.
Trade-off: Evaluating different models while
considering their complexity and performance
on the available data.

ESTIMATING RISK STATISTICS:
Although machine learning can be used as a risk
management tool, it also contains many risks
itself.
“To quantify the generalization risk is done by
the estimation of the loss this process is called
as Risk Estimation or risk Characterization”

Below are a few risks associated with
Machine Learning:
1. Poor Data:
Poor data or dirty data includes errors in training
data, outliers, and unstructured data, which
cannot be adequately interpreted by the
model.

2. Over fitting
• An over fitted model fits the training data so
perfectly that it becomes unable to learn the
variability for the algorithm.
• It means it won't be able to generalize well
when it comes to testing real data

3. Biased data
• Biased (Bias means a systematic error that
occurs in the machine learning model itself
due to incorrect assumptions in the ML
process).

4. Lack of strategy and experience:
• Further, lack of strategy and experience due to
fewer resources leads to wastage of time and
money as well as negatively affect the
organization's production and revenue.

5. Security Risks
• Data scientists and machine learning Experts have reported 3
types of attacks, primarily for machine learning models. These
are as follows:
1) Evasion attacks (an act or instance of escaping,
avoiding):supplying malicious data that gets classified as genuine
2) Data Poisoning attacks:
-the source of raw data is known, which is used to train the ML
models. The resulting machine learning model's does not make
accuracy.
3) Model Stealing:
Model stealing techniques are used to create a clone model
based on information or data used in the training of a base
model . The model having a similar look and feel.

6. Data privacy and confidentiality:
To achieve good results, we need to secure data
by defining some privacy terms and conditions
as well as making it confidential.

7. Third-party risks
• Third-party risks generally exist when
someone outsources their business to third-
party service providers who may fail to
properly govern a machine learning solution.

SAMPLING DISTRIBUTION OF AN ESTIMATOR:
a sample is a smaller part of a larger group.
This smaller portion is meant to be
representative of the population as a whole.
Sampling distribution (or the distribution of
data) is a statistical metric that determine
whether an event or certain outcome will take
place.

• An estimator is a rule or formula used to
calculate an estimate of a population
parameter based on sample data.

how the concept of the sampling distribution of
an estimator works:
1. Population Parameter:
- Consider a population with an unknown
parameter, such as the population mean (μ) or
population proportion (p).

2. Random Sampling:
• - Draw multiple random samples of a fixed size
(n) from the population. Each sample provides
a set of observed data.

EMPIRICAL RISK MINIMIZATION:
• Empirical risk refers to the training error,
which is the difference between the model's
predictions and the actual target values in the
training dataset.
• Minimizing empirical risk is a key objective in
training machine learning models to ensure
that they learn patterns from the training data.

Challenges:
• Overfitting:
• Underfitting:

Strategies to Minimize Empirical Risk:
1) Regularization: Regularization helps prevent
over fitting by balancing between fitting the
training data and keeping the model's
parameters small.
2) Feature Engineering: Create meaningful
features or representations that capture the
essential information from the data.

3) Cross-Validation: Use techniques like k-fold
cross-validation to estimate the model's
generalization performance on different data
splits. This helps detect over fitting and aids in
model selection.
4) Early Stopping: Monitor the model's
performance on a validation set during
training. Stop training when the validation
performance starts deteriorating, preventing
over fitting.

ML UNIT-1 machine learning notes to stu

More Related Content

Similar to ML UNIT-1 machine learning notes to stu

Recently uploaded

ML UNIT-1 machine learning notes to stu