KEMBAR78
Unit 1 Notes | PDF | Eigenvalues And Eigenvectors | Matrix (Mathematics)
0% found this document useful (0 votes)
29 views54 pages

Unit 1 Notes

The document provides an introduction to machine learning, detailing its history, definitions, applications, advantages, disadvantages, and challenges. It emphasizes the growing need for machine learning due to its ability to handle complex tasks and analyze vast amounts of data, with applications in areas like self-driving cars, fraud detection, and medical diagnosis. Key challenges include data quality, overfitting, and the complexity of the learning process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views54 pages

Unit 1 Notes

The document provides an introduction to machine learning, detailing its history, definitions, applications, advantages, disadvantages, and challenges. It emphasizes the growing need for machine learning due to its ability to handle complex tasks and analyze vast amounts of data, with applications in areas like self-driving cars, fraud detection, and medical diagnosis. Key challenges include data quality, overfitting, and the complexity of the learning process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

UNIT I INTRODUCTION AND MATHEMATICAL FOUNDATIONS 12

What is Machine Learning? Need History Definitions Applications - Advantages,

Disadvantages & Challenges -Types of Machine Learning Problems Mathematical Foundations

- Linear Algebra & Analytical Geometry -Probability and Statistics- Bayesian Conditional

Probability -Vector Calculus & Optimization - Decision Theory - Information theory

Need for Machine Learning

The need for machine learning is increasing day by day. The reason behind the need for machine
learning is that it is capable of doing tasks that are too complex for a person to implement directly. As
a human, we have some limitations as we cannot access the huge amount of data manually, so for this,
we need some computer systems and here comes the machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data and let them
explore the data, construct the models, and predict the required output automatically. The performance
of the machine learning algorithm depends on the amount of data, and it can be determined by the cost
function. With the help of machine learning, we can save both time and money.

The importance of machine learning can be easily understood by its uses cases, Currently, machine
learning is used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion by
Facebook, etc. Various top companies such as Netflix and Amazon have build machine learning
models that are using a vast amount of data to analyze the user interest and recommend product
accordingly.

Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data

o Solving complex problems, which are difficult for a human

o Decision making in various sector including finance

o Finding hidden patterns and extracting useful information from data.

History of Machine Learning

The term machine learning was coined in 1959 by Arthur Samuel, an IBM employee and pioneer in
the field of computer gaming and artificial intelligence. Also the synonym self-teaching computers
were used in this time period.

By the early 1960s an experimental "learning machine" with punched tape memory, called Cybertron,
had been developed by Raytheon Company to analyze sonar signals, electrocardiograms, and speech
patterns using rudimentary reinforcement learning. It was repetitively "trained" by a human
operator/teacher to recognize patterns and equipped with a "goof" button to cause it to re-evaluate
incorrect decisions. A representative book on research into machine learning during the 1960s was
Nilsson's book on Learning Machines, dealing mostly with machine learning for pattern classification.
Interest related to pattern recognition continued into the 1970s, as described by Duda and Hart in
1973. In 1981 a report was given on using teaching strategies so that a neural network learns to
recognize 40 characters (26 letters, 10 digits, and 4 special symbols) from a computer terminal.

Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in the
machine learning field: "A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P if its performance at tasks in T, as measured by P,
improves with experience E. "This definition of the tasks in which machine learning is concerned
offers a fundamentally operational definition rather than defining the field in cognitive terms. This
follows Alan Turing's proposal in his paper "Computing Machinery and Intelligence", in which the
question "Can machines think?" is replaced with the question "Can machines do what we (as thinking
entities) can do?"

Modern-day machine learning has two objectives, one is to classify data based on models which have
been developed, the other purpose is to make predictions for future outcomes based on these models.
A hypothetical algorithm specific to classifying data may use computer vision of moles coupled with
supervised learning in order to train it to classify the cancerous moles. A machine learning algorithm
for stock trading may inform the trader of future potential predictions.

Machine Learning

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences on
their own. The term machine learning was first introduced by Arthur Samuel in 1959.

Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.

Machine learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed. Machine learning brings computer science and
statistics together for creating predictive models. Machine learning constructs or uses the algorithms
that learn from historical data. The more we will provide the information, the higher will be the
performance.

Machine learning is a growing technology which enables computers to learn automatically from past
data. Machine learning uses various algorithms for building mathematical models and making
predictions using historical data or information. Currently, it is being used for various tasks such as
image recognition, speech recognition, email filtering, Facebook auto-tagging, recommender system,
and many more.

Applications of Machine learning

Below are some most trending real-world applications of Machine Learning:

1. Image Recognition

Image recognition is one of the most common applications of machine learning. It is used to identify
objects, persons, places, digital images, etc. The popular use case of image recognition and face
detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with
our Facebook friends, then we automatically get a tagging suggestion with name, and the technology
behind this is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech recognition, and
it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms are
widely used by various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa
are using speech recognition technology to follow the voice instructions.

3. Traffic prediction

If we want to visit a new place, we take help of Google Maps, which shows us the correct path with
the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested
with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors

o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information from the
user and sends back to its database to improve the performance.

4. Product recommendations

Machine learning is widely used by various e-commerce and entertainment companies such as
Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product
on Amazon, then we started getting an advertisement for the same product while internet surfing on
the same browser and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests the
product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series, movies,
etc., and this is also done with the help of machine learning.

5. Self-driving cars

One of the most exciting applications of machine learning is self-driving cars. Machine learning plays
a significant role in self-driving cars. Tesla, the most popular car manufacturing company is working
on self-driving car. It is using unsupervised learning method to train the car models to detect people
and objects while driving.
6. Email Spam and Malware Filtering

Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We
always receive an important mail in our inbox with the important symbol and spam emails in our
spam box, and the technology behind this is Machine learning. Below are some spam filters used by
Gmail:

o Content Filter

o Header filter

o General blacklists filter

o Rules-based filters

o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the
name suggests, they help us in finding the information using our voice instruction. These assistants
can help us in various ways just by our voice instructions such as Play music, call someone, Open an
email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode it using
ML algorithms and act accordingly.

8. Online Fraud Detection

Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent
transaction can take place such as fake accounts, fake ids, and steal money in the middle of a
transaction. So to detect this, Feed Forward Neural network helps us by checking whether it is a
genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values become
the input for the next round. For each genuine transaction, there is a specific pattern which gets
change for the fraud transaction hence, it detects it and makes our online transactions more secure.

9. Stock Market trading

Machine learning is widely used in stock market trading. In the stock market, there is always a risk of
up and downs in shares, so for this machine learning's long short term memory neural network is used
for the prediction of stock market trends.
10. Medical Diagnosis

In medical science, machine learning is used for diseases diagnoses. With this, medical technology is
growing very fast and able to build 3D models that can predict the exact position of lesions in the
brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation

Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all,
as for this also machine learning helps us by converting the text into our known languages. Google's
GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine
Learning that translates the text into our familiar language, and it called as automatic translation.

The technology behind the automatic translation is a sequence to sequence learning algorithm, which
is used with image recognition and translates the text from one language to another language.

Advantages, Disadvantages & Challenges of Machine learning

Advantages of Machine learning

Alexa, Google, Cortana, Self-driving cars, and Siri are some well-known real-world examples of the
application of machine learning. Being very wide in applications, Machine Learning is now added to
professional s in many universities. The advantages of machine learning will help you spring a
positive outlook about this technological advancement.

1. Artificial Intelligence

The future of automation is looking forward to wholly automated artificial intelligence. An advanced
AI system will be capable of testing various applications where human needs can be avoided. In
robotics, research is being made to build a machine capable of thinking like the human brain.

2. Frequent Upgrades and Improvement

As more and more data is fed to the systems, the output improves consistently. With time and
learning, the machine can produce more accurate results without the need for new codes and
algorithms. The algorithm analyzes various data statistics and predicts the possible options, and gives
the best option according to gained experience.

3. Wide Applications are well-known advantages of Machine Learning

With numerous applications and use, there will be more advantages of machine learning apparently.
The fields in which a developed machine learning system can be applied are huge in number. From
medical, engineering, aviation, space technology, etc., the use of this technology is unlimited. It will
create an environment where once the system is designed, it will be capable of taking appropriate
actions without human intervention.

4. Security Upgrades

Most industries are prone to various man-made accidents and mishaps, which can be avoided using
the science of machine learning. Moreover, once the system has gained enough experience using
recorded data, it can mitigate the risks of errors which becomes the reason for various mechanical and
human failures.

5. Sorting bulk data

Whether it's finding the best deals or searching relevant user-based results, computer programs can do
the work a lot faster than humans. For example, some jobs may require thousands of humans to sort
the data; this process may take as long as a month to complete. However, the same process will
complete within minutes by the automated systems. The lower cost and maximum time utilization will
be highly considered benefits of machine learning to companies and employees.

Disadvantages of Machine learning

The innovation of this technique is not very old, but full automation still requires a breakthrough in
current technology. Although it has complemented the humans with search and voice features, still the
accuracy of results is computer-based, not according to the user's needs. Apart from making humans
lazy, some other machine learning disadvantages are listed below.

1. Computational Errors

The results obtained from machines may have errors due to statistical reasoning. Most of the
automated systems generate results based on previous searches and data that was loaded into a
computer program. Any new experience or data may not have accurate results or output.

2. Huge Data Requirement

Initially, a huge amount of time is invested in making machine programs, and the requirement of data
is also very huge. Therefore, long codes and programs are needed to make the machine learn initial
responses and essential functions. Then based on user search and requirement machine gives results
and continuously gains decisive quality. Even small logical errors can lead to heavy disadvantages of
machine learning process in resulting in faulty inputs.

3. Still in the development phase

The results might be satisfactory, but a completely automated system requires a lot of research and
analysis. Scientists and programmers are continuously trying to figure out more advanced techniques
for improving machine outputs.

And it will be a long journey before we can acquire an AI close to human interpretation. The
technology is in its infancy also leads to questions about its acceptance and flexibility with rapidly
changing technology.

4. Impossible to identify multiple choices

As the machine is only made to identify certain choices, specify the options based on human
behaviour or varies widely. The machine is made to choose mostly the correct decisions, but there
could be situations where a machine cannot make optimal decisions.
5. Storage Services are among major disadvantages of Machine Learning

The backup and servers needed to maintain and record the acquired data keep on piling and hence the
cost. For a machine to learn, the possible data is unlimited, and there is always a need to store this
data. Various storages and cloud services are still not sufficient to make room for this amount of data

Challenges of Machine Learning

There are a lot of challenges that machine learning:

1. Poor Quality of Data

Data plays a significant role in the machine learning process. One of the significant issues that
machine learning professionals face is the absence of good quality data. Unclean and noisy data can
make the whole process extremely exhausting. We don’t want our algorithm to make inaccurate or
faulty predictions. Hence the quality of data is essential to enhance the output. Therefore, we need to
ensure that the process of data pre-processing which includes removing outliers, filtering missing
values, and removing unwanted features, is done with the utmost level of perfection.

2. Under fitting of Training Data

This process occurs when data is unable to establish an accurate relationship between input and output
variables. It simply means trying to fit in undersized jeans. It signifies the data is too simple to
establish a precise relationship. To overcome this issue:

o Maximize the training time

o Enhance the complexity of the model

o Add more features to the data

o Reduce regular parameters

o Increasing the training time of model

3. Overfitting of Training Data

Overfitting refers to a machine learning model trained with a massive amount of data that negatively
affect its performance. It is like trying to fit in Oversized jeans. Unfortunately, this is one of the
significant issues faced by machine learning professionals. This means that the algorithm is trained
with noisy and biased data, which will affect its overall performance.

We can tackle this issue by:

o Analyzing the data with the utmost level of perfection

o Use data augmentation technique

o Remove outliers in the training set

o Select a model with lesser features

o To know more, you can visit here.


4. Machine Learning is a Complex Process

The machine learning industry is young and is continuously changing. Rapid hit and trial experiments
are being carried on. The process is transforming, and hence there are high chances of error which
makes the learning complex. It includes analyzing the data, removing data bias, training data,
applying complex mathematical calculations, and a lot more. Hence it is a really complicated process
which is another big challenge for Machine learning professionals.

5. Lack of Training Data

The most important task you need to do in the machine learning process is to train the data to achieve
an accurate output. Less amount training data will produce inaccurate or too biased predictions. Let us
understand this with the help of an example. Machine-learning algorithm needs a lot of data to
distinguish. For complex problems, it may even require millions of data to be trained. Therefore, we
need to ensure that Machine learning algorithms are trained with sufficient amounts of data.

6. Slow Implementation

This is one of the common issues faced by machine learning professionals. The machine learning
models are highly efficient in providing accurate results, but it takes a tremendous amount of time.
Slow programs, data overload, and excessive requirements usually take a lot of time to provide
accurate results. Further, it requires constant monitoring and maintenance to deliver the best output.

7. Imperfections in the Algorithm When Data Grows

So you have found quality data, trained it amazingly, and the predictions are really concise and
accurate. Yay, you have learned how to create a machine learning algorithm!! But wait, there is a
twist; the model may become useless in the future as data grows. The best model of the present may
become inaccurate in the coming Future and require further rearrangement. So you need regular
monitoring and maintenance to keep the algorithm working. This is one of the most exhausting issues
faced by machine learning professionals

Types of Machine Learning Problems

1. Supervised Machine Learning

2. Unsupervised Machine Learning

3. Semi-Supervised Machine Learning

4. Reinforcement Learning
1. Supervised Machine Learning

As its name suggests, Supervised machine learning is based on supervision. It means in the supervised
learning technique, we train the machines using the "labelled" dataset, and based on the training, the
machine predicts the output. Here, the labelled data specifies that some of the inputs are already

mapped to the output. More preciously, we can say; first, we train the machine with the input and
corresponding output, and then we ask the machine to predict the output using the test dataset.

The main goal of the supervised learning technique is to map the input variable(x) with the output
variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud
Detection, Spam filtering, etc.

Categories of Supervised Machine Learning

o Classification

o Regression

a) Classification

Classification algorithms are used to solve the classification problems in which the output variable is
categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms
predict the categories present in the dataset. Some real-world examples of classification algorithms
are Spam Detection, Email filtering, etc.

Some popular classification algorithms are given below:

o Random Forest Algorithm

o Decision Tree Algorithm

o Logistic Regression Algorithm

o Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which there is a linear relationship
between input and output variables. These are used to predict continuous output variables, such as
market trends, weather prediction, etc.

Some popular Regression algorithms are given below:

o Simple Linear Regression Algorithm

o Multivariate Regression Algorithm

o Decision Tree Algorithm

o Lasso Regression
Advantages of Supervised Learning:

o Since supervised learning work with the labelled dataset so we can have an exact idea about the
classes of objects.

o These algorithms are helpful in predicting the output on the basis of prior experience.

Disadvantages of Supervised Learning:

o These algorithms are not able to solve complex tasks.

o It may predict the wrong output if the test data is different from the training data.

o It requires lots of computational time to train the algorithm.

Applications of Supervised Learning

 Image Segmentation: Supervised Learning algorithms are used in image segmentation. In


this process, image classification is performed on different image data with pre-defined
labels.
 Medical Diagnosis: Supervised algorithms are also used in the medical field for diagnosis
purposes. It is done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new patients.
 Fraud Detection: Supervised Learning classification algorithms are used for identifying
fraud transactions, fraud customers, etc. It is done by using historic data to identify the
patterns that can lead to possible fraud.
 Spam detection: In spam detection & filtering, classification algorithms are used. These
algorithms classify an email as spam or not spam. The spam emails are sent to the spam
folder.
 Speech Recognition: Supervised learning algorithms are also used in speech recognition. The
algorithm is trained with voice data, and various identifications can be done using the same,
such as voice-activated passwords, voice commands, etc.

2. Unsupervised Machine Learning


 Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabelled dataset, and the machine predicts the output without
any supervision.
 In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
 The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to find
the hidden patterns from the input dataset. Machine will discover its patterns and differences,
such as colour difference, shape difference, and predict the output when it is tested with the
test dataset.
 Categories of Unsupervised Machine Learning
o Clustering
o Association
a) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is a way to
group the objects into a cluster such that the objects with the most similarities remain in one group
and have fewer or no similarities with the objects of other groups. An example of the clustering
algorithm is grouping the customers by their purchasing behaviour.
Some of the popular clustering algorithms are given below:
o K-Means Clustering algorithm
o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis

b) Association

Association rule learning is an unsupervised learning technique, which finds interesting relations
among variables within a large dataset. The main aim of this learning algorithm is to find the
dependency of one data item on another data item and map those variables accordingly so that it can
generate maximum profit. This algorithm is mainly applied in Market Basket analysis, Web usage
mining, continuous production, etc. Some popular algorithms of Association rule learning are Apriori
Algorithm, Eclat, FP-growth algorithm.

Advantages of Unsupervised Learning Algorithm:

o These algorithms can be used for complicated tasks compared to the supervised ones because these
algorithms work on the unlabeled dataset.

o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is easier as
compared to the labelled dataset.

Disadvantages of Unsupervised Learning Algorithm:

o The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and
algorithms are not trained with the exact output in prior.

o Working with Unsupervised learning is more difficult as it works with the unlabelled dataset that
does not map with the output.

Applications of Unsupervised Learning

o Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright in
document network analysis of text data for scholarly articles.

o Recommendation Systems: Recommendation systems widely use unsupervised learning


techniques for building recommendation applications for different web applications and e-commerce
websites.

o Anomaly Detection: Anomaly detection is a popular application of unsupervised learning, which


can identify unusual data points within the dataset. It is used to discover fraudulent transactions.

o Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract particular
information from the database. For example, extracting information of each user located at a
particular location.
3. Semi-Supervised Learning

Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised and
Unsupervised machine learning. It represents the intermediate ground between Supervised (With
Labelled training data) and Unsupervised learning (with no labelled training data) algorithms and uses
the combination of labelled and unlabeled datasets during the training period.

Although Semi-supervised learning is the middle ground between supervised and unsupervised
learning and operates on the data that consists of a few labels, it mostly consists of unlabelled data. As
labels are costly, but for corporate purposes, they may have few labels. It is completely different from
supervised and unsupervised learning as they are based on the presence & absence of labels.

To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the concept
of Semi-supervised learning is introduced. The main aim of semi-supervised learning is to effectively
use all the available data, rather than only labelled data like in supervised learning.

Advantages of Semi-Supervised Learning:

o It is simple and easy to understand the algorithm.

o It is highly efficient.

o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.

Disadvantages of Semi-Supervised Learning:

o Iterations results may not be stable.

o We cannot apply these algorithms to network-level data.

o Accuracy is low.

4. Reinforcement Learning

Reinforcement learning works on a feedback-based process, in which an AI agent (A software


component) automatically explore its surrounding by hitting & trail, taking action, learning from
experiences, and improving its performance. Agent gets rewarded for each good action and get
punished for each bad action; hence the goal of reinforcement learning agent is to maximize the
rewards.

In reinforcement learning, there is no labelled data like supervised learning, and agents learn from
their experiences only.

Due to its way of working, reinforcement learning is employed in different fields such as Game
theory, Operation Research, Information theory, multi-agent systems.

A reinforcement learning problem can be formalized using Markov Decision Process(MDP). In MDP,
the agent constantly interacts with the environment and performs actions; at each action, the
environment responds and generates a new state.
Categories of Reinforcement Learning

o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the


tendency that the required behaviour would occur again by adding something. It enhances the strength
of the behaviour of the agent and positively impacts it.

o Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite to the
positive RL. It increases the tendency that the specific behaviour would occur again by avoiding the
negative condition

Real-world Use cases of Reinforcement Learning

o Video Games: RL algorithms are much popular in gaming applications. It is used to gain super-
human performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO Zero.

o Resource Management: The "Resource Management with Deep Reinforcement Learning" paper
showed that how to use RL in computer to automatically learn and schedule resources to wait for
different jobs in order to minimize average job slowdown.

o Robotics: RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement learning. There are
different industries that have their vision of building intelligent robots using AI and Machine learning
technology.

o Text Mining: Text-mining, one of the great applications of NLP, is now being implemented with the
help of Reinforcement Learning by Salesforce company.

Advantages of Reinforcement Learning

o It helps in solving complex real-world problems which are difficult to be solved by general
techniques.

o The learning model of RL is similar to the learning of human beings; hence most accurate results
can be found.

o Helps in achieving long term results.

Disadvantage of Reinforcement Learning

o RL algorithms are not preferred for simple problems.

o RL algorithms require huge data and computations.

o Too much reinforcement learning can lead to an overload of states which can weaken the results.

Mathematical Foundations of Machine Learning

Linear Algebra & Analytical Geometry of Machine Learning

Machine learning has a strong connection with mathematics. Each machine learning algorithm is
based on the concepts of mathematics & also with the help of mathematics, one can choose the correct
algorithm by considering training time, complexity, number of features, etc. Linear Algebra is an
essential field of mathematics, which defines the study of vectors, matrices, planes, mapping, and
lines required for linear transformation.

The term Linear Algebra was initially introduced in the early 18th century to find out the unknowns in
Linear equations and solve the equation easily; hence it is an important branch of mathematics that
helps study data. Also, no one can deny that Linear Algebra is undoubtedly the important and primary
thing to process the applications of Machine Learning. It is also a prerequisite to start learning
Machine Learning and data science.

Linear algebra plays a vital role and key foundation in machine learning , and it enables ML
algorithms to run on a huge number of datasets.

The concepts of linear algebra are widely used in developing algorithms in machine learning.

Linear Algebra also helps to create better supervised as well as unsupervised Machine Learning
algorithms.

Few supervised learning algorithms can be created using Linear Algebra, which is as follows:

o Logistic Regression

o Linear Regression

o Decision Trees

o Support Vector Machines (SVM)

Further, below are some unsupervised learning algorithms listed that can also be created with the help
of linear algebra as follows:

o Single Value Decomposition (SVD)

o Clustering

o Components Analysis

Examples
o Datasets and Data Files
o Linear Regression
o Recommender Systems
o One-hot encoding
o Regularization
o Principal Component Analysis
o Images and Photographs
o Singular-Value Decomposition
o Deep Learning
o Latent Semantic Analysis

Datasets and Data Files


Each machine learning project works on the dataset, and we fit the machine learning model using
this dataset.
Each dataset resembles a table-like structure consisting of rows and columns. Where each row
represents observations, and each column represents features/Variables. This dataset is handled as
a Matrix, which is a key data structure in Linear Algebra.
Further, when this dataset is divided into input and output for the supervised learning model, it
represents a Matrix(X) and Vector(y), where the vector is also an important concept of linear
algebra.

Images and Photographs


In machine learning, images/photographs are used for computer vision applications. Each Image
is an example of the matrix from linear algebra because an image is a table structure consisting of
height and width for each pixel.
Moreover, different operations on images, such as cropping, scaling, resizing, etc., are performed
using notations and operations of Linear Algebra.

One Hot Encoding


In machine learning, sometimes, we need to work with categorical data. These categorical
variables are encoded to make them simpler and easier to work with, and the popular encoding
technique to encode these variables is known as one-hot encoding.
In the one-hot encoding technique, a table is created that shows a variable with one column for
each category and one row for each example in the dataset. Further, each row is encoded as a
binary vector, which contains either zero or one value. This is an example of sparse
representation, which is a subfield of Linear Algebra.

Linear Regression

Linear regression is a popular technique of machine learning borrowed from statistics. It describes
the relationship between input and output variables and is used in machine learning to predict
numerical values. The most common way to solve linear regression problems using Least Square
Optimization is solved with the help of Matrix factorization methods. Some commonly used
matrix factorization methods are LU decomposition, or Singular-value decomposition, which are
the concept of linear algebra.

Regularization

In machine learning, we usually look for the simplest possible model to achieve the best outcome
for the specific problem. Simpler models generalize well, ranging from specific examples to
unknown datasets. These simpler models are often considered models with smaller coefficient
values.

A technique used to minimize the size of coefficients of a model while it is being fit on data is
known as regularization. Common regularization techniques are L1 and L2 regularization. Both of
these forms of regularization are, in fact, a measure of the magnitude or length of the coefficients
as a vector and are methods lifted directly from linear algebra called the vector norm.

Principal Component Analysis

Generally, each dataset contains thousands of features, and fitting the model with such a large
dataset is one of the most challenging tasks of machine learning. Moreover, a model built with
irrelevant features is less accurate than a model built with relevant features. There are several
methods in machine learning that automatically reduce the number of columns of a dataset, and
these methods are known as Dimensionality reduction. The most commonly used dimensionality
reductions method in machine learning is Principal Component Analysis or PCA. This technique
makes projections of high-dimensional data for both visualizations and training models. PCA uses
the matrix factorization method from linear algebra.

Singular-Value Decomposition

Singular-Value decomposition is also one of the popular dimensionality reduction techniques and
is also written as SVD in short form.

It is the matrix-factorization method of linear algebra, and it is widely used in different


applications such as feature selection, visualization, noise reduction, and many more.

Latent Semantic Analysis

Natural Language Processing or NLP is a subfield of machine learning that works with text and
spoken words.

NLP represents a text document as large matrices with the occurrence of words. For example, the
matrix column may contain the known vocabulary words, and rows may contain sentences,
paragraphs, pages, etc., with cells in the matrix marked as the count or frequency of the number of
times the word occurred. It is a sparse matrix representation of text. Documents processed in this
way are much easier to compare, query, and use as the basis for a supervised machine learning
model.

This form of data preparation is called Latent Semantic Analysis, or LSA for short, and is also
known by the name Latent Semantic Indexing or LSI.

Recommender System

A recommender system is a sub-field of machine learning, a predictive modelling problem that


provides recommendations of products. For example, online recommendation of books based on
the customer's previous purchase history, recommendation of movies and TV series, as we see in
Amazon & Netflix.

The development of recommender systems is mainly based on linear algebra methods. We can
understand it as an example of calculating the similarity between sparse customer behaviour
vectors using distance measures such as Euclidean distance or dot products.

Different matrix factorization methods such as singular-value decomposition are used in


recommender systems to query, search, and compare user data.

Deep Learning

Artificial Neural Networks or ANN are the non-linear ML algorithms that work to process the
brain and transfer information from one layer to another in a similar way.

Deep learning studies these neural networks, which implement newer and faster hardware for the
training and development of larger networks with a huge dataset. All deep learning methods
achieve great results for different challenging tasks such as machine translation, speech
recognition, etc. The core of processing neural networks is based on linear algebra data structures,
which are multiplied and added together. Deep learning algorithms also work with vectors,
matrices, tensors (matrix with more than two dimensions) of inputs and coefficients for multiple
dimensions.
Linear Algebra

Linear algebra provides a framework for handling and manipulating data, which is often
represented as vectors and matrices. In machine learning, linear algebra operations are used
extensively in various stages, from data preprocessing to model training and evaluation. For
instance, operations such as matrix multiplication, eigenvalue decomposition, and singular value
decomposition are pivotal in dimensionality reduction techniques like Principal Component
Analysis (PCA). Similarly, the concepts of vector spaces and linear transformations are integral to
understanding neural networks and optimization algorithms.

Definition of Linear Algebra

Linear algebra is the branch of mathematics that deals with vector spaces and linear mappings
between these spaces. It encompasses the study of vectors, matrices, linear equations, and their
properties.

Fundamental Concepts

Vectors

 Vectors are quantities that have both magnitude and direction, often represented as arrows in
space.

Matrices

 Matrices are rectangular arrays of numbers, arranged in rows and columns.

 Matrices are used to represent linear transformations, systems of linear equations, and data
transformations in machine learning.

Scalars

 Scalars are single numerical values, without direction, magnitude only.

 Scalars are used to scale vectors or matrices through operations like multiplication

Operations in Linear Algebra

1. Addition and Subtraction

 Addition and subtraction of vectors or matrices involve adding or subtracting corresponding


elements.
2. Scalar Multiplication

 Scalar multiplication involves multiplying each element of a vector or matrix by a scalar.

3. Dot Product (Vector Multiplication)

 The dot product of two vectors measures the similarity of their directions.

 It is computed by multiplying corresponding elements of two vectors and summing the


results.

Example: For example, given two vectors(u=[u1,u2,u3] and v=[v1,v2,v3])(u=[u1,u2,u3


] and v=[v1,v2,v3]), their dot product is calculated as:
u⋅v=u1⋅v1+u2⋅v2+u3⋅v3u⋅v=u1⋅v1+u2⋅v2+u3⋅v3

4. Cross Product (Vector Multiplication)

 The cross product of two vectors in three-dimensional space produces a vector orthogonal to
the plane containing the original vectors.

 It is used less frequently in machine learning compared to the dot product.

 Example: Given two vectors u and v, their cross product u×v is calculated as
Linear Transformations

Linear transformations are fundamental operations in linear algebra that involve the transformation of
vectors and matrices while preserving certain properties such as linearity and proportionality. In the
context of machine learning, linear transformations play a crucial role in data preprocessing, feature
engineering, and model training

A. Definition and Explanation

Linear transformations are functions that map vectors from one vector space to another in a linear
manner. Formally, a transformation TTT is considered linear if it satisfies two properties:

1. Additivity: T(u+v)=T(u)+T(v) for all vectors u and v.

2. Homogeneity: T(kv)=kT(v) for all vectors v and scalars k.

Linear transformations can be represented by matrices, and their properties are closely related to the
properties of matrices.

B. Common Linear Transformations in Machine Learning

1. Translation:

 Translation involves shifting the position of vectors without changing their


orientation or magnitude.

 In machine learning, translation is commonly used for data normalization and


centering, where the mean of the data is subtracted from each data point.

2. Scaling:

 Scaling involves stretching or compressing vectors along each dimension.

 Scaling is frequently applied in feature scaling, where features are scaled to have
similar ranges to prevent dominance of certain features in machine learning models.

3. Rotation:

 Rotation involves rotating vectors around an axis or point in space.

 While less common in basic machine learning algorithms, rotation can be useful in
advanced applications such as computer vision and robotics.
Matrix Operations

Matrix operations form the cornerstone of linear algebra, providing essential tools for manipulating
and analyzing data in machine learning. In this section, we explore key matrix operations, including
multiplication, transpose, inverse, and determinant, along with their significance and applications.

A. Matrix Multiplication

Matrix multiplication is a fundamental operation in linear algebra, involving the


multiplication of two matrices to produce a new matrix. The resulting matrix’s dimensions are
determined by the number of rows in the first matrix and the number of columns in the second
matrix.

 Definition: Given two matrices A and B, the product matrix C=A⋅B is computed by taking
the dot product of each row of matrix A with each column of matrix B.

 Significance: Matrix multiplication is widely used in machine learning for various tasks,
including transformation of feature vectors, computation of model parameters, and neural
network operations such as feedforward and backpropagation.

B. Transpose and Inverse of Matrices

1. Transpose:

 The transpose of a matrix involves flipping its rows and columns, resulting in a new
matrix where the rows become columns and vice versa.

 It is denoted by AT, and its dimensions are the reverse of the original matrix.

 Transpose is used in applications such as solving systems of linear equations,


computing matrix derivatives, and performing matrix factorization.

2. Inverse:

 The inverse of a square matrix A is another matrix denoted by A−1 such


that A⋅A−1=A−1⋅A= I ,where I is the identity matrix.

 Not all matrices have inverses, and square matrices with a determinant not equal to
zero are invertible.

 Inverse matrices are used in solving systems of linear equations, computing solutions
to optimization problems, and performing transformations.

C. Determinants

The determinant of a square matrix is a scalar value that encodes various properties of the
matrix, such as its volume, orientation, and invertibility.

 Significance: The determinant is used to determine whether a matrix is invertible, calculate


the volume of parallelepiped spanned by vectors, and analyze the stability of numerical
algorithms.
 Properties: The determinant satisfies several properties, including linearity, multiplicativity,
and the property that a matrix is invertible if and only if its determinant is non-zero.

Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors are fundamental concepts in linear algebra that play a significant role in
machine learning algorithms and applications. In this section, we explore the definition, significance,
and applications of eigenvalues and eigenvectors.

A. Definition and Significance

1. Eigenvalues:

 Eigenvalues of a square matrix AAA are scalar values that represent how a
transformation represented by AAA stretches or compresses vectors in certain
directions.

 Eigenvalues quantify the scale of transformation along the corresponding


eigenvectors and are crucial for understanding the behavior of linear transformations.

2. Eigenvectors:

 Eigenvectors are non-zero vectors that are transformed by a matrix only by a scalar
factor, known as the eigenvalue.

 They represent the directions in which a linear transformation represented by a matrix


stretches or compresses space.

 Eigenvectors corresponding to distinct eigenvalues are linearly independent and form


a basis for the vector space.
B. Applications in Machine Learning

1. Dimensionality Reduction:

 Techniques such as Principal Component Analysis (PCA) utilize eigenvalues and


eigenvectors to identify the principal components (directions of maximum variance)
in high-dimensional data and project it onto a lower-dimensional subspace.

 Eigenvalues represent the amount of variance explained by each principal component,


allowing for effective dimensionality reduction while preserving as much information
as possible.
2. Graph-based Algorithms:

 Eigenvalues and eigenvectors play a crucial role in graph-based algorithms such as


spectral clustering and PageRank.

 In spectral clustering, eigenvalues and eigenvectors of the graph Laplacian matrix are
used to partition data into clusters based on spectral properties.

3. Matrix Factorization:

 Techniques like Singular Value Decomposition (SVD) and Non-negative Matrix


Factorization (NMF) rely on eigenvalue decomposition to factorize matrices into
lower-dimensional representations.

 Eigenvalue decomposition facilitates the extraction of meaningful features or


components from high-dimensional data matrices, enabling efficient data
representation and analysis.

Probability

Random variable

A random variable, X, is a quantity that can have different values each time the variable is inspected,
such as in measurements in experiments. Following are few examples.

X(value after rolling the dice)

Here, every time value could be anything from the finite set of {1, 2, 3, 4, 5, 6}

Types of Random variable

There are two types of random variable

1. Discrete random variable

2. Continuous random variable

Discrete random variable

A discrete random variable is one which may take on only a countable number of distinct values.

For example, rolling dice can have values from the set {1, 2, 3, 4, 5, 6}.

Continuous random variable

A continuous random variable is one which takes an infinite number of possible values.

For example, measuring light sensors can have any infinite value if you are using a machine which
can detect with high precision.

Predicting random variable value

Due to nature of random variable, it is impossible to predict the value.


Predicting randomness

While we might not be able to predict a specific value, it is often the case that some values might be
more likely than others. We might be able to say something about how often a certain number will
appear when drawing many examples.

Probability Density (or Distribution) function

How likely each value is for a random variable x, is captured by the probability density
function pdf(x) in the continuous case and by the probability mass function P(x) in the discrete case.

Probability mass function for discrete random variable

Let’s take example of tossing coin three times and random variable X represents the number of head
after three toss.

Probability of following head count.


Example

Probability mass function can be defined as the probability that a discrete random variable will be
exactly equal to some particular value. the probability mass function assigns a particular probability to
every possible value of a discrete random variable. Suppose a fair coin is tossed twice and the sample
space is recorded as S = [HH, HT, TH, TT]. The probability of getting heads needs to be determined.
Let X be the random variable that shows how many heads are obtained. X can take on the values 0, 1,
2. The probability that X will be equal to 1 is 0.5. Thus, it can be said that the probability mass
function of X evaluated at 1 will be 0.5.
The probability mass function table for a random variable X is given as follows:

What Are the Three Types of Probability?

 Theoretical or Classical Probability

 Experimental Probability

 Axiomatic Probability

Theoretical or Classical Probability

Experimental Probability

Experimental probability measures the total number of favorable outcomes for the number of times an
experiment is repeated.
Axiomatic Probability

Axiomatic probability is one more way to describe the outcomes of an event.

There are three rules or axioms which apply to all types of probability.

These rules were defined by Kolmogorov and is called Kolmogorov's axioms.

The three axioms are as follows:

 For any event, the probability is greater than or equal to 0.

 Sample space defines the set of all possible outcomes of an event.

 If A and B are two mutually exclusive outcomes (two events that cannot occur at the same
time) then the probability of A or B occurring is a probability of A plus the probability of B.

What Are the Five Rules of Probability?

 The probability of an impossible event is phi or a null set.

 The maximum probability of an event is its sample space (sample space is the total number of
possible outcomes).

 The probability of any event exists between 0 and 1. 0 can also be a probability.

 There cannot be a negative probability for an event.

 If A and B are two mutually exclusive outcomes (two events that cannot occur at the same
time) then the probability of A or B occurring is the probability of A plus the probability of B.

Joint Probability

Joint Probability refers to the likelihood of two or more events happening together or in conjunction
with each other.

Formula for Joint Probability


The formula for calculating joint probability hinges on whether the events are independent or
dependent:

1. For Independent Events

When events A and B are independent, meaning that the occurrence of one event does not impact the
other, we use the multiplication rule:

P(A∩B) = P(A) x P(B)

Here, P(A) is the probability of occurrence of event A, P(B) is the probability of occurrence of event
B, and P(A∩B) is the joint probability of events A and B.

2. For Dependent Events

Events are often dependent on each other, meaning that one event’s occurrence influences the
likelihood of the other. Here, we employ a modified formula:

P(A∩B) = P(A) x P(B|A)

Here, P(A) is the probability of occurrence of event A, P(B|A) is the conditional probability of
occurrence of event B when event A has already occurred, and P(A∩B) is the joint probability of
events A and B.

Examples of Joint Probability

Example 1: Independent Events

Suppose you are running an e-commerce platform, and you want to find the probability of a customer
purchasing a red shirt (event A) and a blue hat (event B) independently. Find out the Joint Probability
where

P(A): The probability of a customer buying a red shirt is 0.3.

P(B): The probability of a customer purchasing a blue hat is 0.2.

Solution:

P(A∩B) = P(A) x P(B)

P(A∩B) = P(customer buying a red shirt) x P(customer buying a blue hat)

P(A∩B) = 0.3 x 0.2

P(A∩B) = 0.6

Example 2: Dependent Events

Imagine you are in the insurance business, and you want to determine the probability of a customer
filing a claim (event A) and receiving a payout (event B), given that a claim was filed. Find out the
Joint Probability where

P(A): The probability of a customer filing a claim is 0.1.

The probability of a customer receiving a payout given that a claim was filed is 0.8.
Solution:

P(A∩B) = P(A) x P(B|A)

P(A∩B) = P(customer filing a claim) x P(customer receiving a payout given that a claim was filed)

P(A∩B) = 0.1 x 0.8

P(A∩B) = 0.08

Difference between Joint Probability and Conditional Probability

Joint Probability (P(A∩B))

Joint Probability addresses the simultaneous occurrence of events A and B without considering any
specific order or sequence. It quantifies the combined probability of events occurring together,
providing insights into their co-occurrence in a business context.

Conditional Probability (P(B|A))

Conditional Probability focuses on the probability of event B happening, given that event A has
already occurred. This kind of probability is utilised when the occurrence of one event influences the
likelihood of another event, making it a valuable tool for understanding cause-and-effect relationships
in business statistics.

Statistics

Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and visualizing
empirical data. Descriptive statistics and inferential statistics are the two major areas of statistics.
Descriptive statistics are for describing the properties of sample and population data (what has
happened). Inferential statistics use those properties to test hypotheses, reach conclusions, and make
predictions (what can you expect).

Population and Sample

Population:

In statistics, the population comprises all observations (data points) about the subject under study.

An example of a population is studying the voters in an election. In the 2019 Lok Sabha elections,
nearly 900 million voters were eligible to vote in 543 constituencies.

Sample:

In statistics, a sample is a subset of the population. It is a small portion of the total observed
population.

An example of a sample is analyzing the first-time voters for an opinion poll

Measures of Central Tendency


Measures of central tendency are the measures that are used to describe the distribution of data using a
single value. Mean, Median and Mode are the three measures of central tendency.

Measures of Central Value

Sometimes when we are working with large amounts of data, we need one single value to represent
the whole set. In math, there are three measures to find the central value of a given set of data.

They are

 Mean

 Median

 Mode

Mean

Mean represents an average of a given set of numbers.

For example in the given data set,

3,5,7,9

The mean is obtained by adding all the numbers and dividing the sum by the total count of numbers.

Mean=(3+5+7+9)/4

=6

Median

Median represents the central or middle value of a data set.

For example in the given data set,

2,4,6,8,9,10,11,13,15,17,18

The median is obtained by choosing the middle value which is 10

Mode

Mode is the most frequently occurring data item.

For example in the given data set,

30,30,31,31,31,32,32,33,33,33,33,34,35,36

The mode is obtained by choosing the most frequently occuring item in the data set.

Here, the mode is 33.

Measures of Spread
While measuring a central value, we are given a data set. To know how wide is the data set, we use
measures of spread.

They also give us a better picture if the calculated central value (Mean, median, or mode) correctly
depicts the set of values.

For example,

Here we have the marks obtained by students in a test

25,40,55,60,70,86,90,100

While calculating the average marks obtained by students in a 100 marks test, by calculating
the mean, we can only find the average marks obtained by the students, but we do not know how
spread the marks are from 0 to 100.

While calculating the average marks obtained by students in a 100 marks test, the mean score cannot
be more than Here, the mode is 100.

In such cases, measures of spread are useful.

The most-used measures of spread are

 Range

 Quartiles and InterQuartiles Range

 Standard Deviation

 Variance

Range

Range represents the difference between the minimum and maximum values in a data set.

The minimum value is 3.

The maximum value is 13.

Range=13−3=10

Quartile

As the name suggests, a quartile is a measure of spread that groups a given set of values in quarters.

To use this measure,

We first arrange the values in an increasing order.


Make four equal groups of the values.

Interquartile

The interquartile range is obtained by subtracting the third quartile (Q3) and the first quartile (Q1). It
is a place where most of the values lie in a data set or we can say sometimes the interquartile range
and the mean value are the same.

Interquartile Range =Q3−Q1=6−2=4

Variance
 Variance is a measure that gives us a approximate idea about the spread of data. It is not very
accurate.

 The use of variance value is it can be used to calculate the standard deviation of a data set.

Standard Deviation

 Standard deviation is a measure that tells us how far is a data value from the mean.

 It is obtained by taking the square root of variance.

Formula

Hypothesis Testing
Hypothesis testing is a statistical analysis to make decisions using experimental data. It allows you to
statistically back up some findings you have made in looking at the data. In hypothesis testing, you
make a claim and the claim is usually about population parameters such as mean, median, standard
deviation, etc.

The assumption made for a statistical test is called the null hypothesis (H0).

The Alternative hypothesis (H1) contradicts the null hypothesis stating that the assumptions do not
hold true at some level of significance.

Hypothesis testing lets you decide to either reject or retain a null hypothesis.

Example: H0: The average BMI of boys and girls in a class is the same

H1: The average BMI of boys and girls in a class is not the same

To determine whether a finding is statistically significant, you need to interpret the p-value. It is
common to compare the p-value to a threshold value called the significance level.

It often sets the level of significance to 5% or 0.05.

If the p-value > 0.05 - Accept the null hypothesis.

If the p-value < 0.05 - Reject the null hypothesis.

Some popular hypothesis tests are:

 Chi-square test

 T-test

 Z-test

 Analysis of Variance (ANOVA)

Bayesian Conditional Probability

Bayes theorem is one of the most popular machine learning concepts that helps to calculate the
probability of occurring one event with uncertain knowledge while other one has already occurred.

Bayes' theorem can be derived using product rule and conditional probability of event X with known
event Y:

o According to the product rule we can express as the probability of event X with known event Y as
follows;
Here, both events X and Y are independent events which means probability of outcome of both events
does not depends one another.

The above equation is called as Bayes Rule or Bayes Theorem.

o P(X|Y) is called as posterior, which we need to calculate. It is defined as updated probability after
considering the evidence.

o P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is true.

o P(X) is called the prior probability, probability of hypothesis before considering the evidence

o P(Y) is called marginal probability. It is defined as the probability of evidence under any
consideration.

Hence, Bayes Theorem can be written as:

posterior = likelihood * prior / evidence

Prerequisites for Bayes Theorem

1. Experiment

An experiment is defined as the planned operation carried out under controlled condition such as
tossing a coin, drawing a card and rolling a dice, etc.

2. Sample Space

During an experiment what we get as a result is called as possible outcomes and the set of all possible
outcome of an event is known as sample space. For example, if we are rolling a dice, sample space
will be:

S1 = {1, 2, 3, 4, 5, 6}

Similarly, if our experiment is related to toss a coin and recording its outcomes, then sample space
will be:

S2 = {Head, Tail}

3. Event

Event is defined as subset of sample space in an experiment. Further, it is also called as set of
outcomes.
Assume in our experiment of rolling a dice, there are two event A and B such that;

A = Event when an even number is obtained = {2, 4, 6}

B = Event when a number is greater than 4 = {5, 6}

o Probability of the event A ''P(A)''= Number of favourable outcomes / Total number of possible
outcomes P(E) = 3/6 =1/2 =0.5

o Similarly, Probability of the event B ''P(B)''= Number of favourable outcomes / Total number of
possible outcomes =2/6 =1/3 =0.333

o Union of event A and B: A∪B = {2, 4, 5, 6}

Intersection of event A and B: A∩B= {6}

Disjoint Event: If the intersection of the event A and B is an empty set or null then such events are
known as disjoint event or mutually exclusive events also.

4. Random Variable:

It is a real value function which helps mapping between sample space and a real line of an
experiment. A random variable is taken on some random values and each value having some
probability. However, it is neither random nor a variable but it behaves as a function which can either
be discrete, continuous or combination of both.

5. Exhaustive Event:

As per the name suggests, a set of events where at least one event occurs at a time, called exhaustive
event of an experiment.
Thus, two events A and B are said to be exhaustive if either A or B definitely occur at a time and both
are mutually exclusive for e.g., while tossing a coin, either it will be a Head or may be a Tail.

6. Independent Event:

Two events are said to be independent when occurrence of one event does not affect the occurrence of
another event. In simple words we can say that the probability of outcome of both events does not
depends one another. Mathematically, two events A and B are said to be independent if:

P(A ∩ B) = P(AB) = P(A)*P(B)

7. Conditional Probability:

Conditional probability is defined as the probability of an event A, given that another event B has
already occurred (i.e. A conditional B). This is represented by P(A|B) and we can define it as:

P(A|B) = P(A ∩ B) / P(B)

8. Marginal Probability:

Marginal probability is defined as the probability of an event A occurring independent of any other
event B. Further, it is considered as the probability of evidence under any consideration.

P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)

Here ~B represents the event that B does not occur.

How to apply Bayes Theorem or Bayes rule in Machine Learning?

Bayes theorem helps us to calculate the single term P(B|A) in terms of P(A|B), P(B), and P(A). This
rule is very helpful in such scenarios where we have a good probability of P(A|B), P(B), and P(A) and
need to determine the fourth term.

Naïve Bayes classifier is one of the simplest applications of Bayes theorem which is used in
classification algorithms to isolate data as per accuracy, speed and classes.

Let's understand the use of Bayes theorem in machine learning with below example.

Suppose, we have a vector A with I attributes. It means

A = A1, A2, A3, A4……………Ai

Further, we have n classes represented as C1, C2, C3, C4…………Cn.


These are two conditions given to us, and our classifier that works on Machine Language has to
predict A and the first thing that our classifier has to choose will be the best possible class. So, with
the help of Bayes theorem, we can write it as:

P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A)

Here;

P(A) is the condition-independent entity.

P(A) will remain constant throughout the class means it does not change its value with respect to
change in class. To maximize the P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci).

With n number classes on the probability list let's assume that the possibility of any class being the
right answer is equally likely. Considering this factor, we can say that:

P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).

This process helps us to reduce the computation cost as well as time. This is how Bayes theorem plays
a significant role in Machine Learning and Naïve Bayes theorem has simplified the conditional
probability tasks without affecting the precision. Hence, we can conclude that:

P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)*……*P(An/C)

Hence, by using Bayes theorem in Machine Learning we can easily describe the possibilities of
smaller events.

What is Naïve Bayes Classifier in Machine Learning

Naïve Bayes theorem is also a supervised algorithm, which is based on Bayes theorem and used to
solve classification problems. It is one of the most simple and effective classification algorithms in
Machine Learning which enables us to build various ML models for quick predictions. It is a
probabilistic classifier that means it predicts on the basis of probability of an object. Some popular
Naïve Bayes algorithms are spam filtration, Sentimental analysis, and classifying articles.

Advantages of Naïve Bayes Classifier in Machine Learning:

 It is one of the simplest and effective methods for calculating the conditional probability and
text classification problems.
 A Naïve-Bayes classifier algorithm is better than all other models where assumption of
independent predictors holds true.
 It is easy to implement than other models.
 It requires small amount of training data to estimate the test data which minimize the training
time period.
 It can be used for Binary as well as Multi-class Classifications.

Disadvantages of Naïve Bayes Classifier in Machine Learning:

The main disadvantage of using Naïve Bayes classifier algorithms is, it limits the assumption of
independent predictors because it implicitly assumes that all attributes are independent or unrelated
but in real life it is not feasible to get mutually independent attributes.
Sums

1. We flip two coins. What is the probability that both are heads, given that at least one of them
is heads?

We are given that there is at least one head. So there are three possibilities HH, HT, TH. Only one
outcome has both coins as heads: HH.

So our answer is = 1/3

2. We roll a six-sided die. What is the probability that the roll is a 6, given that the outcome is an
even number

The possible outcomes when rolling a die are: 1, 2, 3, 4, 5, 6.

The even numbers are: 2, 4, 6 (3 outcomes).

Only one of these is a 6.

So our answer = 1/3

3.
4.
5.

6.
Vector Calculus and Optimization

Vector Calculus

Vector calculus, or vector analysis, is concerned with differentiation and integration of vector fields,
primarily in 3-dimensional Euclidean. The term "vector calculus" is sometimes used as a synonym for
the broader subject of multivariable calculus, which spans vector calculus as well as partial
differentiation and multiple integration. Vector calculus plays an important role in differential
geometry and in the study of partial differential equations. It is used extensively in physics and
engineering, especially in the description of electromagnetic fields, gravitational fields, and fluid flow.

Vector calculus was developed from quaternion analysis by J. Willard Gibbs and Oliver Heaviside
near the end of the 19th century, and most of the notation and terminology was established by Gibbs
and Edwin Bidwell Wilson in their 1901 book, Vector Analysis. In the conventional form using cross
products, vector calculus does not generalize to higher dimensions, while the alternative approach of
geometric algebra which uses exterior products does (see § Generalizations below for more).

Scalar fields

A scalar field associates a scalar value to every point in a space. The scalar is a mathematical number
representing a physical quantity. Examples of scalar fields in applications include the temperature
distribution throughout space, the pressure distribution in a fluid, and spin-zero quantum fields
(known as scalar bosons), such as the Higgs field. These fields are the subject of scalar field theory.

Vector fields

A vector field is an assignment of a vector to each point in a space. A vector field in the plane, for
instance, can be visualized as a collection of arrows with a given magnitude and direction each
attached to a point in the plane. Vector fields are often used to model, for example, the speed and
direction of a moving fluid throughout space, or the strength and direction of some force, such as the
magnetic or gravitational force, as it changes from point to point.
Vectors and pseudovectors

In more advanced treatments, one further distinguishes pseudovector fields and pseudoscalar fields,
which are identical to vector fields and scalar fields, except that they change sign under an
orientation-reversing map: for example, the curl of a vector field is a pseudovector field, and if one
reflects a vector field, the curl points in the opposite direction. This distinction is clarified and
elaborated in geometric algebra, as described below.

Operation in Vector

The different operations performed with vector quantities are tabulated below with their notation and
illustration

Integrals

There are three types of integrals dealt with in Vector Calculus that are

Line Integral

Surface Integral

Volume Integral

Line Integral

Line Integral in mathematics is the integration of a function along the line of the curve. The function
can be a scalar or vector whose line integral is given by summing up the values of the field at all
points on a curve weighted by some scalar function on the curve. Line Integral is also called Path
Integral
Surface Integral

Surface Integral in mathematics is the integration of a function along the whole region or space that is
not flat. In Surface integral, the surfaces are assumed of small points hence, the integration is given by
summing up all the small points present on the surface. The surface integral is equivalent to the
double integration of a line integral.

Volume Integral

A volume integral, also known as a triple integral, is a mathematical concept used in calculus and
vector calculus to calculate the volume of a three-dimensional region within a space. It is an extension
of the concept of a definite integral in one dimension to three dimensions.

Dot and Cross Products on Vectors

A quantity that is characterized not only by magnitude but also by its direction, is called a vector.
Velocity, force, acceleration, momentum, etc. are vectors.

Vectors can be multiplied in two ways:

 Scalar product or Dot product

 Vector Product or Cross product

Scalar Product/Dot Product of Vectors


 The resultant scalar product/dot product of two vectors is always a scalar quantity. Consider
two vectors a and b. The scalar product is calculated as the product of magnitudes of a, b,
and cosine of the angle between these vectors.
Scalar Product = |a||b| cos α

Here,

 |a| = magnitude of vector a,

 |b| = magnitude of vector b, and

 α = angle between the vectors.

Projection of one vector on other Vector

Vector a can be projected on the line l as shown below:


It is clear from the above figure that we can project one vector over another vector. AC is the
magnitude of vector A. In the above figure, AD is drawn perpendicular to line l. CD represents the
projection of vector a on vector b.

Triangle ACD is thus a right-angled triangle, and we can apply trigonometric formulae.

If α is the measure of angle ACD, then

cos α = CD/AC

Or, CD = AC cos α

From the figure, it is clear that CD is the projection of vector a on vector b. So, we can conclude that
one vector can be projected over the other vector by the cosine of the angle between them.

Properties of Scalar Product

 Scalar product of two vectors is always a real number (scalar).

 Scalar product is commutative i.e. a.b =b.a= |a||b| cos α

 If α is 90° then Scalar product is zero as cos(90) = 0. So, the scalar product of unit vectors in
x, y directions is 0.

 If α is 0° then the scalar product is the product of magnitudes of a and b |a||b|.

 Scalar product of a unit vector with itself is 1.

 Scalar product of a vector a with itself is |a|2

 If α is 180°, the scalar product for vectors a and b is -|a||b|

 Scalar product is distributive over addition


a.(b + c) = a.b + a.c

 If the component form of the vectors is given as:

a = a1x + a2y + a3z

b = b1x + b2y + b3z

then the scalar product is given as

a.b = a1b1 + a2b2 + a3b3

 The scalar product is zero in the following cases:

o The magnitude of vector a is zero

o The magnitude of vector b is zero

o Vectors a and b are perpendicular to each other

Inequalities Based on Dot Product

There are various inequalities based on the dot product of vectors, such as:

 Cauchy – Schwartz inequality

 Triangle Inequality

Cauchy – Schwartz inequality


According to this principle, for any two vectors a and b, the magnitude of the dot product is always
less than or equal to the product of magnitudes of vector a and vector b
|a.b| ≤ |a| |b|

Triangle Inequality

For any two vectors a and b, we always have

|a+ b| ≤ |a| + | b|

Example 1. Consider two vectors such that |a|=6 and |b|=3 and α = 60°. Find their dot product.

Solution:

a.b = |a| |b| cos α

So, a.b = 6.3.cos(60°)

=18(1/2)

a.b = 9
Cross Product/Vector Product of Vectors

The vector product or cross product, of two vectors a and b with an angle α between them is
mathematically calculated as

a × b = |a| |b| sin α

Also, if given two vectors, a=(a1,a2,a3)a=(a1,a2,a3) and b=(b1,b2,b3)b=(b1,b2,b3), their cross


product, denoted by a × b, is calculated as:

a×b=(a2b3–a3b2,a3b1–a1b3,a1b2–a2b1)

Cross product in Determinant Form

If the vector a is represented as a = a1x + a2y + a3z and vector b is represented as

b = b1x + b2y + b3z. Then the cross product a × b can be computed using determinant form

x y z

a1 a2 a3

b1 b2 b3

Then, a × b = x(a2b3 – b2a3) + y(a3b1 – a1b3) + z(a1b2 – a2b1)

Example 1. Find the cross product of two vectors a and b if their magnitudes are 5 and 10
respectively. Given that angle between then is 30°.

Solution:

a × b = a.b.sin (30) = (5) (10) (1/2) = 25 perpendicular to a and b


Optimization

Many important applied problems involve finding the best way to accomplish some task. Often this
involves finding the maximum or minimum value of some function: the minimum time to make a
certain journey, the minimum cost for doing a task, the maximum power that can be generated by a
device, and so on. Many of these problems can be solved by finding the appropriate function and then
using techniques of calculus to find the maximum or the minimum value required.

Generally, such a problem will have the following mathematical form: Find the largest (or smallest)
value of f(x) when a≤x≤b. Sometimes a or b are infinite, but frequently the real world imposes some
constraint on the values that x may have.

Such a problem differs in two ways from the local maximum and minimum problems we encountered
when graphing functions: We are interested only in the function between a and b, and we want to
know the largest or smallest value that f(x) takes on, not merely values that are the largest or smallest
in a small interval. That is, we seek not a local maximum or minimum but a global maximum or
minimum, sometimes also called an absolute maximum or minimum.

Any global maximum or minimum must of course be a local maximum or minimum. If we find all
possible local extrema, then the global maximum, if it exists, must be the largest of the local maxima
and the global minimum, if it exists, must be the smallest of the local minima. We already know where
local extrema can occur: only at those points at which f′(x) is zero or undefined. Actually, there are
two additional points at which a maximum or minimum can occur if the endpoints a and b are not
infinite, namely, at a and b. We have not previously considered such points because we have not been
interested in limiting a function to a small interval. An example should make this clear.

Problem-Solving Strategy: Solving Optimization Problems

1. Introduce all variables. If applicable, draw a figure and label all variables.

2. Determine which quantity is to be maximized or minimized, and for what range of values of
the other variables (if this can be determined at this time).

3. Write a formula for the quantity to be maximized or minimized in terms of the variables. This
formula may involve more than one variable.

4. Write any equations relating the independent variables in the formula from step 33. Use these
equations to write the quantity to be maximized or minimized as a function of one variable.

5. Identify the domain of consideration for the function in step 4 based on the physical problem
to be solved.

6. Locate the maximum or minimum value of the function from step 4. This step typically
involves looking for critical points and evaluating a function at endpoints.

Example - Maximizing the Area of a Garden

A rectangular garden is to be constructed using a rock wall as one side of the garden and wire fencing
for the other three sides (Figure 4.7.1). Given 100ft of wire fencing, determine the dimensions that
would create a garden of maximum area. What is the maximum area?
We want to determine the measurements x and y that will create a garden with a maximum area
using 100ft of fencing.

Solution

Let x denote the length of the side of the garden perpendicular to the rock wall and y denote the length
of the side parallel to the rock wall. Then the area of the garden is

A=x⋅y.

We want to find the maximum possible area subject to the constraint that the total fencing is 100ft, the
total amount of fencing used will be 2x+y. Therefore, the

Constraint equation is

2x+y=100.

Solving this equation for y, we have y=100−2x. Thus, we can write the area as

A(x)=x⋅(100−2x) =100x−2x2.

Before trying to maximize the area function

A(x)=100x−2x2

we need x>0 and y>0 Since y=100−2x, if y>0, then x<50

As mentioned earlier, since A is a continuous function on a closed, bounded interval, by the extreme
value theorem, it has a maximum and a minimum. These extreme values occur either at endpoints or
critical points. At the endpoints, A(x)=0. Since the area is positive for all x in the open interval (0,50),
the maximum must occur at a critical point. Differentiating the function A(x), we obtain
A'(x)=100−4x.

Therefore, the only critical point is x=25.

We conclude that the maximum area must occur when x=25x=25.

Decision Theory

Decision theory is a framework for making decisions under uncertainty. It involves identifying
options, assessing their consequences, and choosing the best option. Decision theory can be applied to
many fields, including business, economics, and psychology

How it works

 Identify options: List all possible options

 Assess consequences: Consider the positive and negative outcomes of each option

 Determine desirability: Decide which consequences are most desirable

 Choose an option: Select the option that maximizes expected utility

Types of decision theory

 Normative decision theory: Makes predictions based on what would happen in an ideal
situation

 Descriptive decision theory: Analyzes past actions to make decisions

Applications

 Business: Decision theory can help determine the best course of action to maximize profit or
revenue

 Psychology: Decision theory can help understand how people make decisions

 Economics: Decision theory can help understand how people make decisions about money

 Marketing: Decision theory can help understand how people make decisions about products
and services

the decision problem is how to select the best of the available alternatives. The elements of the
problem are the possible alternatives (actions, acts), the possible events (states, outcomes of a random
process), the probabilities of these events, the consequences associated with each possible alternative-
event combination, and the criterion (decision rule) according to which the best alternative is selected
To illustrate the construction of this table, suppose that 12 (hundred) eggs are ordered. The purchase
cost is 12 × 8 or 96 (hundreds of dollars). If 10 (hundred) eggs are demanded, 10 are sold, and the
revenue is 10 × 10 or 100 ($00); the profit associated with this alternative-event pair is 100 − 96 or 4
($00).

Information Theory

Information theory is the mathematical study of how information is stored, communicated, and
quantified. It's also known as the mathematical theory of communication

What does information theory do?

 Quantifies how much information is in a message

 Determines how to encode information in the most efficient way

 Separates real information from noise

 Determines how much capacity a channel needs to transmit information optimally


Who developed information theory?

 Claude Shannon, an American electrical engineer, established information theory in the 1940s

 Harry Nyquist and Ralph Hartley made early contributions in the 1920s

What's information theory used for?

Communication engineering, Psychology, Linguistics, Biology, Behavioral science, Neuroscience,


and Statistical mechanics.

A key measure in information theory is entropy. Entropy quantifies the amount of uncertainty
involved in the value of a random variable or the outcome of a random process. For example,
identifying the outcome of a fair coin flip (which has two equally likely outcomes) provides less
information (lower entropy, less uncertainty) than identifying the outcome from a roll of a die (which
has six equally likely outcomes). Some other important measures in information theory are mutual
information, channel capacity, error exponents, and relative entropy. Important sub-fields of
information theory include source coding, algorithmic complexity theory, algorithmic information
theory and information-theoretic security.

Applications of fundamental topics of information theory include source coding/data compression


(e.g. for ZIP files), and channel coding/error detection and correction (e.g. for DSL). The theory has
also found applications in other areas, including statistical inference, cryptography, neurobiology,
perception, signal processing, linguistics, quantum computing, information retrieval, intelligence
gathering, plagiarism detection, pattern recognition, anomaly detection, etc

Entropy of an information source

Based on the probability mass function of each source symbol to be communicated, the Shannon
entropy H, in units of bits (per symbol), is given by

where pi is the probability of occurrence of the i th possible value of the source symbol. This equation
gives the entropy in the units of "bits" (per symbol) because it uses a logarithm of base 2, and this
base-2 measure of entropy has sometimes been called the shannon in his honor

Intuitively, the entropy HX of a discrete random variable X is a measure of the amount


of uncertainty associated with the value of X when only its distribution is known.

Here, I(x) is the self-information, which is the entropy contribution of an individual message,
and Ex is the expected value. If X is the set of all messages {x1, ..., xn} . p(x) is the probability of
some

The joint entropy of two discrete random variables X and Y is merely the entropy of their pairing: (X,
Y). This implies that if X and Y are independent, then their joint entropy is the sum of their individual
entropies.
Mutual information measures the amount of information that can be obtained about one random
variable by observing another. It is important in communication where it can be used to maximize the
amount of information shared between sent and received signals. The mutual information of X
relative to Y is given by:

where SI (Specific mutual Information) is the pointwise mutual information

Coding theory is one of the most important and direct applications of information theory. It can be
subdivided into source coding theory and channel coding theory. Using a statistical description for
data, information theory quantifies the number of bits needed to describe the data, which is the
information entropy of the source.

Data compression (source coding): There are two formulations for the compression problem:

 lossless data compression: the data must be reconstructed exactly;


 lossy data compression: allocates bits needed to reconstruct the data, within a
specified fidelity level measured by a distortion function.

Error-correcting codes (channel coding): While data compression removes as much redundancy as
possible, an error-correcting code adds just the right kind of redundancy (i.e., error correction) needed
to transmit the data efficiently and faithfully across a noisy channel.

You might also like