Unit 1 Notes
Unit 1 Notes
- Linear Algebra & Analytical Geometry -Probability and Statistics- Bayesian Conditional
The need for machine learning is increasing day by day. The reason behind the need for machine
learning is that it is capable of doing tasks that are too complex for a person to implement directly. As
a human, we have some limitations as we cannot access the huge amount of data manually, so for this,
we need some computer systems and here comes the machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge amount of data and let them
explore the data, construct the models, and predict the required output automatically. The performance
of the machine learning algorithm depends on the amount of data, and it can be determined by the cost
function. With the help of machine learning, we can save both time and money.
The importance of machine learning can be easily understood by its uses cases, Currently, machine
learning is used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion by
Facebook, etc. Various top companies such as Netflix and Amazon have build machine learning
models that are using a vast amount of data to analyze the user interest and recommend product
accordingly.
Following are some key points which show the importance of Machine Learning:
The term machine learning was coined in 1959 by Arthur Samuel, an IBM employee and pioneer in
the field of computer gaming and artificial intelligence. Also the synonym self-teaching computers
were used in this time period.
By the early 1960s an experimental "learning machine" with punched tape memory, called Cybertron,
had been developed by Raytheon Company to analyze sonar signals, electrocardiograms, and speech
patterns using rudimentary reinforcement learning. It was repetitively "trained" by a human
operator/teacher to recognize patterns and equipped with a "goof" button to cause it to re-evaluate
incorrect decisions. A representative book on research into machine learning during the 1960s was
Nilsson's book on Learning Machines, dealing mostly with machine learning for pattern classification.
Interest related to pattern recognition continued into the 1970s, as described by Duda and Hart in
1973. In 1981 a report was given on using teaching strategies so that a neural network learns to
recognize 40 characters (26 letters, 10 digits, and 4 special symbols) from a computer terminal.
Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in the
machine learning field: "A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P if its performance at tasks in T, as measured by P,
improves with experience E. "This definition of the tasks in which machine learning is concerned
offers a fundamentally operational definition rather than defining the field in cognitive terms. This
follows Alan Turing's proposal in his paper "Computing Machinery and Intelligence", in which the
question "Can machines think?" is replaced with the question "Can machines do what we (as thinking
entities) can do?"
Modern-day machine learning has two objectives, one is to classify data based on models which have
been developed, the other purpose is to make predictions for future outcomes based on these models.
A hypothetical algorithm specific to classifying data may use computer vision of moles coupled with
supervised learning in order to train it to classify the cancerous moles. A machine learning algorithm
for stock trading may inform the trader of future potential predictions.
Machine Learning
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences on
their own. The term machine learning was first introduced by Arthur Samuel in 1959.
Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.
Machine learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed. Machine learning brings computer science and
statistics together for creating predictive models. Machine learning constructs or uses the algorithms
that learn from historical data. The more we will provide the information, the higher will be the
performance.
Machine learning is a growing technology which enables computers to learn automatically from past
data. Machine learning uses various algorithms for building mathematical models and making
predictions using historical data or information. Currently, it is being used for various tasks such as
image recognition, speech recognition, email filtering, Facebook auto-tagging, recommender system,
and many more.
1. Image Recognition
Image recognition is one of the most common applications of machine learning. It is used to identify
objects, persons, places, digital images, etc. The popular use case of image recognition and face
detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with
our Facebook friends, then we automatically get a tagging suggestion with name, and the technology
behind this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition, and
it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms are
widely used by various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa
are using speech recognition technology to follow the voice instructions.
3. Traffic prediction
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with
the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested
with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
Everyone who is using Google Map is helping this app to make it better. It takes information from the
user and sends back to its database to improve the performance.
4. Product recommendations
Machine learning is widely used by various e-commerce and entertainment companies such as
Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product
on Amazon, then we started getting an advertisement for the same product while internet surfing on
the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests the
product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series, movies,
etc., and this is also done with the help of machine learning.
5. Self-driving cars
One of the most exciting applications of machine learning is self-driving cars. Machine learning plays
a significant role in self-driving cars. Tesla, the most popular car manufacturing company is working
on self-driving car. It is using unsupervised learning method to train the car models to detect people
and objects while driving.
6. Email Spam and Malware Filtering
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We
always receive an important mail in our inbox with the important symbol and spam emails in our
spam box, and the technology behind this is Machine learning. Below are some spam filters used by
Gmail:
o Content Filter
o Header filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the
name suggests, they help us in finding the information using our voice instruction. These assistants
can help us in various ways just by our voice instructions such as Play music, call someone, Open an
email, Scheduling an appointment, etc.
These assistant record our voice instructions, send it over the server on a cloud, and decode it using
ML algorithms and act accordingly.
Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent
transaction can take place such as fake accounts, fake ids, and steal money in the middle of a
transaction. So to detect this, Feed Forward Neural network helps us by checking whether it is a
genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values become
the input for the next round. For each genuine transaction, there is a specific pattern which gets
change for the fraud transaction hence, it detects it and makes our online transactions more secure.
Machine learning is widely used in stock market trading. In the stock market, there is always a risk of
up and downs in shares, so for this machine learning's long short term memory neural network is used
for the prediction of stock market trends.
10. Medical Diagnosis
In medical science, machine learning is used for diseases diagnoses. With this, medical technology is
growing very fast and able to build 3D models that can predict the exact position of lesions in the
brain.
Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all,
as for this also machine learning helps us by converting the text into our known languages. Google's
GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine
Learning that translates the text into our familiar language, and it called as automatic translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm, which
is used with image recognition and translates the text from one language to another language.
Alexa, Google, Cortana, Self-driving cars, and Siri are some well-known real-world examples of the
application of machine learning. Being very wide in applications, Machine Learning is now added to
professional s in many universities. The advantages of machine learning will help you spring a
positive outlook about this technological advancement.
1. Artificial Intelligence
The future of automation is looking forward to wholly automated artificial intelligence. An advanced
AI system will be capable of testing various applications where human needs can be avoided. In
robotics, research is being made to build a machine capable of thinking like the human brain.
As more and more data is fed to the systems, the output improves consistently. With time and
learning, the machine can produce more accurate results without the need for new codes and
algorithms. The algorithm analyzes various data statistics and predicts the possible options, and gives
the best option according to gained experience.
With numerous applications and use, there will be more advantages of machine learning apparently.
The fields in which a developed machine learning system can be applied are huge in number. From
medical, engineering, aviation, space technology, etc., the use of this technology is unlimited. It will
create an environment where once the system is designed, it will be capable of taking appropriate
actions without human intervention.
4. Security Upgrades
Most industries are prone to various man-made accidents and mishaps, which can be avoided using
the science of machine learning. Moreover, once the system has gained enough experience using
recorded data, it can mitigate the risks of errors which becomes the reason for various mechanical and
human failures.
Whether it's finding the best deals or searching relevant user-based results, computer programs can do
the work a lot faster than humans. For example, some jobs may require thousands of humans to sort
the data; this process may take as long as a month to complete. However, the same process will
complete within minutes by the automated systems. The lower cost and maximum time utilization will
be highly considered benefits of machine learning to companies and employees.
The innovation of this technique is not very old, but full automation still requires a breakthrough in
current technology. Although it has complemented the humans with search and voice features, still the
accuracy of results is computer-based, not according to the user's needs. Apart from making humans
lazy, some other machine learning disadvantages are listed below.
1. Computational Errors
The results obtained from machines may have errors due to statistical reasoning. Most of the
automated systems generate results based on previous searches and data that was loaded into a
computer program. Any new experience or data may not have accurate results or output.
Initially, a huge amount of time is invested in making machine programs, and the requirement of data
is also very huge. Therefore, long codes and programs are needed to make the machine learn initial
responses and essential functions. Then based on user search and requirement machine gives results
and continuously gains decisive quality. Even small logical errors can lead to heavy disadvantages of
machine learning process in resulting in faulty inputs.
The results might be satisfactory, but a completely automated system requires a lot of research and
analysis. Scientists and programmers are continuously trying to figure out more advanced techniques
for improving machine outputs.
And it will be a long journey before we can acquire an AI close to human interpretation. The
technology is in its infancy also leads to questions about its acceptance and flexibility with rapidly
changing technology.
As the machine is only made to identify certain choices, specify the options based on human
behaviour or varies widely. The machine is made to choose mostly the correct decisions, but there
could be situations where a machine cannot make optimal decisions.
5. Storage Services are among major disadvantages of Machine Learning
The backup and servers needed to maintain and record the acquired data keep on piling and hence the
cost. For a machine to learn, the possible data is unlimited, and there is always a need to store this
data. Various storages and cloud services are still not sufficient to make room for this amount of data
Data plays a significant role in the machine learning process. One of the significant issues that
machine learning professionals face is the absence of good quality data. Unclean and noisy data can
make the whole process extremely exhausting. We don’t want our algorithm to make inaccurate or
faulty predictions. Hence the quality of data is essential to enhance the output. Therefore, we need to
ensure that the process of data pre-processing which includes removing outliers, filtering missing
values, and removing unwanted features, is done with the utmost level of perfection.
This process occurs when data is unable to establish an accurate relationship between input and output
variables. It simply means trying to fit in undersized jeans. It signifies the data is too simple to
establish a precise relationship. To overcome this issue:
Overfitting refers to a machine learning model trained with a massive amount of data that negatively
affect its performance. It is like trying to fit in Oversized jeans. Unfortunately, this is one of the
significant issues faced by machine learning professionals. This means that the algorithm is trained
with noisy and biased data, which will affect its overall performance.
The machine learning industry is young and is continuously changing. Rapid hit and trial experiments
are being carried on. The process is transforming, and hence there are high chances of error which
makes the learning complex. It includes analyzing the data, removing data bias, training data,
applying complex mathematical calculations, and a lot more. Hence it is a really complicated process
which is another big challenge for Machine learning professionals.
The most important task you need to do in the machine learning process is to train the data to achieve
an accurate output. Less amount training data will produce inaccurate or too biased predictions. Let us
understand this with the help of an example. Machine-learning algorithm needs a lot of data to
distinguish. For complex problems, it may even require millions of data to be trained. Therefore, we
need to ensure that Machine learning algorithms are trained with sufficient amounts of data.
6. Slow Implementation
This is one of the common issues faced by machine learning professionals. The machine learning
models are highly efficient in providing accurate results, but it takes a tremendous amount of time.
Slow programs, data overload, and excessive requirements usually take a lot of time to provide
accurate results. Further, it requires constant monitoring and maintenance to deliver the best output.
So you have found quality data, trained it amazingly, and the predictions are really concise and
accurate. Yay, you have learned how to create a machine learning algorithm!! But wait, there is a
twist; the model may become useless in the future as data grows. The best model of the present may
become inaccurate in the coming Future and require further rearrangement. So you need regular
monitoring and maintenance to keep the algorithm working. This is one of the most exhausting issues
faced by machine learning professionals
4. Reinforcement Learning
1. Supervised Machine Learning
As its name suggests, Supervised machine learning is based on supervision. It means in the supervised
learning technique, we train the machines using the "labelled" dataset, and based on the training, the
machine predicts the output. Here, the labelled data specifies that some of the inputs are already
mapped to the output. More preciously, we can say; first, we train the machine with the input and
corresponding output, and then we ask the machine to predict the output using the test dataset.
The main goal of the supervised learning technique is to map the input variable(x) with the output
variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud
Detection, Spam filtering, etc.
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output variable is
categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms
predict the categories present in the dataset. Some real-world examples of classification algorithms
are Spam Detection, Email filtering, etc.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear relationship
between input and output variables. These are used to predict continuous output variables, such as
market trends, weather prediction, etc.
o Lasso Regression
Advantages of Supervised Learning:
o Since supervised learning work with the labelled dataset so we can have an exact idea about the
classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.
o It may predict the wrong output if the test data is different from the training data.
b) Association
Association rule learning is an unsupervised learning technique, which finds interesting relations
among variables within a large dataset. The main aim of this learning algorithm is to find the
dependency of one data item on another data item and map those variables accordingly so that it can
generate maximum profit. This algorithm is mainly applied in Market Basket analysis, Web usage
mining, continuous production, etc. Some popular algorithms of Association rule learning are Apriori
Algorithm, Eclat, FP-growth algorithm.
o These algorithms can be used for complicated tasks compared to the supervised ones because these
algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is easier as
compared to the labelled dataset.
o The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and
algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled dataset that
does not map with the output.
o Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright in
document network analysis of text data for scholarly articles.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract particular
information from the database. For example, extracting information of each user located at a
particular location.
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised and
Unsupervised machine learning. It represents the intermediate ground between Supervised (With
Labelled training data) and Unsupervised learning (with no labelled training data) algorithms and uses
the combination of labelled and unlabeled datasets during the training period.
Although Semi-supervised learning is the middle ground between supervised and unsupervised
learning and operates on the data that consists of a few labels, it mostly consists of unlabelled data. As
labels are costly, but for corporate purposes, they may have few labels. It is completely different from
supervised and unsupervised learning as they are based on the presence & absence of labels.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the concept
of Semi-supervised learning is introduced. The main aim of semi-supervised learning is to effectively
use all the available data, rather than only labelled data like in supervised learning.
o It is highly efficient.
o Accuracy is low.
4. Reinforcement Learning
In reinforcement learning, there is no labelled data like supervised learning, and agents learn from
their experiences only.
Due to its way of working, reinforcement learning is employed in different fields such as Game
theory, Operation Research, Information theory, multi-agent systems.
A reinforcement learning problem can be formalized using Markov Decision Process(MDP). In MDP,
the agent constantly interacts with the environment and performs actions; at each action, the
environment responds and generates a new state.
Categories of Reinforcement Learning
o Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite to the
positive RL. It increases the tendency that the specific behaviour would occur again by avoiding the
negative condition
o Video Games: RL algorithms are much popular in gaming applications. It is used to gain super-
human performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO Zero.
o Resource Management: The "Resource Management with Deep Reinforcement Learning" paper
showed that how to use RL in computer to automatically learn and schedule resources to wait for
different jobs in order to minimize average job slowdown.
o Robotics: RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement learning. There are
different industries that have their vision of building intelligent robots using AI and Machine learning
technology.
o Text Mining: Text-mining, one of the great applications of NLP, is now being implemented with the
help of Reinforcement Learning by Salesforce company.
o It helps in solving complex real-world problems which are difficult to be solved by general
techniques.
o The learning model of RL is similar to the learning of human beings; hence most accurate results
can be found.
o Too much reinforcement learning can lead to an overload of states which can weaken the results.
Machine learning has a strong connection with mathematics. Each machine learning algorithm is
based on the concepts of mathematics & also with the help of mathematics, one can choose the correct
algorithm by considering training time, complexity, number of features, etc. Linear Algebra is an
essential field of mathematics, which defines the study of vectors, matrices, planes, mapping, and
lines required for linear transformation.
The term Linear Algebra was initially introduced in the early 18th century to find out the unknowns in
Linear equations and solve the equation easily; hence it is an important branch of mathematics that
helps study data. Also, no one can deny that Linear Algebra is undoubtedly the important and primary
thing to process the applications of Machine Learning. It is also a prerequisite to start learning
Machine Learning and data science.
Linear algebra plays a vital role and key foundation in machine learning , and it enables ML
algorithms to run on a huge number of datasets.
The concepts of linear algebra are widely used in developing algorithms in machine learning.
Linear Algebra also helps to create better supervised as well as unsupervised Machine Learning
algorithms.
Few supervised learning algorithms can be created using Linear Algebra, which is as follows:
o Logistic Regression
o Linear Regression
o Decision Trees
Further, below are some unsupervised learning algorithms listed that can also be created with the help
of linear algebra as follows:
o Clustering
o Components Analysis
Examples
o Datasets and Data Files
o Linear Regression
o Recommender Systems
o One-hot encoding
o Regularization
o Principal Component Analysis
o Images and Photographs
o Singular-Value Decomposition
o Deep Learning
o Latent Semantic Analysis
Linear Regression
Linear regression is a popular technique of machine learning borrowed from statistics. It describes
the relationship between input and output variables and is used in machine learning to predict
numerical values. The most common way to solve linear regression problems using Least Square
Optimization is solved with the help of Matrix factorization methods. Some commonly used
matrix factorization methods are LU decomposition, or Singular-value decomposition, which are
the concept of linear algebra.
Regularization
In machine learning, we usually look for the simplest possible model to achieve the best outcome
for the specific problem. Simpler models generalize well, ranging from specific examples to
unknown datasets. These simpler models are often considered models with smaller coefficient
values.
A technique used to minimize the size of coefficients of a model while it is being fit on data is
known as regularization. Common regularization techniques are L1 and L2 regularization. Both of
these forms of regularization are, in fact, a measure of the magnitude or length of the coefficients
as a vector and are methods lifted directly from linear algebra called the vector norm.
Generally, each dataset contains thousands of features, and fitting the model with such a large
dataset is one of the most challenging tasks of machine learning. Moreover, a model built with
irrelevant features is less accurate than a model built with relevant features. There are several
methods in machine learning that automatically reduce the number of columns of a dataset, and
these methods are known as Dimensionality reduction. The most commonly used dimensionality
reductions method in machine learning is Principal Component Analysis or PCA. This technique
makes projections of high-dimensional data for both visualizations and training models. PCA uses
the matrix factorization method from linear algebra.
Singular-Value Decomposition
Singular-Value decomposition is also one of the popular dimensionality reduction techniques and
is also written as SVD in short form.
Natural Language Processing or NLP is a subfield of machine learning that works with text and
spoken words.
NLP represents a text document as large matrices with the occurrence of words. For example, the
matrix column may contain the known vocabulary words, and rows may contain sentences,
paragraphs, pages, etc., with cells in the matrix marked as the count or frequency of the number of
times the word occurred. It is a sparse matrix representation of text. Documents processed in this
way are much easier to compare, query, and use as the basis for a supervised machine learning
model.
This form of data preparation is called Latent Semantic Analysis, or LSA for short, and is also
known by the name Latent Semantic Indexing or LSI.
Recommender System
The development of recommender systems is mainly based on linear algebra methods. We can
understand it as an example of calculating the similarity between sparse customer behaviour
vectors using distance measures such as Euclidean distance or dot products.
Deep Learning
Artificial Neural Networks or ANN are the non-linear ML algorithms that work to process the
brain and transfer information from one layer to another in a similar way.
Deep learning studies these neural networks, which implement newer and faster hardware for the
training and development of larger networks with a huge dataset. All deep learning methods
achieve great results for different challenging tasks such as machine translation, speech
recognition, etc. The core of processing neural networks is based on linear algebra data structures,
which are multiplied and added together. Deep learning algorithms also work with vectors,
matrices, tensors (matrix with more than two dimensions) of inputs and coefficients for multiple
dimensions.
Linear Algebra
Linear algebra provides a framework for handling and manipulating data, which is often
represented as vectors and matrices. In machine learning, linear algebra operations are used
extensively in various stages, from data preprocessing to model training and evaluation. For
instance, operations such as matrix multiplication, eigenvalue decomposition, and singular value
decomposition are pivotal in dimensionality reduction techniques like Principal Component
Analysis (PCA). Similarly, the concepts of vector spaces and linear transformations are integral to
understanding neural networks and optimization algorithms.
Linear algebra is the branch of mathematics that deals with vector spaces and linear mappings
between these spaces. It encompasses the study of vectors, matrices, linear equations, and their
properties.
Fundamental Concepts
Vectors
Vectors are quantities that have both magnitude and direction, often represented as arrows in
space.
Matrices
Matrices are used to represent linear transformations, systems of linear equations, and data
transformations in machine learning.
Scalars
Scalars are used to scale vectors or matrices through operations like multiplication
The dot product of two vectors measures the similarity of their directions.
The cross product of two vectors in three-dimensional space produces a vector orthogonal to
the plane containing the original vectors.
Example: Given two vectors u and v, their cross product u×v is calculated as
Linear Transformations
Linear transformations are fundamental operations in linear algebra that involve the transformation of
vectors and matrices while preserving certain properties such as linearity and proportionality. In the
context of machine learning, linear transformations play a crucial role in data preprocessing, feature
engineering, and model training
Linear transformations are functions that map vectors from one vector space to another in a linear
manner. Formally, a transformation TTT is considered linear if it satisfies two properties:
Linear transformations can be represented by matrices, and their properties are closely related to the
properties of matrices.
1. Translation:
2. Scaling:
Scaling is frequently applied in feature scaling, where features are scaled to have
similar ranges to prevent dominance of certain features in machine learning models.
3. Rotation:
While less common in basic machine learning algorithms, rotation can be useful in
advanced applications such as computer vision and robotics.
Matrix Operations
Matrix operations form the cornerstone of linear algebra, providing essential tools for manipulating
and analyzing data in machine learning. In this section, we explore key matrix operations, including
multiplication, transpose, inverse, and determinant, along with their significance and applications.
A. Matrix Multiplication
Definition: Given two matrices A and B, the product matrix C=A⋅B is computed by taking
the dot product of each row of matrix A with each column of matrix B.
Significance: Matrix multiplication is widely used in machine learning for various tasks,
including transformation of feature vectors, computation of model parameters, and neural
network operations such as feedforward and backpropagation.
1. Transpose:
The transpose of a matrix involves flipping its rows and columns, resulting in a new
matrix where the rows become columns and vice versa.
It is denoted by AT, and its dimensions are the reverse of the original matrix.
2. Inverse:
Not all matrices have inverses, and square matrices with a determinant not equal to
zero are invertible.
Inverse matrices are used in solving systems of linear equations, computing solutions
to optimization problems, and performing transformations.
C. Determinants
The determinant of a square matrix is a scalar value that encodes various properties of the
matrix, such as its volume, orientation, and invertibility.
Eigenvalues and eigenvectors are fundamental concepts in linear algebra that play a significant role in
machine learning algorithms and applications. In this section, we explore the definition, significance,
and applications of eigenvalues and eigenvectors.
1. Eigenvalues:
Eigenvalues of a square matrix AAA are scalar values that represent how a
transformation represented by AAA stretches or compresses vectors in certain
directions.
2. Eigenvectors:
Eigenvectors are non-zero vectors that are transformed by a matrix only by a scalar
factor, known as the eigenvalue.
1. Dimensionality Reduction:
In spectral clustering, eigenvalues and eigenvectors of the graph Laplacian matrix are
used to partition data into clusters based on spectral properties.
3. Matrix Factorization:
Probability
Random variable
A random variable, X, is a quantity that can have different values each time the variable is inspected,
such as in measurements in experiments. Following are few examples.
Here, every time value could be anything from the finite set of {1, 2, 3, 4, 5, 6}
A discrete random variable is one which may take on only a countable number of distinct values.
For example, rolling dice can have values from the set {1, 2, 3, 4, 5, 6}.
A continuous random variable is one which takes an infinite number of possible values.
For example, measuring light sensors can have any infinite value if you are using a machine which
can detect with high precision.
While we might not be able to predict a specific value, it is often the case that some values might be
more likely than others. We might be able to say something about how often a certain number will
appear when drawing many examples.
How likely each value is for a random variable x, is captured by the probability density
function pdf(x) in the continuous case and by the probability mass function P(x) in the discrete case.
Let’s take example of tossing coin three times and random variable X represents the number of head
after three toss.
Probability mass function can be defined as the probability that a discrete random variable will be
exactly equal to some particular value. the probability mass function assigns a particular probability to
every possible value of a discrete random variable. Suppose a fair coin is tossed twice and the sample
space is recorded as S = [HH, HT, TH, TT]. The probability of getting heads needs to be determined.
Let X be the random variable that shows how many heads are obtained. X can take on the values 0, 1,
2. The probability that X will be equal to 1 is 0.5. Thus, it can be said that the probability mass
function of X evaluated at 1 will be 0.5.
The probability mass function table for a random variable X is given as follows:
Experimental Probability
Axiomatic Probability
Experimental Probability
Experimental probability measures the total number of favorable outcomes for the number of times an
experiment is repeated.
Axiomatic Probability
There are three rules or axioms which apply to all types of probability.
If A and B are two mutually exclusive outcomes (two events that cannot occur at the same
time) then the probability of A or B occurring is a probability of A plus the probability of B.
The maximum probability of an event is its sample space (sample space is the total number of
possible outcomes).
The probability of any event exists between 0 and 1. 0 can also be a probability.
If A and B are two mutually exclusive outcomes (two events that cannot occur at the same
time) then the probability of A or B occurring is the probability of A plus the probability of B.
Joint Probability
Joint Probability refers to the likelihood of two or more events happening together or in conjunction
with each other.
When events A and B are independent, meaning that the occurrence of one event does not impact the
other, we use the multiplication rule:
Here, P(A) is the probability of occurrence of event A, P(B) is the probability of occurrence of event
B, and P(A∩B) is the joint probability of events A and B.
Events are often dependent on each other, meaning that one event’s occurrence influences the
likelihood of the other. Here, we employ a modified formula:
Here, P(A) is the probability of occurrence of event A, P(B|A) is the conditional probability of
occurrence of event B when event A has already occurred, and P(A∩B) is the joint probability of
events A and B.
Suppose you are running an e-commerce platform, and you want to find the probability of a customer
purchasing a red shirt (event A) and a blue hat (event B) independently. Find out the Joint Probability
where
Solution:
P(A∩B) = 0.6
Imagine you are in the insurance business, and you want to determine the probability of a customer
filing a claim (event A) and receiving a payout (event B), given that a claim was filed. Find out the
Joint Probability where
The probability of a customer receiving a payout given that a claim was filed is 0.8.
Solution:
P(A∩B) = P(customer filing a claim) x P(customer receiving a payout given that a claim was filed)
P(A∩B) = 0.08
Joint Probability addresses the simultaneous occurrence of events A and B without considering any
specific order or sequence. It quantifies the combined probability of events occurring together,
providing insights into their co-occurrence in a business context.
Conditional Probability focuses on the probability of event B happening, given that event A has
already occurred. This kind of probability is utilised when the occurrence of one event influences the
likelihood of another event, making it a valuable tool for understanding cause-and-effect relationships
in business statistics.
Statistics
Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and visualizing
empirical data. Descriptive statistics and inferential statistics are the two major areas of statistics.
Descriptive statistics are for describing the properties of sample and population data (what has
happened). Inferential statistics use those properties to test hypotheses, reach conclusions, and make
predictions (what can you expect).
Population:
In statistics, the population comprises all observations (data points) about the subject under study.
An example of a population is studying the voters in an election. In the 2019 Lok Sabha elections,
nearly 900 million voters were eligible to vote in 543 constituencies.
Sample:
In statistics, a sample is a subset of the population. It is a small portion of the total observed
population.
Sometimes when we are working with large amounts of data, we need one single value to represent
the whole set. In math, there are three measures to find the central value of a given set of data.
They are
Mean
Median
Mode
Mean
3,5,7,9
The mean is obtained by adding all the numbers and dividing the sum by the total count of numbers.
Mean=(3+5+7+9)/4
=6
Median
2,4,6,8,9,10,11,13,15,17,18
Mode
30,30,31,31,31,32,32,33,33,33,33,34,35,36
The mode is obtained by choosing the most frequently occuring item in the data set.
Measures of Spread
While measuring a central value, we are given a data set. To know how wide is the data set, we use
measures of spread.
They also give us a better picture if the calculated central value (Mean, median, or mode) correctly
depicts the set of values.
For example,
25,40,55,60,70,86,90,100
While calculating the average marks obtained by students in a 100 marks test, by calculating
the mean, we can only find the average marks obtained by the students, but we do not know how
spread the marks are from 0 to 100.
While calculating the average marks obtained by students in a 100 marks test, the mean score cannot
be more than Here, the mode is 100.
Range
Standard Deviation
Variance
Range
Range represents the difference between the minimum and maximum values in a data set.
Range=13−3=10
Quartile
As the name suggests, a quartile is a measure of spread that groups a given set of values in quarters.
Interquartile
The interquartile range is obtained by subtracting the third quartile (Q3) and the first quartile (Q1). It
is a place where most of the values lie in a data set or we can say sometimes the interquartile range
and the mean value are the same.
Variance
Variance is a measure that gives us a approximate idea about the spread of data. It is not very
accurate.
The use of variance value is it can be used to calculate the standard deviation of a data set.
Standard Deviation
Standard deviation is a measure that tells us how far is a data value from the mean.
Formula
Hypothesis Testing
Hypothesis testing is a statistical analysis to make decisions using experimental data. It allows you to
statistically back up some findings you have made in looking at the data. In hypothesis testing, you
make a claim and the claim is usually about population parameters such as mean, median, standard
deviation, etc.
The assumption made for a statistical test is called the null hypothesis (H0).
The Alternative hypothesis (H1) contradicts the null hypothesis stating that the assumptions do not
hold true at some level of significance.
Hypothesis testing lets you decide to either reject or retain a null hypothesis.
Example: H0: The average BMI of boys and girls in a class is the same
H1: The average BMI of boys and girls in a class is not the same
To determine whether a finding is statistically significant, you need to interpret the p-value. It is
common to compare the p-value to a threshold value called the significance level.
Chi-square test
T-test
Z-test
Bayes theorem is one of the most popular machine learning concepts that helps to calculate the
probability of occurring one event with uncertain knowledge while other one has already occurred.
Bayes' theorem can be derived using product rule and conditional probability of event X with known
event Y:
o According to the product rule we can express as the probability of event X with known event Y as
follows;
Here, both events X and Y are independent events which means probability of outcome of both events
does not depends one another.
o P(X|Y) is called as posterior, which we need to calculate. It is defined as updated probability after
considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is true.
o P(X) is called the prior probability, probability of hypothesis before considering the evidence
o P(Y) is called marginal probability. It is defined as the probability of evidence under any
consideration.
1. Experiment
An experiment is defined as the planned operation carried out under controlled condition such as
tossing a coin, drawing a card and rolling a dice, etc.
2. Sample Space
During an experiment what we get as a result is called as possible outcomes and the set of all possible
outcome of an event is known as sample space. For example, if we are rolling a dice, sample space
will be:
S1 = {1, 2, 3, 4, 5, 6}
Similarly, if our experiment is related to toss a coin and recording its outcomes, then sample space
will be:
S2 = {Head, Tail}
3. Event
Event is defined as subset of sample space in an experiment. Further, it is also called as set of
outcomes.
Assume in our experiment of rolling a dice, there are two event A and B such that;
o Probability of the event A ''P(A)''= Number of favourable outcomes / Total number of possible
outcomes P(E) = 3/6 =1/2 =0.5
o Similarly, Probability of the event B ''P(B)''= Number of favourable outcomes / Total number of
possible outcomes =2/6 =1/3 =0.333
Disjoint Event: If the intersection of the event A and B is an empty set or null then such events are
known as disjoint event or mutually exclusive events also.
4. Random Variable:
It is a real value function which helps mapping between sample space and a real line of an
experiment. A random variable is taken on some random values and each value having some
probability. However, it is neither random nor a variable but it behaves as a function which can either
be discrete, continuous or combination of both.
5. Exhaustive Event:
As per the name suggests, a set of events where at least one event occurs at a time, called exhaustive
event of an experiment.
Thus, two events A and B are said to be exhaustive if either A or B definitely occur at a time and both
are mutually exclusive for e.g., while tossing a coin, either it will be a Head or may be a Tail.
6. Independent Event:
Two events are said to be independent when occurrence of one event does not affect the occurrence of
another event. In simple words we can say that the probability of outcome of both events does not
depends one another. Mathematically, two events A and B are said to be independent if:
7. Conditional Probability:
Conditional probability is defined as the probability of an event A, given that another event B has
already occurred (i.e. A conditional B). This is represented by P(A|B) and we can define it as:
8. Marginal Probability:
Marginal probability is defined as the probability of an event A occurring independent of any other
event B. Further, it is considered as the probability of evidence under any consideration.
Bayes theorem helps us to calculate the single term P(B|A) in terms of P(A|B), P(B), and P(A). This
rule is very helpful in such scenarios where we have a good probability of P(A|B), P(B), and P(A) and
need to determine the fourth term.
Naïve Bayes classifier is one of the simplest applications of Bayes theorem which is used in
classification algorithms to isolate data as per accuracy, speed and classes.
Let's understand the use of Bayes theorem in machine learning with below example.
Here;
P(A) will remain constant throughout the class means it does not change its value with respect to
change in class. To maximize the P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci).
With n number classes on the probability list let's assume that the possibility of any class being the
right answer is equally likely. Considering this factor, we can say that:
P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).
This process helps us to reduce the computation cost as well as time. This is how Bayes theorem plays
a significant role in Machine Learning and Naïve Bayes theorem has simplified the conditional
probability tasks without affecting the precision. Hence, we can conclude that:
Hence, by using Bayes theorem in Machine Learning we can easily describe the possibilities of
smaller events.
Naïve Bayes theorem is also a supervised algorithm, which is based on Bayes theorem and used to
solve classification problems. It is one of the most simple and effective classification algorithms in
Machine Learning which enables us to build various ML models for quick predictions. It is a
probabilistic classifier that means it predicts on the basis of probability of an object. Some popular
Naïve Bayes algorithms are spam filtration, Sentimental analysis, and classifying articles.
It is one of the simplest and effective methods for calculating the conditional probability and
text classification problems.
A Naïve-Bayes classifier algorithm is better than all other models where assumption of
independent predictors holds true.
It is easy to implement than other models.
It requires small amount of training data to estimate the test data which minimize the training
time period.
It can be used for Binary as well as Multi-class Classifications.
The main disadvantage of using Naïve Bayes classifier algorithms is, it limits the assumption of
independent predictors because it implicitly assumes that all attributes are independent or unrelated
but in real life it is not feasible to get mutually independent attributes.
Sums
1. We flip two coins. What is the probability that both are heads, given that at least one of them
is heads?
We are given that there is at least one head. So there are three possibilities HH, HT, TH. Only one
outcome has both coins as heads: HH.
2. We roll a six-sided die. What is the probability that the roll is a 6, given that the outcome is an
even number
3.
4.
5.
6.
Vector Calculus and Optimization
Vector Calculus
Vector calculus, or vector analysis, is concerned with differentiation and integration of vector fields,
primarily in 3-dimensional Euclidean. The term "vector calculus" is sometimes used as a synonym for
the broader subject of multivariable calculus, which spans vector calculus as well as partial
differentiation and multiple integration. Vector calculus plays an important role in differential
geometry and in the study of partial differential equations. It is used extensively in physics and
engineering, especially in the description of electromagnetic fields, gravitational fields, and fluid flow.
Vector calculus was developed from quaternion analysis by J. Willard Gibbs and Oliver Heaviside
near the end of the 19th century, and most of the notation and terminology was established by Gibbs
and Edwin Bidwell Wilson in their 1901 book, Vector Analysis. In the conventional form using cross
products, vector calculus does not generalize to higher dimensions, while the alternative approach of
geometric algebra which uses exterior products does (see § Generalizations below for more).
Scalar fields
A scalar field associates a scalar value to every point in a space. The scalar is a mathematical number
representing a physical quantity. Examples of scalar fields in applications include the temperature
distribution throughout space, the pressure distribution in a fluid, and spin-zero quantum fields
(known as scalar bosons), such as the Higgs field. These fields are the subject of scalar field theory.
Vector fields
A vector field is an assignment of a vector to each point in a space. A vector field in the plane, for
instance, can be visualized as a collection of arrows with a given magnitude and direction each
attached to a point in the plane. Vector fields are often used to model, for example, the speed and
direction of a moving fluid throughout space, or the strength and direction of some force, such as the
magnetic or gravitational force, as it changes from point to point.
Vectors and pseudovectors
In more advanced treatments, one further distinguishes pseudovector fields and pseudoscalar fields,
which are identical to vector fields and scalar fields, except that they change sign under an
orientation-reversing map: for example, the curl of a vector field is a pseudovector field, and if one
reflects a vector field, the curl points in the opposite direction. This distinction is clarified and
elaborated in geometric algebra, as described below.
Operation in Vector
The different operations performed with vector quantities are tabulated below with their notation and
illustration
Integrals
There are three types of integrals dealt with in Vector Calculus that are
Line Integral
Surface Integral
Volume Integral
Line Integral
Line Integral in mathematics is the integration of a function along the line of the curve. The function
can be a scalar or vector whose line integral is given by summing up the values of the field at all
points on a curve weighted by some scalar function on the curve. Line Integral is also called Path
Integral
Surface Integral
Surface Integral in mathematics is the integration of a function along the whole region or space that is
not flat. In Surface integral, the surfaces are assumed of small points hence, the integration is given by
summing up all the small points present on the surface. The surface integral is equivalent to the
double integration of a line integral.
Volume Integral
A volume integral, also known as a triple integral, is a mathematical concept used in calculus and
vector calculus to calculate the volume of a three-dimensional region within a space. It is an extension
of the concept of a definite integral in one dimension to three dimensions.
A quantity that is characterized not only by magnitude but also by its direction, is called a vector.
Velocity, force, acceleration, momentum, etc. are vectors.
Here,
Triangle ACD is thus a right-angled triangle, and we can apply trigonometric formulae.
cos α = CD/AC
Or, CD = AC cos α
From the figure, it is clear that CD is the projection of vector a on vector b. So, we can conclude that
one vector can be projected over the other vector by the cosine of the angle between them.
If α is 90° then Scalar product is zero as cos(90) = 0. So, the scalar product of unit vectors in
x, y directions is 0.
There are various inequalities based on the dot product of vectors, such as:
Triangle Inequality
Triangle Inequality
|a+ b| ≤ |a| + | b|
Example 1. Consider two vectors such that |a|=6 and |b|=3 and α = 60°. Find their dot product.
Solution:
=18(1/2)
a.b = 9
Cross Product/Vector Product of Vectors
The vector product or cross product, of two vectors a and b with an angle α between them is
mathematically calculated as
a×b=(a2b3–a3b2,a3b1–a1b3,a1b2–a2b1)
b = b1x + b2y + b3z. Then the cross product a × b can be computed using determinant form
x y z
a1 a2 a3
b1 b2 b3
Example 1. Find the cross product of two vectors a and b if their magnitudes are 5 and 10
respectively. Given that angle between then is 30°.
Solution:
Many important applied problems involve finding the best way to accomplish some task. Often this
involves finding the maximum or minimum value of some function: the minimum time to make a
certain journey, the minimum cost for doing a task, the maximum power that can be generated by a
device, and so on. Many of these problems can be solved by finding the appropriate function and then
using techniques of calculus to find the maximum or the minimum value required.
Generally, such a problem will have the following mathematical form: Find the largest (or smallest)
value of f(x) when a≤x≤b. Sometimes a or b are infinite, but frequently the real world imposes some
constraint on the values that x may have.
Such a problem differs in two ways from the local maximum and minimum problems we encountered
when graphing functions: We are interested only in the function between a and b, and we want to
know the largest or smallest value that f(x) takes on, not merely values that are the largest or smallest
in a small interval. That is, we seek not a local maximum or minimum but a global maximum or
minimum, sometimes also called an absolute maximum or minimum.
Any global maximum or minimum must of course be a local maximum or minimum. If we find all
possible local extrema, then the global maximum, if it exists, must be the largest of the local maxima
and the global minimum, if it exists, must be the smallest of the local minima. We already know where
local extrema can occur: only at those points at which f′(x) is zero or undefined. Actually, there are
two additional points at which a maximum or minimum can occur if the endpoints a and b are not
infinite, namely, at a and b. We have not previously considered such points because we have not been
interested in limiting a function to a small interval. An example should make this clear.
1. Introduce all variables. If applicable, draw a figure and label all variables.
2. Determine which quantity is to be maximized or minimized, and for what range of values of
the other variables (if this can be determined at this time).
3. Write a formula for the quantity to be maximized or minimized in terms of the variables. This
formula may involve more than one variable.
4. Write any equations relating the independent variables in the formula from step 33. Use these
equations to write the quantity to be maximized or minimized as a function of one variable.
5. Identify the domain of consideration for the function in step 4 based on the physical problem
to be solved.
6. Locate the maximum or minimum value of the function from step 4. This step typically
involves looking for critical points and evaluating a function at endpoints.
A rectangular garden is to be constructed using a rock wall as one side of the garden and wire fencing
for the other three sides (Figure 4.7.1). Given 100ft of wire fencing, determine the dimensions that
would create a garden of maximum area. What is the maximum area?
We want to determine the measurements x and y that will create a garden with a maximum area
using 100ft of fencing.
Solution
Let x denote the length of the side of the garden perpendicular to the rock wall and y denote the length
of the side parallel to the rock wall. Then the area of the garden is
A=x⋅y.
We want to find the maximum possible area subject to the constraint that the total fencing is 100ft, the
total amount of fencing used will be 2x+y. Therefore, the
Constraint equation is
2x+y=100.
Solving this equation for y, we have y=100−2x. Thus, we can write the area as
A(x)=x⋅(100−2x) =100x−2x2.
A(x)=100x−2x2
As mentioned earlier, since A is a continuous function on a closed, bounded interval, by the extreme
value theorem, it has a maximum and a minimum. These extreme values occur either at endpoints or
critical points. At the endpoints, A(x)=0. Since the area is positive for all x in the open interval (0,50),
the maximum must occur at a critical point. Differentiating the function A(x), we obtain
A'(x)=100−4x.
Decision Theory
Decision theory is a framework for making decisions under uncertainty. It involves identifying
options, assessing their consequences, and choosing the best option. Decision theory can be applied to
many fields, including business, economics, and psychology
How it works
Assess consequences: Consider the positive and negative outcomes of each option
Normative decision theory: Makes predictions based on what would happen in an ideal
situation
Applications
Business: Decision theory can help determine the best course of action to maximize profit or
revenue
Psychology: Decision theory can help understand how people make decisions
Economics: Decision theory can help understand how people make decisions about money
Marketing: Decision theory can help understand how people make decisions about products
and services
the decision problem is how to select the best of the available alternatives. The elements of the
problem are the possible alternatives (actions, acts), the possible events (states, outcomes of a random
process), the probabilities of these events, the consequences associated with each possible alternative-
event combination, and the criterion (decision rule) according to which the best alternative is selected
To illustrate the construction of this table, suppose that 12 (hundred) eggs are ordered. The purchase
cost is 12 × 8 or 96 (hundreds of dollars). If 10 (hundred) eggs are demanded, 10 are sold, and the
revenue is 10 × 10 or 100 ($00); the profit associated with this alternative-event pair is 100 − 96 or 4
($00).
Information Theory
Information theory is the mathematical study of how information is stored, communicated, and
quantified. It's also known as the mathematical theory of communication
Claude Shannon, an American electrical engineer, established information theory in the 1940s
Harry Nyquist and Ralph Hartley made early contributions in the 1920s
A key measure in information theory is entropy. Entropy quantifies the amount of uncertainty
involved in the value of a random variable or the outcome of a random process. For example,
identifying the outcome of a fair coin flip (which has two equally likely outcomes) provides less
information (lower entropy, less uncertainty) than identifying the outcome from a roll of a die (which
has six equally likely outcomes). Some other important measures in information theory are mutual
information, channel capacity, error exponents, and relative entropy. Important sub-fields of
information theory include source coding, algorithmic complexity theory, algorithmic information
theory and information-theoretic security.
Based on the probability mass function of each source symbol to be communicated, the Shannon
entropy H, in units of bits (per symbol), is given by
where pi is the probability of occurrence of the i th possible value of the source symbol. This equation
gives the entropy in the units of "bits" (per symbol) because it uses a logarithm of base 2, and this
base-2 measure of entropy has sometimes been called the shannon in his honor
Here, I(x) is the self-information, which is the entropy contribution of an individual message,
and Ex is the expected value. If X is the set of all messages {x1, ..., xn} . p(x) is the probability of
some
The joint entropy of two discrete random variables X and Y is merely the entropy of their pairing: (X,
Y). This implies that if X and Y are independent, then their joint entropy is the sum of their individual
entropies.
Mutual information measures the amount of information that can be obtained about one random
variable by observing another. It is important in communication where it can be used to maximize the
amount of information shared between sent and received signals. The mutual information of X
relative to Y is given by:
Coding theory is one of the most important and direct applications of information theory. It can be
subdivided into source coding theory and channel coding theory. Using a statistical description for
data, information theory quantifies the number of bits needed to describe the data, which is the
information entropy of the source.
Data compression (source coding): There are two formulations for the compression problem:
Error-correcting codes (channel coding): While data compression removes as much redundancy as
possible, an error-correcting code adds just the right kind of redundancy (i.e., error correction) needed
to transmit the data efficiently and faithfully across a noisy channel.