MLT Unit 1 Notes
MLT Unit 1 Notes
UNIT-1
Machine learning (ML) is a type of artificial intelligence (AI) that enables a system to learn from data
instead of through explicit programming. ML uses algorithms that iteratively learn from data to improve,
describe data, and predict outcomes.
2.NEEDS
Data is the lifeblood of all business. Data-driven decisions increasingly make the difference between
keeping up with competition or falling further behind. Machine learning can be the key to unlocking the
value of corporate and customer data and enacting decisions that keep a company ahead of the
competition.
Advancements in AI for applications like natural language processing (NLP) and computer vision
(CV) are helping industries like financial services, healthcare, and automotive accelerate innovation,
improve customer experience, and reduce costs. Machine learning has applications in all types of
industries, including manufacturing, retail, healthcare and life sciences, travel and hospitality, financial
services, and energy, feedstock, and utilities. Use cases include:
Manufacturing. Predictive maintenance and condition monitoring
Retail. Upselling and cross-channel marketing
Healthcare and life sciences. Disease identification and risk satisfaction
Travel and hospitality. Dynamic pricing
Financial services. Risk analytics and regulation
Energy. Energy demand and supply optimization
Page 1
3.HISTORY
It’s all well and good to ask if androids dream of electric sheep, but science fact has evolved to a point
where it’s beginning to coincide with science fiction. No, we don’t have autonomous androids struggling
with existential crises — yet — but we are getting ever closer to what people tend to call “artificial
intelligence.”
Machine Learning is a sub-set of artificial intelligence where computer algorithms are used to
autonomously learn from data and information. In machine learning computers don’t have to be explicitly
programmed but can change and improve their algorithms by themselves.
Today, machine learning algorithms enable computers to communicate with humans, autonomously drive
cars, write and publish sport match reports, and find terrorist suspects. I firmly believe machine learning
will severely impact most industries and the jobs within them, which is why every manager should have
at least some grasp of what machine learning is and how it is evolving.
In this post I offer a quick trip through time to examine the origins of machine learning as well as the
most recent milestones.
1950 — Alan Turing creates the “Turing Test” to determine if a computer has real intelligence. To pass
the test, a computer must be able to fool a human into believing it is also human.
1952 — Arthur Samuel wrote the first computer learning program. The program was the game of
checkers, and the IBM computer improved at the game the more it played, studying which moves made
up winning strategies and incorporating those moves into its program.
1957 — Frank Rosenblatt designed the first neural network for computers (the perceptron), which
simulate the thought processes of the human brain.
1967 — The “nearest neighbor” algorithm was written, allowing computers to begin using very basic
pattern recognition. This could be used to map a route for traveling salesmen, starting at a random city
but ensuring they visit all cities during a short tour.
1979 — Students at Stanford University invent the “Stanford Cart” which can navigate obstacles in a
room on its own.
1981 — Gerald Dejong introduces the concept of Explanation Based Learning (EBL), in which a
computer analyses training data and creates a general rule it can follow by discarding unimportant data.
Page 2
1985 — Terry Sejnowski invents NetTalk, which learns to pronounce words the same way a baby does.
1990s — Work on machine learning shifts from a knowledge-driven approach to a data-driven approach.
Scientists begin creating programs for computers to analyze large amounts of data and draw conclusions
— or “learn” — from the results.
2006 — Geoffrey Hinton coins the term “deep learning” to explain new algorithms that let computers
“see” and distinguish objects and text in images and videos.
2010 — The Microsoft Kinect can track 20 human features at a rate of 30 times per second, allowing
people to interact with the computer via movements and gestures.
2011 — Google Brain is developed, and its deep neural network can learn to discover and categorize
objects much the way a cat does.
2012 – Google’s X Lab develops a machine learning algorithm that is able to autonomously browse
YouTube videos to identify the videos that contain cats.
2014 – Facebook develops DeepFace, a software algorithm that is able to recognize or verify individuals
on photos to the same level as humans can.
2015 – Microsoft creates the Distributed Machine Learning Toolkit, which enables the efficient
distribution of machine learning problems across multiple computers.
2015 – Over 3,000 AI and Robotics researchers, endorsed by Stephen Hawking, Elon Musk and Steve
Wozniak (among many others), sign an open letter warning of the danger of autonomous weapons which
select and engage targets without human intervention.
2016 – Google’s artificial intelligence algorithm beats a professional player at the Chinese board game
Go, which is considered the world’s most complex board game and is many times harder than chess. The
Page 3
AlphaGo algorithm developed by Google DeepMind managed to win five games out of five in the Go
competition.
So are we drawing closer to artificial intelligence? Some scientists believe that’s actually the wrong
question.
They believe a computer will never “think” in the way that a human brain does, and that comparing the
computational analysis and algorithms of a computer to the machinations of the human mind is like
comparing apples and oranges.
Regardless, computers’ abilities to see, understand, and interact with the world around them is growing at
a remarkable rate. And as the quantities of data we produce continue to grow exponentially, so will our
computers’ ability to process and analyze — and learn from — that data grow and expand.
4.DEFINITIONS
Machine learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms
and statistical models that enable computers to perform tasks without being explicitly programmed for
those tasks. In essence, machine learning algorithms learn from data, iteratively improving their
performance over time as they are exposed to more data.
A commonly cited definition of machine learning is given by Tom Mitchell, a computer scientist and
professor at Carnegie Mellon University:
" A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
In this definition:
"Experience E" refers to the data or examples provided to the algorithm.
"Tasks T" refer to the specific problems or tasks that the algorithm aims to solve or perform.
"Performance measure P" is the metric used to evaluate the algorithm's performance on the tasks T.
5.APPLICATIONS
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to identify
objects, persons, places, digital images, etc. The popular use case of image recognition and face detection
is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our
Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind
this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.
Page 4
2. Speech Recognition:
While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's a
popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech
to text", or "Computer speech recognition." At present, machine learning algorithms are widely used by
various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech
recognition technology to follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the
shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested
with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
Everyone who is using Google Map is helping this app to make it better. It takes information from the
user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product
on Amazon, then we started getting an advertisement for the same product while internet surfing on the
same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests the product
as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc.,
and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a
significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on
self-driving car. It is using unsupervised learning method to train the car models to detect people and
objects while driving.
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always
receive an important mail in our inbox with the important symbol and spam emails in our spam box, and
the technology behind this is Machine learning. Below are some spam filters used by Gmail:
o Content Filter
o Header filter
Page 5
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name
suggests, they help us in finding the information using our voice instruction. These assistants can help us
in various ways just by our voice instructions such as Play music, call someone, Open an email,
Scheduling an appointment, etc.
These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.
Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent transaction
can take place such as fake accounts, fake ids, and steal money in the middle of a transaction. So to
detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction or a
fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values become the
input for the next round. For each genuine transaction, there is a specific pattern which gets change for
the fraud transaction hence, it detects it and makes our online transactions more secure.
Machine learning is widely used in stock market trading. In the stock market, there is always a risk of up
and downs in shares, so for this machine learning's long short term memory neural network is used for
the prediction of stock market trends.
In medical science, machine learning is used for diseases diagnoses. With this, medical technology is
growing very fast and able to build 3D models that can predict the exact position of lesions in the brain.
Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all, as
for this also machine learning helps us by converting the text into our known languages. Google's GNMT
(Google Neural Machine Translation) provide this feature, which is a Neural Machine Learning that
translates the text into our familiar language, and it called as automatic translation.
Page 6
The technology behind the automatic translation is a sequence to sequence learning algorithm, which is
used with image recognition and translates the text from one language to another language.
6.ADVANTAGES
Automation: Machine learning algorithms can automate complex tasks and processes, reducing
the need for human intervention.
Data mining: Machine learning can assess data and find patterns in it.
Better advertising and marketing: Machine learning algorithms can predict which consumers are
the most likely to actually buy a product.
More accurate predictions: Machine learning can analyze vast amounts of data, identify patterns,
and make accurate predictions without being explicitly programmed for every possible scenario.
Data-driven decision making: Machine learning uses real-time and historical data to make
informed decisions, which significantly reduces reliance on decisions based on intuition or
assumptions.
Continuous improvement: Machine learning models can continuously learn and improve over
time.
Enhanced online shopping experience: Machine learning algorithms can analyze user behavior to
offer personalized product suggestions.
7.DISADVANTAGES
Inaccurate data: ML algorithms can be difficult to train due to data errors, unusable data formats,
and the inability to label data.
Biased data: Human biases can creep into datasets and spoil outcomes. For example, the selfie
editor Face App was initially trained to make faces “hotter” by lightening the skin tone.
Ethical concerns: ML can amplify existing biases in the training data, which can raise ethical
concerns related to discrimination, stereotyping, and data privacy.
Security risks: ML relies heavily on data, which raises privacy and security concerns.
Other disadvantages: ML requires significant computational power, and can lead to excessive use
that harms mankind.
8.CHALLENGES
Poor quality data is one of the primary obstacles to machine learning. Algorithms need access to valuable
information to advance, but exhausting available data can lead to issues.
Data bias
Page 7
Data bias can occur in collection, analysis, and utilization. Biased data can lead to inaccurate and
unreliable results, making it ineffective at achieving desired goals.
Overfitting
Overfitting occurs when a model is trained too closely on the training data, and as a result, it performs
poorly on new, unseen data.
Underfitting
Under fitting occurs when a machine learning algorithm is unable to capture the relationship between the
input and output variables accurately. This generates a high error rate on both the training set and unseen
data.
Data collection
Most of the time for running machine learning end-to-end is spent on preparing the data, which includes
collecting, cleaning, analyzing, visualizing, and feature engineering.
In this post, you will learn about the most common types of machine learning (ML) problems along with
a few examples. Without further ado, let’s look at these problem types and understand the details.
Regression
Classification
Clustering
Time-series forecasting
Anomaly detection
Ranking
Recommendation
Data generation
Optimization
Linear regression,
When the need is to predict numerical values, K-NN, random
such kinds of problems are called regression forest, neural
Regression problems. For example, house price prediction networks
Page 8
binary classification problem. When it is
multiple classes, it is multi-nomial classification.
For example, classify whether a person is
suffering from a disease or otherwise. Classify
whether a stock is “buy”, “sell”, or “hold”. gradient boosting
Check this related post – Machine learning classifier, neural
techniques for stock prediction networks
K-Means,
DBSCAN,
Hierarchical
When there is a need to categorize the data clustering, Gaussian
points in similar groupings or clusters, this is mixture models,
Clustering called a clustering problem. BIRCH
Page 9
ranking algorithm to recommend the next items.
Generative
When there is a need to generate data such as adversarial network
images, videos, articles, posts, etc, the problem is (GAN), Hidden
Data generation called a data generation problem. Markov models
10.MATHEMATICAL FOUNDATIONS
This course is an introduction to key mathematical concepts at the heart of machine learning. The focus is
on matrix methods and statistical models and features real-world applications ranging from classification
and clustering to denoising and recommender systems.
11.LINEAR ALGEBRA
Linear algebra serves as a fundamental mathematical foundation for many aspects of machine learning,
providing the framework for representing and manipulating data, defining models, and solving
optimization problems. Here are some key concepts in linear algebra that are essential for understanding
machine learning:
Vectors represent quantities with magnitude and direction, often used to represent features
or observations in machine learning.
2. Matrix Operations:
Matrix multiplication, which involves multiplying rows and columns to produce new
matrices. This operation is fundamental in many machine learning algorithms, such as
linear regression and neural networks.
Page 10
3. Transpose and Inverse:
The transpose of a matrix flips it across its diagonal, swapping rows and columns.
The inverse of a square matrix A, denoted as A^-1, is a matrix such that A * A^-1 = I,
where I is the identity matrix. The inverse is crucial for solving systems of linear equations
and for certain optimization problems.
5. Matrix Decompositions:
Singular Value Decomposition (SVD), which decomposes a matrix into singular vectors
and singular values. SVD is used in PCA and other dimensionality reduction techniques.
6. Linear Transformations:
Linear transformations map vectors to other vectors while preserving addition and scalar
multiplication properties. Many machine learning algorithms involve linear
transformations, such as feature transformations and linear classifiers.
Page 11
12.ANALYTICAL GEOMETRY
Analytic geometry is used in physics and engineering, as well as in fields like aviation, rocketry, space
science, and spaceflight. It serves as the cornerstone for many contemporary geometric disciplines,
including algebraic, differential, discrete, and computational geometry.
Analytic geometry, also called coordinate geometry, is a branch of mathematics that combines algebra
and geometry. It deals with geometric figures using coordinate systems and algebraic techniques.
Instead of only dealing with on geometric properties and relationships, analytic geometry introduces a
coordinate system that represent geometric figures using numerical coordinates.
Engineering Design
Analytic geometry is widely used in engineering design to the model and analyze complex shapes and
structures. Engineers use coordinate systems and equations to the design buildings, bridges and
mechanical components.
Example: Engineers use analytic geometry to design the curves and surfaces of the car bodies for the
aerodynamics and aesthetics.
Computer Graphics
Analytic geometry forms the basis of the computer graphics allowing the programmers to the create and
manipulate images on the screens. By representing objects as sets of the coordinates and equations
computers can render realistic 2D and 3D graphics.
Page 12
Physics and Astronomy
Analytic geometry is used in physics and astronomy to the study the motion of the objects and understand
the behavior of the celestial bodies. The Equations describing the trajectories of the particles and planets
can be analyzed using the geometric techniques.
Example: NASA uses analytic geometry to calculate the orbits of the spacecraft and predict their
trajectories.
Cartography
Analytic geometry is essential in the cartography the science of the mapmaking. The Cartographers use
coordinate systems and equations to the represent the Earth’s surface accurately and create maps that are
used for the navigation, urban planning and environmental studies.
Example: GPS systems rely on analytic geometry to the determine the coordinates of the user’s location
and provide directions.
Robotics
Analytic geometry plays a crucial role in robotics enabling the engineers to the program robots to perform
the tasks accurately and efficiently. The Robots use coordinate systems and geometric algorithms to the
navigate environments and manipulate objects.
Example: Industrial robots use analytic geometry to the precisely control their movements when
assembling products on assembly lines.
ORTHOGONAL PROJECTION
Probability and statistics are considered as the base foundation for ML and data science to develop ML
algorithms and build decision-making capabilities. Also, Probability and statistics are the primary
prerequisites to learn ML.
Machine Learning is an interdisciplinary field that uses statistics, probability, algorithms to learn from
data and provide insights which can be used to build intelligent applications. In this article, we will
discuss some of the key concepts widely used in machine learning.
Probability and statistics are related areas of mathematics which concern themselves with analyzing the
relative frequency of events.
Page 13
Probability deals with predicting the likelihood of future events, while statistics involves the analysis of
the frequency of past events.
Probability
Most people have an intuitive understanding of degrees of probability, which is why we use words like
“probably” and “unlikely” in our daily conversation, but we will talk about how to make quantitative
claims about those degrees [1].
This event can be anything like tossing a coin, rolling a die or pulling a colored ball out of a bag. In
these examples the outcome of the event is random, so the variable that represents the outcome of these
events is called a random variable.
Let us consider a basic example of tossing a coin. If the coin is fair, then it is just as likely to come up
heads as it is to come up tails. In other words, if we were to repeatedly toss the coin many times, we
would expect about about half of the tosses to be heads and and half to be tails. In this case, we say that
the probability of getting a head is 1/2 or 0.5 .
The empirical probability of an event is given by number of times the event occurs divided by the total
number of incidents observed. If forntrials and we observe ssuccesses, the probability of success is s/n. In
the above example. any sequence of coin tosses may have more or less than exactly 50% heads.
Theoretical probability on the other hand is given by the number of ways the particular event can occur
divided by the total number of possible outcomes. So a head can occur once and possible outcomes are
two (head, tail). The true (theoretical) probability of a head is 1/2.
Joint Probability
Probability of events A and B denoted byP(A and B) or P(A ∩ B)is the probability that events A and B
both occur. P(A ∩ B) = P(A). P(B) . This only applies if Aand Bare independent, which means that
if Aoccurred, that doesn’t change the probability of B, and vice versa.
Conditional Probability
Let us consider A and B are not independent, because if A occurred, the probability of B is higher. When
A and B are not independent, it is often useful to compute the conditional probability, P (A|B), which is
the probability of A given that B occurred: P(A|B) = P(A ∩ B)/ P(B).
The probability of an event A conditioned on an event B is denoted and defined P(A|B) = P(A∩B)/P(B)
Similarly, P(B|A) = P(A ∩ B)/ P(A) . We can write the joint probability of as A and B as P(A ∩ B)=
p(A).P(B|A), which means : “The chance of both things happening is the chance that the first one
happens, and then the second one given the first happened.”
Bayes’ Theorem
Bayes’s theorem is a relationship between the conditional probabilities of two events. For example, if we
want to find the probability of selling ice cream on a hot and sunny day, Bayes’ theorem gives us the
Page 14
tools to use prior knowledge about the likelihood of selling ice cream on any other type of day (rainy,
windy, snowy etc.).
where Hand E are events, P(H|E) is the conditional probability that event H occurs given that event E has
already occurred. The probability P(H) in the equation is basically frequency analysis; given
our prior data what is the probability of the event occurring. The P(E|H) in the equation is called
the likelihood and is essentially the probability that the evidence is correct, given the information from
the frequency analysis. P(E) is the probability that the actual evidence is true.
a Sensitivity (also called the true positive rate) result for 95% of the patients with the disease, and
a Specificity (also called the true negative rate) result for 95% of the healthy patients.
If we let “+” and “−” denote a positive and negative test result, respectively, then the test accuracies are
the conditional probabilities : P(+|disease) = 0.95, P(-|healthy) = 0.95,
In Bayesian terms, we want to compute the probability of disease given a positive test, P(disease|+).
Descriptive Statistics
Descriptive statistics refers to methods for summarizing and organizing the information in a data set. We
will use below table to describe some of the statistical concepts [4].
Elements: The entities for which information is collected are called the elements. In the above table, the
elements are the 10 applicants. Elements are also called cases or subjects.
Variables: The characteristic of an element is called a variable. It can take different values for different
elements.e.g., marital status, mortgage, income, rank, year, and risk. Variables are also called attributes.
Qualitative: A qualitative variable enables the elements to be classified or categorized according to some
characteristic. The qualitative variables are marital status, mortgage, rank, and risk. Qualitative variables
are also called categorical variables.
Quantitative: A quantitative variable takes numeric values and allows arithmetic to be meaningfully
performed on it. The quantitative variables are income and year. Quantitative variables are also
called numerical variables.
Discrete Variable: A numerical variable that can take either a finite or a countable number of values is a
discrete variable, for which each value can be graphed as a separate point, with space between each
point. ‘year’ is an example of a discrete variable..
Page 15
Continuous Variable: A numerical variable that can take infinitely many values is a continuous variable,
whose possible values form an interval on the number line, with no space between the points. ‘income’ is
an example of a continuous variable.
Population: A population is the set of all elements of interest for a particular problem. A parameter is a
characteristic of a population.
Sample: A sample consists of a subset of the population. A characteristic of a sample is called a statistic.
Random sample: When we take a sample for which each element has an equal chance of being selected.
Indicate where on the number line the central part of the data is located.
Mean
The mean is the arithmetic average of a data set. To calculate the mean, add up the values and divide by
The population mean is the arithmetic average of a population, and is denoted 𝜇 (“myu”, the Greek letter
the number of values.The sample mean is the arithmetic average of a sample, and is denoted x̄ (“x-bar”).
for m).
Median
The median is the middle data value, when there is an odd number of data values and the data have been
sorted into ascending order. If there is an even number, the median is the mean of the two middle data
values. When the income data are sorted into ascending order, the two middle values are $32,100 and
$32,200, the mean of which is the median income, $32,150.
Mode
The mode is the data value that occurs with the greatest frequency. Both quantitative and categorical
variables can have modes, but only quantitative variables can have means or medians. Each income value
Range
The range of a variable equals the difference between the maximum and minimum values. The range of
income is:
Range only reflects the difference between largest and smallest observation, but it fails to reflect how data
is centralized.
Variance
as 𝜎² (“sigma-squared”):
Population variance is defined as the average of the squared differences from the Mean, denoted
Page 16
Larger Variance means the data are more spread out.
The sample variance s² is approximately the mean of the squared deviations, with N replaced by n-1. This
difference occurs because the sample mean is used as an approximation of the true population mean.
Standard Deviation
The standard deviation or sd of a bunch of numbers tells you how much the individual numbers tend to
differ from the mean.
The sample standard deviation is the square root of the sample variance: sd = √ s². For example, incomes
deviate from their mean by $7201.
The population standard deviation is the square root of the population variance: sd= √ 𝜎².
Three different data distributions with same mean (100) and different standard deviation (5,10,20)
The smaller the standard deviation, narrower the peak, the data points are closer to the mean. The further
the data points are from the mean, the greater the standard deviation.
Indicate the relative position of a particular data value in the data distribution.
Percentile
The pth percentile of a data set is the data value such that p percent of the values in the data set are at or
below this value. The 50th percentile is the median. For example, the median income is $32,150, and
50% of the data values lie at or below this value.
Percentile rank
The percentile rank of a data value equals the percentage of values in the data set that are at or below that
value. For example, the percentile rank. of Applicant 1’s income of $38,000 is 90%, since that is the
percentage of incomes equal to or less than $38,000.
Page 17
Interquartile Range (IQR)
The first quartile (Q1) is the 25th percentile of a data set; the second quartile (Q2) is the 50th percentile
(median); and the third quartile (Q3) is the 75th percentile.
The IQR measures the difference between 75th and 25th observation using the formula: IQR = Q3 − Q1.
Different ways you can describe patterns found in uni-variate data include central tendency : mean, mode
and median and dispersion: range, variance, maximum, minimum, quartiles , and standard deviation.
Pie chart [left] & Bar chart [right] of Marital status from loan applicants table.
The various plots used to visualize uni-variate data typically are Bar Charts, Histograms, Pie Charts. etc.
Bi-variate analysis involves the analysis of two variables for the purpose of determining the empirical
relationship between them. The various plots used to visualize bi-variate data typically are scatter-plot,
box-plot.
Scatter Plots
The simplest way to visualize the relationship between two quantitative variables , x and y. For two
continuous variables, a scatter-plot is a common graph. Each (x, y) point is graphed on a Cartesian plane,
with the x axis on the horizontal and the y axis on the vertical. Scatter plots are sometimes called
correlation plots because they show how two variables are correlated.
Page 18
Correlation
A correlation is a statistic intended to quantify the strength of the relationship between two variables.
The correlation coefficient r quantifies the strength and direction of the linear relationship between two
quantitative variables. The correlation coefficient is defined as:
Box Plots
A box plot is also called a box and whisker plot and it’s used to picture the distribution of values. When
one variable is categorical and the other continuous, a box-plot is commonly used. When you use a box
plot you divide the data values into four parts called quartiles. You start by finding the median or middle
value. The median splits the data values into halves. Finding the median of each half splits the data values
into four parts, the quartiles.
Each box on the plot shows the range of values from the median of the lower half of the values at the
bottom of the box to the median of the upper half of the values at the top of the box. A line in the middle
of the box occurs at the median of all the data values. The whiskers then point to the largest and smallest
values in the data.
Bayes’ Theorem is used to determine the conditional probability of an event. It was named after an
English statistician, Thomas Bayes who discovered this formula in 1763.
Bayes Theorem is a very important theorem in mathematics, that laid the foundation of a unique
statistical inference approach called the Bayes’ inference.
Page 19
It is used to find the probability of an event, based on prior knowledge of conditions that might be related
to that event. For example, if we want to find the probability that a white marble drawn at random came
from the first bag, given that a white marble has already been drawn, and there are three bags each
containing some white and black marbles, then we can use Bayes’ Theorem.
This article will explore the Bayes theorem. We will present the statement, proof, derivation, and formula
of the theorem, as well as illustrate its applications with various examples.
Bayes theorem (also known as the Bayes Rule or Bayes Law) is used to determine the conditional
probability of event A when event B has already occurred.
The general statement of Bayes’ theorem is “The conditional probability of an event A, given the
occurrence of another event B, is equal to the product of the event of B, given A and the probability of A
divided by the probability of event B.” i.e.
where,
Let E1, E2,…, En be a set of events associated with the sample space S, in which all the events E1, E2,
…, En have a non-zero probability of occurrence. All the events E1, E2,…, E form a partition of S. Let A
be an event from space S for which we have to find probability, then according to Bayes’ theorem,
for k = 1, 2, 3, …., n
For any two events A and B, then the formula for the Bayes theorem is given by: (the image given below
gives the Bayes’ theorem formula)
where,
P(A) and P(B) are the probabilities of events A and B also P(B) is never equal to zero.
Page 20
P(A|B) is the probability of event A when event B happens
The proof of Bayes’ Theorem is given as, according to the conditional probability formula,
P(Ei∩A) = P(Ei)P(A|Ei)……(ii)
P(A) = ∑ P(Ek)P(A|Ek)…..(iii)
Substituting the value of P(Ei∩A) and P(A) from eq (ii) and eq(iii) in eq(i) we get,
Bayes’ theorem is also known as the formula for the probability of “causes”. As we know, the Ei‘s are a
partition of the sample space S, and at any given time only one of the events Ei occurs. Thus we conclude
that the Bayes’ theorem formula gives the probability of a particular Ei, given the event A has occurred.
Various terms used in the Bayes theorem are explained below in this article.
After learning about Bayes theorem in detail, let us understand some important terms related to the
concepts we covered in formula and derivation.
Hypotheses
Events happening in the sample space E1, E2,… En is called the hypotheses
Priori Probability
Priori Probability is the initial probability of an event occurring before any new data is taken into
account. P(Ei) is the priori probability of hypothesis Ei.
Posterior Probability
Posterior Probability is the updated probability of an event after considering new information.
Probability P(Ei|A) is considered as the posterior probability of hypothesis Ei
Conditional Probability
The probability of an event A based on the occurrence of another event B is termed conditional
Probability. It is denoted as P(A|B) and represents the probability of A when event B has already
happened.
Joint Probability
Page 21
When the probability of two more events occurring together and at the same time is measured it is
marked as Joint Probability. For two events A and B, it is denoted by joint probability is denoted
as, P(A∩B).
Random Variables
Real-valued variables whose possible values are determined by random experiments are called random
variables. The probability of finding such variables is the experimental probability.
Bayesian inference is very important and has found application in various activities, including medicine,
science, philosophy, engineering, sports, law, etc., and Bayesian inference is directly derived from Bayes’
theorem.
Example: Bayes’ theorem defines the accuracy of the medical test by taking into account how likely a
person is to have a disease and what is the overall accuracy of the test.
The difference between Conditional Probability and Bayes Theorem can be understood with the help of
the table given below,
In machine learning, an objective function (also known as a loss or cost function) quantifies how well a
model’s predictions match the actual target values in the training data. The goal of optimization is then to
find the set of model parameters that minimizes this objective function, effectively making the model’s
predictions as accurate as possible.
That’s when calculus comes into play, as it provides the mathematical foundation for understanding how
functions change and how to optimize them with the help of tools like derivatives and gradients.
Page 22
Functions
Roughly speaking, a derivative describes how the function’s output changes with respect to a small
change in its input. Mathematically, a derivative can be defined as follows:
A function can have more than one variable; we call it a multivariate function. In such cases, the function
can have partial derivatives obtained by varying one variable at a time.
By collecting these partial derivatives in a vector, we obtain a Gradient (see where I’m heading ?)
Page 23
In a similar way to how we defined a function f in R:
We can also encounter situations where we need to take gradients of matrices with respect to vectors (or
other matrices), which yields a multidimensional tensor.
Now after all these definitions, let’s (finally) explain why is this useful in Machine Learning algorithms.
Backpropagation
Consider a neural network with multiple layers. The goal of training is to adjust the weights and biases of
the network in such a way that the predicted outputs are as close as possible to the true outputs. This is
achieved by minimizing a loss function that quantifies the discrepancy between predictions and actual
values.
2. Propagate the input through each layer of the network, one by one, to compute the output
activations. At each layer, compute the weighted sum of the inputs, apply the activation function,
and pass the result to the next layer.
3. Continue this process until you reach the output layer, obtaining the predicted output ypred
Page 24
Step 2: Compute Loss
1. Calculate the loss between the predicted output ypred and the actual target ytrue using a suitable
loss function, such as mean squared error or cross-entropy
1. For each hidden layer l, compute the gradient of the loss L with respect to the weighted sum of
inputs z(l) before the activation function.
2. Compute the gradient of the loss with respect to the layer’s weights and biases using the computed
local gradient and the input activations from the previous layer.
After computing the gradients for all layers, we update the weights and biases using an optimization
algorithm like gradient descent :
Step 5: Repeat
We continue this process for a predefined number of iterations (epochs) or until the loss converges to a
satisfactory level.
The Gradient Descent algorithm mentioned above provides an overview of Optimization. Training a
machine learning model involves finding a set of parameters that allows the model to make accurate
predictions or classifications based on input data. The process of training is essentially the search for
these parameter values that optimize the model’s performance.
Optimization
In machine learning, most objective functions are designed to be minimized, eaning that the optimal value
corresponds to the minimum. Generally, finding the minimum using analytical methods isn’t feasible,
therefore we start at an initial valuevalue, then follow the negative gradient.
Page 25
Where alpha is called the step size ans ∇f(xt) represents the gradient of f evaluated at the point xt.
It’s important to choose a good step size alpha, also called the learning rate. could lead to slow
algorithmic performance, while a large one might cause the gradient descent to overshoot or diverge.
Another common challenge in optimization is reaching a global minimum instead of a local minimum.
You can overcome the issue by trying different step sizes, different initial points, incorporate momentum
into the optimization algorithm or using Stochastic Gradient Descent (SGD)
When dealing with very large datasets, the feasibility of loading the entire dataset into memory for the
standard Gradient Descent algorithm becomes questionable. That’s where Stochastic Gradient Descent
comes into play. Instead of working with the whole datset, SGD splits the training data into mini-batches
and then for each mini-batch computes the gradient of the loss function with respect to the parameters
using the mini-batch data and finally updates the parameters. Of course, this randomness from mini-batch
sampling introduces noise into the optimization process. That’s why many variations of the SGD
algorithm were developed, such as Adam (Adaptive Moment Estimation) which is commonly used for
training deep learning models.
Constrained Optimization
The problems we have previously considered were free of constraints, this isn’t always the case. For
example, in various contexts, you may have limited resources to allocate or limits to respect. A general
constrained optimization problem looks like this :
Constrained to
Lagrange multipliers are a technique used to solve this kind of problems. The method involves
introducing additional variables (Lagrange multipliers) to incorporate the constraints into the objective
function. . This creates a new function, called the Lagrangian, by introducing, Lagrange multipliers for
each inequality constraint and for each equality constraint. The Lagrangian function is:
The constrained optimization problem we have previously stated is called the primal problem. To which
we associate the dual problem The dual function is defined as:
Page 26
The dual function
Subject to
Support Vector Machines (SVMs) utilize the concept of Lagrange multipliers to define an optimal
hyperplane that maximizes the margin between distinct classes of data points. This margin represents the
separation distance between the hyperplane and the nearest data points from each class. In scenarios
where our data doesn’t exhibit perfect linear separability, the concept of the soft margin SVM introduces
the notion of slack variables and an associated penalty term to accommodate misclassified points. These
slack variables, denoted as ξ, come into play when points either fall within the margin or are incorrectly
classified. They quantify the extent by which a data point resides on the “incorrect” side of its respective
margin hyperplane.
Here, C is a hyperparameter controlling the trade-off between maximizing the margin and minimi-zing
the classification error and ξi is the slack variable associated with the i-th training example, measuring the
margin violation.
The dual problem of SVM optimization is frequently tackled due to its lower dimensionality in
comparison to the primal problem. It is important to recognize that although SVMs offer a well-structured
optimization problem, addressing it can necessitate substantial computational resources, particularly
when dealing with very large datasets.
Page 27
16.DECISION THEORY
Decision theory is a study of an agent's rational choices that supports all kinds of progress in technology
such as work on machine learning and artificial intelligence. Decision theory looks at how decisions are
made, how multiple decisions influence one another, and how decision-making parties deal with
uncertainty.
1. Decision-making Process: At its core, machine learning involves making decisions based on
data. Decision theory formalizes this process by considering the available choices (actions),
possible outcomes, and the probabilities associated with each outcome. In the context of
classification, for example, the decision-making process involves selecting a class label for a
given input based on the observed data.
2. Utility Theory: Decision theory often employs utility functions to quantify the desirability of
different outcomes. In machine learning, utility functions may represent various objectives, such
as accuracy, precision, recall, or any other performance metric relevant to the specific application.
By maximizing expected utility (or minimizing expected loss), machine learning algorithms can
make decisions that lead to the most desirable outcomes on average.
3. Loss Functions: In supervised learning, the choice of a loss function plays a crucial role in
training machine learning models. Loss functions quantify the discrepancy between predicted
outcomes and true outcomes. Different loss functions correspond to different notions of error, and
their selection depends on the specific characteristics of the problem at hand. Decision theory
provides a principled framework for choosing appropriate loss functions based on the decision-
making objectives and the underlying uncertainty.
4. Bayesian Decision Theory: Bayesian decision theory is particularly relevant in machine learning,
especially in probabilistic modeling and Bayesian inference. It combines Bayesian probability
theory with decision theory to make optimal decisions in uncertain environments. In Bayesian
decision theory, decisions are made by considering both prior knowledge (expressed as a
probability distribution) and observed data, leading to posterior decisions that maximize expected
utility.
5. Risk Minimization: Machine learning algorithms often aim to minimize some notion of risk,
which encompasses the expected loss or error over the entire input space. Decision theory
provides a formal framework for risk minimization, allowing practitioners to design learning
algorithms that make decisions with desirable properties, such as robustness to uncertainty and
generalization to unseen data.
Page 28
17.INFORMATION THEORY
Machine learning aims to extract interesting signals from data and make critical predictions. On the other
hand, information theory studies encoding, decoding, transmitting, and manipulating information.
Information theory plays a crucial role in the mathematical foundation of machine learning, particularly
in understanding and quantifying various aspects of data, learning, and communication. Here's how
information theory is incorporated into machine learning:
1. Entropy: Entropy is a fundamental concept in information theory that measures the uncertainty or
randomness of a random variable. In machine learning, entropy is often used in decision trees and
other classification algorithms to quantify the impurity or disorder of a set of labels. By
minimizing entropy, these algorithms can construct decision boundaries that effectively separate
different classes.
4. Mutual Information: Mutual information measures the amount of information that two random
variables share. In machine learning, mutual information is used in feature selection and feature
extraction to identify informative features that are relevant to the prediction task. By maximizing
mutual information between features and labels, machine learning algorithms can focus on the
most discriminative aspects of the data.
5. Compression and Coding: Information theory provides insights into data compression and
coding techniques, which are essential for reducing the storage and transmission costs of data. In
machine learning, compression algorithms can be used to pre-process data and reduce its
dimensionality, leading to more efficient learning algorithms and improved generalization
performance.
6. Channel Capacity: Channel capacity represents the maximum rate at which information can be
reliably transmitted over a communication channel. In machine learning, understanding channel
capacity can help in designing efficient communication protocols for distributed learning systems,
where data is transmitted between multiple nodes or devices.
Page 29