UCK358E – INTR.
TO ARTIFICIAL INTELLIGENCE
SPRING ‘24
LECTURE 2
MACHINE LEARNING BASICS
Instructor: Barış Başpınar
Learning Algorithms: Tasks
A learning algorithm is an algorithm that performs a task by using data.
A learning problem usually has three components: Task, Performance and Experience
Here are some common learning tasks
Classification: Find a function that maps input data to one of k categories
Example: Object recognition
Classification with missing inputs: Find a function or a collection of functions that maps input data
to one of k categories, where the input vector might have missing values
Example: Medical diagnosis
Regression: Find a function that predicts a continuous output
Example: Predict expected amount an insured person will make
Transcription: Observe some unstructured data and transcribe it into discrete, textual form
Example: Image translation, speech recognition
Learning Algorithms: Tasks
Here are more learning tasks
Machine Translation: Convert from one language to another
Example: Translation for natural languages
Structured Output: Any task that involves the output as a vector with important relationship among
its elements
Example: Break down a sentence into nouns, verbs, etc, image captioning
Anomaly Detection: Flag unusual or atypical objects
Example: Spam filtering, credit card fraud detection
Data Completion: Estimate missing values in data
Example: Product recommendation
Denoising: Recover a clean example from a corrupt one
Example: Image and video denoising
Density Estimation: Compute the 𝑝(𝒙) that generated the data
Example: Estimate the structure of a PGM
Learning Algorithms: Performance
We need a quantitative measure 𝑃 to measure how well our algorithm is performing on task 𝑇
For classification and translation tasks, we measure the performance by accuracy,
Just divide the number of correctly classified examples to number of all examples
Performance is evaluated on test data, data that has never been seen by the algorithm before
This set is usually different from the training data
Sometimes selecting the correct performance criteria can be a complicated task itself
For translation, should we measure the accuracy on word by word basis or a sentence by sentence
basis?
For regression, should we penalize small medium frequency mistakes, or large rare mistakes
Learning Algorithms: Experience
Many learning algorithms are categorized as supervised or unsupervised based on the data
that experience
Roughly speaking, unsupervised algorithms try to determine 𝑝(𝒙) from 𝒙
Whereas the supervised algorithm try to determine 𝑝( 𝒚 ∣ 𝒙 ) from (𝒙, 𝒚)
However, sometimes the lines between the two categories get blurry
In semi-supervised algorithms, where the data is only partially labelled
Unsupervised algorithms can also be used for supervised learning
Simply estimate 𝑝( 𝒙, 𝒚 ) and then marginalize
And vice versa:
Use chain rule 𝑝 𝑥 = ς𝑛𝑖=1 𝑝 𝑥𝑖 𝑥1 , … , 𝑥𝑖−1 to turn an unsupervised problem to a collection of 𝑛
supervised problems
Learning Algorithms: Linear Regression Example
Let’s use a linear regression example to clarify these subjects
In linear regression, our model is as follows:
Where 𝒘 ∈ 𝑅𝑛 is a weight vector. Hence the task is to find 𝒘 that does well according to some
performance measure.
A popular type of performance measure for these type of tasks is the mean squared error
The experience is gained by observing the dataset
It can be shown that the optimal solution is found by
Learning Algorithms: Linear Regression Example
Capacity, Overfitting and Underfitting: Generalization
A central challenge in machine learning is to perform well on new, previously unseen inputs.
This property is known as generalization. The error on the test data is referred to as generalization
error
But how can we say something about the test data, if all we see is training data?
If these datasets were generated arbitrarily, we can’t...
However, if they come from the same distribution 𝑝(𝒙, 𝑦) , then we can say something! For
instance, their mean should be equal
In general, we want to find a 𝒘 such that:
I. Make the training error small
II. Make the gap between training and test error small
Unfortunately, we rarely do both good at the same time
Failing on I, is known as underfitting, our model’s capacity is not sufficient to capture 𝑝(𝒙, 𝑦)
Success on I but failing on II, is known as overfitting, our model’s capacity is so high that it becomes
overtuned to work only on training data
Capacity, Overfitting and Underfitting: Generalization
Capacity, Overfitting and Underfitting: Capacity
Capacity, Overfitting and Underfitting: Capacity
Capacity, Overfitting and Underfitting: Regularization
Regularization is any modification we make to the learning algorithm to reduce generalization
error.
The most common form of regularization is to modify the loss function so that larger parameters
are penalized
There are other forms of regularization, for instance ‘dropout‘ (randomly dropping some of the
weights in your model) is very popular in deep learning
Hyperparameters and Validation Sets
Hyperparameters are the parameters of our model that controls the capacity, as well as the
behaviour of the algorithm
Examples: degree of a polynomial, number of layers in a neural net, structure of a PGM...
These are usually set by domain experts
Can we ’learn’ those parameters as well?
We can try optimizing them, this is called hyperparameter optimization
However, using the whole training data does not make sense
The optimization procedure would simply select the parameters that result in highest capacity (i.e.
overfitting)
Similar to how we separated the whole data into test and training, we can further divide the
training data into two disjoint subsets
The smaller part that is used for learning the hyperparameters is called the validation set
Estimators, Bias and Variance: Estimation
How can we quantify the concept of overfitting/underfitting?
Classical statistics and estimation theory can help with that
We will take the frequentist view for the rest of this section
Point estimation is an attempt to provide a single best estimate for a quantity of interest. We
denote the point estimate of 𝜽 as 𝜽
Let {𝑥 1 , . . . , 𝑥 (𝑚) } be a set of 𝑚 i.i.d. data points. A point estimator or statistic is any function
of the data:
𝑚 is a random variable, even though 𝜽 is not
{𝑥 1 , . . . , 𝑥 (𝑚) } is generated randomly, hence 𝜽
We are interested in analyzing the relationship between 𝜽 𝑚 and 𝜽, both for finite 𝑚 and 𝑚 → ∞
Estimators, Bias and Variance: Bias
The bias of an estimator is defined as:
The expectation is taken over the data
𝑚 ) = 0 and asymptotically unbiased if lim Bias (𝜽
An estimator is unbiased if Bias(𝜽 𝑚 ) = 0
𝑚→∞
Let 𝑥 (𝑖) be generated from a Bernoulli distribution with mean 𝜃. Then the estimator
1
𝜃መ𝑚 = σ𝑚 𝑖=1 𝑥
(𝑖) is unbiased
𝑚
The same estimator is also unbiased when 𝑥 (𝑖) are generated from a Gaussian distribution with
mean 𝜇
1
However, the estimator 𝜎ො𝑚 = 𝑚 σ𝑚𝑖=1 𝑥
(𝑖) − 𝜇Ƹ
𝑚 is only asymptotically unbiased. We can make it
1
unbiased by setting 𝜎ො𝑚 = σ𝑚𝑖=1 𝑥
(𝑖) − 𝜇Ƹ
𝑚
𝑚−1
It might seem like unbiased estimators are preferable, but this is not necessarily true, we will
soon see that biased estimators possess some interesting properties
Estimators, Bias and Variance: Variance and Standard Error
Another interesting property of an estimator is how much it varies as a function of the data
sample. This is called variance
Low variance estimators give similar outputs irregardless of the sampled data. Variance is an
excellent measure for quantifying the uncertainty in our estimations.
For instance, for Gaussian distributions we can say that our estimate of the mean 𝜇Ƹ falls into the
interval of
with 95% probability.
Hence ideally, we want both low bias and low variance.
1
For Bernoulli distribution, variance of the estimator is 𝜃(1 − 𝜃)
𝑚
The variance decreases as 𝑚 increases, this is a common property of popular estimators
Estimators, Bias and Variance: Bias-Variance Decomposition
Bias and variance are two difference sources of error in an estimator
Bias measures the expected deviation from the true value.
Variance measures deviation from the expected estimator value among different subsets of data.
Should we choose a model with low bias or low variance?
The answer is given by the key idea that mean squared error can be decomposed as follows
Hence for a fixed MSE, there is a trade-off between variance and bias
In general, increasing capacity increases the variance and decreases the bias.
Estimators, Bias and Variance: Bias-Variance Trade-off
Supervised Learning Algorithms: Probabilistic Supervised Learning
The supervised learning is mainly about estimating the probability distribution 𝑝(𝑦 ∣ 𝒙)
To use MLE for this job, we first define a family of probability distributions 𝑝(𝑦 ∣ 𝒙; 𝜽)
Linear regression problem can be expressed by the family:
It can be shown that in this case 𝜃𝑀𝐿 has a closed form expression, which is the same as least
squares solution!
For binary classification, we can use the logistic sigmoid function:
No closed form solution in this case, we need to use numerical optimization methods.
Supervised Learning Algorithms: Parametric Supervised Learning
Not all supervised learning algorithms are probabilistic. We can simply define a parametric
family of deterministic models 𝑦 = 𝑓(𝒙; 𝒘) and use some loss function to find an optimal 𝒘∗
One of the most famous approaches to this problem is the Support Vector Machines (SVMs)
For binary classification, we let 𝑓(𝒙) = 𝒘𝑇 𝒙 + 𝑏
If 𝑓(𝒙) > 0 we predict that sample belongs to class 1 and vice versa
SVM is useful for two reasons
Solving for optimal 𝒘 is a convex optimization problem
It can be shown that:
The second part is the key results. We can replace 𝒙𝑇 𝒙 (𝑖) by any inner product and we can
still use the same algorithm!
This is how we generalize to nonlinear classifiers:
The 𝑘 is called a 𝑘𝑒𝑟𝑛𝑒𝑙
Supervised Learning Algorithms: Nonparametric Supervised Learning
For most supervised learning algorithms we need to fix the form and number of parameters in
the model beforehand.
Can we also learn those from the model?
Models that allow these are called nonparametric, in some sense, they let the data speak for itself
One of the simplest nonparametric classifier is k-nearest neighborhood method
Fix a positive integer k (yes even nonparametric models have parameters)
For a new point, find the nearest k samples from the dataset, compute their class/output averages
and simply predict this value for the new point
There are other popular nonparametric classifiers as well, decision trees, random forests etc.
Supervised Learning Algorithms: Nonparametric Supervised Learning
Unsupervised Learning
There are unsupervised learning algorithms for dimensionality reduction, e.g. PCA
PCA takes unlabeled and difficult to interpret data and transforms it into a much more manageable
form
Another popular form of unsupervised learning is clustering
How can we segment the data in a meaningful way?
k-means clustering is a very popular algorithm for this job.
Generating association rules is another task that can be solved via unsupervised learning
An association rule is an unsupervised learning method which is used for finding the relationships
between variables in the large database.
References
I. Goodfellow, Y. Bengio and A. Courville, “Deep Learning”, 2016.