Intro to Machine Learning Basics
Intro to Machine Learning Basics
Olivier Colliot
A non-technical introduction
to machine learning
Olivier Colliot*1
1 Sorbonne Université, Institut du Cerveau - Paris Brain Institute - ICM,
CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié-Salpêtrière, F-75013, Paris,
France
* Corresponding author: e-mail address: olivier.colliot@cnrs.fr
Abstract
This chapter provides an introduction to machine learning for a non-
technical readership. Machine learning is an approach to artificial intelli-
gence. The chapter thus starts with a brief history of artificial intelligence
in order to put machine learning into this broader scientific context. We
then describe the main general concepts of machine learning. Readers
with a background in computer science may skip this chapter.
1. Introduction
Machine learning (ML) is a scientific domain which aims at allowing
computers to perform tasks without being explicitly programmed to do
so [1]. To that purpose, the computer is trained using the examination
of examples or experiences. It is part of a broader field of computer
science called artificial intelligence (AI) which aims at creating comput-
ers with abilities that are characteristic of human or animal intelligence.
This includes tasks such as perception (the ability to recognize images
or sounds), reasoning, decision making or creativity. Emblematic tasks
which are easy to perform for a human and are inherently difficult for a
computer are for instance recognizing objects, faces or animals in pho-
tographs, or recognizing words in speech. On the other hand, there are
also tasks which are inherently easy for a computer and difficult for a hu-
man, such as computing with large numbers or memorizing exactly huge
amounts of text. Machine learning is the AI technique that has achieved
the most impressive successes over the past years. However, it is not the
only approach to AI and conceptually different approaches also exist.
Machine learning also has close ties to other scientific fields. First,
it has evident strong links to statistics. Indeed, most machine learning
approaches exploit statistical properties of the data. Moreover, some
classical approaches used in machine learning were actually invented in
statistics (for instance linear or logistic regression). Nowadays, there is
a constant interplay between progress in statistics and machine learn-
ing. ML has also important ties to signal and image processing, ML
techniques being efficient for many applications in these domains and
signal/image processing concepts being often key to the design or un-
derstanding of ML techniques. There are also various links to different
branches of mathematics, including optimization and differential geom-
etry. Besides, some inspiration for the design of ML approaches comes
from the observation of biological cognitive systems, hence the connec-
tions with cognitive science and neuroscience. Finally, the term data
science has become commonplace to refer to the use of statistical and
computational methods for extracting meaningful patterns from data. In
practice, machine learning and data science share many concepts, tech-
niques and tools. Nevertheless, data science puts more emphasis on the
discovery of knowledge from the data while machine learning focuses on
solving tasks.
This chapter starts by providing a few historical landmarks regarding
artificial intelligence and machine learning (section 2). It then proceeds
with the main concepts of ML which are foundational to understand
other chapters of this book.
2. A bit of history
As a scientific endeavour, artificial intelligence is at least 80 years old.
Here, we provide a very brief overview of this history. For more details,
the reader may refer to [2]. A non-exhaustive timeline of AI is shown
in Figure 1.
Even if this is debatable, one often considers AI to emerge in the
1940s–1950s with a series of important concepts and events. In 1943,
the neurophysiologist Warren McCulloch and the logician Walter Pitts
Turing
Expert
test
systems Deep
learning
SVM era
Dartmouth
Perceptron Back-
workshop
propagation
Artificial
neuron model
(a)
Dendrites Axon terminal
x1
Cell body
x2 y
..
.
xp
Axon
(b)
!! $!
$"
!" ∑ "
$# %
…
!#
Figure 2: (a) Biological neuron. The synapses form the input of the
neuron. Their signals are combined and if the result exceeds a given
threshold, the neuron is activated and produces an output signal which is
sent through the axon. (b) The perceptron: an artificial neuron which is
inspired by biology. It is composed of the set of inputs (which correspond
to the information entering the synapses) xi , which are linearly combined
with weights wi and then go through a non-linear function g to produce
an output y. Image in panel (a) is courtesy of Thibault Rolland.
Machine
learning
Expert
systems
Early research
2nd AI
1st
AI winter
winter
1940 1950 1960 1970 1980 1990 2000 2010 2020
sets of rules. They were difficult to maintain and update. They also had
poor performances in perception tasks such as image and speech recogni-
tion. Academic and industrial funding subsequently dropped. This was
the second AI winter.
At this stage, it is probably useful to come back to the two main
families of AI: symbolic and connexionist (Figure 4). They had important
links at the beginning (see for example the work of McCulloch and Pitt
aiming to perform logical operations using artificial neurons) but they
subsequently developed separately. In short, these two families can be
described as follows. The first operates on symbols through sets of logical
rules. It has strong ties to the domain of predicate logic. Connexionism
aims at training networks of artificial neurons. This is done through
the examination of training examples. More generally, it is acceptable
to put most machine learning methods within the connexionist family,
even though they don’t rely on artificial neuron models, because their
underlying principle is also to exploit statistical similarities in the training
data. For a more detailed perspective on the two families of AI, the reader
can refer to the very interesting (and even entertaining!) paper of Cardon
et al [12].
Let us come back to our historical timeline. The 1980s saw a rebirth of
connexionism and, more generally, the start of the rise of machine learn-
ing. Interestingly, it is at that time that two of the main conferences
on machine learning started: the International Conference on Machine
Learning (ICML) in 1980 and Neural Information Processing Systems
(NeurIPS, formerly NIPS) in 1987. It had been known for a long time
that neural networks with multiple layers (as opposed to the original per-
ceptron with a single layer) (Figure 5) could solve non-linearly separable
problems but their training remained difficult. The back-propagation al-
Artificial intelligence
Symbolic AI Connexionism
Operates on symbols - Neural networks
through logical rules - More generally:
machine learning
&!
!!
&" "'!
!"
&$ "'"
!#
&%!
term deep learning. The building blocks of this solution were already
present in the 1980s but there was not enough computing power nor
large training datasets for them to work properly. In the interval, things
had changed. Computers had become exponentially more powerful and,
in particular, the use of Graphical Processing Units (GPU) considerably
sped up computations. The expansion of the Internet had provided mas-
sive amounts of data of various sorts such as texts and images. In the
subsequent years, deep learning [26] approaches became increasingly so-
phisticated. In parallel, efficient and mature software packages including
TensorFlow [27], PyTorch [28] or Keras [29], whose development is sup-
ported by major companies such as Google and Facebook, enable deep
learning to be used more easily by scientists and engineers.
Artificial intelligence in medicine as a research field is about 50 years
old. In 1975, an expert system, called MYCIN, was proposed to identify
bacteria causing various infectious diseases [30]. More generally, there
was a growing interest in expert systems for medical applications. Med-
ical image processing also quickly became a growing field. The first con-
ference on Information Processing in Medical Imaging (IPMI) was held
in 1977 (it existed under a different name since 1969). The first SPIE
Medical Image Processing conference took place in 1986 and the Med-
ical Image Computing and Computer-Assisted Intervention conference
(MICCAI) started in 1998. Image perception tasks, such as segmenta-
tion or classification, soon became among the key topics of this field,
even though the methods came in majority from traditional image pro-
cessing and not from machine learning. In the 2010s, machine learning
approaches became dominant for medical image processing and more
generally in artificial intelligence in medicine.
To conclude this part, it is important to be clear about the different
terms, in particular those of artificial intelligence, machine learning and
deep learning (Figure 6). Machine learning is one approach to artificial
intelligence and other radically different approaches exist. Deep learning
is a specific type of machine learning approach. It has recently obtained
impressive results on some types of data (in particular images and text)
but this does not mean that it is the universal solution to all problems.
As we will see in this book, there are tasks for which other types of
approaches perform best.
Artificial intelligence
Machine learning
Deep learning
tions that will directly perform the considered task. Instead, one will
write a program that allows the computer to learn how to perform the
task by examining examples or experiences. The output of this learning
process is a computer program itself that performs the desired task, but
this program was not explicitly written. Instead, it has been learned
automatically by the computer.
In 1997, Tom Mitchell gave a more precise definition of a well-posed
machine learning problem [31]:
Classification
Regression
30.0
27.5
Body mass index
25.0
22.5
20.0
17.5
20 30 40 50 60
Age (in years)
3.1.4. Discussion
Unsupervised learning is obviously attractive because it does not re-
quire labels. Indeed, acquiring labels for a training set is usually time-
consuming and expensive because the labels need to be assigned by a
human. This is even more problematic in medicine because the labels
must be provided by experts in the field. It is thus in principle at-
tractive to adopt unsupervised strategies, even for tasks which could be
framed as supervised learning problems. Nevertheless, up to now, the
performances of supervised approaches are often vastly superior in many
Clustering
Input
Racoons
Learning
Trees
Cat Output
the training set. In other words, we are looking for the function which
minimizes the average error over the training set. Let us call this average
error the cost function:
n
1X
ℓ y (i) , f (x(i) )
J(f ) =
n i=1
Learning will then aim at finding the function fˆ which minimizes the
cost function:
n
1X
fˆ = arg min ℓ y (i) , f (x(i) )
f ∈F n i=1
In the above equation, arg min indicates that we are interested in the
function f that minimizes the cost J(f ) and not in the value of the cost
itself. F is the space that contains all admissible functions. F can for
instance be the set of linear functions or the set of neural networks with
a given architecture.
The procedure that will aim at finding f that minimizes the cost is
called an optimization procedure. Sometimes, the minimum can be find
analytically (i.e. by directly solving an equation for f ) but this will rarely
be the case. In other cases, one will resort to an iterative procedure (i.e.
an algorithm): the function f is iteratively modified until we find the
function which minimizes the cost. There are cases where we will have
an algorithm that is guaranteed to find the global minimum and others
where one will only find a local minimum.
Minimizing the errors on the training set does not guarantee that
the trained computer will perform well on new examples which were
not part of the training set. A first reason may be that the training
set is too different from the general population (for instance, we have
trained a model on a dataset of young males and we would like to apply
it to patients of any gender and age). Another reason is that, even if
the training set characteristics follow those of the general population,
the learned function f may be too specific to the training set. In other
words, it has learned the training set “by heart” but has not discovered a
more general rule that would work for other examples. This phenomenon
is called overfitting and often arises when the dimensionality of the data
is too high (there are many variables to represent an input), when the
training set is too small or when the function f is too flexible. A way to
prevent overfitting will be to modify the cost function so that it not only
represents the average error across training samples but also constrains
the function f to have some specific properties.
However, there are cases where the input is not a vector of numbers.
This is the case when the input is a medical image, a text or a DNA
sequence for instance. Of course, in a computer, everything is stored as
numbers. An image is an array of values representing the gray-scale in-
tensity of each pixel (Figure 10). A text is a sequence of characters which
are each coded as a number. However, unlike in the example presented
in Table 1, these numbers are not meaningful by themselves. For this
reason, a common approach is to extract features, which will be series
of numbers that meaningfully represent the input. For example, if the
input is a brain magnetic resonance image (MRI), relevant features could
be the volumes of different anatomical regions of the brain (this specific
process is done using a technique called image segmentation which is cov-
ered in another chapter). This would result in a series of numbers that
would form an input vector. The development of efficient methods for
extracting meaningful features from raw data is important in machine
learning. Such an approach is often called feature engineering. Deep
learning methods allow for avoiding extracting features by providing an
end-to-end approach from the raw data to the output. In some areas,
this has made feature engineering less important but there are still ap-
! $(#!)
' " ="
3
dJ
possible to directly solve dw 1
= 0. This will nevertheless not be the case
in general. Very often, it will not be possible to solve this analytically.
We will thus resort to an iterative algorithm. One classical iterative
method is gradient descent. In the general case, f depends not on only
one parameter w1 but on a set of parameters (w1 , . . . , wp ) which can be
assembled into a vector w. Thus, instead of working with the derivative
dJ
dw1
, we will work with the gradient ∇w J. The gradient is a vector that
indicates the direction that one should follow to climb along J. We will
thus follow the opposite of the gradient, hence the name gradient descent.
This process is illustrated in Figure 12, together with the corresponding
algorithm.
4. Conclusion
This chapter provided an introduction to machine learning (ML) for a
non-technical readership (e.g. physicians, neuroscientists . . . ). ML is an
approach to artificial intelligence and thus needs to be put into this larger
context. We introduced the main concepts underlying ML that will be
further expanded in Chapters 2 to 6. The reader can find a summary of
these main concepts, as well as notations, in Box 3.
repeat
dJ
w1 w1 ⌘ dw 1
until convergence;
• The input x
• The output y
• The loss: measures the error between the predicted and the
true output, for a given sample
ℓ(y, f (x))
• The cost function: measures the average error across the train-
ing samples
J(f ) = n1 ni=1 ℓ y (i) , f (x(i) )
P
Acknowledgments
The author would like to thank Johann Faouzi for his insightful com-
ments. This work was supported by the French government under man-
agement of Agence Nationale de la Recherche as part of the “Investisse-
ments d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA
Institute) and reference ANR-10-IAIHU-06 (Institut Hospitalo-Universitaire
ICM).
References
[1] Samuel AL (1959) Some studies in ma- [11] McCarthy J (1960) Recursive functions of
chine learning using the game of checkers. symbolic expressions and their computation
IBM Journal of research and development by machine, part I. Communications of the
3(3):210–229 ACM 3(4):184–195
[2] Russell S, Norvig P (2002) Artificial intelli- [12] Cardon D, Cointet JP, Mazières A,
gence: a modern approach. Pearson Libbrecht E (2018) Neurons spike
back. Reseaux 5:173–220, URL
https://neurovenge.antonomase.fr/
[3] McCulloch WS, Pitts W (1943) A logical
RevengeNeurons_Reseaux.pdf
calculus of the ideas immanent in nervous
activity. The bulletin of mathematical bio-
physics 5(4):115–133 [13] Rumelhart DE, Hinton GE, Williams RJ
(1986) Learning representations by back-
propagating errors. Nature 323(6088):533–
[4] Wiener N (1948) Cybernetics or Control 536
and Communication in the Animal and the
Machine. MIT press [14] Le Cun Y (1985) Une procédure
d’apprentissage pour réseau à seuil as-
[5] Hebb DO (1949) The organization of behav- symétrique. Cognitiva 85:599–604
ior. Wiley
[15] LeCun Y, Boser B, Denker JS, Henderson
[6] Turing AM (1950) Computing machinery D, Howard RE, Hubbard W, Jackel LD
and intelligence. Mind 59(236):433–360 (1989) Backpropagation applied to hand-
written zip code recognition. Neural com-
putation 1(4):541–551
[7] McCarthy J, Minsky ML, Rochester N,
Shannon CE (1955) A proposal for the
dartmouth summer research project on [16] Matan O, Baird HS, Bromley J, Burges
artificial intelligence. Research Report URL CJC, Denker JS, Jackel LD, Le Cun Y, Ped-
http://raysolomonoff.com/dartmouth/ nault EPD, Satterfield WD, Stenard CE,
boxa/dart564props.pdf et al (1992) Reading handwritten digits:
A zip code recognition system. Computer
25(7):59–63
[8] Newell A, Simon H (1956) The logic theory
machine–a complex information processing [17] Legendre AM (1806) Nouvelles méthodes
system. IRE Transactions on information pour la détermination des orbites des
theory 2(3):61–79 comètes. Firmin Didot
[9] Rosenblatt F (1958) The perceptron: a [18] Pearson K (1901) On lines and planes of
probabilistic model for information storage closest fit to systems of points in space.
and organization in the brain. Psychological The London, Edinburgh, and Dublin philo-
review 65(6):386 sophical magazine and journal of science
2(11):559–572
[10] Buchanan BG, Shortliffe EH (1984) Rule-
based expert systems: the MYCIN exper- [19] Fisher RA (1936) The use of multiple mea-
iments of the Stanford Heuristic Program- surements in taxonomic problems. Annals
ming Project. Addison-Wesley of eugenics 7(2):179–188
[20] Loh WY (2014) Fifty years of classification [28] Paszke A, Gross S, Massa F, Lerer A,
and regression trees. International Statisti- Bradbury J, Chanan G, Killeen T, Lin
cal Review 82(3):329–348 Z, Gimelshein N, Antiga L, et al (2019)
Pytorch: An imperative style, high-
[21] Quinlan JR (1986) Induction of decision performance deep learning library. In: Ad-
trees. Machine learning 1(1):81–106 vances in neural information processing sys-
tems, vol 32, pp 8026–8037
[22] Vapnik V (1999) The nature of statistical
learning theory. Springer [29] Chollet F, et al (2015) Keras. URL https:
//github.com/fchollet/keras
[23] Boser BE, Guyon IM, Vapnik VN (1992) A
training algorithm for optimal margin clas-
sifiers. In: Proceedings of the fifth annual [30] Shortliffe E (1976) Computer-based medical
workshop on Computational learning the- consultations: MYCIN. Elsevier
ory, pp 144–152
[31] Mitchell T (1997) Machine learning. Mc-
[24] Pedregosa F, Varoquaux G, Gramfort Graw Hill
A, Michel V, Thirion B, et al (2011)
Scikit-learn: Machine learning in python. [32] Mikolov T, Chen K, Corrado G, Dean J
the Journal of machine Learning research (2013) Efficient estimation of word repre-
12:2825–2830 sentations in vector space. arXiv preprint
arXiv:13013781
[25] Krizhevsky A, Sutskever I, Hinton GE
(2012) Imagenet classification with deep
convolutional neural networks. In: Ad- [33] Radford A, Narasimhan K, Sali-
vances in neural information processing sys- mans T, Sutskever I (2018) Improv-
tems, vol 25, pp 1097–1105 ing language understanding by gen-
erative pre-training. URL https:
[26] LeCun Y, Bengio Y, Hinton G (2015) Deep //cdn.openai.com/research-covers/
learning. Nature 521(7553):436–444 language-unsupervised/language_
understanding_paper.pdf
[27] Abadi M, Agarwal A, et al (2015) Ten-
sorFlow: Large-scale machine learning on [34] Devlin J, Chang MW, Lee K, Toutanova
heterogeneous systems. URL https://www. K (2018) Bert: Pre-training of deep bidi-
tensorflow.org/, software available from rectional transformers for language under-
tensorflow.org standing. arXiv preprint arXiv:181004805