Complete Data Science Cheatsheet
Complete Data Science Cheatsheet
wertyuiopasdfghjklzxcvbnmqw
ertyuiopasdfghjklzxcvbnmqwer
           ALL IMPORTANT
            CHEAT SHEETS
tyuiopasdfghjklzxcvbnmqwerty
                   BY
uiopasdfghjklzxcvbnmqwertyui
        STANFORD UNIVERSITY
opasdfghjklzxcvbnmqwertyuiop
                    &
           Massachusetts
asdfghjklzxcvbnmqwertyuiopas
             Institute of
dfghjklzxcvbnmqwertyuiopasdf
             technology
ghjklzxcvbnmqwertyuiopasdfgh
jklzxcvbnmqwertyuiopasdfghjkl
         PROBABILITY, STATISTICS,
zxcvbnmqwertyuiopasdfghjklzx
        ARTIFICIAL INTELLIGENCE,
           MACHINE LEARNING,
cvbnmqwertyuiopasdfghjklzxcv
            & DEEP LEARNING
bnmqwertyuiopasdfghjklzxcvbn
                    15/10/2020
mqwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnmq
wertyuiopasdfghjklzxcvbnmqw
  Probability–the Science of Uncertainty                               Theorem (Bayes’ rule) Given a partition {A1 , A2 , . . .} of the           Discrete random variables
                                                                       sample space, meaning that ⋃ Ai = Ω and the events are disjoint,
                and Data                                                                                  i                                       Probability mass function and expectation
                               by Fabián Kozynski
                                                                       and if P(Ai ) > 0 for all i, then for every event B, the conditional
                                                                                                                                                  Definition (Random variable) A random variable X is a function
                                                                       probabilities P(Ai ∣B) can be obtained from the conditional
                                                                                                                                                  of the sample space Ω into the real numbers (or Rn ). Its range can
                                                                       probabilities P(B∣Ai ) and the initial probabilities P(Ai ) as follows:
                                                                                                                                                  be discrete or continuous.
Probability                                                                                                   P(Ai )P(B∣Ai )                      Definition (Probability mass funtion (PMF)) The probability law
                                                                                          P(Ai ∣B) =                                 .
Probability models and axioms                                                                             ∑j P(Aj )P(B∣Aj )                       of a discrete random variable X is called its PMF. It is defined as
Definition (Sample space) A sample space Ω is the set of all                                                                                                   pX (x) = P(X = x) = P ({ω ∈ Ω ∶ X(ω) = x}) .
                                                                       Independence
possible outcomes. The set’s elements must be mutually exclusive,
collectively exhaustive and at the right granularity.                  Definition (Independence of events) Two events are independent             Properties
Definition (Event) An event is a subset of the sample space.           if occurrence of one provides no information about the other. We
                                                                       say that A and B are independent if                                        pX (x) ≥ 0, ∀ x.
Probability is assigned to events.
Definition (Probability axioms) A probability law P assigns                                     P(A ∩ B) = P(A)P(B).                              ∑x pX (x) = 1.
probabilities to events and satisfies the following axioms:                                                                                       Example (Bernoulli random variable) A Bernoulli random
                                                                       Equivalently, as long as P(A) > 0 and P(B) > 0,
Nonnegativity P(A) ≥ 0 for all events A.                                                                                                          variable X with parameter 0 ≤ p ≤ 1 (X ∼ Ber(p)) takes the
                                                                                       P(B∣A) = P(B)              P(A∣B) = P(A).                  following values:
Normalization P(Ω) = 1.                                                                                                                                                     ⎧
                                                                                                                                                                            ⎪
                                                                                                                                                                            ⎪1 w.p. p,
(Countable) additivity For every sequence of events A1 , A2 , . . .    Remarks                                                                                          X=⎨
                                                                                                                                                                            ⎪
                                                                                                                                                                            ⎪0 w.p. 1 − p.
                                                                           • The definition of independence is symmetric with respect to                                    ⎩
such that Ai ∩ Aj = ∅: P (⋃ Ai ) = ∑ P(Ai ).
                                 i             i                             A and B.                                                             An indicator random variable of an event (IA = 1 if A occurs) is an
Corollaries (Consequences of the axioms)                                   • The product definition applies even if P(A) = 0 or P(B) = 0.         example of a Bernoulli random variable.
                                                                       Corollary If A and B are independent, then A and B c are                   Example (Discrete uniform random variable) A Discrete uniform
    • P(∅) = 0.
                                                                       independent. Similarly for Ac and B, or for Ac and B c .                   random variable X between a and b with a ≤ b (X ∼ Uni[a, b])
    • For any finite collection of disjoint events A1 , . . . , An ,                                                                              takes any of the values in {a, a + 1, . . . , b} with probability b−a+1
                                                                                                                                                                                                                      1
                                                                                                                                                                                                                          .
           n          n                                                Definition (Conditional independence) We say that A and B are
        P ( ⋃ Ai ) = ∑ P(Ai ).                                         independent conditioned on C, where P(C) > 0, if                           Example (Binomial random variable) A Binomial random
           i=1     i=1                                                                                                                            variable X with parameters n (natural number) and 0 ≤ p ≤ 1
    •   P(A) + P(Ac ) =   1.                                                              P(A ∩ B∣C) = P(A∣C)P(B∣C).                              (X ∼ Bin(n, p)) takes values in the set {0, 1, . . . , n} with
    • P(A) ≤ 1.                                                        Definition (Independence of a collection of events) We say that            probabilities pX (i) = (ni)pi (1 − p)n−i .
                                                                       events A1 , A2 , . . . , An are independent if for every collection of     It represents the number of successes in n independent trials where
    • If A ⊂ B, then P(A) ≤ P(B).
                                                                       distinct indices i1 , i2 , . . . , ik , we have                            each trial has a probability of success p. Therefore, it can also be
    • P(A ∪ B) = P(A) + P(B) − P(A ∩ B).                                                                                                          seen as the sum of n independent Bernoulli random variables, each
                                                                                 P(Ai1 ∩ . . . ∩ Aik ) = P(Ai1 ) ⋅ P(Ai2 )⋯P(Aik ).
    • P(A ∪ B) ≤ P(A) + P(B).                                                                                                                     with parameter p.
Example (Discrete uniform law) Assume Ω is finite and consists         Counting                                                                   Example (Geometric random variable) A Geometric random
of n equally likely elements. Also, assume that A ⊂ Ω with k                                                                                      variable X with parameter 0 ≤ p ≤ 1 (X ∼ Geo(p)) takes values in
                                                                       This section deals with finite sets with uniform probability law. In
elements. Then P(A) = n  k
                           .                                                                                                                      the set {1, 2, . . .} with probabilities pX (i) = (1 − p)i−1 p.
                                                                       this case, to calculate P(A), we need to count the number of
                                                                       elements in A and in Ω.                                                    It represents the number of independent trials until (and including)
Conditioning and Bayes’ rule                                                                                                                      the first success, when the probability of success in each trial is p.
                                                                       Remark (Basic counting principle) For a selection that can be
Definition (Conditional probability) Given that event B has                                                                                       Definition (Expectation/mean of a random variable) The
                                                                       done in r stages, with ni choices at each stage i, the number of
occurred and that P (B) > 0, the probability that A occurs is
                                                                       possible selections is n1 ⋅ n2 ⋯nr .                                       expectation of a discrete random variable is defined as
                                             P(A ∩ B)                  Definition (Permutations) The number of permutations
                          P(A∣B) =
                                         △
                                                      .                                                                                                                     E[X] = ∑ xpX (x).
                                                                                                                                                                                  △
                                              P(B)                     (orderings) of n different elements is
                                                                                                                                                                                      x
Remark (Conditional probabilities properties) They are the same                                       n! = 1 ⋅ 2 ⋅ 3⋯n.                           assuming ∑x ∣x∣pX (x) < ∞.
as ordinary probabilities. Assuming P(B) > 0:
                                                                       Definition (Combinations) Given a set of n elements, the number            Properties (Properties of expectation)
    • P(A∣B) ≥ 0.                                                      of subsets with exactly k elements is
    • P(Ω∣B) = 1                                                                                                                                      • If X ≥ 0 then E[X] ≥ 0.
                                                                                                      n       n!
    • P(B∣B) = 1.                                                                                    ( )=            .                                • If a ≤ X ≤ b then a ≤ E[X] ≤ b.
                                                                                                      k   k!(n − k)!
    • If A ∩ C = ∅, P(A ∪ C∣B) = P(A∣B) + P(C∣B).                                                                                                     • If X = c then E[X] = c.
                                                                       Definition (Partitions) We are given an n−element set and
Proposition (Multiplication rule)                                      nonnegative integers n1 , n2 , . . . , nr , whose sum is equal to n. The   Example Expected value of know r.v.
P(A1 ∩A2 ∩⋯∩An ) = P(A1 )⋅P(A2 ∣A1 )⋯P(An ∣A1 ∩A2 ∩⋯∩An−1 ).           number of partitions of the set into r disjoint subsets, with the ith          • If X ∼ Ber(p) then E[X] = p.
                                                                       subset containing exactly ni elements, is equal to
Theorem (Total probability theorem) Given a partition                                                                                                 • If X = IA then E[X] = P(A).
{A1 , A2 , . . .} of the sample space, meaning that ⋃ Ai = Ω and the                                  n                 n!
                                                          i
                                                                                            (                   )=               .                    • If X ∼ Uni[a, b] then E[X] =        a+b
                                                                                                                                                                                                .
                                                                                                n1 , . . . , nr    n1 !n2 !⋯nr !                                                             2
events are disjoint, and for every event B, we have                                                                                                   • If X ∼ Bin(n, p) then E[X] = np.
                                                                       Remark This is the same as counting how to assign n distinct
                      P(B) = ∑ P(Ai )P(B∣Ai ).
                                     i
                                                                       elements to r people, giving each person i exactly ni elements.                • If X ∼ Geo(p) then E[X] =         1
                                                                                                                                                                                          p
                                                                                                                                                                                            .
Theorem (Expected value rule) Given a random variable X and a              Properties (Properties of joint PMF)                                                             Proposition (Expectation of product of independent r.v.) If X
function g ∶ R → R, we construct the random variable Y = g(X).                                                                                                              and Y are discrete independent random variables,
                                                                                • ∑ ⋯ ∑ pX1 ,...,Xn (x1 , . . . , xn ) = 1.
Then                                                                               x1     xn
          ∑ ypY (y) = E[Y ] = E [g(X)] = ∑ g(x)pX (x).                                                                                                                                               E[XY ] = E[X]E[Y ].
            y                                    x
                                                                                • pX1 (x1 ) = ∑ ⋯ ∑ pX1 ,...,Xn (x1 , x2 , . . . , xn ).
                                                                                                 x2     xn                                                                  Remark If X and Y are independent,
Remark (PMF of Y = g(X)) The PMF of Y = g(X) is                                                                                                                             E [g(X)h(Y )] = E [g(X)] E [h(Y )].
                                                                                • pX2 ,...,Xn (x2 , . . . , xn ) = ∑ pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ).
pY (y) = ∑ pX (x).                                                                                                       x1
         x∶g(x)=y                                                                                                                                                           Proposition (Variance of sum of independent random variables)
                                                                           Definition (Functions of multiple r.v.) If Z = g(X1 , . . . , Xn ),                              IF X and Y are discrete independent random variables,
Remark In general g (E[X]) ≠ E [g(X)]. They are equal if
                                                                           where g ∶ Rn → R, then pZ (z) = P (g(X1 , . . . , Xn ) = z).
g(x) = ax + b.                                                                                                                                                                                Var(X + Y ) = Var(X) + Var(Y ).
                                                                           Proposition (Expected value rule for multiple r.v.) Given
Variance, conditioning on an event, multiple r.v.                          g ∶ Rn → R,
                                                                                                                                                                            Continuous random variables
Definition (Variance of a random variable) Given a random                   E [g(X1 , . . . , Xn )] =      ∑           g(x1 , . . . , xn )pX1 ,...,Xn (x1 , . . . , xn ).   PDF, Expectation, Variance, CDF
variable X with µ = E[X], its variance is a measure of the spread                                       x1 ,...,xn
of the random variable and is defined as                                                                                                                                    Definition (Probability density function (PDF)) A probability
                                                                           Properties (Linearity of expectations)
                                                                                                                                                                            density function of a r.v. X is a non-negative real valued function
            Var(X) = E [(X − µ) ] = ∑(x − µ) pX (x).
                                    2                 2
                       △
                                                                                • E[aX + b] = aE[X] + b.                                                                    fX that satisfies the following
                                          x
                                                                                • E[X1 + ⋯ + Xn ] = E[X1 ] + ⋯ + E[Xn ].                                                           ∞
Definition (Standard deviation)                                                                                                                                                 • ∫ fX (x)dx = 1.
                              √                                            Conditioning on a random variable, independence                                                        −∞
                        σX = Var(X).                                                                                                                                                                 b
                                                                           Definition (Conditional PMF given another random variable)                                           • P(a ≤ X ≤ b) = ∫ fX (x)dx for some random variable X.
Properties (Properties of the variance)                                    Given discrete random variables X, Y and y such that pY (y) > 0                                                          a
    • Var(aX) = a2 Var(X), for all a ∈ R.                                  we define                                                                                        Definition (Continuous random variable) A random variable X is
                                                                                                           △ pX,Y (x, y)                                                    continuous if its probability law can be described by a PDF fX .
    • Var(X + b) = Var(X), for all b ∈ R.                                                       pX∣Y (x∣y) =             .
                                                                                                               pY (y)                                                       Remark Continuous random variables satisfy:
    • Var(aX + b) = a2 Var(X).
                                                                           Proposition (Multiplication rule) Given jointly discrete random
                                                                                                                                                                                • For small δ > 0, P(a ≤ X ≤ a + δ) ≈ fX (a)δ.
    • Var(X) =      E[X 2 ] − (E[X])2 .                                    variables X, Y , and whenever the conditional probabilities are
Example (Variance of known r.v.)                                           defined,                                                                                             • P(X = a) = 0, ∀a ∈ R.
    • If X ∼ Ber(p), then Var(X) = p(1 − p).                                            pX,Y (x, y) = pX (x)pY ∣X (y∣x) = pY (y)pX∣Y (x∣y).                                 Definition (Expectation of a continuous random variable) The
                                                                                                                                                                            expectation of a continuous random variable is
    • If X ∼ Uni[a, b], then Var(X) =
                                              (b−a)(b−a+2)
                                                           .               Definition (Conditional expectation) Given discrete random
                                                   12                                                                                                                                                          ∞
                                                                           variables X, Y and y such that pY (y) > 0 we define                                                                      E[X] = ∫
                                                                                                                                                                                                         △
                                                                                                                                                                                                                    xfX (x)dx.
    • If X ∼ Bin(n, p), then Var(X) = np(1 − p).                                                                                                                                                               −∞
    • If X ∼ Geo(p), then Var(X) =        1−p                                                      E[X∣Y = y] = ∑ xpX∣Y (x∣y).                                                          ∞
                                           p2                                                                                 x                                             assuming ∫ ∣x∣fX (x)dx < ∞.
Proposition (Conditional PMF and expectation, given an event)                                                                                                                          −∞
                                                                           Additionally we have
Given the event A, with P(A) > 0, we have the following                                                                                                                     Properties (Properties of expectation)
    • pX∣A (x) = P(X = x∣A).                                                                   E [g(X)∣Y = y] = ∑ g(x)pX∣Y (x∣y).
                                                                                                                              x
                                                                                                                                                                                • If X ≥ 0 then E[X] ≥ 0.
    • If A is a subset of the range of X, then:                                                                                                                                 • If a ≤ X ≤ b then a ≤ E[X] ≤ b.
                                 ⎧
                                 ⎪
                                                                           Theorem (Total probability and expectation theorems)
                                 ⎪
                                     1
                                        pX (x), if x ∈ A,
      pX∣A (x) = pX∣{X∈A} (x) = ⎨ P(A)
                △                                                          If pY (y) > 0, then                                                                                                  ∞
                                 ⎪
                                 ⎪0,            otherwise.                                                                                                                      • E [g(X)] = ∫ g(x)fX (x)dx.
                                 ⎩                                                                 pX (x) = ∑ pY (y)pX∣Y (x∣y),                                                                −∞
    • ∑x pX∣A (x) = 1.                                                                                             y                                                            • E[aX + b] = aE[X] + b.
    • E[X∣A] = ∑x xpX∣A (x).                                                                       E[X] = ∑ pY (y)E[X∣Y = y].                                               Definition (Variance of a continuous random variable) Given a
                                                                                                               y
    • E [g(X)∣A] = ∑x g(x)pX∣A (x).                                                                                                                                         continuous random variable X with µ = E[X], its variance is
Proposition (Total expectation rule) Given a partition of disjoint  Definition (Independence of a random variable and an event) A                                                                                     ∞
                                                                    discrete random variable X and an event A are independent if                                                       Var(X) = E [(X − µ)2 ] = ∫         (x − µ)2 fX (x)dx.
events A1 , . . . , An such that ∑i P(Ai ) = 1, and P(Ai ) > 0,                                                                                                                                                      −∞
                                                                    P(X = x and A) = pX (x)P(A), for all x.
          E[X] = P(A1 )E[X∣A1 ] + ⋯ + P(An )E[X∣An ].               Definition (Independence of two random variables) Two discrete                                          It has the same properties as the variance of a discrete random
                                                                    random variables X and Y are independent if                                                             variable.
Definition (Memorylessness of the geometric random variable)
When we condition a geometric random variable X on the event        pX,Y (x, y) = pX (x)pY (y) for all x, y.                                                                Example (Uniform continuous random variable) A Uniform
X > n we have memorylessness, meaning that the “remaining time” Remark (Independence of a collection of random variables) A                                                 continuous random variable X between a and b, with a < b,
X − n, given that X > n, is also geometric with the same parameter. collection X1 , X2 , . . . , Xn of random variables are independent if                                  (X ∼ Uni(a, b)) has PDF
Formally,                                                                                                                                                                                                ⎧
                                                                                                                                                                                                         ⎪
                      pX−n∣X>n (i) = pX (i).                             pX1 ,...,Xn (x1 , . . . , xn ) = pX1 (x1 )⋯pXn (xn ), ∀ x1 , . . . , xn .                                                       ⎪ 1 ,       if a < x < b,
                                                                                                                                                                                                fX (x) = ⎨ b−a
                                                                                                                                                                                                         ⎪
                                                                                                                                                                                                         ⎪0,         otherwise.
Definition (Joint PMF) The joint PMF of random variables                   Remark (Independence and expectation) In general,                                                                             ⎩
X1 , X2 , . . . , Xn is                                                    E [g(X, Y )] ≠ g (E[X], E[Y ]). An exception is for linear functions:
                                                                                                                                                                                                                     (b−a)2
pX1 ,X2 ,...,Xn (x1 , . . . , xn ) = P(X1 = x1 , . . . , Xn = xn ).        E[aX + bY ] = aE[X] + bE[Y ].                                                                    We have E[X] =    a+b
                                                                                                                                                                                               2
                                                                                                                                                                                                    and Var(X) =       12
                                                                                                                                                                                                                            .
Example (Exponential random variable) An Exponential random          Theorem (Total probability and expectation theorems) Given a                 Proposition (Expectation of product of independent r.v.) If X
variable X with parameter λ > 0 (X ∼ Exp(λ)) has PDF                 partition of the space into disjoint events A1 , A2 , . . . , An such that   and Y are independent continuous random variables,
                           ⎧                                         ∑i P(Ai ) = 1 we have the following:
                           ⎪
                           ⎪λe−λx ,          if x ≥ 0,                                                                                                                   E[XY ] = E[X]E[Y ].
                  fX (x) = ⎨                                                  FX (x) = P(A1 )FX∣A1 (x) + ⋯ + P(An )FX∣An (x),
                           ⎪
                           ⎪ 0,               otherwise.
                           ⎩
                                                                               fX (x) = P(A1 )fX∣A1 (x) + ⋯ + P(An )fX∣An (x),                    Remark If X and Y are independent,
We have E[X] =   1
                     and Var(X) =         1                                                                                                       E [g(X)h(Y )] = E [g(X)] E [h(Y )].
                 λ                       λ2
                                            .                                   E[X] = P(A1 )E[X∣A1 ] + ⋯ + P(An )E[X∣An ].
Definition (Cumulative Distribution Function (CDF)) The CDF                                                                                       Proposition (Variance of sum of independent random variables)
of a random variable X is FX (x) = P(X ≤ x).                                                                                                      If X and Y are independent continuous random variables,
                                                                     Definition (Jointly continuous random variables) A pair
In particular, for a continuous random variable, we have
                                                                     (collection) of random variables is jointly continuous if there exists
                                                                                                                                                                   Var(X + Y ) = Var(X) + Var(Y ).
                                     x                               a joint PDF fX,Y that describes them, that is, for every set B ⊂ Rn
                       FX (x) = ∫ fX (x)dx,
                                                                                    P ((X, Y ) ∈ B) = ∬ fX,Y (x, y)dxdy.                          Proposition (Bayes’ rule summary)
                                −∞
                                                                                                               B
                                     dFX (x)                                                                                                          • For X, Y discrete: pX∣Y (x∣y) =
                                                                                                                                                                                              pX (x)pY ∣X (y∣x)
                                                                                                                                                                                                                .
                          fX (x) =           .                       Properties (Properties of joint PDFs)                                                                                         pY (y)
                                       dx
                                                                                       ∞
                                                                         • fX (x) = ∫ fX,Y (x, y)dy.                                                                                             fX (x)fY ∣X (y∣x)
Properties (Properties of CDF)                                                                                                                        • For X, Y continuous: fX∣Y (x∣y) =             fY (y)
                                                                                                                                                                                                                   .
                                                                                     −∞
    • If y ≥ x, then FX (y) ≥ FX (x).                                                                                 x     y
                                                                                                                                                                                                            pX (x)fY ∣X (y∣x)
                                                                         • FX,Y (x, y) = P(X ≤ x, Y ≤ y) = ∫ [ ∫ fX,Y (u, v)dv] du.                   • For X discrete, Y continuous: pX∣Y (x∣y) =                            .
    •   lim FX (x) = 0.                                                                                              −∞ −∞                                                                                       fY (y)
        x→−∞
                                                                                        ∂ 2 FX,Y (x,y)                                                                                                      fX (x)pY ∣X (y∣x)
    • lim FX (x) = 1.                                                    • fX,Y (x) =                  .                                              • For X continuous, Y discrete: fX∣Y (x∣y) =                            .
        x→∞                                                                                  ∂x ∂y                                                                                                               pY (y)
Definition (Normal/Gaussian random variable) A Normal random         Example (Uniform joint PDF on a set S) Let S ⊂ R2 with area
                                                                                                                                                  Derived distributions
variable X with mean µ and variance σ 2 > 0 (X ∼ N (µ, σ 2 )) has    s > 0, then the random variable (X, Y ) is uniform over S if it has
PDF                                                                  PDF                                                                          Proposition (Discrete case) Given a discrete random variable X
                                                                                                      ⎧
                                                                                                      ⎪
                 fX (x) = √
                             1    − 1 (x−µ)2
                                 e 2σ2        .                                                       ⎪ 1 , (x, y) ∈ S,                           and a function g, the r.v. Y = g(X) has PMF
                                                                                        fX,Y (x, y) = ⎨ s
                           2πσ 2                                                                      ⎪0, (x, y) ∈/ S.
                                                                                                      ⎪
                                                                                                      ⎩                                                                 pY (y) =     ∑        pX (x).
We have E[X] = µ and Var(X) = σ 2 .
                                                                     Conditioning on a random variable, independence, Bayes’ rule                                                  x∶g(x)=y
Remark (Standard Normal) The standard Normal is N (0, 1).
                                                                     Definition (Conditional PDF given another random variable)
Proposition (Linearity of Gaussians) Given X ∼ N (µ, σ 2 ), and if                                                                                Remark (Linear function of discrete random variable) If
                                                                     Given jointly continuous random variables X, Y and a value y such
a ≠ 0, then aX + b ∼ N (aµ + b, a2 σ 2 ).                            that fY (y) > 0, we define the conditional PDF as                            g(x) = ax + b, then pY (y) = pX ( y−b
                                                                                                                                                                                     a
                                                                                                                                                                                        ).
Using this Y = X−µ   is a standard gaussian.
                 σ                                                                                             fX,Y (x, y)                        Proposition (Linear function of continuous r.v.) Given a
                                                                                            fX∣Y (x∣y) =
                                                                                                           △
                                                                                                                                .
Conditioning on an event, and multiple continuous r.v.                                                             fY (y)                         continuous random variable X and Y = aX + b, with a ≠ 0, we have
Definition (Conditional PDF given an event) Given a continuous       Additionally we define P(X ∈ A∣Y = y) ∫A fX∣Y (x∣y)dx.                                                         1       y−b
random variable X and event A with P (A) > 0, we define the          Proposition (Multiplication rule) Given jointly continuous                                         fY (y) =       fX (     ).
conditional PDF as the function that satisfies                                                                                                                                     ∣a∣       a
                                                                     random variables X, Y , whenever possible we have
                  P(X ∈ B∣A) = ∫ fX∣A (x)dx.                                  fX,Y (x, y) = fX (x)fY ∣X (y∣x) = fY (y)fX∣Y (x∣y).                 Corollary (Linear function of normal r.v.) If X ∼ N (µ, σ 2 ) and
                                         B
                                                                                                                                                  Y = aX + b, with a ≠ 0, then Y ∼ N (aµ + b, a2 σ 2 ).
Definition (Conditional PDF given X ∈ A) Given a continuous          Definition (Conditional expectation) Given jointly continuous
random variable X and an A ⊂ R, with P (A) > 0:                      random variables X, Y , and y such that fY (y) > 0, we define the            Example (General function of a continuous r.v.) If X is a
                                                                     conditional expected value as                                                continuous random variable and g is any function, to obtain the
                            ⎧
                            ⎪                                                                                                                     pdf of Y = g(X) we follow the two-step procedure:
                            ⎪
                                1
                                   fX (x),          x ∈ A,                                                     ∞
               fX∣X∈A (x) = ⎨ P(A)                                                     E[X∣Y = y] = ∫              xfX∣Y (x∣y)dx.
                            ⎪
                            ⎪0,                     x ∈/ A.                                                                                          1. Find the CDF of Y : FY (y) = P(Y ≤ y) = P (g(X) ≤ y).
                            ⎩
                                                                                                           −∞
                                                                     Additionally we have
Definition (Conditional expectation) Given a continuous random                                                                                       2. Differentiate the CDF of Y to obtain the PDF:
variable X and an event A, with P (A) > 0:                                         E [g(X)∣Y = y] = ∫
                                                                                                               ∞
                                                                                                                   g(x)fX∣Y (x∣y)dx.
                                                                                                                                                                 dFY (y)
                                                                                                                                                        fY (y) = dy      .
                                                                                                           −∞
                                     ∞
                     E[X∣A] = ∫          fX∣A (x)dx.                 Theorem (Total probability and total expectation theorems)
                                                                                                                                                  Proposition (General formula for monotonic g) Let X be a
                                  −∞                                                                                                              continuous random variable and g a function that is monotonic
Definition (Memorylessness of the exponential random variable)                    fX (x) = ∫
                                                                                                     ∞
                                                                                                fY (y)fX∣Y (x∣y)dy,                               wherever fX (x) > 0. The PDF of Y = g(X) is given by
                                                                                            −∞
When we condition an exponential random variable X on the event                              ∞
X > t we have memorylessness, meaning that the “remaining time”                  E[X] = ∫      fY (y)E[X∣Y = y]dy.                                                                             dh
                                                                                                                                                                     fY (y) = fX (h(y)) ∣         (y)∣ .
X − t given that X > t is also geometric with the same parameter                           −∞                                                                                                  dy
i.e.,                                                            Definition (Independence) Jointly continuous random variables
                 P(X − t > x∣X > t) = P(X > x).                  X, Y are independent if fX,Y (x, y) = fX (x)fY (y) for all x, y.                 where h = g −1 in the interval where g is monotonic.
Sums of independent r.v., covariance and correlation                           Definition (Conditional variance as a random variable) Given            The Central Limit Theorem
Proposition (Discrete case) Let X, Y be discrete independent                   random variables X, Y the conditional variance Var(X∣Y ) is the         Theorem (Central Limit Theorem (CLT)) Given a sequence of
random variables and Z = X + Y , then the PMF of Z is                          random variable that takes the value Var(X∣Y = y) whenever              independent random variables {X1 , X2 , . . .} with E[Xi ] = µ and
                                                                               Y = y.                                                                  Var(Xi ) = σ 2 , we define
                      pZ (z) = ∑ pX (x)pY (z − x).                             Theorem (Law of total variance)
                                 x                                                                                                                                                  1     n
                                                                                            Var(X) = E [Var(X∣Y )] + Var (E[X∣Y ]) .                                         Zn =   √     ∑ (Xi − µ).
Proposition (Continuous case) Let X, Y be continuous                                                                                                                                σ n   i=1
independent random variables and Z = X + Y , then the PDF of Z is              Proposition (Sum of a random number of independent r.v.)
                                                                                                                                                       Then, for every z, we have
                                 ∞                                             Let N be a nonnegative integer random variable.
                    fZ (z) = ∫           fX (x)fY (z − x)dx.                   Let X, X1 , X2 , . . . , XN be i.i.d. random variables.                                     lim P(Zn ≤ z) = P(Z ≤ z),
                              −∞                                                                                                                                           n→∞
                                                                               Let Y = ∑i Xi . Then
Proposition (Sum of independent normal r.v.) Let X ∼           N (µx , σx2 )                                                                           where Z ∼ N (0, 1).
and Y ∼ N (µy , σy2 ) independent. Then                                                      E[Y ] = E[N ]E[X],
                                                                                                                                                       Corollary (Normal approximation of a binomial) Let
Z = X + Y ∼ N (µx + µy , σx2 + σy2 ).                                                      Var(Y ) = E[N ] Var(X) + (E[X])2 Var(N ).                   X ∼ Bin(n, p) with n large. Then Sn can be approximated by
Definition (Covariance) We define the covariance of random                                                                                             Z ∼ N (np, np(1 − p)).
variables X, Y as                                                                                                                                      Remark (De Moivre-Laplace 1/2 approximation) Let X ∼ Bin,
                                                                               Convergence of random variables                                         then P(X = i) = P (i − 12 ≤ X ≤ i + 12 ) and we can use the CLT to
            Cov(X, Y ) = E [(X − E[X]) (Y − E[Y ])] .
                          △
                                                                                                                                                       approximate the PMF of X.
                                                                               Inequalities, convergence, and the Weak Law of
Properties (Properties of covariance)                                          Large Numbers
    • If X, Y are independent, then Cov(X, Y ) = 0.                            Theorem (Markov inequality) Given a random variable X ≥ 0 and,
    • Cov(X, X) = Var(X).                                                      for every a > 0 we have
    • Cov(aX + b, Y ) = a Cov(X, Y ).                                                                                 E[X]
                                                                                                         P(X ≥ a) ≤        .
    • Cov(X, Y + Z) = Cov(X, Y ) + Cov(X, Z).                                                                           a
                                                                               Theorem (Chebyshev inequality) Given a random variable X with
    • Cov(X, Y ) = E[XY ] − E[X]E[Y ].
                                                                               E[X] = µ and Var(X) = σ 2 , for every  > 0 we have
Proposition (Variance of a sum of r.v.)
                                                                                                                             σ2
        Var(X1 + ⋯ + Xn ) = ∑ Var(Xi ) + ∑ Cov(Xi , Xj ).                                                P (∣X − µ∣ ≥ ) ≤      .
                                     i                i≠j
                                                                                                                             2
                                                                               Theorem (Weak Law of Large Number (WLLN)) Given a
Definition (Correlation coefficient) We define the correlation                 sequence of i.i.d. random variables {X1 , X2 , . . .} with E[Xi ] = µ
coefficient of random variables X, Y , with σX , σY > 0, as                    and Var(Xi ) = σ 2 , we define
                                             Cov(X, Y )                                                           1 n
                        ρ(X, Y ) =
                                         △
                                                        .                                                  Mn =     ∑ Xi ,
                                               σX σY                                                              n i=1
Properties (Properties of the correlation coefficient)                         for every  > 0 we have
    • −1 ≤ ρ ≤ 1.
                                                                                                      lim P (∣Mn − µ∣ ≥ ) = 0.
                                                                                                     n→∞
    • If X, Y are independent, then ρ = 0.
                                                                               Definition (Convergence in probability) A sequence of random
    • ∣ρ∣ = 1 if and only if X − E[X] = c (Y − E[Y ]).
                                                                               variables {Yi } converges in probability to the random variable Y if
    • ρ(aX + b, Y ) = sign(a)ρ(X, Y ).
                                                                                                      lim P (∣Yi − Y ∣ ≥ ) = 0,
                                                                                                     n→∞
Conditional expectation and variance, sum of
random number of r.v.                                                          for every  > 0.
Definition (Conditional expectation as a random variable) Given                Properties (Properties of convergence in probability) If Xn → a
random variables X, Y the conditional expectation E[X∣Y ] is the               and Yn → b in probability, then
random variable that takes the value E[X∣Y = y] whenever Y = y.                    • Xn + Yn → a + b.
Theorem (Law of iterated expectations)                                             • If g is a continuous function, then g(Xn ) → g(a).
                          E [E[X∣Y ]] = E[X].                                      • E[Xn ] does not always converge to a.
CS 229 – Machine Learning                                                                                                                                                         https://stanford.edu/~shervine
       Super VIP Cheatsheet: Machine Learning                                                                                 4 Machine Learning Tips and Tricks                                                                                          10
                                                                                                                                4.1 Metrics . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
                                                                                                                                    4.1.1 Classification . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
                   Afshine Amidi and Shervine Amidi                                                                                 4.1.2 Regression . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
                                                                                                                                4.2 Model selection . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   11
                               October 6, 2018                                                                                  4.3 Diagnostics . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   11
                                                                                                                              5 Refreshers                                                                                                                12
                                                                                                                                5.1 Probabilities and Statistics . . . . . . . . . . . . . . .                        .   .   .   .   .   .   .   .   .   12
Contents                                                                                                                             5.1.1 Introduction to Probability and Combinatorics                              .   .   .   .   .   .   .   .   .   12
                                                                                                                                     5.1.2 Conditional Probability . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   12
1 Supervised Learning                                                                                                 2              5.1.3 Random Variables . . . . . . . . . . . . . . .                             .   .   .   .   .   .   .   .   .   13
  1.1 Introduction to Supervised Learning . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2              5.1.4 Jointly Distributed Random Variables . . . . .                             .   .   .   .   .   .   .   .   .   13
  1.2 Notations and general concepts . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2              5.1.5 Parameter estimation . . . . . . . . . . . . .                             .   .   .   .   .   .   .   .   .   14
  1.3 Linear models . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2         5.2 Linear Algebra and Calculus . . . . . . . . . . . . . .                           .   .   .   .   .   .   .   .   .   14
      1.3.1 Linear regression . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2              5.2.1 General notations . . . . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   14
                                                                                                                                     5.2.2 Matrix operations . . . . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   15
      1.3.2 Classification and logistic regression        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
                                                                                                                                     5.2.3 Matrix properties . . . . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   15
      1.3.3 Generalized Linear Models . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
                                                                                                                                     5.2.4 Matrix calculus . . . . . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   16
  1.4 Support Vector Machines . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
  1.5 Generative Learning . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4
      1.5.1 Gaussian Discriminant Analysis . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4
      1.5.2 Naive Bayes . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4
  1.6 Tree-based and ensemble methods . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4
  1.7 Other non-parametric approaches . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4
  1.8 Learning Theory . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   5
2 Unsupervised Learning                                                                                               6
  2.1 Introduction to Unsupervised Learning .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   6
  2.2 Clustering . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   6
      2.2.1 Expectation-Maximization . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   6
      2.2.2 k-means clustering . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   6
      2.2.3 Hierarchical clustering . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   6
      2.2.4 Clustering assessment metrics . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   6
  2.3 Dimension reduction . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   7
      2.3.1 Principal component analysis . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   7
      2.3.2 Independent component analysis .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   7
3 Deep Learning                                                                                                       8
  3.1 Neural Networks . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   8
  3.2 Convolutional Neural Networks . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   8
  3.3 Recurrent Neural Networks . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   8
  3.4 Reinforcement Learning and Control .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   9
1     Supervised Learning                                                                                      r Cost function – The cost function J is commonly used to assess the performance of a model,
                                                                                                               and is defined with the loss function L as follows:
1.1     Introduction to Supervised Learning                                                                                                                m
                                                                                                                                                           X
                                                                                                                                                 J(θ) =          L(hθ (x(i) ), y (i) )
Given a set of data points {x(1) , ..., x(m) } associated to a set of outcomes {y (1) , ..., y (m) }, we
want to build a classifier that learns how to predict y from x.                                                                                            i=1
r Type of prediction – The different types of predictive models are summed up in the table
below:                                                                                                         r Gradient descent – By noting α ∈ R the learning rate, the update rule for gradient descent
                                                                                                               is expressed with the learning rate and the cost function J as follows:
                              Regression                         Classifier
                                                                                                                                                         θ ←− θ − α∇J(θ)
             Outcome          Continuous                            Class
            Examples       Linear regression     Logistic regression, SVM, Naive Bayes
r Type of model – The different models are summed up in the table below:
                                                                                                               Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training
                                                                                                               example, and batch gradient descent is on a batch of training examples.
      Illustration                                                                                             r Likelihood – The likelihood of a model L(θ) given parameters θ is used to find the optimal
                                                                                                               parameters θ through maximizing the likelihood. In practice, we use the log-likelihood `(θ) =
                                                                                                               log(L(θ)) which is easier to optimize. We have:
1.2     Notations and general concepts                                                                         r Newton’s algorithm – The Newton’s algorithm is a numerical method that finds θ such
                                                                                                               that `0 (θ) = 0. Its update rule is as follows:
r Hypothesis – The hypothesis is noted hθ and is the model that we choose. For a given input
                                                                                                                                                                      `0 (θ)
data x(i) , the model prediction output is hθ (x(i) ).                                                                                                    θ←θ−
                                                                                                                                                                      `00 (θ)
r Loss function – A loss function is a function L : (z,y) ∈ R × Y 7−→ L(z,y) ∈ R that takes as
inputs the predicted value z corresponding to the real data value y and outputs how different                  Remark: the multidimensional generalization, also known as the Newton-Raphson method, has
they are. The common loss functions are summed up in the table below:                                          the following update rule:
                                                                                                                                                                        −1
    Least squared            Logistic                Hinge                  Cross-entropy                                                        θ ← θ − ∇2θ `(θ)               ∇θ `(θ)
       1                                                                                             
         (y − z)2       log(1 + exp(−yz))        max(0,1 − yz)      − y log(z) + (1 − y) log(1 − z)
       2                                                                                                       1.3     Linear models
r LMS algorithm – By noting α the learning rate, the update rule of the Least Mean Squares                          Distribution                 η           T (y)              a(η)                  b(y)
(LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff
learning rule, is as follows:                                                                                                                     φ
                                                                                                                                                       
                                                                                                                        Bernoulli          log   1−φ
                                                                                                                                                              y        log(1 + exp(η))                 1
                                               m                                                                                                                                                                   
                                                                                                                                                                                η2                              2
                                               X                                (i)
                                                                                                                                                                                                     exp − y2
                                                                           
                          ∀j,   θ j ← θj + α          y (i) − hθ (x(i) ) xj                                             Gaussian                 µ            y                  2
                                                                                                                                                                                            √1
                                                                                                                                                                                             2π
                                               i=1
                                                                                                                                                                                                       1
                                                                                                                         Poisson             log(λ)           y                 eη
Remark: the update rule is a particular case of the gradient ascent.                                                                                                                                   y!
r LWR – Locally Weighted Regression, also known as LWR, is a variant of linear regression that                          Geometric          log(1 − φ)         y           log     eη
                                                                                                                                                                                        
                                                                                                                                                                                                       1
                                                                                                                                                                                 1−eη
weights each training example in its cost function by w(i) (x), which is defined with parameter
τ ∈ R as:
                                                                       
                                                          (x(i) − x)2                                   r Assumptions of GLMs – Generalized Linear Models (GLM) aim at predicting a random
                                 w(i) (x) = exp       −
                                                              2τ 2                                      variable y as a function fo x ∈ Rn+1 and rely on the following 3 assumptions:
Remark: there is no closed form solution for the case of logistic regressions.                                                            1
                                                                                                                                    min     ||w||2         such that       y (i) (wT x(i) − b) > 1
                                                                                                                                          2
r Softmax regression – A softmax regression, also called a multiclass logistic regression, is
used to generalize logistic regression when there are more than 2 outcome classes. By convention,
we set θK = 0, which makes the Bernoulli parameter φi of each class i equal to:
                                                exp(θiT x)
                                      φi =
                                             K
                                             X
                                                     exp(θjT x)
                                               j=1
r Kernel – Given a feature mapping φ, we define the kernel K to be defined as:                                  1.5.2    Naive Bayes
                                         K(x,z) = φ(x)T φ(z)                                                    r Assumption – The Naive Bayes model supposes that the features of each data point are all
                                                                                                              independent:
                                                                      2
                                                              ||x−z||
In practice, the kernel K defined by K(x,z) = exp −              2σ 2
                                                                            is called the Gaussian kernel
                                                                                                                                                                                            n
and is commonly used.                                                                                                                                                                       Y
                                                                                                                                    P (x|y) = P (x1 ,x2 ,...|y) = P (x1 |y)P (x2 |y)... =         P (xi |y)
                                                                                                                                                                                            i=1
                                                                                                                r Solutions – Maximizing the log-likelihood gives the following solutions, with k ∈ {0,1},
                                                                                                                l ∈ [[1,L]]
                                                                                                                                                                                                                (j)
                                                                                                                                 1                                                          #{j|y (j) = k and xi      = l}
                                                                                                                   P (y = k) =     × #{j|y (j) = k}       and      P (xi = l|y = k) =
                                                                                                                                 m                                                                  #{j|y (j) = k}
Remark: we say that we use the "kernel trick" to compute the cost function using the kernel
because we actually don’t need to know the explicit mapping φ, which is often very complicated.
Instead, only the values K(x,z) are needed.                                                                     Remark: Naive Bayes is widely used for text classification and spam detection.
                                                                                                                r Random forest – It is a tree-based technique that uses a high number of decision trees
1.5     Generative Learning                                                                                     built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly
                                                                                                                uninterpretable but its generally good performance makes it a popular algorithm.
A generative model first tries to learn how the data is generated by estimating P (x|y), which
we can then use to estimate P (y|x) by using Bayes’ rule.                                                       Remark: random forests are a type of ensemble methods.
                                                                                                                r Boosting – The idea of boosting methods is to combine several weak learners to form a
                                                                                                                stronger one. The main ones are summed up in the table below:
1.5.1    Gaussian Discriminant Analysis
r Setting – The Gaussian Discriminant Analysis assumes that y and x|y = 0 and x|y = 1 are
such that:                                                                                                                               Adaptive boosting                     Gradient boosting
                                           y ∼ Bernoulli(φ)                                                                       - High weights are put on errors to        - Weak learners trained
                                                                                                                                  improve at the next boosting step          on remaining errors
                           x|y = 0 ∼ N (µ0 ,Σ)   and    x|y = 1 ∼ N (µ1 ,Σ)                                                       - Known as Adaboost
r Estimation – The following table sums up the estimates that we find when maximizing the
likelihood:
                                                                                                                1.7     Other non-parametric approaches
                   φ
                   b             µbj (j = 0,1)                                Σ
                                                                              b                                 r k-nearest neighbors – The k-nearest neighbors algorithm, commonly known as k-NN, is a
                                                                                                                non-parametric approach where the response of a data point is determined by the nature of its
                                                                                                                k neighbors from the training set. It can be used in both classification and regression settings.
             m                  Pm                            m
         1   X                       1
                                 i=1 {y (i) =j}
                                                x(i)    1    X
                   1{y(i) =1}      m                               (x(i) − µy(i) )(x(i) − µy(i) )T
                                       1                                                                        Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the
                                 P
         m                                              m
             i=1                    i=1 {y (i) =j}           i=1                                                higher the variance.
                                                                                                           r Upper bound theorem – Let H be a finite hypothesis class such that |H| = k and let δ and
                                                                                                           the sample size m be fixed. Then, with probability of at least 1 − δ, we have:
                                                                                                                                                          r
                                                                                                                                                              1      2k
                                                                                                                                                                    
                                                                                                                                          (h) 6 min (h) + 2
                                                                                                                                            b                   log
                                                                                                                                                 h∈H                   2m         δ
r Hoeffding inequality – Let Z1 , .., Zm be m iid variables drawn from a Bernoulli distribution
of parameter φ. Let φ
                    b be their sample mean and γ > 0 fixed. We have:
                                 P (|φ − φ
                                         b| > γ) 6 2 exp(−2γ 2 m)
                                                m
                                            1   X
                                  b(h) =             1{h(x(i) )6=y(i) }
                                            m
                                                i=1
r Shattering – Given a set S = {x(1) ,...,x(d) }, and a set of classifiers H, we say that H shatters
S if for any set of labels {y (1) , ..., y (d) }, we have:
2.1     Introduction to Unsupervised Learning                                                                           We note c(i) the cluster of data point i and µj the center of cluster j.
                                                                                                                        r Algorithm – After randomly initializing the cluster centroids µ1 ,µ2 ,...,µk ∈ Rn , the k-means
r Motivation – The goal of unsupervised learning is to find hidden patterns in unlabeled data                           algorithm repeats the following step until convergence:
{x(1) ,...,x(m) }.                                                                                                                                                                          m
                                                                                                                                                                                            X
r Jensen’s inequality – Let f be a convex function and X a random variable. We have the                                                                                                           1{c(i) =j} x(i)
following inequality:                                                                                                                                                                       i=1
                                                                                                                                          c(i) = arg min||x(i) − µj ||2     and      µj =     m
                                         E[f (X)] > f (E[X])                                                                                        j                                            X
                                                                                                                                                                                                       1{c(i) =j}
                                                                                                                                                                                                 i=1
2.2     Clustering
2.2.1     Expectation-Maximization
r Latent variables – Latent variables are hidden/unobserved variables that make estimation
problems difficult, and are often denoted z. Here are the most common settings where there are
latent variables:
                                 Xˆ                                
                                                                       P (x(i) ,z (i) ; θ)
                                                                                             
                                                                                                                                 Ward linkage                Average linkage                           Complete linkage
                   θi = argmax                   Qi (z (i) ) log                                 dz (i)
                             θ           z (i)                            Qi (z (i) )                                        Minimize within cluster     Minimize average distance           Minimize maximum distance
                                   i
                                                                                                                                    distance               between cluster pairs               of between cluster pairs
               k                                                            m
               X                                                            X
        Bk =         nc(i) (µc(i) − µ)(µc(i) − µ)T ,            Wk =              (x(i) − µc(i) )(x(i) − µc(i) )T
               j=1                                                          i=1
the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such
that the higher the score, the more dense and well separated the clusters are. It is defined as
follows:
                                                   Tr(Bk )   N −k
                                        s(k) =             ×
                                                   Tr(Wk )   k−1                                                                   2.3.2    Independent component analysis
                                                                                                                                   It is a technique meant to find the underlying generating sources.
2.3     Dimension reduction                                                                                                        r Assumptions – We assume that our data x has been generated by the n-dimensional source
                                                                                                                                   vector s = (s1 ,...,sn ), where si are independent random variables, via a mixing and non-singular
2.3.1    Principal component analysis                                                                                              matrix A as follows:
                                                                                                                                                                                x = As
It is a dimension reduction technique that finds the variance maximizing directions onto which
to project the data.                                                                                                               The goal is to find the unmixing matrix W = A−1 by an update rule.
r Eigenvalue, eigenvector – Given a matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if
there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have:                                                          r Bell and Sejnowski ICA algorithm – This algorithm finds the unmixing matrix W by
                                                                                                                                   following the steps below:
                                                    Az = λz
                                                                                                                                      • Write the probability of x = As = W −1 s as:
r Spectral theorem – Let A ∈ Rn×n . If A is symmetric, then A is diagonalizable by a real                                                                                           n
orthogonal matrix U ∈ Rn×n . By noting Λ = diag(λ1 ,...,λn ), we have:
                                                                                                                                                                                    Y
                                                                                                                                                                          p(x) =          ps (wiT x) · |W |
                                       ∃Λ diagonal,            A = U ΛU       T                                                                                                     i=1
Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of
matrix A.                                                                                                                             • Write the log likelihood given our training data {x(i) , i ∈ [[1,m]]} and by noting g the
                                                                                                                                        sigmoid function as:
r Algorithm – The Principal Component Analysis (PCA) procedure is a dimension reduction
technique that projects the data on k dimensions by maximizing the variance of the data as                                                                              m n
                                                                                                                                                                                                                             !
follows:
                                                                                                                                                                        X X                                
                                                                                                                                                                                              0
                                                                                                                                                              l(W ) =               log g         (wjT x(i) )   + log |W |
   • Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.                                                                                        i=1   j=1
                                                                                                                                   Therefore, the stochastic gradient ascent learning rule is such that for each training example
                       (i)                                    m                                     m
            (i)
                      xj − µj                        1        X       (i)                       1   X       (i)                    x(i) , we update W as follows:
           xj     ←                where        µj =                 xj       and       σj2   =           (xj     − µj )   2
                         σj                          m                                          m
                                                               i=1                                  i=1
                                                                                                                                                                    1 − 2g(w1T x(i) )
                                                                                                                                                                                                                          
                                                                                                                                                                  1 − 2g(w2 x ) (i) T
                                                                                                                                                                              T (i)
                                                                                                                                                       W ←− W + α        ..          x + (W T )−1 
                                                                                                                                                                                                     
                                      m
                                  1   X                 T                                                                                                                   .
   • Step 2: Compute Σ =                    x(i) x(i)       ∈ Rn×n , which is symmetric with real eigenvalues.
                                  m                                                                                                                                 1 − 2g(wn T x(i) )
                                      i=1
   • Step 4: Project the data on spanR (u1 ,...,uk ). This procedure maximizes the variance
     among all k-dimensional spaces.
3     Deep Learning
                                                                                                                                       ∂L(z,y)   ∂L(z,y)   ∂a   ∂z
                                                                                                                                               =         ×    ×
3.1    Neural Networks                                                                                                                   ∂w        ∂a      ∂z   ∂w
Neural networks are a class of models that are built with layers. Commonly used types of neural       As a result, the weight is updated as follows:
networks include convolutional and recurrent neural networks.                                                                                            ∂L(z,y)
                                                                                                                                            w ←− w − η
r Architecture – The vocabulary around neural networks architectures is described in the                                                                   ∂w
figure below:
                                                                                                      r Updating weights – In a neural network, weights are updated as follows:
                                                                                                         • Step 1: Take a batch of training data.
                                                                                                      follows:
                                                                                                                                                   xi − µB
                                                                                                                                           xi ←− γ p       +β
                                                                                                                                                      2 +
                                                                                                                                                     σB
                                                                                                      It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
                                                                                                      aims at allowing higher learning rates and reducing the strong dependence on initialization.
r Cross-entropy loss – In the context of neural networks, the cross-entropy loss L(z,y) is
commonly used and is defined as follows:                                                              3.3    Recurrent Neural Networks
                                       h                                i
                           L(z,y) = − y log(z) + (1 − y) log(1 − z)                                   r Types of gates – Here are the different types of gates that we encounter in a typical recurrent
                                                                                                      neural network:
3.4    Reinforcement Learning and Control                                                              r Maximum likelihood estimate – The maximum likelihood estimates for the state transition
                                                                                                       probabilities are as follows:
The goal of reinforcement learning is for an agent to learn how to evolve in an environment.
                                                                                                                                        #times took action a in state s and got to s0
                                                                                                                          Psa (s0 ) =
r Markov decision processes – A Markov decision process (MDP) is a 5-tuple (S,A,{Psa },γ,R)                                                   #times took action a in state s
where:
• S is the set of states r Q-learning – Q-learning is a model-free estimation of Q, which is done as follows:
r Value function – For a given policy π and a given state s, we define the value function V π
as follows:
                                    h                                                      i
                     V π (s) = E R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + ...|s0 = s,π
                                                                                               ∗
r Bellman equation – The optimal Bellman equations characterizes the value function V π
of the optimal policy π ∗ :
                               ∗                                                ∗
                                                          X
                          V π (s) = R(s) + max γ                  Psa (s0 )V π (s0 )
                                                a∈A
                                                         s0 ∈S
Remark: we note that the optimal policy π ∗ for a given state s is such that:
                                                      X
                                   π ∗ (s) = argmax           Psa (s0 )V ∗ (s0 )
                                             a∈A
                                                      s0 ∈S
V0 (s) = 0
4.1.1      Classification                                                                                 r AUC – The area under the receiving operating curve, also noted AUC or AUROC, is the
                                                                                                          area below the ROC as shown in the following figure:
In a context of a binary classification, here are the main metrics that are important to track to
assess the performance of the model.
r Confusion matrix – The confusion matrix is used to have a more complete picture when
assessing the performance of a model. It is defined as follows:
                                       Predicted class
                                   +                     –
                                  TP                  FN
                       +                        False Negatives
                             True Positives
                                                 Type II error
      Actual class                                                                                        4.1.2    Regression
                                  FP                  TN
                       –    False Positives                                                              r Basic metrics – Given a regression model f , the following metrics are commonly used to
                                                True Negatives                                           assess the performance of the model:
                              Type I error
                                                                                                             Total sum of squares             Explained sum of squares               Residual sum of squares
r Main metrics – The following metrics are commonly used to assess the performance of
classification models:                                                                                                   m
                                                                                                                         X                                  m
                                                                                                                                                            X                                    m
                                                                                                                                                                                                 X
                                                                                                               SStot =         (yi − y)2          SSreg =         (f (xi ) − y)2       SSres =         (yi − f (xi ))2
                                                                                                                         i=1                                i=1                                  i=1
         Metric             Formula             Interpretation
                           TP + TN
        Accuracy                                Overall performance of model                             r Coefficient of determination – The coefficient of determination, often noted R2 or r2 ,
                      TP + TN + FP + FN
                                                                                                         provides a measure of how well the observed outcomes are replicated by the model and is defined
                              TP                                                                         as follows:
        Precision                               How accurate the positive predictions are
                            TP + FP                                                                                                                                   SSres
                                                                                                                                                       R2 = 1 −
                              TP                                                                                                                                      SStot
          Recall                                Coverage of actual positive sample
                            TP + FN
        Sensitivity                                                                                      r Main metrics – The following metrics are commonly used to assess the performance of
                              TN                                                                         regression models, by taking into account the number of variables n that they take into consid-
        Specificity                             Coverage of actual negative sample                       eration:
                            TN + FP
                             2TP                                                                                                                                                                    Adjusted R2
         F1 score                               Hybrid metric useful for unbalanced classes                   Mallow’s Cp                     AIC                             BIC
                        2TP + FP + FN
                                                                                                            SSres + 2(n + 1)b
                                                                                                                            σ2                                                                       (1 − R2 )(m − 1)
                                                                                                                                      2 (n + 2) − log(L)          log(m)(n + 2) − 2 log(L)       1−
                                                                                                                    m                                                                                     m−n−1
r ROC – The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by
varying the threshold. These metrics are are summed up in the table below:                                where L is the likelihood and b
                                                                                                                                        σ 2 is an estimate of the variance associated with each response.
Once the model has been chosen, it is trained on the entire dataset and tested on the unseen
test set. These are represented in the figure below:                                                                                                                           h                            i
                                                                                                     ... + λ||θ||1                    ... + λ||θ||22                     ... + λ (1 − α)||θ||1 + α||θ||22
                                                                                                     λ∈R                              λ∈R                                λ ∈ R,    α ∈ [0,1]
                                                                                                    r Model selection – Train model on training set, then evaluate on the development set, then
r Cross-validation – Cross-validation, also noted CV, is a method that is used to select a          pick best performance model on the development set, and retrain all of that model on the whole
model that does not rely too much on the initial training set. The different types are summed       training set.
up in the table below:
                                                                                                    4.3     Diagnostics
                        k-fold                              Leave-p-out
           - Training on k − 1 folds and        - Training on n − p observations and                r Bias – The bias of a model is the difference between the expected prediction and the correct
                                                                                                    model that we try to predict for given data points.
           assessment on the remaining one      assessment on the p remaining ones
           - Generally k = 5 or 10              - Case p = 1 is called leave-one-out                r Variance – The variance of a model is the variability of the model prediction for given data
                                                                                                    points.
The most commonly used method is called k-fold cross-validation and splits the training data        r Bias/variance tradeoff – The simpler the model, the higher the bias, and the more complex
into k folds to validate the model on one fold while training the model on the k − 1 other folds,   the model, the higher the variance.
all of this k times. The error is then averaged over the k folds and is named cross-validation
error.
                                                                                                                             Underfitting                   Just right                 Overfitting
                                                                                                                          - High training error        - Training error            - Low training error
                                                                                                          Symptoms        - Training error close       slightly lower than         - Training error much
                                                                                                                          to test error                test error                  lower than test error
                                                                                                                          - High bias                                              - High variance
Regression
r Regularization – The regularization procedure aims at avoiding the model to overfit the
data and thus deals with high variance issues. The following table sums up the different types
of commonly used regularization techniques:
5 Refreshers
                                                                                                        r Axioms of probability – For each event E, we denote P (E) as the probability of event E
  Deep learning                                                                                         occuring. By noting E1 ,...,En mutually exclusive events, we have the 3 following axioms:
                                                                                                                                                                                n
                                                                                                                                                                                           !       n
                                                                                                                                                                                [                  X
                                                                                                                   (1)   0 6 P (E) 6 1     (2)    P (S) = 1      (3)      P           Ei       =         P (Ei )
                                                                                                                                                                                i=1                i=1
                                                                                                        r Partition – Let {Ai , i ∈ [[1,n]]} be such that for all i, Ai 6= ∅. We say that {Ai } is a partition
                                                                                                        if we have:
                                                                                                                                                                         n
                                                                                                                                                                         [
                                                                                                                                     ∀i 6= j, Ai ∩ Aj = ∅     and              Ai = S
                                                                                                                                                                         i=1
                                                                                                                                                                                  n
                                                                                                                                                                                  X
                                                                                                        Remark: for any event B in the sample space, we have P (B) =                       P (B|Ai )P (Ai ).
                                                                                                                                                                                  i=1
r Extended form of Bayes’ rule – Let {Ai , i ∈ [[1,n]]} be a partition of the sample space.                               r Expectation and Moments of the Distribution – Here are the expressions of the expected
We have:                                                                                                                  value E[X], generalized expected value E[g(X)], kth moment E[X k ] and characteristic function
                                                                                                                          ψ(ω) for the discrete and continuous cases:
                                                        P (B|Ak )P (Ak )
                                        P (Ak |B) =   n
                                                      X
                                                             P (B|Ai )P (Ai )                                                  Case             E[X]                   E[g(X)]                            E[X k ]                  ψ(ω)
                                                       i=1                                                                                  n                      n                                  n                      n
                                                                                                                                            X                      X                              X                          X
                                                                                                                                  (D)             xi f (xi )             g(xi )f (xi )                     xki f (xi )             f (xi )eiωxi
r Independence – Two events A and B are independent if and only if we have:                                                                 i=1                    i=1                            i=1                        i=1
                                                                                                                                        ˆ   +∞                 ˆ   +∞                         ˆ   +∞                     ˆ   +∞
                                              P (A ∩ B) = P (A)P (B)                                                              (C)             xf (x)dx                g(x)f (x)dx                      xk f (x)dx              f (x)eiωx dx
                                                                                                                                         −∞                    −∞                                 −∞                     −∞
                                                F (x) = P (X 6 x)
                                                                                                                          r Transformation of random variables – Let the variables X and Y be linked by some
                                                                                                                          function. By noting fX and fY the distribution function of X and Y respectively, we have:
Remark: we have P (a < X 6 B) = F (b) − F (a).
                                                                                                                                                                                                  dx
                                                                                                                                                                         fY (y) = fX (x)
r Probability density function (PDF) – The probability density function f is the probability                                                                                                      dy
that X takes on values between two adjacent realizations of the random variable.
r Relationships involving the PDF and CDF – Here are the important properties to know                                     r Leibniz integral rule – Let g be a function of x and potentially c, and a, b boundaries that
in the discrete (D) and the continuous (C) cases.                                                                         may depend on c. We have:
                                                                                                                                              ˆ                                       ˆ b
                                                                                                                                           ∂      b             ∂b          ∂a              ∂g
                                                                                                                                                    g(x)dx =       · g(b) −    · g(a) +         (x)dx
 Case                   CDF F                         PDF f                     Properties of PDF                                          ∂c    a              ∂c          ∂c           a ∂c
                        X                                                                      X
  (D)         F (x) =           P (X = xi )    f (xj ) = P (X = xj )     0 6 f (xj ) 6 1 and           f (xj ) = 1
                                                                                                                          r Chebyshev’s inequality – Let X be a random variable with expected value µ and standard
                        xi 6x                                                                      j                      deviation σ. For k, σ > 0, we have the following inequality:
                          ˆ   x                                                           ˆ   +∞                                                                                                      1
                                                              dF
  (C)          F (x) =            f (y)dy          f (x) =                f (x) > 0 and            f (x)dx = 1                                                         P (|X − µ| > kσ) 6
                           −∞                                 dx                          −∞                                                                                                          k2
r Marginal density and cumulative distribution – From the joint density probability                                                 5.1.5    Parameter estimation
function fXY , we have:
                                                                                                                                r Random sample – A random sample is a collection of n random variables X1 , ..., Xn that
       Case              Marginal density                                    Cumulative function                                are independent and identically distributed with X.
                                                                                                                                r Estimator – An estimator θ̂ is a function of the data that is used to infer the value of an
                                                                                                                                unknown parameter θ in a statistical model.
                                    X                                                   XX
          (D)       fX (xi ) =            fXY (xi ,yj )               FXY (x,y) =                      fXY (xi ,yj )
                                    j                                               xi 6x yj 6y                                 r Bias – The bias of an estimator θ̂ is defined as being the difference between the expected
                                ˆ                                               ˆ       ˆ                                       value of the distribution of θ̂ and the true value, i.e.:
                                    +∞                                              x        y
          (C)       fX (x) =              fXY (x,y)dy           FXY (x,y) =                        fXY (x0 ,y 0 )dx0 dy 0                                                  Bias(θ̂) = E[θ̂] − θ
                                 −∞                                              −∞         −∞
r Diagonal matrix – A diagonal matrix D ∈ Rn×n is                    a square matrix with nonzero values in                                          ∀i,j,      i,j = Aj,i
                                                                                                                                                               AT
its diagonal and zero everywhere else:
                                      d1 0 · · ·                     0                                        Remark: for matrices A,B, we have (AB)T = B T AT .
                                                                            
                                           ..   ..                    ..
                                              .    .
                                D= 0                                  .
                                                                                                               r Inverse – The inverse of an invertible square matrix A is noted A−1 and is the only matrix
                                                                           
                                        .. ..   ..
                                                                            
                                         .    .    .                  0                                        such that:
                                       0   ···   0                   dn
                                                                                                                                                     AA−1 = A−1 A = I
Remark: we also note D as diag(d1 ,...,dn ).
                                                                                                               Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1 =
5.2.2    Matrix operations                                                                                     B −1 A−1
                                                                                                               r Trace – The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:
r Vector-vector multiplication – There are two types of vector-vector products:
                                                                                                                                                                   n
   • inner product: for x,y ∈ Rn , we have:
                                                                                                                                                                   X
                                                                                                                                                       tr(A) =           Ai,i
                                                                                                                                                                   i=1
                                                          n
                                                          X
                                               xT y =            xi yi ∈ R
                                                                                                               Remark: for matrices A,B, we have tr(AT ) = tr(A) and tr(AB) = tr(BA)
                                                          i=1
                                                                                                               r Determinant – The determinant of a square matrix A ∈ Rn×n , noted |A| or det(A) is
                                                                                                               expressed recursively in terms of A\i,\j , which is the matrix A without its ith row and j th
   • outer product: for x ∈ Rm , y ∈ Rn , we have:
                                                                                                               column, as follows:
                                         x1 y1           ···     x1 yn                                                                                     n
                                                ..                  ..
                                                                                                                                                             X
                               xy T =                                            ∈ Rm×n                                                    det(A) = |A| =          (−1)i+j Ai,j |A\i,\j |
                                                 .                   .
                                           xm y1          ···     xm yn                                                                                      j=1
                                                                                                               Remark: A is invertible if and only if |A| 6= 0. Also, |AB| = |A||B| and |AT | = |A|.
r Matrix-vector multiplication – The product of matrix A ∈ Rm×n and vector x ∈ Rn is a
vector of size Rm , such that:
                                    
                                        aT
                                                                                                              5.2.3    Matrix properties
                                         r,1 x              n
                                           ..
                                                            X
                             Ax =                    =           ac,i xi ∈ R     m
                                                                                                               r Symmetric decomposition – A given matrix A can be expressed in terms of its symmetric
                                            .
                                         T
                                        ar,m x               i=1                                               and antisymmetric parts as follows:
                                                                                                                                                     A + AT              A − AT
    where aT
           r,i are the vector rows and ac,j are the vector columns of A, and xi are the entries
                                                                                                                                                A=          +
                                                                                                                                                        2                   2
of x.                                                                                                                                                | {z }              | {z }
                                                                                                                                                     Symmetric       Antisymmetric
r Matrix-matrix multiplication – The product of matrices A ∈ Rm×n and B ∈ Rn×p is a
matrix of size Rn×p , such that:
                        
                            aT           ···         aT
                                                                                                              r Norm – A norm is a function N : V −→ [0, + ∞[ where V is a vector space, and such that
                             r,1 bc,1                 r,1 bc,p             n
                                                                                                               for all x,y ∈ V , we have:
                                ..                       ..
                                                                           X
                 AB =                                           =              ac,i bT      n×p
                                                                                             ∈R
                                 .                        .                            r,i
                             T
                            ar,m bc,1    ···          T
                                                     ar,m bc,p             i=1                                    • N (x + y) 6 N (x) + N (y)
tively.
                                                                                                                  • if N (x) = 0, then x = 0
r Transpose – The transpose of a matrix A ∈ Rm×n , noted AT , is such that its entries are
flipped:                                                                                                       For x ∈ V , the most commonly used norms are summed up in the table below:
r Linearly dependence – A set of vectors is said to be linearly dependent if one of the vectors                          ∇A tr(ABAT C) = CAB + C T AB T                          ∇A |A| = |A|(A−1 )T
in the set can be defined as a linear combination of the others.
Remark: if no vector can be written this way, then the vectors are said to be linearly independent.
r Matrix rank – The rank of a given matrix A is noted rank(A) and is the dimension of the
vector space generated by its columns. This is equivalent to the maximum number of linearly
independent columns of A.
A = AT and ∀x ∈ Rn , xT Ax > 0
∃Λ diagonal, A = U ΛU T
A = U ΣV T
                                                                                                        Illustration
r Fully Connected (FC) – The fully connected layer (FC) operates on a flattened input where
each input is connected to all neurons. If present, FC layers are usually found towards the end
of CNN architectures and can be used to optimize objectives such as class scores.
                                                                                                                                                                                   - Maximum padding
                                                                                                                         - No padding          - Padding such that feature
                                                                                                                                                                      l m          such that end
                                                                                                                                               map size has size          I
                                                                                                                                                                                   convolutions are
                                                                                                                         - Drops last                                     S
                                                                                                          Purpose                                                                  applied on the limits
                                                                                                                         convolution if        - Output size is
                                                                                                                                                                                   of the input
                                                                                                                         dimensions do not     mathematically convenient
                                                                                                                         match                                                     - Filter ’sees’ the input
                                                                                                                                               - Also called ’half’ padding
                                                                                                                                                                                   end-to-end
Remark: the application of K filters of size F × F results in an output feature map of size
O × O × K.
r Stride – For a convolutional or a pooling operation, the stride S denotes the number of pixels       Remark: often times, Pstart = Pend , P , in which case we can replace Pstart + Pend by 2P in
by which the window moves after each operation.                                                        the formula above.
r Understanding the complexity of the model – In order to assess the complexity of a                                     ReLU                        Leaky ReLU                        ELU
model, it is often useful to determine the number of parameters that its architecture will have.
In a given layer of a convolutional neural network, it is done as follows:                                                                         g(z) = max(z,z)           g(z) = max(α(ez − 1),z)
                                                                                                                    g(z) = max(0,z)
                                                                                                                                                      with   1                    with α  1
                           CONV                         POOL                      FC
Illustration
r Receptive field – The receptive field at layer k is the area denoted Rk × Rk of the input
that each pixel of the k-th activation map can ’see’. By calling Fj the filter size of layer j and       1.6    Object detection
Si the stride value of layer i and with the convention S0 = 1, the receptive field at layer k can
be computed with the formula:                                                                            r Types of models – There are 3 main types of object recognition algorithms, for which the
                                             k                j−1                                        nature of what is predicted is different. They are described in the table below:
                                             X                Y
                                  Rk = 1 +         (Fj − 1)         Si
                                                                                                                                            Classification
                                             j=1              i=0                                          Image classification                                                      Detection
                                                                                                                                            w. localization
In the example below, we have F1 = F2 = 3 and S1 = S2 = 1, which gives R2 = 1+2 · 1+2 · 1 =
5.
                                                                                                        r YOLO – You Only Look Once (YOLO) is an object detection algorithm that performs the
                                                                                                        following steps:
                                                                                                           • Step 2: For each grid cell, run a CNN that predicts y of the following form:
        Box of center (bx ,by ), height bh
                                                 Reference points (l1x ,l1y ), ...,(lnx ,lny )                                                                              T
        and width bw                                                                                                           y = pc ,bx ,by ,bh ,bw ,c1 ,c2 ,...,cp ,...        ∈ RG×G×k×(5+p)
                                                                                                                                        |           {z              }
                                                                                                                                             repeated k times
r Intersection over Union – Intersection over Union, also known as IoU, is a function that
quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding                   where pc is the probability of detecting an object, bx ,by ,bh ,bw are the properties of the
box Ba . It is defined as:                                                                                    detected bouding box, c1 ,...,cp is a one-hot representation of which of the p classes were
                                                                                                              detected, and k is the number of anchor boxes.
                                                      Bp ∩ Ba
                                     IoU(Bp ,Ba ) =                                                        • Step 3: Run the non-max suppression algorithm to remove any potential duplicate over-
                                                      Bp ∪ Ba
                                                                                                             lapping bounding boxes.
                                                                                                        Remark: when pc = 0, then the network does not detect any object. In that case, the corre-
                                                                                                        sponding predictions bx , ..., cp have to be ignored.
Remark: we always have IoU ∈ [0,1]. By convention, a predicted bounding box Bp is considered
as being reasonably good if IoU(Bp ,Ba ) > 0.5.                                                         r R-CNN – Region with Convolutional Neural Networks (R-CNN) is an object detection algo-
                                                                                                        rithm that first segments the image to find potential relevant bounding boxes and then run the
r Anchor boxes – Anchor boxing is a technique used to predict overlapping bounding boxes.               detection algorithm to find most probable objects in those bounding boxes.
In practice, the network is allowed to predict more than one box simultaneously, where each box
prediction is constrained to have a given set of geometrical properties. For instance, the first
prediction can potentially be a rectangular box of a given form, while the second will be another
rectangular box of a different geometrical form.
r Non-max suppression – The non-max suppression technique aims at removing duplicate
overlapping bounding boxes of a same object by selecting the most representative ones. After
having removed all boxes having a probability prediction lower than 0.6, the following steps are
repeated while there are boxes remaining:
r Types of models – Two main types of model are summed up in table below:
                                                                                                          r Content cost function – The content cost function Jcontent (C,G) is used to determine how
                                                                                                          the generated image G differs from the original content image C. It is defined as follows:
                                                                                                                                                                 1 [l](C)
                                                                                                                                         Jcontent (C,G) =          ||a    − a[l](G) ||2
                                                                                                                                                                 2
                                                                                                          r Style matrix – The style matrix G[l] of a given layer l is a Gram matrix where each of its
                                                                                                                     [l]
                                                                                                          elements Gkk0 quantifies how correlated the channels k and k0 are. It is defined with respect to
r One Shot Learning – One Shot Learning is a face verification algorithm that uses a limited              activations a[l] as follows:
training set to learn a similarity function that quantifies how different two given images are. The                                                            [l]   [l]
similarity function applied to two images is often noted d(image 1, image 2).                                                                              n nw
                                                                                                                                                           H X
                                                                                                                                                           X
                                                                                                                                                 [l]                        [l]   [l]
                                                                                                                                                Gkk0   =                   aijk aijk0
r Siamese Network – Siamese Networks aim at learning how to encode images to then quantify
                                                                                                                                                           i=1 j=1
how different two images are. For a given input image x(i) , the encoded output is often noted
as f (x(i) ).
                                                                                                          Remark: the style matrix for the style image and the generated image are noted G[l](S) and
r Triplet loss – The triplet loss ` is a loss function computed on the embedding representation           G[l](G) respectively.
of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive
example belong to a same class, while the negative example to another one. By calling α ∈ R+              r Style cost function – The style cost function Jstyle (S,G) is used to determine how the
the margin parameter, this loss is defined as follows:                                                    generated image G differs from the style S. It is defined as follows:
                                                                                                          r Overall cost function – The overall cost function is defined as being a combination of the
                                                                                                          content and style cost functions, weighted by parameters α,β, as follows:
                                                                                                          Remark: a higher value of α will make the model care more about the content while a higher
                                                                                                          value of β will make it care more about the style.
                                                                                                       2.1    Overview
                                                                                                       r Architecture of a traditional RNN – Recurrent neural networks, also known as RNNs,
                                                                                                       are a class of neural networks that allow previous outputs to be used as inputs while having
                                                                                                       hidden states. They are typically as follows:
Remark: use cases using variants of GANs include text to image, music generation and syn-
thesis.
r ResNet – The Residual Network architecture (also called ResNet) uses residual blocks with a
high number of layers meant to decrease the training error. The residual block has the following
characterizing equation:
                                    a[l+2] = g(a[l] + z [l+2] )                                        For each timestep t, the activation a<t> and the output y <t> are expressed as follows:
r Inception Network – This architecture uses inception modules and aims at giving a try
at different convolutions in order to increase its performance. In particular, it uses the 1 × 1               a<t> = g1 (Waa a<t−1> + Wax x<t> + ba )          and   y <t> = g2 (Wya a<t> + by )
convolution trick to lower the burden of computation.
                                                                                                       where Wax , Waa , Wya , ba , by are coefficients that are shared temporally and g1 , g2 activation
                                                                                                       functions
                                            ?   ?    ?
The pros and cons of a typical RNN architecture are summed up in the table below:
                                                                                                                             Advantages                                     Drawbacks
                                                                                                             - Possibility of processing input of any length    - Computation being slow
                                                                                                             - Model size not increasing with size of input     - Difficulty of accessing information
                                                                                                             - Computation takes into account                   from a long time ago
                                                                                                             historical information                             - Cannot consider any future input
                                                                                                             - Weights are shared across time                   for the current state
                                                                                                       r Applications of RNNs – RNN models are mostly used in the fields of natural language
                                                                                                       processing and speech recognition. The different applications are summed up in the table below:
                                                                                                                                  1                      ez − e−z
                                                                                                                      g(z) =                    g(z) =                     g(z) = max(0,z)
                                                                                                                               1 + e−z                   ez + e−z
       Many-to-one
                                                                       Sentiment classification
      Tx > 1, Ty = 1
      Many-to-many
                                                                       Name entity recognition
         Tx = Ty                                                                                         r Vanishing/exploding gradient – The vanishing and exploding gradient phenomena are
                                                                                                         often encountered in the context of RNNs. The reason why they happen is that it is difficult
                                                                                                         to capture long term dependencies because of multiplicative gradient that can be exponentially
                                                                                                         decreasing/increasing with respect to the number of layers.
      Many-to-many                                                                                       r Gradient clipping – It is a technique used to cope with the exploding gradient problem
                                                                                                         sometimes encountered when performing backpropagation. By capping the maximum value for
                                                                       Machine translation               the gradient, this phenomenon is controlled in practice.
         Tx 6= Ty
r Loss function – In the case of a recurrent neural network, the loss function L of all time
steps is defined based on the loss at every time step as follows:
                                              Ty
                                              X                                                          r Types of gates – In order to remedy the vanishing gradient problem, specific gates are used
                                    y ,y) =
                                  L(b                 y <t> ,y <t> )
                                                    L(b                                                  in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and
                                              t=1                                                        are equal to:
r Backpropagation through time – Backpropagation is done at each point in time. At                       where W, U, b are coefficients specific to the gate and σ is the sigmoid function. The main ones
timestep T , the derivative of the loss L with respect to weight matrix W is expressed as follows:       are summed up in the table below:
                                                                                                            - Noted ow                                      - Noted ew
                                                                                                            - Naive approach, no similarity information     - Takes into account words similarity
  Dependencies                                                                                        r Embedding matrix – For a given word w, the embedding matrix E is a matrix that maps
                                                                                                      its 1-hot representation ow to its embedding ew as follows:
                                                                                                                                                 ew = Eow
Remark: learning the embedding matrix can be done using target/context likelihood models.
Remark: the sign ? denotes the element-wise multiplication between two vectors.
r Variants of RNNs – The table below sums up the other commonly used RNN architectures:               2.3.2    Word embeddings
                       Bidirectional                              Deep                                r Word2vec – Word2vec is a framework aimed at learning word embeddings by estimating the
                         (BRNN)                                                                       likelihood that a given word is surrounded by other words. Popular models include skip-gram,
                                                                 (DRNN)                               negative sampling and CBOW.
                                                                                                      r Skip-gram – The skip-gram word2vec model is a supervised learning task that learns word
                                                                                                      embeddings by assessing the likelihood of any given target word t happening with a context
                                                                                                      word c. By noting θt a parameter associated with t, the probability P (t|c) is given by:
                                                          exp(θtT ec )
                                       P (t|c) =
                                                        |V |
                                                        X
                                                               exp(θjT ec )
                                                        j=1
Remark: summing over the whole vocabulary in the denominator of the softmax part makes
this model computationally expensive. CBOW is another word2vec model using the surrounding
words to predict a given word.
r Negative sampling – It is a set of binary classifiers using logistic regressions that aim at                   2.5    Language model
assessing how a given context and a given target words are likely to appear simultaneously, with
the models being trained on sets of k negative examples and 1 positive example. Given a context                  r Overview – A language model aims at estimating the probability of a sentence P (y).
word c and a target word t, the prediction is expressed by:
                                                                                                                 r n-gram model – This model is a naive approach aiming at quantifying the probability that
                                           P (y = 1|c,t) = σ(θtT ec )                                            an expression appears in a corpus by counting its number of appearance in the training data.
Remark: this method is less computationally expensive than the skip-gram model.                                  r Perplexity – Language models are commonly assessed using the perplexity metric, also
                                                                                                                 known as PP, which can be interpreted as the inverse probability of the dataset normalized by
r GloVe – The GloVe model, short for global vectors for word representation, is a word em-                       the number of words T . The perplexity is such that the lower, the better and is defined as
bedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of                        follows:
times that a target i occurred with a context j. Its cost function J is as follows:                                                                                                      ! T1
                                                                                                                                                        T
                                    |V |
                                                                                                                                                        Y                   1
                               1   X                                                                                                            PP =
                      J(θ) =                f (Xij )(θiT ej + bi + b0j − log(Xij ))2
                                                                                                                                                               P|V |        (t)    (t)
                                                                                                                                                                           yj · b
                                                                                                                                                                                yj
                               2                                                                                                                        t=1        j=1
                                   i,j=1
                                                                                                                 Remark: PP is commonly used in t-SNE.
here f is a weighting function such that Xi,j = 0 =⇒ f (Xi,j ) = 0.
                                                                                        (final)
Given the symmetry that e and θ play in this model, the final word embedding           ew         is given
by:                                                                                                              2.6    Machine translation
                                              (final)          e w + θw                                          r Overview – A machine translation model is similar to a language model except it has an
                                             ew         =
                                                                   2                                             encoder network placed before. For this reason, it is sometimes referred as a conditional language
                                                                                                                 model. The goal is to find a sentence y such that:
Remark: the individual components of the learned word embeddings are not necessarily inter-
pretable.                                                                                                                                    y=      arg max           P (y <1> ,...,y <Ty > |x)
                                                                                                                                                  y <1> ,...,y <Ty >
2.4    Comparing words                                                                                           r Beam search – It is a heuristic search algorithm used in machine translation and speech
                                                                                                                 recognition to find the likeliest sentence y given an input x.
r Cosine similarity – The cosine similarity between words w1 and w2 is expressed as follows:
                                                                                                                    • Step 1: Find top B likely words y <1>
                                            w1 · w2
                            similarity =                  = cos(θ)
                                          ||w1 || ||w2 ||                                                           • Step 2: Compute conditional probabilities y <k> |x,y <1> ,...,y <k−1>
Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.                Remark: the attention scores are commonly used in image captioning and machine translation.
r Beam width – The beam width B is a parameter for beam search. Large values of B yield
to better result but with slower performance and increased memory. Small values of B lead to
worse results but is less computationally intensive. A standard value for B is around 10.
r Length normalization – In order to improve numerical stability, beam search is usually ap-
plied on the following normalized objective, often called the normalized log-likelihood objective,
defined as:
                                           Ty
                                      1    X          h                                        i
                    Objective =                   log p(y <t> |x,y <1> , ..., y <t−1> )
                                     Tyα
                                           t=1
Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.
r Error analysis – When obtaining a predicted translation b  y that is bad, one can wonder why          r Attention weight – The amount of attention that the output y <t> should pay to the
                                                                                                                      0                   0
we did not get a good translation y ∗ by performing the following error analysis:                       activation a<t > is given by α<t,t > computed as follows:
                                                                                                                                                                             0
                                                                                                                                                0                exp(e<t,t       >)
                  Case           P (y ∗ |x) > P (b
                                                 y |x)                  P (y ∗ |x) 6 P (b
                                                                                        y |x)                                           α<t,t       >
                                                                                                                                                        =
                                                                                                                                                                Tx
                                                                                                                                                                                  00
                                                                                                                                                                X
              Root cause        Beam search faulty                            RNN faulty                                                                             exp(e<t,t         >
                                                                                                                                                                                           )
                                                                  - Try different architecture                                                              t00 =1
               Remedies        Increase beam width                - Regularize                          Remark: computation complexity is quadratic with respect to Tx .
                                                                  - Get more data
                                                                                                                                                            ?    ?     ?
r Bleu score – The bilingual evaluation understudy (bleu) score quantifies how good a machine
translation is by computing a similarity score based on n-gram precision. It is defined as follows:
                                                                  n
                                                                             !
                                                              1   X
                                 bleu score = exp                       pk
                                                              n
                                                                  k=1
2.7    Attention
r Attention model – This model allows an RNN to pay attention to specific parts of the input
that is considered as being important, which improves the performance of the resulting model
                             0
in practice. By noting α<t,t > the amount of attention that the output y <t> should pay to the
               0
activation a <t  > and c <t> the context at time t, we have:
                                             0
                                                 > <t0 >                              0
                               X                                        X
                      c<t> =         α<t,t        a           with            α<t,t       >
                                                                                              =1
                                t0                                       t0
r Data augmentation – Deep learning models usually need a lot of data to be properly trained.             r Epoch – In the context of training a model, epoch is a term used to refer to one iteration
It is often useful to get more data from the existing ones using data augmentation techniques.            where the model sees the whole training set to update its weights.
The main ones are summed up in the table below. More precisely, given the following input
                                                                                                          r Mini-batch gradient descent – During the training phase, updating weights is usually not
image, here are the techniques that we can apply:
                                                                                                          based on the whole training set at once due to computation complexities or one data point due
                                                                                                          to noise issues. Instead, the update step is done on mini-batches, where the number of data
        Original                  Flip                     Rotation             Random crop               points in a batch is a hyperparameter that we can tune.
                                                                                                          r Loss function – In order to quantify how a given model performs, the loss function L is
                                                                                                          usually used to evaluate to what extent the actual outputs y are correctly predicted by the
                                                                                                          model outputs z.
                                                                                                          r Cross-entropy loss – In the context of binary classification in neural networks, the cross-
                                                                                                          entropy loss L(z,y) is commonly used and is defined as follows:
                                                                                                                                                h                              i
                                                                                                                                    L(z,y) = − y log(z) + (1 − y) log(1 − z)
                                                                               - Random focus
                          - Flipped with respect     - Rotation with           on one part of
    - Image without       to an axis for which       a slight angle            the image
                                                                                                          3.2.2    Finding optimal weights
    any modification      the meaning of the         - Simulates incorrect     - Several random
                          image is preserved         horizon calibration       crops can be               r Backpropagation – Backpropagation is a method to update the weights in the neural network
                                                                               done in a row              by taking into account the actual output and the desired output. The derivative with respect
                                                                                                          to each weight w is computed using the chain rule.
follows:
                                              xi − µB
                                      xi ←− γ p       +β
                                                 2 +
                                                σB
It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.
r Xavier initialization – Instead of initializing the weights in a purely random manner, Xavier                           - Root Mean Square propagation
                                                                                                                                                                           dw                             db
initialization enables to have initial weights that take into account characteristics that are unique       RMSprop       - Speeds up learning algorithm             w − α√                   b ←− b − α √
to the architecture.                                                                                                      by controlling oscillations
                                                                                                                                                                           sdw                             sdb
r Transfer learning – Training a deep learning model requires a lot of data and more impor-                               - Adaptive Moment estimation
tantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets                                                                            vdw                             vdb
                                                                                                               Adam       - Most popular method                     w − α√                  b ←− b − α √
that took days/weeks to train, and leverage it towards our use case. Depending on how much                                                                                   sdw +                          sdb + 
data we have at hand, here are the different ways to leverage this:                                                       - 4 parameters to tune
                                                                                                         r Dropout – Dropout is a technique used in neural networks to prevent overfitting the training
                                                                   Freezes all layers,                   data by dropping out neurons with probability p > 0. It forces the model to avoid relying too
         Small
                                                                   trains weights on softmax             much on particular sets of features.
                                                                                                         Remark: most deep learning frameworks parametrize dropout through the ’keep’ parameter 1−p.
                                                                   Trains weights on layers              r Weight regularization – In order to make sure that the weights are not too large and that
        Large                                                      and softmax by initializing           the model is not overfitting the training set, regularization techniques are usually performed on
                                                                   weights on pre-trained ones           the model weights. The main ones are summed up in the table below:
r Learning rate – The learning rate, often noted α or sometimes η, indicates at which pace the
weights get updated. It can be fixed or adaptively changed. The current most popular method
is called Adam, which is a method that adapts the learning rate.
r Adaptive learning rates – Letting the learning rate vary when training a model can reduce
                                                                                                                                                                                       h                          i
the training time and improve the numerical optimal solution. While Adam optimizer is the                          ... + λ||θ||1                   ... + λ||θ||22             ... + λ (1 − α)||θ||1 + α||θ||22
most commonly used technique, others can also be useful. They are summed up in the table                               λ∈R                             λ∈R
below:                                                                                                                                                                                     λ ∈ R,α ∈ [0,1]
r Early stopping – This regularization technique stops the training process as soon as the
validation loss reaches a plateau or starts to increase.
? ? ?
                                                                                                                             3 Variables-based models                                                                                                                   12
     Super VIP Cheatsheet: Artificial Intelligence                                                                             3.1 Constraint satisfaction problems . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   12
                                                                                                                                    3.1.1 Factor graphs . . . . . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   12
                                                                                                                                    3.1.2 Dynamic ordering . . . . . .                          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   12
                    Afshine Amidi and Shervine Amidi                                                                                3.1.3 Approximate methods . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   13
                                                                                                                                    3.1.4 Factor graph transformations                          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   13
                              September 8, 2019                                                                                3.2 Bayesian networks . . . . . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
                                                                                                                                    3.2.1 Introduction . . . . . . . .                          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
                                                                                                                                    3.2.2 Probabilistic programs . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   15
                                                                                                                                    3.2.3 Inference . . . . . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   15
Contents
                                                                                                                      4 Logic-based models                                                                                                                              16
                                                                                                                        4.1 Basics . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   16
1 Reflex-based models                                                                                               2   4.2 Knowledge base .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
  1.1 Linear predictors . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2   4.3 Propositional logic         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   18
       1.1.1 Classification . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2   4.4 First-order logic .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   18
       1.1.2 Regression . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2
  1.2 Loss minimization . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2
  1.3 Non-linear predictors . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
  1.4 Stochastic gradient descent . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
  1.5 Fine-tuning models . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
  1.6 Unsupervised Learning . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4
       1.6.1 k-means . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4
       1.6.2 Principal Component Analysis           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4
2 States-based models                                                                                                5
  2.1 Search optimization . . . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .    5
       2.1.1 Tree search . . . . . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .    5
       2.1.2 Graph search . . . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .    6
       2.1.3 Learning costs . . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .    7
       2.1.4 A? search . . . . . . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .    7
       2.1.5 Relaxation . . . . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .    8
  2.2 Markov decision processes . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .    8
       2.2.1 Notations . . . . . . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .    8
       2.2.2 Applications . . . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .    9
       2.2.3 When unknown transitions and rewards               .   .   .   .   .   .   .   .   .   .   .   .   .    9
  2.3 Game playing . . . . . . . . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   10
       2.3.1 Speeding up minimax . . . . . . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   11
       2.3.2 Simultaneous games . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   11
       2.3.3 Non-zero-sum games . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   12
                                                                                                      r Linear regression – Given a weight vector w ∈ Rd and a feature vector φ(x) ∈ Rd , the
1.1     Linear predictors                                                                             output of a linear regression of weights w denoted as fw is given by:
In this section, we will go through reflex-based models that can improve with experience, by                                                    fw (x) = s(x,w)
going through samples that have input-output pairs.
r Feature vector – The feature vector of an input x is noted φ(x) and is such that:                   r Residual – The residual res(x,y,w) ∈ R is defined as being the amount by which the prediction
                                                                                                      fw (x) overshoots the target y:
                                            " φ (x) #
                                               1                                                                                             res(x,y,w) = fw (x) − y
                                   φ(x) =        ..      ∈ Rd
                                                  .
                                              φd (x)
                                                    +1   if w · φ(x) > 0
                                                
                      fw (x) = sign(s(x,w)) =       −1   if w · φ(x) < 0
                                                     ?   if w · φ(x) = 0
                                                                                                       Illustration
                                                                                                      r Regression case – The prediction of a sample x of true label y ∈ R with a linear model of
                                                                                                      weights w can be done with the predictor fw (x) , s(x,w). In this situation, a metric of interest
                                                                                                      quantifying the quality of the regression is given by the margin res(x,y,w) and can be used with
                                                                                                      the following loss functions:
                                                                                                         r Stochastic updates – Stochastic gradient descent (SGD) updates the parameters of the
                                                                                                         model one training example (φ(x),y) ∈ Dtrain at a time. This method leads to sometimes noisy,
                                                                                                         but fast updates.
                                                                                                         r Batch updates – Batch gradient descent (BGD) updates the parameters of the model one
                                                                                                         batch of examples (e.g. the entire training set) at a time. This method computes stable update
                                                                                                         directions, at a greater computational cost.
r Logistic function – The logistic function σ, also called the sigmoid function, is defined as:
                                                                                                                                                                         1
                                                                                                                                        ∀z ∈] − ∞, + ∞[,     σ(z) =
                                                                                                                                                                      1 + e−z
where we note w, b, x, z the weight, bias, input and non-activated output of the neuron respec-
tively.
 - Shrinks coefficients to 0                                      Tradeoff between variable                  r Objective function – The loss function for one of the main clustering algorithms, k-means,
                                   Makes coefficients smaller                                                is given by:
 - Good for variable selection                                    selection and small coefficients
                                                                                                                                                                 n
                                                                                                                                                                 X
                                                                                                                                           Lossk-means (x,µ) =         ||φ(xi ) − µzi ||2
                                                                                                                                                                 i=1
                                                                                                             r Algorithm – After randomly initializing the cluster centroids µ1 ,µ2 ,...,µk ∈ Rn , the k-means
                                                                                                             algorithm repeats the following step until convergence:
                                                                                                                                                                                m
                                                                                                                                                                                X
                                                                                                                                                                                       1{zi =j} φ(xi )
                                                                                                                                                                                 i=1
                                                                                                                              zi = arg min||φ(xi ) − µj ||   2
                                                                                                                                                                 and     µj =       m
                                                                                                                                       j                                            X
                                                                        h                            i                                                                                      1{zi =j}
 ... + λ||θ||1                     ... + λ||θ||22                 ... + λ (1 − α)||θ||1 + α||θ||22                                                                                  i=1
 λ∈R                               λ∈R                            λ ∈ R,    α ∈ [0,1]
r Sets vocabulary – When selecting a model, we distinguish 3 different parts of the data that
we have as follows:
r Spectral theorem – Let A ∈ Rn×n . If A is symmetric, then A is diagonalizable by a real                            2     States-based models
orthogonal matrix U ∈ Rn×n . By noting Λ = diag(λ1 ,...,λn ), we have:
                                     ∃Λ diagonal,             A = U ΛU T                                             2.1     Search optimization
                                                                                                                     In this section, we assume that by accomplishing action a from state s, we deterministically
Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of                    arrive in state Succ(s,a). The goal here is to determine a sequence of actions (a1 ,a2 ,a3 ,a4 ,...)
matrix A.                                                                                                            that starts from an initial state and leads to an end state. In order to solve this kind of problem,
                                                                                                                     our objective will be to find the minimum cost path by using states-based models.
r Algorithm – The Principal Component Analysis (PCA) procedure is a dimension reduction
technique that projects the data on k dimensions by maximizing the variance of the data as
follows:
                                                                                                                     2.1.1     Tree search
   • Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
                                                                                                                     This category of states-based algorithms explores all possible states and actions. It is quite
                                                                                                                     memory efficient, and is suitable for huge state spaces but the runtime can become exponential
                                                                                                                     in the worst cases.
                       (i)                                    m                              m
            (i)
                      xj − µj                             1   X      (i)                 1   X       (i)
           xj     ←             where         µj =                  xj     and   σj2 =             (xj − µj )2
                         σj                               m                              m
                                                              i=1                            i=1
                                    m
                                1   X                 T
   • Step 2: Compute Σ =                  x(i) x(i)       ∈ Rn×n , which is symmetric with real eigenvalues.
                                m
                                    i=1
   • Step 4: Project the data on spanR (u1 ,...,uk ). This procedure maximizes the variance
                                                                                                                     r Search problem – A search problem is defined with:
     among all k-dimensional spaces.
                                                                                                                         • a starting state sstart
                                                                                                                     r Breadth-first search (BFS) – Breadth-first search is a graph search algorithm that does a
                                                                                                                     level-by-level traversal. We can implement it iteratively with the help of a queue that stores at
each step future nodes to be visited. For this algorithm, we can assume action costs to be equal          r Graph – A graph is comprised of a set of vertices V (also called nodes) as well as a set of
to a constant c > 0.                                                                                      edges E (also called links).
r Depth-first search (DFS) – Depth-first search is a search algorithm that traverses a graph
by following each path as deep as it can. We can implement it recursively, or iteratively with
the help of a stack that stores at each step future nodes to be visited. For this algorithm, action       Remark: a graph is said to be acylic when there is no cycle.
costs are assumed to be equal to 0.
                                                                                                          r State – A state is a summary of all past actions sufficient to choose future actions optimally.
                                                                                                                                      0                                                   if IsEnd(s)
                                                                                                                                  
                                                                                                                FutureCost(s) =
                                                                                                                                                 
                                                                                                                                          min        Cost(s,a) + FutureCost(Succ(s,a))      otherwise
                                                                                                                                      a∈Actions(s)
                                                                                                          Remark: the figure above illustrates a bottom-to-top approach whereas the formula provides the
2.1.2    Graph search                                                                                     intuition of a top-to-bottom problem resolution.
This category of states-based algorithms aims at constructing optimal paths, enabling exponen-            r Types of states – The table below presents the terminology when it comes to states in the
tial savings. In this section, we will focus on dynamic programming and uniform cost search.              context of uniform cost search:
                     State         Explanation                                                                 • decreases the estimated cost of each state-action of the true minimizing path y given by
                                                                                                                 the training data,
                                   States for which the optimal path has
                  Explored E                                                                                   • increases the estimated cost of each state-action of the current predicted path y 0 inferred
                                   already been found
                                                                                                                 from the learned weights.
                                   States seen for which we are still figuring out
                   Frontier F
                                   how to get there with the cheapest cost                                  Remark: there are several versions of the algorithm, one of which simplifies the problem to only
                                                                                                            learning the cost of each action a, and the other parametrizes Cost(s,a) to a feature vector of
                 Unexplored U      States not seen yet                                                      learnable weights.
r Uniform cost search – Uniform cost search (UCS) is a search algorithm that aims at finding                2.1.4    A? search
the shortest path from a state sstart to an end state send . It explores states s in increasing order
of PastCost(s) and relies on the fact that all action costs are non-negative.                               r Heuristic function – A heuristic is a function h over states s, where each h(s) aims at
                                                                                                            estimating FutureCost(s), the cost of the path from s to send .
                                                                                                            r Algorithm – A∗ is a search algorithm that aims at finding the shortest path from a state s to
                                                                                                            an end state send . It explores states s in increasing order of PastCost(s) + h(s). It is equivalent
                                                                                                            to a uniform cost search with edge costs Cost0 (s,a) given by:
                                                                                                                                      Cost0 (s,a) = Cost(s,a) + h(Succ(s,a)) − h(s)
                                                                                                            Remark: this algorithm can be seen as a biased version of UCS exploring states estimated to be
                                                                                                            closer to the end state.
Remark 1: the UCS algorithm is logically equivalent to Djikstra’s algorithm.
Remark 2: the algorithm would not work for a problem with negative action costs, and adding a               r Consistency – A heuristic h is said to be consistent if it satisfies the two following properties:
positive constant to make them non-negative would not solve the problem since this would end
up being a different problem.                                                                                  • For all states s and actions a,
r Correctness theorem – When a state s is popped from the frontier F and moved to explored                                                      h(s) 6 Cost(s,a) + h(Succ(s,a))
set E, its priority is equal to PastCost(s) which is the minimum cost path from sstart to s.
r Graph search algorithms summary – By noting N the number of total states, n of which
are explored before the end state send , we have:
Remark: the complexity countdown supposes the number of possible actions per state to be
constant.                                                                                                      • The end state verifies the following:
                                                                                                                                                           h(send ) = 0
2.1.3     Learning costs
Suppose we are not given the values of Cost(s,a), we want to estimate these quantities from a
training set of minimizing-cost-path sequence of actions (a1 , a2 , ..., ak ).
r Correctness – If h is consistent, then A∗ returns the minimum cost path.                                r Max heuristic – Let h1 (s), h2 (s) be two heuristics. We have the following property:
r Admissibility – A heuristic h is said to be admissible if we have:                                                       h1 (s), h2 (s) consistent =⇒ h(s) = max{h1 (s), h2 (s)} consistent
                                      h(s) 6 FutureCost(s)
                                                                                                          2.2     Markov decision processes
r Theorem – Let h(s) be a given heuristic. We have:
                                                                                                          In this section, we assume that performing action a from state s can lead to several states s01 ,s02 ,...
                               h(s) consistent =⇒ h(s) admissible                                         in a probabilistic manner. In order to find our way between an initial state and an end state,
                                                                                                          our objective will be to find the maximum value policy by using Markov decision processes that
                                                                                                          help us cope with randomness and uncertainty.
r Efficiency – A∗ explores all states s satisfying the following equation:
                              PastCost(s) 6 PastCost(send ) − h(s)                                        2.2.1     Notations
                                                                                                          r Definition – The objective of a Markov decision process is to maximize rewards. It is defined
                                                                                                          with:
Remark: larger values of h(s) is better as this equation shows it will restrict the set of states s          • whether an end state was reached IsEnd(s)
going to be explored.
                                                                                                             • a discount factor 0 6 γ 6 1
2.1.5    Relaxation
It is a framework for producing consistent heuristics. The idea is to find closed-form reduced
costs by removing constraints and use them as heuristics.
r Relaxed search problem – The relaxation of search problem P with costs Cost is noted
Prel with costs Costrel , and satisfies the identity:
                                     Costrel (s,a) 6 Cost(s,a)
r Relaxed heuristic – Given a relaxed search problem Prel , we define the relaxed heuristic
h(s) = FutureCostrel (s) as the minimum cost path from s to an end state in the graph of costs
Costrel (s,a).                                                                                            r Transition probabilities – The transition probability T (s,a,s0 ) specifies the probability
                                                                                                          of going to state s0 after action a is taken in state s. Each s0 7→ T (s,a,s0 ) is a probability
r Consistency of relaxed heuristics – Let Prel be a given relaxed problem. By theorem, we                 distribution, which means that:
have:                                                                                                                                                    X
                          h(s) = FutureCostrel (s) =⇒ h(s) consistent                                                                        ∀s,a,                  T (s,a,s0 ) = 1
                                                                                                                                                      s0 ∈ States
r Tradeoff when choosing heuristic – We have to balance two aspects in choosing a heuristic:
                                                                                                          r Policy – A policy π is a function that maps each state s to an action a, i.e.
   • Computational efficiency: h(s) = FutureCostrel (s) must be easy to compute. It has to                                                               π : s 7→ a
     produce a closed form, easier search and independent subproblems.
   • Good enough approximation: the heuristic h(s) should be close to FutureCost(s) and we                r Utility – The utility of a path (s0 , ..., sk ) is the discounted sum of the rewards on that path.
     have thus to not remove too many constraints.                                                        In other words,
                                                               k
                                                                                                                                                       X                                                   
                                                               X                                                                  Qopt (s,a) =                 T (s,a,s0 ) Reward(s,a,s0 ) + γVopt (s0 )
                                         u(s0 ,...,sk ) =              ri γ   i−1
                                                                                                                                                 s0 ∈ States
                                                               i=1
                                                                                                                 r Optimal value – The optimal value Vopt (s) of state s is defined as being the maximum value
                                                                                                                 attained by any policy. It is computed as follows:
                                                                                                                                                   Vopt (s) =           max           Qopt (s,a)
                                                                                                                                                                  a∈ Actions(s)
Remark: the figure above is an illustration of the case k = 4.                                                   r Optimal policy – The optimal policy πopt is defined as being the policy that leads to the
                                                                                                                 optimal values. It is defined by:
r Q-value – The Q-value of a policy π by taking action a from state s, also noted Qπ (s,a), is
the expected utility of taking action a from state s and then following policy π. It is defined as                                           ∀s,        πopt (s) =       argmax         Qopt (s,a)
follows:                                                                                                                                                              a∈ Actions(s)
                                      X                                                       
                    Qπ (s,a) =                 T (s,a,s0 ) Reward(s,a,s0 ) + γVπ (s0 )
                                                                                                                 r Value iteration – Value iteration is an algorithm that finds the optimal value Vopt as well
                                 s0 ∈ States
                                                                                                                 as the optimal policy πopt . It is done as follows:
r Value of a policy – The value of a policy π from state s, also noted Vπ (s), is the expected                      • Initialization: for all states s, we have
utility by following policy π from state s over random paths. It is defined as follows:
                                                                                                                                                                        (0)
                                                                                                                                                                      Vopt (s) ←− 0
                                             Vπ (s) = Qπ (s,π(s))
                                                         (0)
                                                      Vπ (s) ←− 0
                                                                                                                 Remark: if we have either γ < 1 or the MDP graph being acyclic, then the value iteration
                                                                                                                 algorithm is guaranteed to converge to the correct answer.
   • Iteration: for t from 1 to TPE , we have
                                       ∀s,
                                                   (t)
                                               Vπ (s) ←− Qπ
                                                                        (t−1)
                                                                                (s,π(s))                         2.2.3     When unknown transitions and rewards
                                                                                                                 Now, let’s assume that the transition probabilities and the rewards are unknown.
        with
                                                                                                                 r Model-based Monte Carlo – The model-based Monte Carlo method aims at estimating
                                                                                                                 T (s,a,s0 ) and Reward(s,a,s0 ) using Monte Carlo simulation with:
                                        X                              h                                 i
                (t−1)                                              0                       0     (t−1) 0
               Qπ     (s,π(s))   =                 T (s,π(s),s ) Reward(s,π(s),s ) +           γVπ    (s )
                                                                                                                                                                  # times (s,a,s0 ) occurs
                                     s0 ∈ States                                                                                                 Tb(s,a,s0 ) =
                                                                                                                                                                   # times (s,a) occurs
Remark: by noting S the number of states, A the number of actions per state, S 0 the number                         and
of successors and T the number of iterations, then the time complexity is of O(TPE SS 0 ).
                                                                                                                                                                0
                                                                                                                                                   Reward(s,a,s
                                                                                                                                                    \             ) = r in (s,a,r,s0 )
r Optimal Q-value – The optimal Q-value Qopt (s,a) of state s with action a is defined to be
the maximum Q-value attained by any policy starting. It is computed as follows:                                  These estimations will be then used to deduce Q-values, including Qπ and Qopt .
Remark: model-based Monte Carlo is said to be off-policy, because the estimation does not                    • a starting state sstart
depend on the exact policy.
                                                                                                             • possible actions Actions(s) from state s
r Model-free Monte Carlo – The model-free Monte Carlo method aims at directly estimating
Qπ , as follows:                                                                                             • successors Succ(s,a) from states s with actions a
                        bπ (s,a) = average of ut where st−1 = s, at = a
                        Q                                                                                    • whether an end state was reached IsEnd(s)
where ut denotes the utility starting at step t of a given episode.
                                                                                                             • the agent’s utility Utility(s) at end state s
Remark: model-free Monte Carlo is said to be on-policy, because the estimated value is dependent
on the policy π used to generate the data.                                                                   • the player Player(s) who controls state s
r Equivalent formulation – By introducing the constant η =                       1
                                                                                        and for
                                                                         1+(#updates to (s,a))
                                                                                                          Remark: we will assume that the utility of the agent has the opposite sign of the one of the
each (s,a,u) of the training set, the update rule of model-free Monte Carlo has a convex combi-           opponent.
nation formulation:
                                                                                                          r Types of policies – There are two types of policies:
                                   bπ (s,a) ← (1 − η)Q
                                   Q                 bπ (s,a) + ηu
as well as a stochastic gradient formulation:                                                                • Deterministic policies, noted πp (s), which are actions that player p takes in state s.
                               bπ (s,a) ← Q
                               Q          bπ (s,a) − η(Q
                                                       bπ (s,a) − u)                                         • Stochastic policies, noted πp (s,a) ∈ [0,1], which are probabilities that player p takes action
                                                                                                               a in state s.
Remark: we can extract πmax and πmin from the minimax value Vminimax .
                                                                                                        r TD learning – Temporal difference (TD) learning is used when we don’t know the transi-
                                                                                                        tions/rewards. The value is based on exploration policy. To be able to use it, we need to know
r Minimax properties – By noting V the value function, there are 3 properties around                    rules of the game Succ(s,a). For each (s,a,r,s0 ), the update is done as follows:
minimax to have in mind:
                                                                                                                                                                         
                                                                                                                               w ←− w − η V (s,w) − (r + γV (s0 ,w)) ∇w V (s,w)
   • Property 1 : if the agent were to change its policy to any πagent , then the agent would be
     no better off.
                              ∀πagent ,   V (πmax ,πmin ) > V (πagent ,πmin )                           2.3.2    Simultaneous games
                                                                                                        This is the contrary of turn-based games, where there is no ordering on the player’s moves.
   • Property 2 : if the opponent changes its policy from πmin to πopp , then he will be no
     better off.                                                                                        r Single-move simultaneous game – Let there be two players A and B, with given possible
                                                                                                        actions. We note V (a,b) to be A’s utility if A chooses action a, B chooses action b. V is called
                               ∀πopp ,    V (πmax ,πmin ) 6 V (πmax ,πopp )                             the payoff matrix.
In the end, we have the following relationship: • A mixed strategy is a probability distribution over actions:
V (πexptmax ,πmin ) 6 V (πmax ,πmin ) 6 V (πmax ,π) 6 V (πexptmax ,π) ∀a ∈ Actions, 0 6 π(a) 6 1
2.3.1    Speeding up minimax                                                                            r Game evaluation – The value of the game V (πA ,πB ) when player A follows πA and player
                                                                                                        B follows πB is such that:
r Evaluation function – An evaluation function is a domain-specific and approximate estimate                                                          X
of the value Vminimax (s). It is noted Eval(s).                                                                                       V (πA ,πB ) =         πA (a)πB (b)V (a,b)
Remark: FutureCost(s) is an analogy for search problems.                                                                                              a,b
Remark: in any finite-player game with finite number of actions, there exists at least one Nash        3.1.1    Factor graphs
equilibrium.
                                                                                                   r Definition – A factor graph, also referred to as a Markov random field, is a set of variables
                                                                                                   X = (X1 ,...,Xn ) where Xi ∈ Domaini and m factors f1 ,...,fm with each fj (X) > 0.
                                                                                                       r Scope and arity – The scope of a factor fj is the set of variables it depends on. The size of
                                                                                                       this set is called the arity.
                                                                                                       Remark: factors of arity 1 and 2 are called unary and binary respectively.
                                                                                                       r Assignment weight – Each assignment x = (x1 ,...,xn ) yields a weight Weight(x) defined as
                                                                                                       being the product of all factors fj applied to that assignment. Its expression is given by:
                                                                                                                                                           m
                                                                                                                                                           Y
                                                                                                                                             Weight(x) =         fj (x)
                                                                                                                                                           j=1
Here, the constraint j with assignment x is said to be satisfied if and only if fj (x) = 1.
r Backtracking search – Backtracking search is an algorithm used to find maximum weight                   3.1.3    Approximate methods
assignments of a factor graph. At each step, it chooses an unassigned variable and explores
its values by recursion. Dynamic ordering (i.e. choice of variables and values) and lookahead             r Beam search – Beam search is an approximate algorithm that extends partial assignments
(i.e. early elimination of inconsistent options) can be used to explore the graph more efficiently,       of n variables of branching factor b = |Domain| by exploring the K top paths at each step. The
although the worst-case runtime stays exponential: O(|Domain|n ).                                         beam size K ∈ {1,...,bn } controls the tradeoff between efficiency and accuracy. This algorithm
                                                                                                          has a time complexity of O(n · Kb log(Kb)).
r Forward checking – It is a one-step lookahead heuristic that preemptively removes incon-                The example below illustrates a possible beam search of parameters K = 2, b = 3 and n = 5.
sistent values from the domains of neighboring variables. It has the following characteristics:
   • After assigning a variable Xi , it eliminates inconsistent values from the domains of all its
     neighbors.
   • If any of these domains becomes empty, we stop the local backtracking search.
   • If we un-assign a variable Xi , we have to restore the domain of its neighbors.
r Most constrained variable – It is a variable-level ordering heuristic that selects the next
unassigned variable that has the fewest consistent values. This has the effect of making incon-
sistent assignments to fail earlier in the search, which enables more efficient pruning.
r Least constrained value – It is a value-level ordering heuristic that assigns the next value
that yields the highest number of consistent values of neighboring variables. Intuitively, this
procedure chooses first the values that are most likely to work.
Remark: in practice, this heuristic is useful when all factors are constraints.
                                                                                                             • we assign to each element u ∈ Domaini a weight w(u) that is the product of all factors
                                                                                                               connected to that variable,
The example above is an illustration of the 3-color problem with backtracking search coupled              Remark: Gibbs sampling can be seen as the probabilistic counterpart of ICM. It has the advan-
with most constrained variable exploration and least constrained value heuristic, as well as              tage to be able to escape local minima in most cases.
forward checking at each step.
r Arc consistency – We say that arc consistency of variable Xl with respect to Xk is enforced             3.1.4    Factor graph transformations
when for each xl ∈ Domainl :
   • unary factors of Xl are non-zero,                                                                    r Independence – Let A,B be a partitioning of the variables X. We say that A and B are
                                                                                                          independent if there are no edges between A and B and we write:
   • there exists at least one xk ∈ Domaink such that any factor between Xl and Xk is
     non-zero.                                                                                                                             A,B independent ⇐⇒ A ⊥
                                                                                                                                                                ⊥B
r AC-3 – The AC-3 algorithm is a multi-step lookahead heuristic that applies forward checking
to all relevant variables. After a given assignment, it performs forward checking and then                Remark: independence is the key property that allows us to solve subproblems in parallel.
successively enforces arc consistency with respect to the neighbors of variables for which the
                                                                                                          r Conditional independence – We say that A and B are conditionally independent given C
domain change during the process.
                                                                                                          if conditioning on C produces a graph in which A and B are independent. In this case, it is
Remark: AC-3 can be implemented both iteratively and recursively.                                         written:
gj (x) = fj (x ∪ {Xi : v}) Remark: finding the best variable ordering is a NP-hard problem.
r Markov blanket – Let A ⊆ X be a subset of variables. We define MarkovBlanket(A) to be            3.2     Bayesian networks
the neighbors of A that are not in A.
                                                                                                   In this section, our goal will be to compute conditional probabilities. What is the probability of
r Proposition – Let C = MarkovBlanket(A) and B = X\(A ∪ C). Then we have:                          a query given evidence?
                                           A⊥
                                            ⊥ B|C
                                                                                                   3.2.1    Introduction
                                                                                               r Explaining away – Suppose causes C1 and C2 influence an effect E. Conditioning on the
                                                                                               effect E and on one of the causes (say C1 ) changes the probability of the other cause (say C2 ).
                                                                                               In this case, we say that C1 has explained away C2 .
                                                                                               r Directed acyclic graph – A directed acyclic graph (DAG) is a finite directed graph with
                                                                                               no directed cycles.
                                                                                                   r Bayesian network – A Bayesian network is a directed acyclic graph (DAG) that specifies
                                                                                                   a joint distribution over random variables X = (X1 ,...,Xn ) as a product of local conditional
                                                                                                   distributions, one for each node:
r Elimination – Elimination is a factor graph transformation that removes Xi from the graph
and solves a small subproblem conditioned on its Markov blanket as follows:                                                                              n
                                                                                                                                                         Y
                                                                                                                           P (X1 = x1 ,...,Xn = xn ) ,         p(xi |xParents(i) )
   • Consider all factors fi,1 ,...,fi,k that depend on Xi
                                                                                                                                                         i=1
   • Remove Xi and fi,1 ,...,fi,k
   • Add fnew,i (x) defined as:                                                                    Remark: Bayesian networks are factor graphs imbued with the language of probability.
                                                          k
                                                          Y
                                      fnew,i (x) = max          fi,l (x)
                                                     xi
                                                          l=1
r Treewidth – The treewidth of a factor graph is the maximum arity of any factor created by
variable elimination with the best variable ordering. In other words,
                         Treewidth =     min       max         arity(fnew,i )
                                       orderings i∈{1,...,n}
                                                                                                   r Locally normalized – For each xParents(i) , all factors are local conditional distributions.
The example below illustrates the case of a factor graph of treewidth 3.                           Hence they have to satisfy:
                                                                                                                         Hto     ∼               o )
                                                                                                                                         p(Hto |Ht−1
                                                                                                                               o∈{a,b}                                                  Multiple object
                                                                                                    Factorial HMM
                                                                                                                                                                                          tracking
                                    X
                                          p(xi |xParents(i) ) = 1                                                        Et ∼   p(Et |Hta ,Htb )
                                     xi
                                                                                                                           Y ∼ p(Y )                                                      Document
As a result, sub-Bayesian networks and conditional distributions are consistent.                      Naive Bayes
                                                                                                                           Wi ∼ p(Wi |Y )                                                classification
Remark: local conditional distributions are the true conditional distributions.
r Concept – A probabilistic program randomizes variables assignment. That way, we can write      3.2.3    Inference
down complex Bayesian networks that generate assignments without us having to explicitly
specify associated probabilities.                                                                r General probabilistic inference strategy – The strategy to compute the probability
                                                                                                 P (Q|E = e) of query Q given evidence E = e is as follows:
Remark: examples of probabilistic programs include Hidden Markov model (HMM), factorial
HMM, naive Bayes, latent Dirichlet allocation, diseases and symptoms and stochastic block           • Step 1: Remove variables that are not ancestors of the query Q or the evidence E by
models.
                                                                                                      marginalization
r Summary – The table below summarizes the common probabilistic programs as well as their           • Step 2: Convert Bayesian network to factor graph
applications:
                                                                                                    • Step 3: Condition on the evidence E = e
          Program           Algorithm                  Illustration          Example                • Step 5: Run probabilistic inference algorithm (manual, variable elimination, Gibbs sam-
                                                                                                      pling, particle filtering)
                                                                                                    • Step 1: for i ∈ {1,..., L}, compute Fi (hi ) =              Fi−1 (hi−1 )p(hi |hi−1 )p(ei |hi )
                                                                                                                                                       P
                                                                                                                                                           hi−1
        Hidden Markov     Ht ∼ p(Ht |Ht−1 )                                                         • Step 2: for i ∈ {L,..., 1}, compute Bi (hi ) =              Bi+1 (hi+1 )p(hi+1 |hi )p(ei+1 |hi+1 )
                                                                                                                                                       P
                                                                          Object tracking                                                                  hi+1
        Model (HMM)       Et ∼ p(Et |Ht )
                                                                                                    • Step 3: for i ∈ {1,...,L}, compute Si (hi ) =    PFi (hi )Bi (hi )
                                                                                                                                                               Fi (hi )Bi (hi )
                                                                                                                                                          hi
with the convention F0 = BL+1 = 1. From this procedure and these notations, we get that             4       Logic-based models
                                 P (H = hk |E = e) = Sk (hk )
                                                                                                    4.1        Basics
Remark: this algorithm interprets each assignment to be a path where each edge hi−1 → hi is         r Syntax of propositional logic – By noting f,g formulas, and ¬, ∧, ∨, →, ↔ connectives, we
of weight p(hi |hi−1 )p(ei |hi ).                                                                   can write the following logical expressions:
r Gibbs sampling – This algorithm is an iterative approximate method that uses a small set of
assignments (particles) to represent a large probability distribution. From a random assignment                       Name          Symbol           Meaning                Illustration
x, Gibbs sampling performs the following steps for i ∈ {1,...,n} until convergence:
   • For all u ∈ Domaini , compute the weight w(u) of assignment x where Xi = u                                     Affirmation         f                 f
   • Sample v from the probability distribution induced by w: v ∼ P (Xi = v|X−i = x−i )
   • Set Xi = v
Remark: X−i denotes X\{Xi } and x−i represents the corresponding assignment. Negation ¬f not f
r Particle filtering – This algorithm approximates the posterior density of state variables
given the evidence of observation variables by keeping track of K particles at a time. Starting
from a set of particles C of size K, we run the following 3 steps iteratively:
   • Step 1: proposal - For each old particle xt−1 ∈ C, sample x from the transition probability                    Conjunction       f ∧g            f and g
     distribution p(x|xt−1 ) and add x to a set C 0 .
   • Step 2: weighting - Weigh each x of the set C 0 by w(x) = p(et |x), where et is the evidence
     observed at time t.
   • Step 3: resampling - Sample K elements from the set C 0 using the probability distribution                     Disjunction       f ∨g             f or g
     induced by w and store them in C: these are the current particles xt .
Remark: a more expensive version of this algorithm also keeps track of past particles in the
proposal step.
r Maximum likelihood – If we don’t know the local conditional distributions, we can learn                           Implication       f →g           if f then g
them using maximum likelihood.
                                             Y
                                   max               p(X = x; θ)
                                     θ
                                          x∈Dtrain
                                                                                                                   Biconditional      f ↔g       f , that is to say g
r Laplace smoothing – For each distribution d and partial assignment (xParents(i) ,xi ), add λ
to countd (xParents(i) ,xi ), then normalize to get probability estimates.
                                                                                                        Remark: formulas can be built up recursively out of these connectives.
r Algorithm – The Expectation-Maximization (EM) algorithm gives an efficient method at
estimating the parameter θ through maximum likelihood estimation by repeatedly constructing             r Model – A model w denotes an assignment of binary weights to propositional symbols.
a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:
                                                                                                        Example: the set of truth values w = {A : 0,B : 1,C : 0} is one possible model to the propositional
   • E-step: Evaluate the posterior probability q(h) that each data point e came from a                 symbols A, B and C.
     particular cluster h as follows:                                                                   r Interpretation function – The interpretation function I(f,w) outputs whether model w
                                                                                                        satisfies formula f :
                                         q(h) = P (H = h|E = e; θ)
                                                                                                                                                 I(f,w) ∈ {0,1}
   • M-step: Use the posterior probabilities q(h) as cluster specific weights on data points e
     to determine θ through maximum likelihood.                                                         r Set of models – M(f ) denotes the set of models w that satisfy formula f . Mathematically
                                                                                                        speaking, we define it as follows:
                                                                                                                                                                              - No model satisfies
                                                                                                          KB                                                                  the constraints after
                                                                                                                          M(KB) ∩ M(f ) = ∅
4.2   Knowledge base                                                                                 contradicts f                                                            adding f
                                                                                                                                                                              Equivalent to KB |= ¬f
r Definition – The knowledge base KB is the conjunction of all formulas that have been
considered so far. The set of models of the knowledge base is the intersection of the set of                                                                                  - f does not contradict
models that satisfy each formula. In other words:                                                                        M(KB) ∩ M(f ) 6= ∅                                   KB
                                                                                                      f contingent
                                                                                                                                and                                           - f adds a non-trivial
                                                                                                         to KB
                                                \                                                                       M(KB) ∩ M(f ) 6= M(KB)                                amount of information
                                   M(KB) =             M(f )
                                                                                                                                                                              to KB
                                               f ∈KB
                                                                                                    r Model checking – A model checking algorithm takes as input a knowledge base KB and
                                                                                                    outputs whether it is satisfiable or not.
                                                                                                    Remark: popular model checking algorithms include DPLL and WalkSat.
                                                                                                                                                f1 ,...,fk
                                                                                                                                                     g
r Probabilistic interpretation – The probability that query f is evaluated to 1 can be seen
as the proportion of models w of the knowledge base KB that satisfy f , i.e.:
                                                                                                    r Forward inference algorithm – From a set of inference rules Rules, this algorithm goes
                                            X                                                       through all possible f1 ,...,fk and adds g to the knowledge base KB if a matching rule exists.
                                                         P (W = w)                                  This process is repeated until no more additions can be made to KB.
                          P (f |KB) =
                                        w∈M(KB)∩M(f )
                                                                                                    r Derivation – We say that KB derives f (written KB ` f ) with rules Rules if f already is in
                                                                                                    KB or gets added during the forward inference algorithm using the set of rules Rules.
                                            X
                                                       P (W = w)
                                          w∈M(KB)                                                   r Properties of inference rules – A set of inference rules Rules can have the following
                                                                                                    properties:
r Satisfiability – The knowledge base KB is said to be satisfiable if at least one model w                  Name         Mathematical formulation                            Notes
satisfies all its constraints. In other words:
                                                                                                                                                               - Inferred formulas are entailed by
                              KB satisfiable ⇐⇒ M(KB) 6= ∅                                                                                                     KB
                                                                                                          Soundness       {f : KB ` f } ⊆ {f : KB |= f }
                                                                                                                                                               - Can be checked one rule at a time
                                                                                                                                                               - "Nothing but the truth"
Remark: M(KB) denotes the set of models compatible with all the constraints of the knowledge                                                                   - Formulas entailing KB are either
base.                                                                                                                                                          already in the knowledge base or
                                                                                                        Completeness      {f : KB ` f } ⊇ {f : KB |= f }
r Relation between formulas and knowledge base – We define the following properties                                                                            inferred from it
between the knowledge base KB and a new formula f :                                                                                                            - "The whole truth"
In this section, we will go through logic-based models that use logical formulas and inference              • Step 3: Return unsatisfiable if and only if False is derived
rules. The idea here is to balance expressivity and computational efficiency.
r Horn clause – By noting p1 ,...,pk and q propositional symbols, a Horn clause has the form:            4.4    First-order logic
                                          (p1 ∧ ... ∧ pk ) −→ q
                                                                                                         The idea here is that variables yield compact knowledge representations.
Remark: when q = false, it is called a "goal clause", otherwise we denote it as a "definite              r Model – A model w in first-order logic maps:
clause".
r Modus ponens inference rule – For propositional symbols f1 ,...,fk and p, the modus                       • constant symbols to objects
ponens rule is written:
                                                                                                            • predicate symbols to tuple of objects
                                  f1 ,...,fk ,   (f1 ∧ ... ∧ fk ) −→ p
                                                     p
                                                                                                     r Horn clause – By noting x1 ,...,xn variables and a1 ,...,ak ,b atomic formulas, the first-order
                                                                                                     logic version of a horn clause has the form:
Remark: it takes linear time to apply this rule, as each application generate a clause that
contains a single propositional symbol.                                                                                                     ∀x1 ,...,∀xn ,      (a1 ∧ ... ∧ ak ) → b
r Completeness – Modus ponens is complete with respect to Horn clauses if we suppose that
KB contains only Horn clauses and p is an entailed propositional symbol. Applying modus                  r Substitution – A substitution θ maps variables to terms and Subst(θ,f ) denotes the result
ponens will then derive p.                                                                               of substitution θ on f .
r Conjunctive normal form – A conjunctive normal form (CNF) formula is a conjunction of              r Unification – Unification takes two formulas f and g and returns the most general substitu-
clauses, where each clause is a disjunction of atomic formulas.                                      tion θ that makes them equal:
Remark: in other words, CNFs are ∧ of ∨.
                                                                                                                                  Unify[f,g] = θ         s.t.      Subst[θ,f ] = Subst[θ,g]
r Equivalent representation – Every formula in propositional logic can be written into an
equivalent CNF formula. The table below presents general conversion properties:                          Note: Unify[f,g] returns Fail if no such θ exists.
                        Rule name                   Initial          Converted                           r Modus ponens – By noting x1 ,...,xn variables, a1 ,...,ak and a01 ,...,a0k atomic formulas and
                                                                                                         by calling θ = Unify(a01 ∧ ... ∧ a0k , a1 ∧ ... ∧ ak ) the first-order logic version of modus ponens can
                                      ↔             f ↔g          (f → g) ∧ (g → f )                     be written:
                   Eliminate          →             f →g                 ¬f ∨ g                                                        a01 ,...,a0k   ∀x1 ,...,∀xn (a1 ∧ ... ∧ ak ) → b
                                     ¬¬              ¬¬f                   f                                                                              Subst[θ, b]
                                  ¬ over ∧         ¬(f ∧ g)              ¬f ∨ ¬g
                                                                                                         r Completeness – Modus ponens is complete for first-order logic with only Horn clauses.
                   Distribute     ¬ over ∨         ¬(f ∨ g)              ¬f ∧ ¬g
                                                                                                         r Resolution rule – By noting f1 , ..., fn , g1 , ..., gm , p, q formulas and by calling θ = Unify(p,q),
                                  ∨ over ∧        f ∨ (g ∧ h)     (f ∨ g) ∧ (f ∨ h)                      the first-order logic version of the resolution rule can be written:
                                                                                                                                         f1 ∨ ... ∨ fn ∨ p, ¬q ∨ g1 ∨ ... ∨ gm
r Resolution inference rule – For propositional symbols f1 ,...,fn , and g1 ,...,gm as well as p,                                        Subst[θ,f1 ∨ ... ∨ fn ∨ g1 ∨ ... ∨ gm ]
the resolution rule is written:
                                f1 ∨ ... ∨ fn ∨ p, ¬p ∨ g1 ∨ ... ∨ gm                                    r Semi-decidability – First-order logic, even restricted to only Horn clauses, is semi-decidable.
                                     f1 ∨ ... ∨ fn ∨ g1 ∨ ... ∨ gm
                                                                                                            • if KB |= f , forward inference on complete inference rules will prove f in finite time
Remark: it can take exponential time to apply this rule, as each application generates a clause
that has a subset of the propositional symbols.                                                             • if KB 6|= f , no algorithm can show this in finite time
This document contains cheat sheets on various topics asked during a Machine Learn-
ing/Data science interview. This document is constantly updated to include more topics.
Table of Contents
Basics of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
    1. Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
5. Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6. Regularization in ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
8. Famous CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Behavioral Interview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
    1. How to prepare for behavioral interview? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
                                                                    Page 1 of 14
                Cheat Sheet – Bias-Variance Tradeoff
What is Bias?
• Error between average model prediction and ground truth
• The bias of the estimated function tells us the capacity of the underlying model to
  predict the values
What is Variance?
• Average variability in the model prediction for the given dataset
• The variance of the estimated function tells you how much the function can adjust
  to the change in the dataset
High Bias                   Overly-simplified Model
                            Under-fitting
                            High error on both test and train data
Minimum Error
                                           Page 2 of 14
                     Cheat Sheet – Imbalanced Data in Classification
         Blue: Label 1
                      Accuracy doesn’t always give the correct insight about your trained model
Accuracy: %age correct prediction                      Correct prediction over total predictions        One value for entire network
Precision: Exactness of model                          From the detected cats, how many were            Each class/label has a value
                                                       actually cats
Recall:   Completeness of model                        Correctly detected cats over total cats          Each class/label has a value
F1 Score: Combines Precision/Recall                    Harmonic mean of Precision and Recall            Each class/label has a value
Positive Positive TP + FP TN + FP
                                                                        (Prec x Rec)                                   TP + TN
                                                       F1 score = 2x                            Accuracy =
                                                                       (Prec + Rec)                             TP + FN + FP + TN
                     False         True
 0
                    Negative      Negative                                    TN                                                 TP
                                                       Specificity =                            Recall, Sensitivity =
                                                                         TN     +FP             True +ve rate                TP + FN
                                                             Possible solutions
        1. Data Replication: Replicate the available data until the                     Blue: Label 1
                   number of samples are comparable                                    Green: Label 0
        2. Synthetic Data: Images: Rotate, dilate, crop, add noise to                   Blue: Label 1
                   existing input images and create new data                           Green: Label 0
        3. Modified Loss: Modify the loss to reflect greater error when                    𝑙𝑜𝑠𝑠 = 𝑎 ∗ 𝒍𝒐𝒔𝒔𝒈𝒓𝒆𝒆𝒏 + 𝑏 ∗ 𝒍𝒐𝒔𝒔𝒃𝒍𝒖𝒆     𝑎>𝑏
                   misclassifying smaller sample set
        4. Change the algorithm: Increase the model/algorithm complexity so that the two classes are perfectly
           separable (Con: Overfitting)
                                                                  Increase model
                                                                    complexity
    No straight line (y=ax) passing through origin can perfectly                 Straight line (y=ax+b) can perfectly separate data.
    separate data. Best solution: line y=0, predict all labels blue                Green class will no longer be predicted as blue
Source: https://www.cheatsheets.aqeel-anwar.com
                                                                  Page 3 of 14
                        Cheat Sheet – PCA Dimensionality Reduction
  What is PCA?
  • Based on the dataset find a new set of orthogonal feature vectors in such a way that the
      data spread is maximum in the direction of the feature vector (or dimension)
  • Rates the feature vector in the decreasing order of data spread (or variance)
  • The datapoints have maximum variance in the first feature vector, and minimum variance
      in the last feature vector
  • The variance of the datapoints in the direction of feature vector can be termed as a
      measure of information in that direction.
  Steps
  1. Standardize the datapoints
  2. Find the covariance matrix from the given datapoints
  3. Carry out eigen-value decomposition of the covariance matrix
  4. Sort the eigenvalues and eigenvectors
  Figure 1                                                                             Figure 2
                              Feature # 1 (F1)
FeFeature # 1
                                                                                                                                     Variance
     Variance
                                                                                      1
                                                                                         e#
                                                                                                                                 2
                                                                                       ur
                                                                                                                            e#
                                                                                            at
                                                                                                                          ur
                                                                                                                         at
                                                                                                                      Fe
                                                                                                   ew
                                                                                                                    w
                                                                                                                 Ne
                                                                                                          N
                                                   e#
        u
                                        at
  Fe
Source: https://www.cheatsheets.aqeel-anwar.com
                                                                               Page 4 of 14
          Cheat Sheet – Bayes Theorem and Classifier
What is Bayes’ Theorem?
• Describes the probability of an event, based on prior knowledge of conditions that might be
  related to the event.
                                                                                         P(A B)
• How the probability of an event changes when
  we have knowledge of another event                                       Posterior
                                                                          Probability
            P(A)        P(A B)
                                  Usually a better
                                  estimate than P(A)
                                                                         Bayes’ Theorem
Example
• Probability of fire P(F) = 1%
• Probability of smoke P(S) = 10%
                                                            Likelihood        P(A)            Evidence
• Prob of smoke given there is a fire P(S F) = 90%
• What is the probability that there is a fire given         P(B A)           Prior               P(B)
  we see a smoke P(F S)?                                                   Probability
Bayes’ theorem assumes the features (x1, x2, x3, … ) are i.i.d. i.e
Source: https://www.cheatsheets.aqeel-anwar.com
                                             Page 5 of 14
                           Cheat Sheet – Regression Analysis
What is Regression Analysis?
Fitting a function f(.) to datapoints yi=f(xi) under some error function. Based on the estimated
function and error, we have the following types of regression
1. Linear Regression:
   Fits a line minimizing the sum of mean-squared error
   for each datapoint.
2. Polynomial Regression:
   Fits a polynomial of order k (k+1 unknowns) minimizing
   the sum of mean-squared error for each datapoint.
3. Bayesian Regression:
   For each datapoint, fits a gaussian distribution by
   minimizing the mean-squared error. As the number of
   data points xi increases, it converges to point
   estimates i.e.
4. Ridge Regression:
   Can fit either a line, or polynomial minimizing the sum
   of mean-squared error for each datapoint and the
   weighted L2 norm of the function parameters beta.
5. LASSO Regression:
   Can fit either a line, or polynomial minimizing the the
   sum of mean-squared error for each datapoint and the
   weighted L1 norm of the function parameters beta.
6. Logistic Regression (NOT regression, but classification):
   Can fit either a line, or polynomial with sigmoid
   activation minimizing the sum of mean-squared error for
   each datapoint. The labels y are binary class labels.
Visual Representation:
       Linear Regression           Polynomial Regression            Bayesian Linear Regression             Logistic Regression
                                                                                                 Label 1
                                                                                                   y
y
Label 0
x x x x
Summary:
                                    What does it fit?                      Estimated function              Error Function
    Linear                         A line in n dimensions
    Polynomial                     A polynomial of order k
    Bayesian Linear          Gaussian distribution for each point
    Ridge                            Linear/polynomial
    LASSO                            Linear/polynomial
    Logistic                   Linear/polynomial with sigmoid
Source: https://www.cheatsheets.aqeel-anwar.com
                                                       Page 6 of 14
                      Cheat Sheet – Regularization in ML
What is Regularization in ML?
• Regularization is an approach to address over-fitting in ML.
• Overfitted model fails to generalize estimations on test data
• When the underlying model to be learned is low bias/high
  variance, or when we have small amount of data, the
  estimated model is prone to over-fitting.
• Regularization reduces the variance of the model
Types of Regularization:                                                     Figure 1. Overfitting
1. Modify the loss function:
• L2 Regularization: Prevents the weights from getting too large (defined by L2 norm). Larger
  the weights, more complex the model is, more chances of overfitting.
• L1 Regularization: Prevents the weights from getting too large (defined by L1 norm). Larger
  the weights, more complex the model is, more chances of overfitting. L1 regularization
  introduces sparsity in the weights. It forces more weights to be zero, than reducing the the
  average magnitude of all weights
• Entropy: Used for the models that output probability. Forces the probability distribution
  towards uniform distribution.
                                                 Page 7 of 14
                          Cheat Sheet – Famous CNNs
AlexNet – 2012
Why: AlexNet was born out of the need to improve the results of
the ImageNet challenge.
What: The network consists of 5 Convolutional (CONV) layers and 3
Fully Connected (FC) layers. The activation used is the Rectified
Linear Unit (ReLU).
How: Data augmentation is carried out to reduce over-fitting, Uses
Local response localization.
VGGNet – 2014
Why: VGGNet was born out of the need to reduce the # of
parameters in the CONV layers and improve on training time
What: There are multiple variants of VGGNet (VGG16, VGG19, etc.)
How: The important point to note here is that all the conv kernels are
of size 3x3 and maxpool kernels are of size 2x2 with a stride of two.
ResNet – 2015
Why: Neural Networks are notorious for not being able to find a
simpler mapping when it exists. ResNet solves that.
What: There are multiple versions of ResNetXX architectures where
‘XX’ denotes the number of layers. The most used ones are ResNet50
and ResNet101. Since the vanishing gradient problem was taken care of
(more about it in the How part), CNN started to get deeper and deeper
How: ResNet architecture makes use of shortcut connections do solve
the vanishing gradient problem. The basic building block of ResNet is
a Residual block that is repeated throughout the network.
                                                               Filter
                                                            Concatenation
                          Weight layer
                                                          3x3               5x5
                   f(x)                  x    1x1
                                                         Conv              Conv
                                                                                  1x1 Conv
                               +                                Previous
                      f(x)+x                                     Layer
Source: https://www.cheatsheets.aqeel-anwar.com
                                                    Page 8 of 14
            Cheat Sheet – Convolutional Neural Network
Convolutional Neural Network:
The data gets into the CNN through the input layer and passes
through various hidden layers before getting to the output layer.
The output of the network is compared to the actual labels in
terms of loss or error. The partial derivatives of this loss w.r.t the
trainable weights are calculated, and the weights are updated
through one of the various methods using backpropagation.
CNN Template:
Most of the commonly used hidden layers (not all) follow a
pattern
1. Layer function: Basic transforming function such as
   convolutional or fully connected layer.
a. Fully Connected: Linear functions between the input and the
   output.
a. Convolutional   Layers: These layers are applied to 2D (3D) input feature maps. The trainable weights are a 2D (3D)
   kernel/filter that moves across the input feature map, generating dot products with the overlapping region of the input
   feature map.
b.Transposed Convolutional (DeConvolutional) Layer: Usually used to increase the size of the output feature map
   (Upsampling) The idea behind the transposed convolutional layer is to undo (not exactly) the convolutional layer
                 Fully Connected Layer                       Convolutional Layer
                  w11*x
           x1           1+ b1
                         + b1                y1
                  w21*x2
           x2
                                   1
                          3   +b
                      1*x
           x3      w3
                                                                                   1.5
                                                                                                                                     4.0                                                    0.4
                                                                                   1.0
                                                                                                                                     2.0
                                                                                   0.5                                                                                                      0.2
Source: https://www.cheatsheets.aqeel-anwar.com
                                                          Page 9 of 14
                                             Cheat Sheet – Ensemble Learning in ML
What is Ensemble Learning? Wisdom of the crowd
Combine multiple weak models/learners into one predictive model to reduce bias, variance and/or improve accuracy.
2.Boosting: Trains N different weak models (usually of same types – homogenous) with the complete dataset in a
     sequential order. The datapoints wrongly classified with previous weak model is provided more weights to that they can
     be classified by the next weak leaner properly. In the test phase, each model is evaluated and based on the test error of
     each weak model, the prediction is weighted for voting. Boosting methods decreases the bias of the prediction.
3.Stacking: Trains N different weak models (usually of different types – heterogenous) with one of the two subsets of the
     dataset in parallel. Once the weak learners are trained, they are used to trained a meta learner to combine their
     predictions and carry out final prediction using the other subset. In test phase, each model predicts its label, these set of
     labels are fed to the meta learner which generates the final prediction.
The block diagrams, and comparison table for each of these three methods can be seen below.
                                         Ensemble Method – Boosting                                                                                                             Ensemble Method – Bagging
                                                                Input Dataset                                                        Step #1                                                       Input Dataset
Step #1                                                                                                                              Create N subsets
Assign equal weights                                         Complete dataset                                                        from original                     Subset #1            Subset #2            Subset #3          Subset #4
to all the datapoints                                                                                                                dataset, one for each
in the dataset                                                                                                                       weak model
 Uniform weights
                                                                                                                                     Step #2
                                                                                                                                     Train each weak
                                                                                                                                                              Weak Model                 Weak Model               Weak Model            Weak Model
Step #2a                                                                                  Step #2b                                   model with an
Train a weak model             Train Weak                                                 •   Based on the final error on the        independent                 #1                         #2                       #3                    #4
with equal weights to                                                                         trained weak model, calculate a        subset, in
                                Model #1                                                                                             parallel
all the datapoints                                                                            scalar alpha.
                                                                                          •   Use alpha to increase the weights of
                                                                                              wrongly classified points, and
                                                                                              decrease the weights of correctly
             alpha1         Adjusted weights                                                  classified points
                                                                                                                                     Step #3
                                                                                                                                     In the test phase, predict from
                                                                                                                                     each weak model and vote their                                     Voting
                                                                                          Step #3b                                   predictions to get final prediction
Step #3a                                               Train Weak                         •   Based on the final error on the
Train a weak model                                      Model #2                              trained weak model, calculate a
with adjusted weights                                                                         scalar alpha.
on all the datapoints                                                                     •   Use alpha to increase the weights of
in the dataset                                                                                wrongly classified points, and                                                                    Final Prediction
                                                                                              decrease the weights of correctly
                                         alpha2      Adjusted weights                         classified points
                                                                                                    Train Weak
Step #(n+1)a                                                                                         Model #4                        Step #2
Train a weak model                                                                                                                   Train each weak
with adjusted weights                                                                                                                model with the
                                                                                                                                                               Train Weak                 Train Weak              Train Weak            Train Weak
on all the datapoints                                                                                                                weak learner               Model #1                   Model #2                Model #3              Model #4
in the dataset                                                                                                                       dataset
                                                                                 alpha3
                                         x                  x                    x                         x                                                                                      Input Dataset
                                                                                                                                                                       Subset #1 – Weak Learners                  Subset #2 – Meta Learner
Step #n+2
In the test phase, predict from each
weak model and vote their predictions
weighted by the corresponding alpha to
get final prediction                                                                                                                 Step #3
                                                                  Voting                                                             Train a meta-
                                                                                                                                     learner for which       Trained Weak                Trained Weak            Trained Weak          Trained Weak
                                                                                                                                     the input is the
                                                                                                                                     outputs of the              Model                       Model                   Model                 Model
                                                                                                                                     weak models for              #1                          #2                      #3                    #4
                                                                                                                                     the Meta Learner
                                                                                                                                     dataset
                                                            Final Prediction
Source: https://www.cheatsheets.aqeel-anwar.com
                                                                                                                        Page 10 of 14
                        How to prepare for
1/4                    behavioral interview?
                 Collect stories, assign keywords, practice
                             the STAR format
        Keywords                           List important keywords that will be populated with your personal
                                           stories. Most common keywords are given in the table below
          Conflict                        Compromise to
                       Negotiation                              Creativity        Flexibility   Convincing
         Resolution                        achieve goal
                                                               Another team      Adjust to a
          Handling     Challenging        Working with
                                                               priorities not     colleague     Take Stand
           Crisis       Situation         difficult people
                                                                  aligned           style
        Handling –ve    Coworker          Working with a                           Your          Influence
                                                              Your strength
          feedback     view of you          deadline                              weakness        Others
                        Handling           Converting            Decision
          Handling                                                                 Conflict     Mentorship/
                       unexpected          challenge to       without enough
           failure                                                                Resolution    Leadership
                        situation          opportunity             data
        Stories
        1. List all the organizations you have been a part of. For example
               1. Academia: BSc, MSc, PhD
               2. Industry: Jobs, Internship
               3. Societies: Cultural, Technical, Sports
        2. Think of stories from step 1 that can fall into one of the keywords categories. The
           more stories the better. You should have at least 10-15 stories.
        3. Create a summary table by assigning multiple keywords to each stories. This will help
           you filter out the stories when the question asked in the interview. An example can be
           seen below
                             Story   1:        [Convincing] [Take Stand] [influence other]
                             Story   2:        [Mentorship] [Leadership]
                             Story   3:        [Conflict resolution] [Negotiation]
                             Story   4:        [decision-without-enough-data]
        STAR Format
        Write down the stories in the STAR format as explained in the 2/4 part of this cheat
        sheet. This will help you practice the organization of story in a meaningful way.
Source: https://www.cheatsheets.aqeel-anwar.com
                                                    Page 11 of 14
                       How to prepare for
2/4                   behavioral interview?
              Direct*, meaningful*, personalized*, logical*
                     *(Respective colors are used to identify these characteristics in the example)
Example: “Tell us about a time when you had to convince senior executives”
                                                     S
                                                                                  “I worked as an intern in XYZ company in
              Situation                                                           the summer of 2019. The project details
                                                                                  provided to me was elaborative. After
       Explain the situation and                                                  some initial brainstorming, and research I
                                                                                  realized that the project approach can be
     provide necessary context for                                                modified to make it more efficient in
                                                                                  terms of the underlying KPIs. I decided to
              your story.                                                         talk to my manager about it.”
                                                     T
                  Task                                                           approach and how it could improve the
                                                                                 KPIs. I was able to convince him. He
       Explain the task and your                                                 asked me if I will be able to present my
                                                                                 proposed approach for approval in front of
         responsibility in the                                                   the higher executives. I agreed to it. I was
                                                                                 working out of the ABC(city) office and
               situation                                                         the executives need to fly in from
                                                                                 XYZ(city) office.”
                                                     A
                                                                                 executives to know better about their area
                                                                                 of expertise so that I can convince them
     Walk through the steps and                                                  accordingly. I prepared an elaborative 15
                                                                                 slide presentation starting with explaining
     actions you took to address                                                 their approach, moving onto my proposed
              the issue                                                          approach and finally comparing them on
                                                                                 preliminary results.
                                                     R
                                                                                 was better than the initial one. The
                                                                                 executives proposed a few small changes
       State the outcome of the                                                  to my approach and really appreciated my
         result of your actions                                                  stand. At the end of my internship, I was
                                                                                 selected among the 3 out of 68 interns
                                                                                 who got to meet the senior vice president
                                                                                 of the company over lunch.”
                                                        Page 12 of 14
                        How to answer a
3/4                   behavioral question?
            Understand, Extract, Map, Select and Apply
      Example: “Tell us about a time when you had to convince senior executives”
                                           Page 13 of 14
                        Behavioral Interview
4/4                        Cheat Sheet
                    Summarizing the behavioral interview
       How to
                                    2         Based on all the organizations you have been a part of,
                                             think of all the stories that fall under the keywords above
        for the                          3         Practice each story using the STAR format. You will have
                                                          to answer the question following this format.
Source: https://www.cheatsheets.aqeel-anwar.com
Page 14 of 14