Contents
2 Random variables 1
2.1 Random variable and expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Two random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Averages and their convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Random variables
This chapter provides a short introduction to the concept of random variable; these concepts are
central in data science and will be important in the rest of the course. Indeed we will want to
forecast using ideas from probability theory, in order to design efficient prediction tools but also to
quantify the uncertainties associated with them.
2.1 Random variable and expectation
Definition. A random variable is an object designed to represent uncertainty. Will denote random
variables by capital letters, such as X, Y , X1 , X2 , etc. Random variables take values (called
“realizations”) in a set (called the state space), and each value is associated with a probability. As
a first example, to represent a coin flip, we introduce a random variable X that takes value in the
set {heads, tails}. If the coin lands on heads with probability p, i.e. P(X = heads) = p, then it
must land on tails with probability 1 − p. Having described the probabilities associated with every
possible outcome, we have fully described our coin flip X (see Figure 1(a)).
Let us take another example: X is the Uniform distribution on the interval [0, 1]. It means that
X takes values in [0, 1], and it is equally likely to “land” anywhere in [0, 1]. More formally, X is
such that P(X ∈ (a, b)) = b − a for all a and b such that 0 ≤ a < b ≤ 1: in words, the probability
that X lands in a sub-interval of [0, 1] is equal to the length of that sub-interval (see Figure 1(b)).
We will encounter various random variables, taking values in finite or infinite sets such as R (the
set of all real values). When the state space is large or infinite, we cannot describe the probability
that X will be equal to each and very particular value. So how do we describe X?
There are various, perfectly valid ways to define a random variable X. For example,
• We can describe the relation between X and an already-defined variable. For example, if we
say that X = − log(U ) where U is Uniform in [0, 1], we have fully described X, and we can
simulate it on the computer. At least, if we assume that we can simulate Uniform variables
on the computer.
Pierre Jacob 1 Forecasting and predictive analytics
1.0
1
0.7
probability
probability
0.3
0.0 0
heads tails a b c d
0 1
state state
(a) (b)
Figure 2.1: Two random variables: a biased coin flip (left), and a Uniform variable in [0, 1]. Here,
b − a is equal to d − c, so the variable is equally likely to land in (a, b) or in (c, d).
• We can describe the probability density function of X, denoted by fX , from the state space
to R+ , the set of positive reals. For example, the density function of a Uniform(0, 1) variable
is x 7→ 1(x ∈ (0, 1)), and the density function of the Exponential(1) variable is x 7→ exp(−x).
From the probability density function, we can compute the probability that the random
variable lands in any subset of the state space. For example for any a < b in the state space,
Rb
we have P(X ∈ (a, b)) = a fX (x)dx. In words: the area under the curve of fX between a and
b. Here we represent probabilities as integrals of fX , which is convenient because integrals
can be computed by certain humans and all computers.
We can prove that X = − log(U ) where U is Uniform in [0, 1] is indeed a random variable with
density function x 7→ exp(−x), via the change of variable formula.
Change of variable. Suppose that X is a random variable with density fX , and that s is a
one-to-one function. Define Y = s(X). What is the density of fY ?
ds (y)
−1
(change of variable) fY (y) = fX (s−1 (y)) × . (2.1)
dy
In the above equation, s−1 is the inverse of s: s−1 (y) is the number such that s(s−1 (y)) = y. The
last term on the right is the absolute value of the derivative of s−1 evaluated at y.
To summarise, we can think of random variables as mathematical objects (more precisely, as
functions), and as concrete objects that we can simulate on the computer (see Figure 2.2 on the
Exponential variable). Both views are useful.
Pierre Jacob 2 Forecasting and predictive analytics
1.00 1.00
0.75 0.75
density
U
0.50 0.50
0.25 0.25
0.00 0.00
0 2 4 0 2 4
X state space
(a) (b)
Figure 2.2: Two views on the Exponential random variable. Left: generate X = − log(U ) with
U ∼ Uniform(0, 1). Right: probability density function x 7→ exp(−x).
Properties. Once we have defined a variable, we can look at its properties. The expectation of
R +∞
a random variable X, E[X], also known as its mean, is defined by E[X] = −∞ xfX (x)dx, where
fX is the probability density function of X. The integral is not always finite, so the expectation
can be infinite: this is the case for example with a Cauchy variable that has density x 7→ π −1 (1 +
R +∞
x2 )−1 . Similarly we can define E[h(X)] = −∞ h(x)fX (x)dx for a function h, for example E[X 2 ] =
x fX (x)dx. It is helpful to know that expectations are defined as integrals. But it does not mean
R 2
that we have to resort to (scary!) integral calculations every time we meet an expectation, thanks
to fundamental properties recalled below (linearity, and later on, the tower property).
The expectation is linear, which means the following: if X is equal (with probability one) to a
constant real number c, then E[X] = c. For any pair of random variables X and Y , and any two
real numbers a and b, the following holds:
(linearity of expectation) E[aX + bY ] = aE[X] + bE[Y ]. (2.2)
Using linearity, we can find the expectation of a Uniform(a, b) variable X from the expectation of
a Uniform(0, 1) variable U , which is equal to 1/2. Since X is a + (b − a)U , E[X] = a + (b − a)/2.
We can also use the linearity of expectation to show that E[(X − E[X])2 ] is equal to E[X 2 ] − E[X]2 .
This is called the variance of X and denoted by V[X]. The variance satisfies:
V[aX + b] = a2 V[X]. (2.3)
Pierre Jacob 3 Forecasting and predictive analytics
0.4 0.4
0.3 0.3
density
0.2 0.2
0.1 0.1
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
state space state space
(a) (b)
Figure 2.3: Two views on the Normal random variable. Left: normalized histogram of generated
values simulated using (2.5). Right: probability density function ϕ defined in (2.4).
Univariate Normal distribution. A Normal distribution, denoted by Normal(µ, σ 2 ), has prob-
ability density function
1 1
(Normal pdf) ϕ : x 7→ √ exp − 2 (x − µ) ,
2
(2.4)
2πσ 2 2σ
for x ∈ R. This is one way of defining it. It is “standard” if µ = 0 and σ = 1. Alternatively, we can
describe a Normal variable by its relation to the (already defined) Uniform variable. Suppose that
U1 and U2 are independent (more on independence below) Uniform(0, 1) variables. Define Z as
(Box–Muller transform) Z= −2 log(U1 ) cos(2πU2 ). (2.5)
p
Then Z is Normal(0, 1), and therefore X = µ + σZ is Normal(µ, σ 2 ). See Figure 2.3.
We write X ∼ Normal(µ, σ 2 ) to specify that X follows that distribution. We can compute:
E[X] = µ and V[X] = σ 2 .
2.2 Two random variables
Things get more interesting with a pair of variables, as we can introduce the important concepts of
independence, conditional distribution and covariance.
Independence. We now consider a pair of real-valued random variables X and Y . We can put
them in a “vector” V = (X, Y ), in which case, we have one random vector of length 2. The random
vector, like any other random variable, can be described with its probability density function fX,Y .
Pierre Jacob 4 Forecasting and predictive analytics
We say that X and Y are independent if the joint density fX,Y factorizes into a product of marginal
densities fX and fY :
(independent factorization) ∀x, y ∈ R fX,Y (x, y) = fX (x) fY (y) . (2.6)
For example the density of a vector (U1 , U2 ) of independent Uniform(0, 1) variables is the function
(x, y) 7→ 1(x ∈ (0, 1)) × 1(y ∈ (0, 1)). To simulate independent variables, we just simulate them
separately, without sharing of information, recycling or communication between the two simulators.
Conditioning. Consider again a pair of real-valued random variables X and Y , not necessarily
independent. Conditioning on X, or more precisely on the event {X = x}, means considering
the distribution of the random variables while fixing the value of X to some x ∈ R. As if the
random variable X was solidified into the value x. For example, suppose that X ∼ Normal(0, 1)
and Y = aX + b. Then if we “condition” on {X = x}, Y is equal to the ax + b.
Conditioning on {X = x}, the variable Y might have a different distribution than if we do not
condition on {X = x}. We denote the conditional density of Y given {X = x} by y 7→ fY |X (y|x).
For any joint distribution fX,Y , we can always write
(general factorization) fX,Y (x, y) = fX (x) fY |X (y|x) = fY (y) fX|Y (x|y) . (2.7)
From this we get the expression fY |X (y|x) = fX,Y (x, y) /fX (x), so we can obtain an expression for
the conditional density using the joint and the marginal densities.
Note, since (2.7) is always true, independence as in (2.6) implies that fY |X (y|x) = fY (y) and
that fX|Y (x|y) = fX (x). This is intuitive, it corresponds our human idea of “independence”:
knowing the value of X does not change our understanding of the distribution of Y , and vice versa.
The notion of independence is symmetric in X and Y .
Tower property. We write E[Y |X] or E[Y |X = x] for the expectation of the random variable Y
when we condition/know the value of X, say x. We have this very useful property, for any pair of
random variables X and Y ,
(tower property of expectation) E[Y ] = E[E[Y |X]]. (2.8)
For example, suppose that X and W are two independent Normal(0, 1) variables and that Y =
aX + bW . If we condition on the event {X = x}, then Y becomes ax + bW and its distribution
is Normal(ax, b2 ), thus E[Y |X] = aX. On the other hand, unconditionally the expectation E[Y ]
is 0. We can find this by linearity: E[Y ] = aE[X] + bE[W ]. Or by the tower property: E[Y ] =
Pierre Jacob 5 Forecasting and predictive analytics
E[E[Y |X]] = E[aX] = 0.
Products. A useful property about independent variables is that the expectation E [XY ] is equal
to the product E [X] E [Y ]. Indeed, using the tower property:
E [XY ] = E [E[XY |X]] = E[XE[Y |X]] = E[XE[Y ]] = E[X]E[Y ],
where we have used E [Y |X] = E [Y ] by independence. Another useful property is that, for any two
functions g and h, if X and Y are independent then g(X) and h(Y ) are independent.
Covariance. The notion of independence is considered uncontroversial: most people agree on
its mathematical definition and on its use in data analysis. On the other hand, there have many
attempts at quantifying the amount of dependence between variables. The correlation coefficient
is one of them.
First, the covariance between X and Y is defined as
Cov (X, Y ) = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ] . (2.9)
The second equality can be checked by developing the product, and using the linearity of expec-
tations. The covariance is not a very intuitive notion but note that if X and Y are independent,
then Cov(X, Y ) = 0 because then E[XY ] = E[X]E[Y ]. If Cov(X, Y ) = 0 we say that X and Y are
uncorrelated. Independent variables are always uncorrelated.
Uncorrelated but dependent. There are plenty of pairs of variables X and Y such that
Cov (X, Y ) = 0 and yet X and Y are dependent. Consider X following a symmetric distri-
bution around 0, such as a centered Normal distribution or a Uniform distribution on [−1, 1].
Define Y as Y = X 2 . Then X brings a lot of information on Y (in fact, X determines Y com-
pletely), so intuitively the two variables are dependent. On the other hand, we can compute
Cov (X, Y ) = E X 3 − E[X]E X 2 . Since X is symmetric around 0, we have E X 3 = 0 and
E [X] = 0, and thus Cov (X, Y ) = 0.
We state some properties of the covariance. With X = Y , we obtain Cov (X, X) = V [X]. The
covariance is symmetric: Cov (X, Y ) = Cov (Y, X). The covariance is bilinear: for numbers a, b, c
and random variables X, Y, W ,
(bilinearity of covariance) Cov (aX + bY, cW ) = ac Cov (X, W ) + bc Cov (Y, W ) . (2.10)
The covariance is invariant by shifts: for all a ∈ R, Cov (X + a, Y ) = Cov (X, Y ).
Pierre Jacob 6 Forecasting and predictive analytics
3 3
0 0
X2
X2
−3 −3
−6 −6
−5 0 5 10 −5 0 5 10
X1 X1
(a) (b)
Figure 2.4: Bivariate Normal random variable. Left: samples. Right: contours of the probability
density function.
Correlation. The correlation is a standardized covariance, defined as
Cov (X, Y )
(correlation) Cor (X, Y ) = p . (2.11)
V [X] V [Y ]
Some properties of the correlation are derived from those of the covariance (symmetry, invariance
by shifts). Some properties are specific to the correlation:
• Invariance by scalings: Cor (aX, Y ) = Cor (X, Y ) for all a ∈ R. Invariance by shifts and
scalings means that the correlation is insensitive to the units used for X and Y .
• The correlation is always between −1 and +1. Indeed, for any pair of random variables X
and Y with finite first two moments (i.e. E X 2 and E Y 2 are finite), the Cauchy–Schwarz
2
inequality states that E [XY ] ≤ E X 2 E Y 2 . If we apply this inequality to the variables
2
X − E [X] and Y − E [Y ], we obtain Cov (X, Y ) ≤ V [X] V[Y ] and thus Cor (X, Y ) ∈ [−1, 1].
Furthermore, the equality holds only if Y = aX + b for some real numbers a and b. Therefore,
we have Cor (X, Y ) = 1 (resp. = −1) if and only if Y = aX + b with a > 0 (resp. with a < 0).
Maximally correlated variables are perfectly aligned.
The latter property hints at a limitation of the correlation coefficient: it really only captures linear
associations.
Bivariate Normal distribution. A multivariate Normal distribution Normal (µ, Σ) of dimen-
−k/2 −1/2 T
sion k has probability density function fX : x 7→ (2π) |Σ| exp(− 21 (x − µ) Σ−1 (x − µ)),
defined for all x ∈ Rk . Here µ ∈ Rk is a real vector, and Σ ∈ Rk×k is a positive definite matrix.
Pierre Jacob 7 Forecasting and predictive analytics
T
The notation |Σ| refers to the determinant of Σ, Σ−1 to its inverse, and (x − µ) refers to the
transpose of the column vector (x − µ). We can simulate a vector X distributed as Normal(µ, Σ)
in dimension k, by simulating a vector of independent standard Normals Z = (Z1 , . . . , Zk ) and
computing X = µ + LZ, where L is a matrix such that LLT = Σ. Given a matrix Σ we can find
such a matrix L by “Cholesky decomposition”.
The density function is simpler if we consider the case where k = 2: the “bivariate” Normal, see
Figure 2.4. If we consider the mean µ to be the pair (µ1 , µ2 ) and the covariance matrix Σ to be
!
σ12 ρσ1 σ2
Σ= ,
ρσ1 σ2 σ22
then we can explicitly invert Σ and compute its determinant. After some work, we can write, for
any pair (x1 , x2 ), the joint density fX1 ,X2 (x1 , x2 ) of (X1 , X2 ) ∼ Normal(µ, Σ) as
" 2 2 #!
1 1
x1 − µ1 x2 − µ2 x1 − µ1 x2 − µ2
exp − + − 2ρ .
2πσ1 σ2 1−ρ 2 (1 ρ2 )
p
2 − σ1 σ2 σ1 σ2
Note that if ρ = 0, then the off-diagonal elements of Σ are zero, i.e. Cov(X1 , X2 ) = 0. But also
the joint density factorizes into a product of marginal densities as in (2.6). In that case, X1 and
X2 are independent. So for variables that are jointly Normal, lack of correlation is equivalent to
independence. It is not true for general pairs of random variables.
2.3 Averages and their convergence
We can define empirical versions of expectations, based on realizations of random variables. Indeed,
Pn
the expectation E [X] can be approximated by an empirical average n−1 t=1 xt , denoted by x̄n ,
where (xt )nt=1 are n independent realizations of X. This is key: we can approximate a theoretical
quantity such as X with an empirical quantity such as x̄n . In some sense, this is what makes
statistical analysis useful. But why does it work?
Average and expectation. One justification is through the law of large numbers. Assume that
E [|X|] = |x|fX (x)dx < ∞, and that X1 , X2 is a sequence of independent identical copies of X
R
then the law states:
n
1X a.s.
(law of large numbers) X̄n = Xt −−−−→ E [X] . (2.12)
n t=1 n→∞
Pierre Jacob 8 Forecasting and predictive analytics
3 0.4
density
2
Xn
0.2
1
0.0
0 5
0
0 25 50 75 100 n (Xn − E[X])
n n 2 3 20
(a) (b)
Figure 2.5: Asympotics
√ of the average. Left: twenty independent trajectories of X̄n along n.
Right: histogram of n(X̄n − E[X]) for different n and standard Normal density in solid black line.
In this example each X is Exponential(1), so E[X] = 1 and V[X] = 1.
Pn
The convergence “almost sure” or “a.s.” means that P limn→∞ n−1 Xt = E [X] = 1; in
t=1
words: in every experiment where we would generate such sequence X1 , X2 , etc, there is an integer
n large enough so that X̄n is close to E [X]. This is called an “asymptotic” result, because it
describes a phenomenon occurring when n → ∞ and it does not say anything about any finite
value of n.
If we assume more, we get more. For example if we assume that V[X] < ∞, then we have
Chebyshev’s inequality that states that for all ε > 0 and for all n ≥ 1:
V[X]
(Chebyshev) P(|X̄n − E[X]| > ε) ≤ . (2.13)
nε2
Accordingly, the probability that X̄n is ε-away from E[X] goes to zero as 1/n when n → ∞. But
Chebyshev is a non-asymptotic result: it works for all n. Under the same assumption V[X] < ∞,
the Central Limit Theorem is a purely asymptotic result that states
√ d
(CLT) n(X̄n − E[X]) −−−−→ Normal(0, V[X]). (2.14)
n→∞
The convergence is “in distribution”: the random variable on the left becomes more and more like
the random variable on the right of the arrow.
Empirical covariance and correlation Consider two samples, x = (x1 , . . . , xn ), and y =
(y1 , . . . , yn ), that are assumed to be realizations of random variables X1 , . . . , Xn and Y1 , . . . , Yn , all
distributed identically as X and Y . From (2.9), replacing expectations by averages we obtain the
Pierre Jacob 9 Forecasting and predictive analytics
empirical covariance
n n
1X 1X
Ĉov (x1:n , y1:n ) = (xt − x̄n ) (yt − ȳn ) = xt yt − x̄n ȳn . (2.15)
n t=1 n t=1
With the same reasoning, if we replace V [X] by the empirical variance σ̂x2 defined as Ĉov (x1:n , x1:n ),
and if we replace V [Y ] by σ̂y2 , then from Eq. (2.11) we obtain the empirical correlation as
Pn Pn
n−1 (xt − x̄n ) (yt − ȳn ) (xt − x̄n ) (yt − ȳn )
Ĉor (x1:n , y1:n ) = t=1
q = qP t=1 . (2.16)
n 2 Pn 2
σ̂x2 σ̂y2 t=1 (x t − x̄ n ) t=1 (y t − ȳn )
The ability to approximate “theoretical” quantities such as expectations using samples will be
key in the developments of the next chapters.
Pierre Jacob 10 Forecasting and predictive analytics