Statistics for Data Science - 2
Week 5 Notes
1. Functions of continuous random variable:
Suppose X is a continuous random variable with CDF FX and PDF fX and suppose
g : R → R is a (reasonable) function. Then, Y = g(X) is a random variable with CDF
FY determined as follows:
• FY (y) = P (Y ≤ y) = P (g(X) ≤ y) = P (X ∈ {x : g(x) ≤ y})
• To evaluate the above probability
– Convert the subset Ay = {x : g(x) ≤ y} into intervals in real line.
– Find the probability that X falls in those intervals.
R
– FY (y) = P (X ∈ AY ) = AY fX (x)dx
• If FY has no jumps, you may be able to differentiate and find a PDF.
2. Theorem: Monotonic differentiable function
Suppose X is a continuous random variable with PDF fX . Let g(x) be monotonic for
dg(x)
x ∈ supp(X) with derivative g ′ (x) = . Then, the PDF of Y = g(X) is
dx
1 −1
fY (y) = fX (g (y))
|g ′ (g −1 (y))|
• Translation: Y = X + a
fY (y) = fX (y − a)
• Scaling: Y = aX
1
fY (y) = fX (ya)
|a|
• Affine: Y = aX + b
1
fY (y) = fX ((y − b)a)
|a|
• Affine transformation of a normal random variable is normal.
3. Expected value of function of continuous random variable:
Let X be a continuous random variable with density fX (x). Let g : R → R be a function.
The expected value of g(X), denoted E[g(X)], is given by
Z ∞
E[g(X)] = g(x)fX (x)dx
−∞
whenever the above integral exists.
1
• The integral may diverge to ±∞ or may not exist in some cases.
4. Expected value (mean) of a continuous random variable:
Mean, denoted E[X] or µX or simply µ is given by
Z ∞
E[X] = xfX (x)dx
−∞
5. Variance of a continuous random variable:
2
Variance, denoted Var[X] or σX or simply σ 2 is given by
Z ∞
2
Var(X) = E[(X − E[X]) ] = (x − µ)2 fX (x)dx
−∞
• Variance is a measure of spread of X about its mean.
• Var(X) = E[X 2 ] − E[X]2
X E[X] Var(X)
a+b (b−a)2
Uniform[a, b] 2 12
1 1
Exp(λ) λ λ2
Normal(µ, σ 2 ) µ 2
σ
6. Markov’s inequality:
If X is a continuous random variable with mean µ and non-negative supp(X) (i.e. P (X <
0) = 0), then
µ
P (X > c) ≤
c
7. Chebyshev’s inequality:
If X is a continuous random variable with mean µ and variance σ 2 , then
1
P (|X − µ| ≥ kσ) ≤
k2
8. Marginal density: Let (X, Y ) be jointly distributed where X is discrete with range
TX and PMF pX (x).
For each x ∈ TX , we have a continuous random variable Yx with density fYx (y).
fYx (y) : conditional density of Y given X = x, denoted fY |X=x (y).
• Marginal density of Y
P
– fY (y) = pX (x)fY |X=x (y)
x∈TX
2
9. Conditional probability of discrete given continuous: Suppose X and Y are
jointly distributed with X ∈ TX being discrete with PMF pX (x) and conditional densi-
ties fY |X=x (y) for x ∈ TX . The conditional probability of X given Y = y0 ∈ supp(Y ) is
defined as
pX (x)fY |X=x (y0 )
• P (X = x | Y = y0 ) =
fY (y0 )
3
Statistics for Data Science - 2
Week 6 Notes
Continuous Random Variables
1. Joint density: A function f (x, y) is said to be a joint density function if
• f (x, y) ≥ 0, i.e. f is non-negative.
R∞ R∞
• f (x, y)dxdy = 1
−∞ −∞
2. 2D uniform distribution: Fix some (reasonable) region D in R2 with total area |D|.
We say that (X, Y ) ∼ Uniform(D) if they have the joint density
(
1
(x, y) ∈ D
fXY (x, y) = |D|
0 otherwise
3. Marginal density: Suppose (X, Y ) have joint density fXY (x, y). Then,
y=∞
• X has the marginal density fX (x) =
R
fXY (x, y)dy.
y=−∞
x=∞
• Y has the marginal density fY (y) =
R
fXY (x, y)dx.
x=−∞
– In general the marginals do not determine joint density.
4. Independence: (X, Y ) with joint density fXY (x, y) are independent if
• fXY (x, y) = fX (x)fY (y)
– If independent, the marginals determine the joint density.
5. Conditional density: Let (X, Y ) be random variables with joint density fXY (x, y).
Let fX (x) and fY (y) be the marginal densities.
• For a such that fX (a) > 0, the conditional density of Y given X = a, denoted as
fY |X=a (y), is defined as
fXY (a, y)
fY |X=a (y) =
fX (a)
• For b such that fY (b) > 0, the conditional density of X given Y = b, denoted as
fX|Y =b (x), is defined as
fXY (x, b)
fX|Y =b (x) =
fY (b)
6. Properties of conditional density: Joint = Marginal × Conditional, for x = a and
y = b such that fX (a) > 0 and fY (b) > 0.
• fXY (a, b) = fX (a)fY |X=a (b) = fY (b)fX|Y =b (a)
Page 2
Statistics for Data Science - 2
Week 7 Notes
Statistics from samples and Limit theorems
1. Empirical distribution:
Let X1 , X2 , . . . , Xn ∼ X be i.i.d. samples. Let #(Xi = t) denote the number of times
t occurs in the samples. The empirical distribution is the discrete distribution with
PMF
#(Xi = t)
p(t) =
n
• The empirical distribution is random because it depends on the actual sample
instances.
• Descriptive statistics: Properties of empirical distribution. Examples :
– Mean of the distribution
– Variance of the distribution
– Probability of an event
• As number of samples increases, the properties of empirical distribution should
become close to that of the original distribution.
2. Sample mean:
Let X1 , X2 , . . . , Xn ∼ X be i.i.d. samples. The sample mean, denoted X, is defined to
be the random variable
X1 + X 2 + . . . + Xn
X=
n
• Given a sampling x1 , . . . , xn the value taken by the sample mean X is x =
x1 + x2 + . . . + xn
. Often, X and x are both called sample mean.
n
3. Expected value and variance of sample mean:
Let X1 , X2 , . . . , Xn be i.i.d. samples whose distribution has a finite mean µ and variance
σ 2 . The sample mean X has expected value and variance given by
σ2
E[X] = µ, Var(X) =
n
• Expected value of sample mean equals the expected value or mean of the distri-
bution.
• Variance of sample mean decreases with n.
4. Sample variance:
Let X1 , X2 , . . . , Xn ∼ X be i.i.d. samples. The sample variance, denoted S 2 , is defined
to be the random variable
(X1 − X)2 + (X2 − X)2 + . . . + (Xn − X)2
S2 = ,
n−1
where X is the sample mean.
5. Expected value of sample variance:
Let X1 , X2 , . . . , Xn be i.i.d. samples whose distribution has a finite variance σ 2 . The
2 (X1 − X)2 + (X2 − X)2 + . . . + (Xn − X)2
sample variance S = has expected value
n−1
given by
E[S 2 ] = σ 2
• Values of sample variance, on average, give the variance of distribution.
• Variance of sample variance will decrease with number of samples (in most cases).
• As n increases, sample variance takes values close to distribution variance.
6. Sample proportion:
The sample proportion of A, denoted S(A), is defined as
number of Xi for which A is true
S(A) =
n
• As n increases, values of S(A) will be close to P (A).
• Mean of S(A) equals P (A).
• Variance of S(A) tends to 0.
7. Weak law of large numbers:
Let X1 , X2 , . . . , Xn ∼ iid X with E[X] = µ, Var(X) = σ 2 .
X1 + X2 + . . . + Xn
Define sample mean X = . Then,
n
σ2
P (|X − µ| > δ) ≤
nδ 2
Page 2
Statistics for Data Science - 2
Week 8 Notes
Statistics from samples and Limit theorems
1. Moment generating function (MGF):
Let X be a zero-mean random variable (E[X] = 0). The MGF of X, denoted MX (λ),
is a function from R to R defined as
MX (λ) = E[eλX ]
•
MX (λ) = E[eλX ]
λ2 X 2 λ3 X 3
= E[1 + λX + + + . . .]
2! 3!
λ2 λ3
= 1 + λE[X] + E[X ] + E[X 3 ] + . . .
2
2! 3!
λk
That is coefficient of in the MGF of X gives the kth moment of X.
k!
2 2
• If X ∼ Normal(0, σ 2 ) then, MX (λ) = eλ σ /2
• Let X1 , X2 , . . . , Xn ∼ i.i.d. X and let S = X1 + X2 + . . . + Xn , then
MS (λ) = (E[eλX ])n = [MX (λ)]n
It implies that MGF of sum of independent random variables is product of the
individual MGFs.
2. Central limit theorem: Let X1 , X2 , . . . , Xn ∼ iid X with E[X] = µ, Var(X) = σ 2 .
Define Y = X1 + X2 + . . . + Xn . Then,
Y − nµ
√ ≈ Normal(0, 1).
nσ
3. Gamma distribution:
X ∼ Gamma(α, β) if PDF fx (x) ∝ xα−1 e−βx , x>0
• α > 0 is a shape parameter.
• β > 0 is a rate parameter.
1
• θ = is a scale parameter.
β
α
• Mean, E[X] =
β
α
• Variance, Var(X) = 2
β
4. Beta distribution:
X ∼ Beta(α, β) if PDF fx (x) ∝ xα−1 (1 − x)β−1 , 0<x<1
• α > 0, β > 0 are the shape parameters.
α
• Mean, E[X] =
α+β
αβ
• Variance, Var(X) = 2
(α + β) (α + β + 1)
5. Cauchy distribution:
1 α
X ∼ Cauchy(θ, α2 ) if PDF fx (x) ∝
π α + (x − θ)2
2
• θ is a location parameter.
• α > 0 is a scale parameter.
• Mean and variance are undefined.
6. Some important results:
• Let Xi ∼ Normal(µi , σi2 ) are independent and let Y = a1 X1 + a2 X2 + . . . an Xn ,
then
Y ∼ Normal(µ, σ 2 )
where µ = a1 µ1 + a2 µ2 + . . . an µn and σ 2 = a21 σ12 + a22 σ22 + . . . a2n σn2
That is linear combinations of i.i.d. normal distributions is again a normal distri-
bution.
• Sum of n i.i.d. Exp(β) is Gamma(n, β).
1 1
• Square of Normal(0, σ ) is Gamma , 2 .
2
2 2σ
X
• Suppose X, Y ∼ i.i.d. Normal(0, σ 2 ). Then, ∼ Cauchy(0, 1).
Y
• Suppose X ∼ Gamma(α, k), Y ∼ Gamma(β, k) are independent random vari-
X
ables, then ∼ Beta(α, β).
X +Y
• Sum of n independent Gamma(α, β) is Gamma(nα, β).
n 1
• If X1 , X2 , . . . , Xn ∼ i.i.d. Normal(0, σ ) , then
2
X12 +X22 +. . .+Xn2 ∼ Gamma , .
2 2σ 2
Page 2
n 1
• Gamma , is called Chi-square distribution with n degrees of freedom, de-
2 2
noted χ2n .
• Suppose X1 , X2 , . . . , Xn ∼ i.i.d. Normal(µ, σ 2 ). Suppose that X and S 2 denote
the sample mean and sample variance, respectively, then
(n − 1)S 2
(i) ∼ χ2n−1
σ2
(ii) X and S 2 are independent.
Page 3