Probability Theory
Probability Theory
Brice Huang
Spring 2018
   These are my lecture notes for the Spring 2018 iteration of 18.175, Probability
Theory, taught by Prof. Vadim Gorin.
   These notes are written in LATEX during lectures in real time, and may contain
errors. If you find an error, or would otherwise like to suggest improvements,
please contact me at bmhuang@mit.edu.
   Special thanks to Evan Chen and Tony Zhang for the help with formatting,
without which this project would not be possible.
  Special thanks to Ryan Alweiss for proofreading these notes and catching
my errors.
   These notes were last updated 2018-03-23. The permalink to these notes is
http://web.mit.edu/bmhuang/www/notes/18175-notes.pdf.
                                        1
Brice Huang                                                                 Contents
Contents
                                         2
Brice Huang                                                               Contents
                                         3
Brice Huang                                                             Contents
                                      4
Brice Huang        1   February 6, 2018: Probability Theory, and Why We Care
1.1     Administrivia
Lectures are 9:30-11am Tuesdays and Thursdays.
     Office hours are immediately after lecture, or by appointment.
  There are 4 psets (40%) and 2 exams (60%). Collaboration on homework is
OK, but acknowledge sources. Exams are closed book.
    Lemma 1.1
    The following identity holds.
                                      X
                                               (w) = 1.
                                    w∈{0,1}n
                                           5
Brice Huang        1    February 6, 2018: Probability Theory, and Why We Care
Proof. “Who has seen this before?” [Everyone raises hand in unison.]
Observe that the probability of interest has a sharp peak when f (x) is maximized.
By omitted computation:
                                                     
                                         1−x       p
                         f 0 (x) = log         ·        .
                                           x     1−p
Note that
                                  1    1        2
                             √      p         ≤√
                                 2πN x(1 − x)    2π
when each of x, 1 − x ≥ N1 . Because the sum has at most N terms,
                                 
                   SN (w)              2N
              P            < p −  ≤ √ exp(N f (p − )) → 0
                     N                   2π
                                              6
Brice Huang       1     February 6, 2018: Probability Theory, and Why We Care
By a similar argument,
                                                 
                                    SN (w)
                            P              >p+       → 0.
                                      N
                                                                  √
   In other words, we expect |Sn − N p| to be on the order of          N.
      P [SN (w) = M ]
                                                                            3 !!
       1    1               N        p(1 − p) 2                         u
   =√    p         exp −           ·         · u + NO                  √
      2πN p(1 − p)       p(1 −  p)     2N                                N
       1    1
                   exp −u2 /2 .
                             
   =√    p
      2πN p(1 − p)
Summing this from A to B gives us a Riemann sum that converges to the desired
integral.
    While CLT holds much more generally, this proof only works for coin flips.
Replace the coin with a die, and this method doesn’t work. One of the goals of
this class is to develop a framework that will give us results like CLT for free.
                                          7
Brice Huang        1   February 6, 2018: Probability Theory, and Why We Care
                                                             λk
                             P [SN (w) = K] → e−λ               .
                                                             k!
                                                           P∞               k
Remark 1.6. This result implies the identity                 k=0    e−λ λk! = 1.
   As with CLT, there are forms of this result that hold much more generally,
which we’ll need to develop machinery to talk about.
  Proposition 1.7
  There is no way to define P (x ∈ A) for all sets A such that the above
  properties are satisfied.
                                             8
Brice Huang       1   February 6, 2018: Probability Theory, and Why We Care
This doesn’t hold for any p, because the right-hand sum is either 0 or infinite.
   So, it’s still not clear what a continuous probability distribution means.
Overcoming this problem requires a notion of measure, which we will develop
over the next few lectures.
                                           9
Brice Huang                                  2   February 8, 2018: Measure Theory
2.1     σ-Algebras
Definition 2.1. A σ-algebra A is a collection of subsets of Ω, obeying the
following axioms:
    1. ∅ ∈ A;
    2. If A ∈ A, then A = Ω \ A ∈ A;
                                        S
    3. If A1 , A2 , A3 , · · · ∈ A, then n An ∈ A.
A∩B =A∪B ∈A
and
                                A \ B = A ∩ B ∈ A.
    Example 2.2
    If Ω is countable, 2Ω is a σ-algebra. The singletons being in A implies
    everything is in A.
    Example 2.3
    If Ω has some topology (e.g. R), the Borel algebra on Ω is
2.2     Measure
Definition 2.4. A probability measure is a function p : A → R≥0 , obeying
these axioms:
    1. p(∅) = 0;
    2. If A ∩ B = ∅, then p(A ∪ B) = p(A) + p(B);
                                                    S
    3. If A1 ⊂ A2 ⊂ . . . , then limn→∞ p(Ai ) = p ( i Ai ).
                                          10
Brice Huang                               2       February 8, 2018: Measure Theory
  Lemma 2.5
  Given axioms (1) and (2), axioms (3) and (3*) are equivalent.
  Example 2.6
  Finite unions of open/closed intervals in R are an algebra of sets.
    Explicitly, the defining axioms for an algebra of sets A are axioms (1) and
S of a σ-algebra, and the followingSmodification of (3): if A1 ⊂ A2 ⊂ . . . and
(2)
  i Ai ∈ A, then limn→∞ p(Ai ) = p ( i Ai ).
                                                                S
    Analogously, we also modify (3*) to add the hypothesiss i Ai ∈ A. The
modified (3) and (3*) are still equivalent.
   We care about algebras of sets because:
2.4.1   N
                                                                          P
Let Ω = N = {1, 2, 3, . . . , }, and {pi }i∈N be a sequenceP
                                                           with pi ≥ 0,       pi = 1.
Then we can take A = 2Ω and, for any A ∈ A, p(A) = i∈A pi .
                                       11
Brice Huang                                             2     February 8, 2018: Measure Theory
Let the algebra A be finite disjoint unions of rectangles (called simple sets),
with probability measure
                                      X
                            p(A) =         p(rect).
                                                   rect ∈A
  Proposition 2.8
  p(A) is well-defined.
  Proposition 2.9
  p(A) is σ-additive on the algebra of sets A.
                        S∞
Proof. Let
       P A =              n=1   An , where A and the An are simple sets. We want
p(A) = p(An ).
   By writing
                                                                   ∞
                                          k
                                                                               !
                                          X                        [
                             p(A) =              p(An ) + p               An
                                          n=1                     n=k+1
                                                  Pk
and taking k → ∞, we get p(A) ≥                      n=1   p(An ).
   We can find a compact (closed, bounded) A− ⊂ A such that
p(A− ) ≥ p(A) − ,
                                                                   1
                                        p(A+
                                           n ) ≤ p(An ) +            .
                                                                  2n
                                                        −
                A+
        S
Now,        n    n is an open cover of the compact set A , so there is a finite subcover.
Thus,
                                                       N
                                                       X
                                          p(A− ) ≤           p(A+
                                                                n ).
                                                       n=1
                                                      12
Brice Huang                                     2    February 8, 2018: Measure Theory
   This gives us                           X
                                 p(A) ≤        p(An ) + 2
                                           n
                          P
for any . Thus p(A) =       n   p(An ).
   We can make a complete and more explicit construction using the Lebesgue
σ-algebra L([0, 1]2 ). We define the Lebesgue outer measure by
We say A ∈ L([0, 1]2 ) if for all  > 0, there is a simple set B such that
µ∗ (A∆B) < .
   We will take the following theorem for granted.
  Theorem 2.11
  L([0, 1]2 ) is a σ-algebra, and µ∗ is a σ-additive probability measure on it.
                                            13
Brice Huang              3    February 13, 2018: CDFs and measurable functions
      This is because
                                                     !
                                        \
                P ((−∞, y]) = P               (−∞, x]    = lim+ P ((−∞, x]) .
                                                           x→y
                                        x>y
if Fp jumps upward at y.
    Theorem 3.2
    Measures on B(R) are in one-to-one correspondence with distribution
    functions (i.e. functions satisfying 1-3).
                                              14
Brice Huang               3   February 13, 2018: CDFs and measurable functions
  Lemma 3.3
  The measure P on B(R) corresponding to a distribution function F (x) is a
  σ-additive measure.
Proof. Same as the proof for uniform measure in the previous lecture, using the
compact sets trick.
  Theorem 3.4
  B(R) is a σ-algebra spanned by intervals.
We define the Cantor set as follows. Set C0 = [0, 1], and define Cn+1 by taking
out the middle third of each interval in Cn . So, for example:
                     C0       =    [0, 1]
                     C1       =    [0, 31 ] ∪ [ 32 , 1]
                     C1      [0, 19 ] ∪ [ 92 , 13 ] ∪ [ 23 , 79 ] ∪ [ 89 , 1],
                              =
                              T
and define the Cantor set C = i Ci . We will define a pdf with support C.
                       k
   Since Leb(Ck ) = 32 , the uniform measure on Ck has PDF
                                            k
                                            3
                                  pk (x) =      ICk (x).
                                            2
                                                15
Brice Huang                 3    February 13, 2018: CDFs and measurable functions
   Each such PDF has a corresponding CDF Fk (x). Then, we define the CDF
                                      F (x) = lim Fk (x).
                                                     k→∞
3.2     CDFs on Rk
Let P be a probability measure on B(Rk ). We can define a CDF
                 Fp (x1 , . . . , xk ) = P ((−∞, x1 ] × · · · × (−∞, xk ]) .
The same three properties from above hold.
   The examples above also generalize:
                                                         16
Brice Huang             3   February 13, 2018: CDFs and measurable functions
  Example 3.8
  f = IM is measurable iff M ∈ A. This is because any preimage of f is
  either the empty set, M , M , or everything.
  Lemma 3.9
  Suppose for all x ∈ R, f −1 ((−∞, x]) is measurable (i.e. in A). Then f is
  measurable.
Proof. Let
                        H = {H ∈ B(R)|f −1 (H) ∈ A}.
We claim H is a σ-algebra. We verify:
                                !
                           [          [
                        −1
                      f       Hi =      f −1 (Hi ) ∈ A,
                              i
and
                       f −1 H = f −1 (R) \ f −1 (H) ∈ A.
                             
  Since (−∞, x] ∈ H, and B(R) is the minimal σ-algebra containing all intervals
(−∞, x], we must have H = B(R). Therefore f is measurable.
  Lemma 3.10
  Let (Ω, B(Ω)) be some topology. If f : Ω → R is continuous, then f is
  measurable.
  Proposition 3.11
  The pointwise limit of measurable functions is measurable.
Proof. Suppose fn (w) → f (w), for w ∈ Ω. Then, f −1 has the explicit form:
                                \ [ \
             f −1 ((−∞, x]) =              fn−1 ((−∞, x + ]) .
                                  m∈N N ∈N n≥N
                                     1
                                  = m
This is because
                                       17
Brice Huang            3   February 13, 2018: CDFs and measurable functions
  Proposition 3.12
  If f, g are measurable, then f + g is measurable.
   The same result holds for products and quotients of measurable functions.
Proofs are left as an exercise.
   Next time, we’ll develop a notion of expected value, which will require
Lebesgue integration theory.
                                       18
Brice Huang                                   4    February 15, 2018: Lebesgue Integration
4.1        Motivation
Given a measurable function f : (Ω, A, P ) → (R, B(R)),R we would like aRnotion
of expected value. This will be defined via an integral , such that Ef = f dP .
Suppose A ∈ A, and                              (
                                                 1         x∈A
                                   f (x) = IA =                   .
                                                 0         x 6∈ A
Then we define                        Z
                                          IA dP = P (A).
                                                   19
Brice Huang                             4   February 15, 2018: Lebesgue Integration
  Proposition 4.1
  This is well-defined; i.e. the limit exists and does not depend on the choice
  of f n .
  Proposition 4.2
  Any bounded measurable function f is Lebesgue integrable.
                          1
Proof. Define f n (x) =   n bnf (x)c.   It’s clear that f n ⇒ f .
  Proposition 4.3
  Let f ≥ 0 be an unbounded measurable function. f is Lebesgue integrable
  if and only if           X
                              P (f ≥ n) < ∞.
                                 n≥0
Proof. Write        Z              XZ
                          f dP =            f (x)In≤f (x)≤n+1 dP.
                                   n≥0
                                            20
Brice Huang                                 4    February 15, 2018: Lebesgue Integration
yields                                  Z
                 X                                              X
                       P (f ≥ n) ≤              f dP ≤ 1 +             P (f ≥ n).
                 n≥0                                            n≥0
This is well-defined if the two integrals on the right are both well-defined.
   In other words, f is Lebesgue integrable iff its tails are small; that is, if
                X                       X
                    P (f ≥ n) < ∞ and       P (f ≤ −n) < ∞.
                 n≥0                                   n≥0
  Proposition 4.5
  If both Riemann and Lebesgue integrals exist, then they are equal.
    A picture of what’s going on: the Riemann integral takes vertical rectangular
slices of the function, while the Lebesgue integral takes horizontal slices. The
advantage of horizontal slices is that it no longer depends on the structure of
the real line. But, the absolute-summability requirement of Lebesgue integrals
means that we cannot have the Riemann notion of improper integrals.
                                                  21
Brice Huang                          4     February 15, 2018: Lebesgue Integration
  2. (Additivity)          Z                    Z            Z
                               (f + g) dP =         f dP +       g dP.
                                                                         R
  5. (Lebesgue
          R    Dominated Convergence) If |f (x)| ≤ g(x) and                  g(x) dP exists,
     then f (x) dP exists, and
                            Z          Z
                                      
                             f (x) dP  ≤ g dP.
                                      
                           R
  6. (Continuity) Suppose f dP exists. Then, for all  > 0, there exists δ > 0
     such that for all A ∈ A with P (A) < δ,
                                  Z
                                     f IA dP ≤ .
                                           22
Brice Huang                             4   February 15, 2018: Lebesgue Integration
                
and pick δ =   2N .   Then,
                 Z                  Z               Z
                      f + IA dP =   f + IA If + ≤N + f + IA If + ≥N
                                                Z
                                ≤ N · P (A) + f + If + ≥N
                                    1    1
                                ≤      +  = .
                                    2    2
                                            23
Brice Huang                     5   February 22, 2018: Lebesgue Integral Computations
Proof. We’ll first show this when f is an indicator function, then when f is a
simple function, and finally for general f .1
     For indicator functions f = IA , we have
               Z                  X                  XZ
                  f dP = P (A) =       P (A ∩ An ) =    f IAn dP
                                             n                         n
by σ-additivity.
                                       P
     For simple functions f =              fi IA0i , use linearity of the integral.
   For general f , let fm ⇒ f uniformly, where the fm are simple functions.
Fix  > 0, and set N such that sup |fm − f | <  for all m > N . Then, for all
m > N ,         Z          Z        Z                 
                                                       
                  fm dP − f dP  =  (fm − f ) dP  ≤ 
                                                       
and similarly              Z           Z         
                                                 
                            fm IAn dP − f IAn dP  ≤ P (An ).
                                                 
Summing the second equality over  gives the bound
                                                    
                 X Z                XZ              
                       fm IAn dP −          f IAn dP  ≤ .
                                                    
                 
                  n                  n
                                                     
           R             P R
Since          fm dP =      n    fm IAn dP , this implies
                                Z                          
                                           XZ              
                                 f dP −           f IAn dP  ≤ 2.
                                                           
                                            n
                                                            
                                                     24
Brice Huang                   5   February 22, 2018: Lebesgue Integral Computations
     Proposition 5.2
     If A1 , A2 , . . . partition Ω and the bounds
                         Z                    XZ
                            |f |IAn dP < ∞,        |f |IAn dP < ∞,
                                                 n≥1
then f is integrable.
Definition 5.3. We say f (x) is equivalent to g(x) if they are equal almost
surely, i.e.
                           Pr(f (x) 6= g(x)) = 0.
     Proposition 5.4
                                     R                 R
     If f, g are equivalent, then        f (x) dP =        g(x) dP .
     Proposition 5.5
                         R               R
     If for all A ∈ A,       f IA dP =       gIA dP , then f = g almost surely.
A = {f − g > } ∈ A
contradiction.
                                                 25
Brice Huang            5        February 22, 2018: Lebesgue Integral Computations
Proof. For simple functions, the two integrals evaluate to the same sum, so
there is nothing to prove.
   For general functions: if fn are simple functions uniformly approximating f ,
then fn (φ(x)) are simple functions uniformly approximating f (φ(x)). So we are
done.
for A ∈ B(R).
   Great. Now we’re back to a real-valued integral, so we can compute things.
The only wrinkle is we need to know what the measure Pf looks like.
   If Pf is discrete, we directly compute
                                     X
                             E(f ) =   xi P (f = xi ).
                                    Rx
If Pf is continuous, then Pf =       −∞
                                          p(t) dt.
    The following formula lets us explicitly compute random variables’ expecta-
tions:
                                            26
Brice Huang                    5      February 22, 2018: Lebesgue Integral Computations
  Proposition 5.9
  The
  R ∞ following equality of Lebesgue and Riemann integrals holds, provided
   −∞
      |x|p(x) < ∞.
                          Z          Z ∞
                             x dPf =     xp(x) dx.
                                                         −∞
Proof of Change of Variables Formula. We’ll first show these integrals exist iff
the other exists. Note the two-sided bound
                                 Z −n            Z n+1         
       nP (n ≤ |f | ≤ n + 1) = n        p(x) dx +       p(x) dx
                                                  −n−1                              n
                                             Z   −n                             Z    n+1
                                         ≤           |x|p(x) dx +                          |x|p(x) dx
                                             −n−1                               n
                                                         Z   −n                           Z    n+1          
                                         ≤ (n + 1)                     p(x) dx +                      p(x) dx
                                                              −n−1                             n
                                         = (n + 1)P (n ≤ |f | ≤ n + 1)
This implies
               Z   ∞                               X
                        |x|p(x) < ∞ ⇔                    nP (n ≤ |f | ≤ n + 1) < ∞.
                   −∞                              n≥0
The second inequality holds iff the Lebesgue integral is defined, as desired.
   Next, we prove the integrals are equal. We can show that
                       Z                 Z A
                          xI|x|≤A dPf =      xp(x) dx
                                                               −A
                                      1
by approximating x with               n bnxc.
                            Then, we take A → ∞; since
                Z        Z              Z
                  x dPf − xI|x|≤A dPf = xI|x|>A dPf
For simplicity, let’s take Ω = R, and f (x) = x. Consider the Gaussian random
variable N (µ, σ 2 ), with probability density
                                                     1  (x−µ)2
                                        p(x) =      √ e− 2σ2 .
                                                   σ 2π
                           R∞
   First, we’ll verify that −∞ p(x) = 1. By change of variables, this is equivalent
  R∞       2           √
to −∞ e−x /2 dx = 2π. This is because
               Z   ∞   Z   ∞                                    Z      2π   Z   ∞
                                   −x2 /2−y 2 /2                                           2
                               e                   dx dy =                          re−r       /2
                                                                                                    dr dθ
               −∞ −∞                                               0        0
                                                              = 2π.
                                                         27
Brice Huang             5   February 22, 2018: Lebesgue Integral Computations
In practice, the second is usually easier because we don’t have to deal with Pf 2 .
   The squared-Gaussian’s expectation is
                              Z ∞
                        2 2
                EN (µ, σ ) =       x2 p(x) dx
                               −∞
                                     Z ∞
                                 1             (x−µ)2
                            =  √          x2 e− 2σ2 dx
                              σ 2π −∞
                                     Z ∞
                                 1                     x2
                            =  √          (x + µ)2 e− 2σ2 dx
                              σ 2π −∞
                                           Z ∞
                                       1              x2
                               2
                            =µ +     √         x2 e− 2σ2 dx
                                    σ 2π −∞
                                            Z ∞
                                     2 1               x2
                               2
                            =µ +σ      √         x2 e− 2 dx
                                         2π −∞
                               2     2
                            =µ +σ
where the final integral is computed by parts.
  This is the sense of convergence for which, for example, the Law of Large
Numbers (see Lecture 1) holds.
                                        28
Brice Huang              5   February 22, 2018: Lebesgue Integral Computations
  Lemma 5.12
  If ξn →a.s. ξ, then ξn →P ξ.
and                                 \[ \
                               A=               A,n .
                                    >0 N n≥N
                                        29
Brice Huang           6   February 27, 2018: Convergence of Random Variables
In some sense, these just differ in the order that we take limits.
  Last time we showed that (1) implies (2). Today we will define two more
modes of convergence.
    Proposition 6.2
    ξn →d ξ if and only if for each continuous bounded f (x),
             lim Efx (ξn ) ≥ lim Egx,m (ξn ) = Egx,m (ξ) ≥ Efx− m1 (ξ).
            n→∞             n→∞
                                        30
Brice Huang               6    February 27, 2018: Convergence of Random Variables
               lim Efx (ξn ) ≤ lim Ehx,m (ξn ) = Ehx,m (ξ) ≤ Efx+ m1 (ξ).
              n→∞               n→∞
whence                                                
                            1                          1
                     Fξ x −     ≤ lim Fξn (x) ≤ Fξ x +     .
                            m    n→∞                   m
Taking m → ∞ gives Fξn (x) → Fξ (x).
   Conversely, suppose ξn → ξ converges in distribution. Note that Fξ is
monotonic, so it has only countably many discontinuities.2 Take a continuous f
with |f | ≤ C. We want to show Ef (ξn ) → Ef (ξ).
   Pick a point of continuity A > 0 of Fξ , big enough so that Fξn (−A) < δ and
1 − Fξn (A) < δ for all n. This implies the bound
whence
                                     Ef (ξ)I|ξ|>A ≤ 2δC.
Pick points −A = x0 < x1 < · · · < xN = A, such that all xi are points of
continuity of Fξ and |f (x) − f (y)| < δ inside each [xi , xi+1 ]. Thus we have the
bound
Z                                                         
                                                          
 f (ξn )Iξ ∈[x ,x ] dP − f (xi ) (Fξn (xi ) − Fξn (xi−1 )) ≤ δ (Fξn (xi ) − Fξn (xi−1 ) ,
          n   i i−1                                       
whence
                   X                                  
Ef (ξn )I|ξn |≤A −   f (xi ) (Fξn (xi ) − Fξn (xi−1 )) ≤ δ (Fξn (xN ) − Fξn (x0 )) ≤ δ.
                                                      
Similarly we have
                             X                                
              Ef (ξ)I|ξ|≤A −   f (xi ) (Fξ (xi ) − Fξ (xi−1 )) ≤ δ.
                                                              
As n → ∞, we get that
                                                      
                  lim Ef (ξn )I|ξn |≤A − Ef (ξ)I|ξ|≤A  ≤ 2δ.
                                                      
                        n→∞
Along with the bounds on Ef (ξn )I|ξn |>A and Ef (ξ)I|ξ|>A , this implies
                                          
                     lim Ef (ξn ) − Ef (ξ) ≤ 2δ + 4δC.
                                          
                              n→∞
6.2    L1 convergence
Definition 6.3. ξn → ξ in expectation (denoted ξn →L1 ξ) if
                                       E |ξn − ξ| → 0.
  2 Proof:   take one rational number skipped by each discontinuity.
                                              31
Brice Huang          6   February 27, 2018: Convergence of Random Variables
and                                         [
                                     Am =        Am
                                                  n.
                                             n
as n → ∞. Moreover,
                         Z         Z        
                                            
                          ξn I dP  ≤  gI dP  ≤ 
                              A         A   
                                         32
Brice Huang             6   February 27, 2018: Convergence of Random Variables
and similarly                         Z      
                                             
                                       ξI dP  ≤ .
                                         A   
This shows that                  Z     Z     
                                             
                            n→∞ ξn dP − ξ dP  ≤ 2.
                             lim             
Proof. First, we may assume fn ≥ 0 for all n, by adding −f1 to all our functions.
      By Markov’s Inequality,
                                      Z                        Z
                                  1                        1                    K
              P (fn (x) > M ) ≤           fn Ifn >M dP ≤           fn dP =        .
                                  M                        M                    M
Let
                       A = {x|fn (x) is unbounded as n → ∞}.
Then,
                                  P (A) ≤ P (fn (x) > M )
for all M , so in fact P (A) = 0. This shows that fn (x) → f (x) almost surely.
                                                         R
    It remains to show that Efn → Ef . We will show f dP < ∞, which will
allow us to use Dominated Convergence.
      For each N , note the bound
        N
        X                             Z                            Z
              nP (n ≤ f ≤ n + 1) ≤        f If <N +1 dP = lim          fn Ifn <N +1 dP,
                                                           n→∞
        n=1
                                             33
Brice Huang          6   February 27, 2018: Convergence of Random Variables
Proof. Let
                             ϕn (x) = inf fm (x).
                                      m≥n
0 ≤ ϕn (x) ≤ fn (x) ≤ K.
                                      34
Brice Huang                                            7     March 1, 2018: Product Measures
    Lemma 7.1
    P1 ⊗ P2 is σ-additive on A1 × A2 .
                               S
Proof. Suppose A × B =             iAi × Bi as a disjoint union. We want to show
                                             X
                           P1 (A) · P2 (B) =     P1 (Ai ) · P2 (Bi ).
                                                   i
For finite unions, we subdivide the sets Ai × Bi more finely into atoms, like we
did in Lecture 2. For countably-infinite unions, define
fi = IAi P2 (Bi ).
This implies                           X
                                            fi = IA · P2 (B).
                                       i
The integrals of partial sums of the LHS are increasing, so the Monotone
Convergence Theorem applies, and
                    N
                    X                                      N Z
                                                           X              N
                                                                        Z X
                          (P1 ⊗ P2 )(Ai × Bi ) =                 fi =          fi
                    i=1                                    i=1           i=1
has limit                     Z
                                   IA P2 (B) = P1 (A) · P2 (B).
as N → ∞.
    Corollary 7.2
    By Caratheodory’s Theorem this probability measure extends to a proba-
    bility measure on σ(A1 × A2 ).
                                                  35
Brice Huang                                        7   March 1, 2018: Product Measures
  Corollary 7.4
  To show (7.1) it is enough to check
Proof. The intervals of the form Bi = (−∞, bi ] span the Borel algebra B(R).
  Example 7.5
  Let N (0, 1) be the standard Gaussian random variable, with pdf p(x) =
           2
   √1 e−x /2 .    Then, a pair of independent Gaussian random variables
    2π
  (N (0, 1), N (0, 1)) has pdf
                                          1      2        2
                             p(x, y) =      · e−x /2 · e−y /2 .
                                         2π
  Theorem 7.6
  If X, Y are independent and integrable random variables, then
                                              36
Brice Huang                                       7   March 1, 2018: Product Measures
   Note that both sides are bilinear in X and Y , so the result holds for simple
functions X and Y .
   Now suppose X, Y ≥ 0. We consider
                                  1 n                     1 n
                         Xn =       b2 Xc         Yn =      b2 Y c.
                                 2n                      2n
These are independent too, because
                                            1 n       1       1          1
      P (Xn ≤ x, Yn ≤ y) = P (X <             b2 xc + n , Y < n b2n yc + n )
                                           2n        2       2          2
factorizes. So, by the monotone convergence theorem, E(Xn Yn ) → E(XY ) and
E(Xn ) · E(Yn ) → E(X) · E(Y ).
  Proposition 7.7
  If X, Y are independent, then f (X), g(Y ) are as well, for any measurable
  functions f, g.
  Proposition 7.8
  If X1 , . . . , Xm , Xm+1 , . . . , Xn are independent, then
f (X1 , . . . , Xm ), Xm+1 , . . . , Xn
are independent.
Proof. For simplicity, we’ll show this for f (X, Y ) and Z. The general argument
is analogous. We use the identity
                                       = P ((X, Y ) ∈ f −1 (B1 )) · P (Z ∈ B2 )
                                       = P (f (X, Y ) ∈ B1 ) · P (Z ∈ B2 ).
The crucial step is the second equality, which relies on the construction PX ⊗
PY ⊗ PZ being associative. That is, it doesn’t matter whether we construct
this product measure as (PX ⊗ PY ) ⊗ PZ or as PX ⊗ (PY ⊗ PZ ), because in
both cases our base sets are boxes SX × SY × SZ and Caratheodory’s Theorem
guarantees uniqueness of the resulting probability measure. Thus the second
equality is valid and we are done.
                                             37
Brice Huang                                   7    March 1, 2018: Product Measures
Proof Sketch. For elementary functions, this is true by Theorem 7.10. Following
the usual strategy, we extend this first to simple functions, then to arbitrary
functions.
  Theorem 7.12
  Let X, Y be independent random variables. Then
                                Z
                     FX+Y (t) =      Fx (t − u) dPy .
                                      u∈R
                                        38
Brice Huang                                      7    March 1, 2018: Product Measures
Proof. We compute:
                     Fx+y (t) = P (x + y ≤ t)
                                x
                              =    Ix+y≤t dPx ⊗ dPy
                                Z Z              
                              =         Ix≤t−y dPx dPy
                                 y    x
                                Z
                              = Fx (t − y) dPy .
                                    y
whence                              Z   ∞
                       px+y (t) =           px (t − u)py (u) du.
                                    −∞
Joke. “Now let’s compute something! Well, the only thing I know how to
integrate is a Gaussian random variable, so...” - Gorin
  Example 7.13
  Suppose X ∼ N (µ, σ 2 ), Y ∼ N (ν, c2 ) are independent. Then, we will show
X + Y ∼ N (µ + ν, σ 2 + c2 ).
                                                    u2
                                                       
                                       1
                       px (u) =       √ exp − 2
                                    σ 2π           2σ
                                              2
                                      1          u
                       py (u) = √ exp −                .
                                      2π         2
  We compute:
             Z ∞
  px+y (t) =     px (t − u)py (u) du
              −∞
                 Z ∞
                               (t − u)2     u2
                                              
              1
           =           exp −            − 2 du
             2πσ −∞                2       2σ
                                        ∞                              2 !
                          1 t2                   1 σ2 + 1
                                   Z                     
              1                                                   t
           =     exp −                     exp − ·        · u−              du
             2πσ          2 1 + σ2     −∞        2   σ2        1 + σ12
                                    1 t2
                                            
                  1
           =p               exp −              .
               2π(σ 2 + 1)          2 1 + σ2
                                            39
Brice Huang                       8     March 6, 2018: Sequences of Random Variables
    Lemma 8.1
    Ω is compact in the Tikhonov topology.
converges. Then, we can pick a subsequence i21 , i22 , i23 , . . . of i11 , i12 , i13 , . . . such
that
                             (i2 )  (i2 )  (i2 )
                            x2 1 , x2 2 , x2 3 , . . .
converges. Continue this process, so
                                          n           n           n
                                      x(i1 ) , x(i2 ) , x(i3 ) , . . .
                                                      40
Brice Huang                       8    March 6, 2018: Sequences of Random Variables
  Proposition 8.2
  This defines a probability measure on Ω.
where the union on the right is disjoint. If this union is finite, the boxes on the
right differ in finitely many dimensions, so we can reduce to a finite-dimensional
setting. This is handled by the product-measures machinery from last lecture.
   Otherwise, finite additivity implies
                                                   N
                                                   X
                                                         p |ak1 , bk1 | × |ak2 , bk2 | × . . . ,
                                                                                              
           p (|a1 , b1 | × |a2 , b2 | × . . .) ≥
                                                   k=1
whence
                                                   ∞
                                                   X
                                                         p |ak1 , bk1 | × |ak2 , bk2 | × . . . .
                                                                                              
           p (|a1 , b1 | × |a2 , b2 | × . . .) ≥
                                                   k=1
We need to show equality holds. Shrink the box on the left a bit to get a closed
box C with measure
                              p(C) ≥ LHS − ,
and grow each box on the right a bit to get an open box Ck with measure
                                                                    
                                 p(Ck ) ≤ p(k th box) +               .
                                                                   2k
By
SMcompactness, the Ck are an open subcover of C, which has a finite subcover
 k=1 Ck ⊃ C. Then, by finite additivity,
                             ∞
                             X                    N
                                                  X
             RHS +  ≥             p(Ck ) ≥             p(Ck ) ≥ p(C) ≥ LHS − ,
                             k=1                  k=1
whence
                                      RHS ≥ LHS + 2.
Take  → 0 to get LHS = RHS, as desired.
   Cool. This lets us construct sequences of i.i.d. uniform variables. Now, how
do we get arbitrary distributions?
    Supose ξ has density p(x) > 0. Then Fξ (x) : R → (0, 1) is strictly monotone
and continuous. We claim we generate the variable ξ from a uniform random
variable as follows.
  Proposition 8.3
  Let u ∼ uniform(0, 1). Then,
Fξ−1 (u) = ξ.
                                                   41
Brice Huang                 8   March 6, 2018: Sequences of Random Variables
Therefore:
  Corollary 8.4
  Independent random variables with arbitrary distributions exist.
  Lemma 8.7
  For independent ξ and µ, the following hold:
Proof. The first identity follows from linearity of the Lebesgue integral. For the
second identity, compute:
                                                2
      Var(ξ + µ) =E (ξ + µ)2 − E [E(ξ + µ)]
                               
                                                    2          2
                  =E[ξ 2 ] + E[µ2 ] + 2E[ξµ] − (E[ξ]) − (E[µ]) − 2E[ξ]E[µ]
                  = Var(ξ) + Var(µ),
where for the last step we use E[ξµ] = EξEµ, by independence.
   The third identity should be clear.
                                         42
Brice Huang                    8    March 6, 2018: Sequences of Random Variables
  Lemma 8.8
  If random variables ηn obey E[ηn2 ] → 0, then ηn →P 0.
                                                               1
                                                                   Pn
Proof of Weak Law of Large Numbers. Let ηn =                   n    k=1 ξk   − m. Then,
E[ηn ] = 0
and
                                                      n
                                                   1 X
                        E[ηn2 ] = Var(ηn ) =            Var(ξk )
                                                   n2
                                                         k=1
                                1
                               = Var(ξ1 ) → 0.
                                n
Then, by Lemma 8.8, ηn →P 0 and we are done.
As a generalization:
  Corollary 8.9
  If ξi are independent and Var(ξi ) = σi < ∞, and
                                        n
                                     1 X
                                          σk → 0,
                                     n2
                                           k=1
  then                                                 !
                                   n
                           1       X
                                         (ξk − E[ξk ])     →P 0.
                           n
                                   k=1
                                              43
Brice Huang                 8    March 6, 2018: Sequences of Random Variables
  Note that E[xi ] = E[x2i ] = p, so Var(xi ) = p(1 − p). Therefore, by the Weak
LLN,
                                  1X
                                       xi →P p
                                  n i
and by Markov’s Inequality
                       X            
                        1                Var x1
                    P    xk − p >  <
                         n                  n2
                                           p(1 − p)
                                         =
                                              n2
                                             1
                                         ≤      .
                                           4n2
   It remains to show Bn → f uniformly, i.e.
                          lim sup |f (p) − Bn (p)| = 0.
                         n→∞ p∈[0,1]
Pick δ > 0 such that |f (x) − f (y)| < 12  whenever |x − y| < δ. This is possible
because continuous functions on closed intervals are uniformly continuous. Then,
write
                                       
                           X          k      n k
              Bn (p) =             f           p (1 − p)n−k (∗)
                           k
                                      n      k
                       k:| n −p|<δ
                                       
                           X          k     n k
                     +             f           p (1 − p)n−k (∗∗).
                          k
                                      n      k
                       k:| n −p|≥δ
                                C
                           ≤         .
                               4nδ 2
                                             44
Brice Huang                 8   March 6, 2018: Sequences of Random Variables
whence
                                               1     C
                          |Bn (p) − f (p)| ≤     +
                                               2    2nδ 2
                                        C     1
for all p. Take n large enough that 2nδ   2 < 2 , so for all sufficiently large n,
  Corollary 8.11
  Suppose random variables ξ and η satisfy |ξ| < C, |η| < C, and E[ξ k ] =
  E[η k ] for all k. Then ξ = η, in the sense that their distributions are equal.
Proof. The condition E[ξ k ] = E[η k ] for all k implies E[g(ξ)] = E[g(η)] for all
polynomials g. By Weierstrass Approximation, this implies E[f (ξ)] = E[f (η)]
for all continuous f .
   The distribution functions Fξ , Fη are the expectation of a step function,
which can be approximated by arbitrarily steep continuous functions, so we are
done.
                                        45
Brice Huang                      9    March 8, 2018: Strong Law of Large Numbers
Proof. Let’s prove the first assertion first. Note the set equality
                                            ∞ [
                                            \ ∞
                                     B=               An ,
                                            k=1 n=k
            S∞
and that      n=k   An is a decreasing function of k. Therefore, for any k,
                                        ∞
                                              !
                                       [           X
                          P (B) ≤ P       An ≤         P (An ).
                                       n=k              n≥k
                                             46
Brice Huang                               9       March 8, 2018: Strong Law of Large Numbers
compute:
                               ∞                        ∞
                                          !                        !
                               [                        \
                 1−P                 An       =P              An
                               n=k                      n=k
                                                  Y            
                                              =         P An           (by independence)
                                                  n≥k
                                                  Y
                                              =         (1 − P (An ))
                                                  n≥k
                               =0
                                   P (Ai ) = ∞.3
                                 P
where the last step follows from
   The difference between this bound and bounds like Markov is that Kol-
mogorov’s Inequality bounds the random walk X1 + · · · + Xk at all timesteps k,
whereas Markov bounds only the value of the walk at the last timestep.
and
                            Ak = { max |Si | < a and |Sk | ≥ a}.
                                      1≤i≤k−1
                                                                                               Sn
That is, Ak is the event that |Sk | ≥ a first at timestep k, and A =                            k=1   Ak is
the event that |Sk | ≥ a at some point.
   Then, we can compute:
       n
       X
               σk2 = ESn2
       k=1
                  ≥ E[Sn2 IA ]
                    Xn
                  =     E[Sn2 IAk ]
                      k=1
                      Xn
                             E[Sk2 IAk ] + 2E[Sk (Sn − Sk )IAk ] + E(Sn − Sk )2 IAk
                                                                                   
                  =
                      k=1
                      Xn
                             2               
                  ≥          a P (Ak ) + 0 + 0 .
                      k=1
  3 Details:   use the bound
                                                                                         
                  Y                           Y                               X
                       (1 − P (An )) ≤            exp(−P (An )) = exp −            P (An )
                 n≥k                      n≥k                                 n≥k
                                                         47
Brice Huang                    9   March 8, 2018: Strong Law of Large Numbers
as desired.
    Using Kolmogorov, we’ll prove another bound. This isn’t quite Strong LLN
yet, but it’s useful.
  Theorem 9.4
                                                               2
                                                                            σk2 < ∞,
                                                                        P
  Let XP
       1 , X2 , . . . be independent with mean 0 and variance σk . If   k
           ∞
  then n=1 Xk converges almost surely.
                                   σ 2 + · · · + σn+m
                                                  2
                                
           P max |Sn+k − Sn | ≥  ≤ n+1               .
               1≤k≤n                      2
This is because we can take n such that 12 k≥n σk2 < δ.
                                             P
              1
    Take  = M  , for M = 1, 2, 3, . . . . For each , (*) occurs with probability 1.
So, their countable intersection occurs with probability 1.
                                           48
Brice Huang                  9       March 8, 2018: Strong Law of Large Numbers
                                                                       yn
   Here’s one application of this. Suppose bn =P  n and xn =           n .   Then this
                                                   n
theorem says that if n ynn is convergent, then n1 i=1 yi → 0.
                    P
                                                Pk
Proof of Kronecker’s Lemma. Let Sk =             i=1   xk . Using summation by parts,
we compute:
             n            j
           1 X          1 X
               bk xk =      nSk − Sk−1
          bn           bn
              k=1           k=1
                                                n
                           1         1        1 X
                      =      bn Sn − b0 S0 −      Sk−1 (bk − bk−1 ).
                          bn        bn       bn
                                                        k=1
                 Pn                        P∞
The first term is i=1 xi , which converges
                                   P∞ to i=1 xi . The second is 0 because
S0 = 0, and the third converges to i=1 xi by Toeplitz.
  Proposition 9.7
                                                                    2
  Let
  P∞Xi 1be 2independent random variables with mean µi and variance σi . If
    n=1 n2 σn < ∞, then
                                 n
                            1X
                               (Xk − µk ) →a.s. 0.
                            n
                              k=1
Note that this implies Strong LLN, conditioned on the values Eξi2 existing. So
we’re almost there!
Proof. The random variable Xn n−µn has mean 0 and variance n12 σn2 . By Theo-
rem 9.4, n Xn n−µn converges almost surely. By Kronecker, this implies
        P
                                 n
                            1X
                               (Xk − µk ) →a.s. 0.
                            n
                              k=1
Yn = ξn I|ξn |≤n .
                                           49
Brice Huang                           9     March 8, 2018: Strong Law of Large Numbers
  Lemma 9.8
  Let ξ1 , ξ2 , ξ3 , . . . be i.i.d. random variables. If E|ξn | < ∞ for all n, then
  P Var(Yn )
    n    n2      < ∞.
                                                                                1
                                                                                    Pn
As E[Yk ] → Eξi = m by continuity of the Lebesgue integral, we have             n    k=1   E[Yk ] →
m. Therefore,
                        n                n
                     1X               1X
                           Yk →a.s.        E[Yk ] →a.s. m.
                     n                n
                                k=1                k=1
                                                  50
Brice Huang                       9    March 8, 2018: Strong Law of Large Numbers
                                                      Pn
   For the converse, suppose for contradiction that n1 k=1 ξk → m and E|ξi | =
∞. Then, by the Lebesgue integrability condition, for any C we have
                           X  |ξn |         
                               P        > n = ∞.
                                     C
                                 n≥1
By Borel-Cantelli, this means that with probability 1, the event |ξn | > nC
happens for infinitely many ξn . At each ξn where this happens, we have
              n                                        
             1 X           C            1 n−1X          C
                     ξk − m >      or            ξk − m > .
                                                       
             n                 3        n − 1               3
             
                                                         
                      k=1                           k=1
             1
                 Pn
Therefore,   n    k=1 ξk    cannot converge.
                                            51
Brice Huang                                  10      March 13, 2017: Snow Day
                                     52
Brice Huang                             11       March 15, 2017: Characteristic Functions
Note that this definition does not depend at all on the first finitely many terms
of An .
  Example 11.1
  The σ-algebra                        X
                                {w ∈ Ω|  ξn converges}
                                             n
is a tail σ-algebra.
    In the previous example, this means that every series of random variables
either converges with probability 1, or diverges with probability 1. There is no
in between.
P (A ∩ A) = P (A) · P (A),
  Example 11.3
  Let ξ1 , ξ2 , . . . be random variables. Then the radius of convergence of
                                                   ∞
                                                   X
                                       f (t) =           ξi ti
                                                   i=1
   5 i.e. the σ-algebras {ξ −1 (B)|B ∈ B(R)}. It may be helpful to think of this as the σ-algebra
                           k
of events of the form ξk ∈ B, for B ∈ B(R) – so, for example, events like ξk ≤ c.
                                                  53
Brice Huang                             11    March 15, 2017: Characteristic Functions
  Example 11.4
  Let’s be more concrete. Let ξi in the previous example be independent
  Boolean variables:             (
                                   1    p = 12
                           ξi =                .
                                   −1 p = 12
  f (t) clearly blows up when t > 1 and converges when t < 1, so R = 1.
  Example 11.5
  R = 1 when ξi ∼ N (0, 1) as well. In fact we can note that ξi ti ∼ N (0, t2i ),
  and use the theorem from last lecture about sum of variances converges
  implies sum of random variables converges.
  Proposition 11.7
  The characteristic function has the following properties:
1. |ϕξ (t)| ≤ 1.
     2. ϕξ (0) = 1.
     3. ϕξ (t) is uniformly continuous over t ∈ R.
     4. ϕξ (·) is positive-definite, i.e. for any t1 , . . . , tk ∈ R,
                                                54
Brice Huang                                 11    March 15, 2017: Characteristic Functions
  Theorem 11.8
  Any function satisfying (1) through (4) is a characteristic function.
    In fact, we’ll later show how to recover a random variable from its character-
istic function.
  Proposition 11.9
  The characteristic function has the following properties:
  Example 11.10
  Let ξ ∼ N (µ, σ 2 ). We’ll first compute this for µ = 0, σ = 1:
                  Z ∞                             Z ∞
              1               x2        1     1 2        1       2        1 2
    ϕξ (t) = √          eitx− 2 dx = √ e− 2 t         e− 2 (x−it) dx = e− 2 t ,
               2π −∞                    2π         −∞
                       R ∞ − 1 (x−it)2      √
  where we evaluate −∞ e 2             dx = 2π by complex analysis.a
                                                            2 2
                                                                 
     So, applying (1) above, we get ϕξ (t) = exp iµt − σ 2t in the general
  case.
                                        1   2
             contour integrate e− 2 z along the box with vertices R, −R, −R − it, R − it,
     a Details:
         RR    1      2      R R − 1 x2
             −2
  to get −R e (x−it) dx ≈ −R      e 2    dx with error going to 0 as R → ∞.
                                                   55
Brice Huang                               11      March 15, 2017: Characteristic Functions
   Proposition 11.11
  Assume E[ξ k ] exists. Then ϕξ (t) is k times differentiable, and
                                        k
                                    ∂               
                                              ϕξ (t)         = ik E[ξ k ].
                                                    
                                    ∂t                  t=0
Proof. We will induct on k. Suppose E[ξ k+1 ] exists and is finite. Then E[ξ k ]
exists and is finite.6 Now compute:
       (k)
       ϕ (t + ∆) − ϕ(k) (t)                     itξ l ei∆ξ − 1
                                                                     
                                 k+1   itξ k+1 
                                                                         
                            − i     Ee   ξ      = Ee ξ           − iξ  
                ∆                                         ∆           
   Corollary 11.12
  Whenever Eξ m exists,
                                           m
                                           X (it)k
                               ϕξ (t) =                   Eξ k + o(tm ).
                                                   k!
                                           k=0
                                   1 ∞
                                                         −itx2
                                                                − e−itx1
                                    Z                                    
                                                 δ 2 t2  e
        Fξ (x2 ) − Fξ (x1 ) = lim        ϕ(t)e− 2                          dt.
                              δ→0 2π −∞                        −it
The point of the δ is to add a small Gaussian noise, to smooth over any
discontinuities in the original random variable.
   6 Details:   write
                                E[ξ k ] = E[ξ k I|ξ|≤1 ] + E[ξ k I|ξ|>1 ].
The first term is clearly bounded, and the second is bounded by E[ξ k+1 ]
                                                   56
Brice Huang                         11       March 15, 2017: Characteristic Functions
So, we backed out the density Pδ (x) from the characteristic function ϕδ (t).
   Now, we can invert ξ. Let Fδ be the distribution function of ξ + Yδ . Then:
                               Z x2
         Fδ (x2 ) − Fδ (x1 ) =      Pδ (x) dx
                                x1
                                         ∞
                                                                  e−itx2 − e−itx1
                                    Z                                              
                               1                        2 2
                                                     − δ 2t
                            =                ϕ(t)e                                      .
                              2π     −∞                                 −it
  Corollary 11.14
  If                           X 1
                                    E|ξ k |tk < ∞
                                 k!
                                k
                                              57
Brice Huang             12     March 20, 2018: Limits of Characteristic Functions
ϕξ (t) = E exp(itξ).
Characteristic functions have the nice property that ϕξ+η (t) = ϕξ (t)ϕη (t) for
independent variables ξ and η.
    We also derived the Levy Inversion Formula, which lets us recover a random
variable’s distribution from its characteristic function:
                                  1 ∞
                                                       −itx2
                                                              − e−itx1
                                   Z                                   
                                                  2 2
                                               − δ 2t  e
       Fξ (x2 ) − Fξ (x1 ) = lim       ϕξ (t)e                           dt.
                             δ→0 2π −∞                       −it
Recall that the point of the δ is to add a small Gaussian noise, in order to force
the distribution to be continuous.
    For certain well-behaved ξ, it’s possible to pass to the limit explicitly - that
is, we can just take δ to 0.
  Corollary 12.1
    R∞
  If −∞ |ϕξ (t)| dt < ∞, then ξ has a well-defined density, which is
                                           Z   ∞
                                       1
                             p(x) =                e−itx ϕξ (t) dt.
                                      2π   −∞
so by dominated convergence we can move the limδ→0 under the integral sign.
Thus,
                              1 ∞
                                                        −itx2
                                                                − e−itx1
                                Z                                        
                                                   2 2
                                                − δ 2t   e
       Fξ (x2 ) − Fξ (x1 ) =        lim ϕξ (t)e                            dt
                             2π −∞ δ→0                         −it
                              1 ∞
                                           −itx2
                                                     − e−itx1
                                Z                             
                                            e
                           =       ϕξ (t)                       dt
                             2π −∞                 −it
                              1 ∞
                                Z         Z x2
                           =       ϕξ (t)      e−itx dx dt
                             2π −∞         x1
                                  1 ∞ −itx
                             Z x2  Z
                           =             e     ϕξ (t) dt dx (by Fubini).
                              x1 2π −∞
Therefore                                           Z   ∞
                               d           1
                    p(x) =       Fξ (x) =                   e−itx ϕξ (t) dt.
                              dx          2π         −∞
                                               58
Brice Huang             12   March 20, 2018: Limits of Characteristic Functions
Remark 12.2. The characteristic function really just takes a Fourier transform,
which Levy inversion inverts. This is especially apparent when the above
corollary holds.
  Example 12.3
  Let’s see a non-example of this corollary: what happens when ξ doesn’t
  have a density?
       Consider the Bernoulli variable ξ ∼ Ber(p). Then,
  so certainly                 Z   ∞
                                       |ϕξ (t)| dt = ∞.
                                −∞
   Here’s how we should interpret m  ~ and C. When the above equation holds,
m
~ = (m1 , . . . , mk ) must be the vector of means given by mj = Eξj , and C
must be the nonnegative definite matrix, called the covariance matrix, whose
entries are given by
                                           59
Brice Huang               12     March 20, 2018: Limits of Characteristic Functions
                                        ξ~ = Aξ~0 + m,
                                                    ~
where ξ~0 is the Gaussian vector whose coordinates are i.i.d. unit Gaussians
N (0, 1).
Definition 12.7 (Gaussian Vector, Definition 4, kind of). Suppose the covari-
ance matrix C of [ξ1 , . . . , ξk ] is nonsingular. Then [ξ1 , . . . , ξk ] is a Gaussian
vector if
                                                                                
                                     1             1
       Pξ~(x1 , . . . , xk ) = p                         ~ T C −1 (~x − m)
                                             exp − (~x − m)                   ~    .
                                 (2π)k det C       2
  Theorem 12.8
  These four definitions are equivalent.
where                                    
                                k
                                  X           k
                                              X
                   m ~c, ξ~ = E    cj ξj  =   cj mj = (m,
                                                         ~ ~c)
                                        j=1            j=1
and
                             
    h  i2           k
                      X           X
     σ ~c, ξ~ = Var    cj ξj  =   ca cb E(ξa − ma )(ξb − mb ) = (C~c, ~c) .
                           j=1                a,b
                                                60
Brice Huang             12   March 20, 2018: Limits of Characteristic Functions
C = B T diag(λ1 , . . . , λk )B = AT A,
  Corollary 12.10
  A Gaussian vector is uniquely specified by its mean and covariance matrix.
  Corollary 12.11
  For a Gaussian vector ξ,    ~ two coordinates ξa , ξb are uncorrelated (i.e.
  Cov(ξa , ξb ) = 0) iff they are independent.
  Corollary 12.12
                        ~ pairwise independence of coordinates is equivalent
  For a Gaussian vector ξ,
  to joint independence of coordinates.
    It’s worth noting that pairwise independence is generally far weaker than
joint independence.
   We want to use this machinery to prove asymptotic theorems like the central
limit theorem. We introduced characteristic functions as our tool for studying
central tendency, so let’s look next at how they interact with limits.
                                         61
Brice Huang                           12    March 20, 2018: Limits of Characteristic Functions
  Proposition 12.15
  Let {ξn } be a sequence of random variables, such that
Proof. We look at the distribution functions Fξn (x) ∈ [0, 1]. At a fixed point
x, the values of Fξn (x) are just points in [0, 1], which must have a convergent
subsequence.
    We can’t pick a subsequence such that Fξn (x) converges for all x simulta-
neously, but we can pick a subsequence {nk } such that Fξnk → H(x) for all
rational x.7
   H(x) is monotone. Define Fξ (x) as the right-limit of H along rationals. Note
that this doesn’t depend on the approximation scheme, because Fξ (x) is just
the infimum of H(y) over y > x. Also note that Fξ (x) may not equal H(x),
even for rationals x.
   Let’s check that Fξ is a distribution function, by verifying axioms.
Monotonicity. By construction.
Limits at 0,1. By condition (∗),
                                                          62
Brice Huang            12   March 20, 2018: Limits of Characteristic Functions
and
                        Fξnk (r1 ) ≤ Fξnk (x) ≤ Fξnk (r2 ).
As k → ∞, we have Fξnk (r1 ) → H(r1 ) and Fξnk (r1 ) → H(r2 ), so for all
sufficiently large k,
                      H(r1 ) ≤ Fξnk (x) ≤ H(r2 ).
By taking  → 0, we conclude Fξnk (x) → Fξ (x).
   Next time we’ll finish the proof of Theorem 12.14. We’ll show that for
random variables ξn whose characteristic functions converge pointwise to a limit
continuous at t = 0, the condition (∗) holds. Then we’ll show full convergence,
by showing that all subsequential limits are equal.
                                        63
Brice Huang        13     March 22, 2018: Central Limit Theorem and Variations
  Lemma 13.1
  Let X be any random variable. For any u > 0,
                                  1 u
                                  Z
                            2
                  P |X| ≥       ≤       (1 − ϕX (t)) dt.
                           u      u −u
Therefore,                             
                               sin uX               1
                          E 1−              ≥         P (|uX| ≥ 2)
                                 uX                 2
and we are done.
  Lemma 13.2
  The hypothesis of Proposition 12.15 holds. That is,
                                           64
Brice Huang           13    March 22, 2018: Central Limit Theorem and Variations
Since ϕξ (t) → 1 as t → 0, we can pick C large enough that P (|ξn | ≥ 2C) < 
for all sufficiently large n. Increase C as necessary to make this bound hold for
all n.
                                                65
Brice Huang          13   March 22, 2018: Central Limit Theorem and Variations
Indeed, the constant, linear, and quadratic terms are ϕ(0), ϕ0 (0), and 12 ϕ00 (0).
By Proposition 11.11, these are:
                                ϕ(0)     = E[ηi0 ] = 1
                                 0
                               ϕ (0)     = iE[ηi ] = 0
                             1 00          1 2            1
                               ϕ (0)     =   i E[ηi2 ] = − .
                             2             2              2
   Now, set
                                  Pn                         n
                                     i=1 ξi− mn   1 X
                          Xn =            √     =√       ηi .
                                         σ n       n i=1
By properties of characteristic functions,
                     n                 2 n
                                 t2
              
                  t                        t
  ϕXn (t) = ϕ √        = 1−         +o          → exp(−t2 /2) = ϕN (0,1) (t).
                   n             2n        n
where the pointwise convergence is by the definition of the exp function. By
Theorem 12.14, we are done.
   This leaves open the question: in practice, we can’t get as many data points
ξn as we want. How many data points do we need before CLT is useful? The
Berry-Essen Inequality answers this question.
  Theorem 13.4
  Assume the hypotheses of CLT, and that E|ξi |3 = M . Then,
                                              66
Brice Huang        13   March 22, 2018: Central Limit Theorem and Variations
  then
                             n
                          1 X
                                (ξi − E[ξi ]) →d N (0, 1).
                         Sn i=1
  Example 13.8
  Suppose ξi has third moments. Suppose E[ξi ] = 0, E[ξi2 ] = σ 2 , and
  E|ξi3 | < C. Then Sn2 = nσ 2 , so
                              n
                          1 X               Cn
                                 E|ξi |3 ≤ 3/2 3/2 → 0.
                         Sn3 i=1          n σ
                                  n
                              1 X
                          lim  2
                                     E[ξi2 ]I|ξi |≥Sn = 0,
                         n→∞ Sn
                                 i=1
  then
                             n
                          1 X
                                (ξi − E[ξi ]) →d N (0, 1).
                         Sn i=1
Remark 13.10. The Lindeberg CLT condition says the ξi don’t have fluctua-
tions on the order where we expect CLT to hold. As an exercise, show that the
Lyapanov CLT condition implies the Lindeberg CLT condition.
Remark 13.11. In some sense, the Lindeberg CLT is the strongest form of
CLT possible, because a result by Feller says that the Lindeberg CLT condition
is necessary for CLT.
    To show how to check the Lindeberg condition, let’s check it for i.i.d. random
variables.
                                             67
Brice Huang       13   March 22, 2018: Central Limit Theorem and Variations
  Example 13.12
  Let ξi be i.i.d. random variables with variance σ 2 , so Sn = nσ 2 . We need
  that
                        n
                     1 X  2                 1 
                           E ξi I|ξi |≥§n = 2 E ξi2 I|ξi |≥σ√n
                                                                
                      2
                    Sn i=1                  σ
converges to 0. Since E[ξi2 ] < ∞, this follows from continuity of the integral.
   The proof of Lindeberg CLT is similar to the proof of CLT – we just have to
be more careful with the o(t2 ) error term. The proof is in Durrett.
68