KEMBAR78
Probabilistic Methods in Combinatorics | PDF | Mathematical Concepts | Discrete Mathematics
0% found this document useful (0 votes)
81 views215 pages

Probabilistic Methods in Combinatorics

Uploaded by

kjadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views215 pages

Probabilistic Methods in Combinatorics

Uploaded by

kjadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 215

MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

Lecture notes (MIT 18.226)

Probabilistic Methods in Combinatorics

Yufei Zhao
Massachusetts Institute of Technology
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

Last updated: June 18, 2024


These are lecture notes for a graduate class that I taught at MIT. The main textbook
reference is
Alon and Spencer, The Probabilistic Method, Wiley, 4ed.
Please report errors via the Google Form https://bit.ly/pmnoteserror.

Asymptotic notation convention


We adopt the following asymptotic notation. Each line below has the same meaning
for positive functions 𝑓 and 𝑔 as some parameter, usually 𝑛, tends to infinity:
• 𝑓 ≲ 𝑔, 𝑓 = 𝑂 (𝑔), 𝑔 = Ω( 𝑓 ), 𝑓 ≤ 𝐶𝑔 (for some constant 𝐶 > 0)
• 𝑓 /𝑔 → 0, 𝑓 ≪ 𝑔, 𝑓 = 𝑜(𝑔) (and sometimes 𝑔 = 𝜔( 𝑓 ))
• 𝑓 = Θ(𝑔), 𝑓 ≍ 𝑔, 𝑔 ≲ 𝑓 ≲ 𝑔
• 𝑓 ∼ 𝑔, 𝑓 = (1 + 𝑜(1))𝑔
• whp (= with high probability) means with probability 1 − 𝑜(1)
Warning. Analytic number theorists use ≪ to mean 𝑂 (·) (Vinogradov notation),
differently from how the symbol is used in these notes.
Subscripts (e.g., 𝑂 𝑠 ( ), ≲𝑠 ) are used to emphasize that the hidden constants may depend
on the subscripted parameters. For example, 𝑓 (𝑠, 𝑥) ≲𝑠 𝑔(𝑠, 𝑥) means that for every 𝑠
there is some constant 𝐶𝑠 so that 𝑓 (𝑠, 𝑥) ≤ 𝐶𝑠 𝑔(𝑠, 𝑥) for all 𝑥.
We write [𝑁] := {1, . . . , 𝑁 }.
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

Contents

Contents

1 Introduction 1
1.1 Lower bounds to Ramsey numbers . . . . . . . . . . . . . . . . . . . 1
1.2 Set systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 2-colorable hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 List chromatic number of 𝐾𝑛,𝑛 . . . . . . . . . . . . . . . . . . . . . 12

2 Linearity of Expectations 17
2.1 Hamiltonian paths in tournaments . . . . . . . . . . . . . . . . . . . 17
2.2 Sum-free subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Turán’s theorem and independent sets . . . . . . . . . . . . . . . . . 19
2.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Unbalancing lights . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Crossing number inequality . . . . . . . . . . . . . . . . . . . . . . . 25

3 Alterations 29
3.1 Dominating set in graphs . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Heilbronn triangle problem . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 High girth and high chromatic number . . . . . . . . . . . . . . . . . 32
3.5 Random greedy coloring . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Second Moment 37
4.1 Does a typical random graph contain a triangle? . . . . . . . . . . . . 37
4.2 Thresholds for fixed subgraphs . . . . . . . . . . . . . . . . . . . . . 42
4.3 Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Clique number of a random graph . . . . . . . . . . . . . . . . . . . 55
4.5 Hardy–Ramanujan theorem on the number of prime divisors . . . . . 57
4.6 Distinct sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 Weierstrass approximation theorem . . . . . . . . . . . . . . . . . . . 63

5 Chernoff Bound 69
5.1 Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

Contents

5.2 Nearly equiangular vectors . . . . . . . . . . . . . . . . . . . . . . . 73


5.3 Hajós conjecture counterexample . . . . . . . . . . . . . . . . . . . . 75

6 Lovász Local Lemma 79


6.1 Statement and proof . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Coloring hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3 Independent transversal . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4 Directed cycles of length divisible by 𝑘 . . . . . . . . . . . . . . . . 90
6.5 Lopsided local lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.6 Algorithmic local lemma . . . . . . . . . . . . . . . . . . . . . . . . 97

7 Correlation Inequalities 107


7.1 Harris–FKG inequality . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 Applications to random graphs . . . . . . . . . . . . . . . . . . . . . 110

8 Janson Inequalities 115


8.1 Probability of non-existence . . . . . . . . . . . . . . . . . . . . . . 115
8.2 Lower tails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.3 Chromatic number of a random graph . . . . . . . . . . . . . . . . . 124

9 Concentration of Measure 129


9.1 Bounded differences inequality . . . . . . . . . . . . . . . . . . . . . 129
9.2 Martingales concentration inequalities . . . . . . . . . . . . . . . . . 130
9.3 Chromatic number of random graphs . . . . . . . . . . . . . . . . . . 135
9.4 Isoperimetric inequalities: a geometric perspective . . . . . . . . . . 139
9.5 Talagrand’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.6 Euclidean traveling salesman problem . . . . . . . . . . . . . . . . . 162

10 Entropy 173
10.1 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
10.2 Permanent, perfect matchings, and Steiner triple systems . . . . . . . 178
10.3 Sidorenko’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . 185
10.4 Shearer’s lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

11 Containers 201
11.1 Containers for triangle-free graphs . . . . . . . . . . . . . . . . . . . 203
11.2 Graph containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.3 Hypergraph container theorem . . . . . . . . . . . . . . . . . . . . . 208
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1 Introduction

The probabilistic method is an important technique in combinatorics. In a typical


application, we wish to prove the existence of something with certain desirable prop-
erties. To do so, we devise a random construction and show that it works with positive
probability.
Let us begin with a simple example of this method.

Theorem 1.0.1 (Large bipartite subgraph)


Every graph with 𝑚 edges has a bipartite subgraph with at least 𝑚/2 edges.

Proof. Let the graph by 𝐺 = (𝑉, 𝐸). Assign every vertex a color, randomly either
black or white, uniformly and independently at random.
Let 𝐸 ′ be the set of edges with one black endpoint and one white endpoint. Then
(𝑉, 𝐸 ′) is a bipartite subgraph of 𝐺.
Every edge belongs to 𝐸 ′ with probability 1/2. So by the linearity of expectation, the
expected size of 𝐸 ′ is
1
E[|𝐸 ′ |] = |𝐸 | .
2
1
Thus there is some coloring with |𝐸 ′ | ≥ 2 |𝐸 |. Then (𝑉, 𝐸 ′) is the desired subgraph.

1.1 Lower bounds to Ramsey numbers


Ramsey number 𝑹(𝒌, ℓ) = smallest 𝑛 such that in every red-blue edge coloring of 𝐾𝑛 ,
there exists a red 𝐾 𝑘 or a blue 𝐾ℓ .
For example, 𝑅(3, 3) = 6 (every red/blue edge-coloring of 𝐾6 has a monochromatic
triangle, but one can color 𝐾5 without any monochromatic triangle).
Ramsey (1929) proved that 𝑅(𝑘, ℓ) exists (i.e., is finite). This is known as Ramsey’s
theorem.

1
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1 Introduction

Image courtesy of Kmhkmh. Source: Wikimedia Commons. License CC BY.

Paul Erdős (1913–1996) is considered the father of the probabilistic method.


He published around 1,500 papers during his lifetime, and had more than 500
collaborators. To learn more about Erdős, see his biography The man who loved
only numbers by Hoffman and the documentary N is a number (You may be able
to watch this movie for free on Kanopy using your local public library account).

© Cambridge Wittgenstein archive. All rights reserved. This content is excluded from our
Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use.
Frank Ramsey (1903–1930) wrote seminal papers in philosophy, economics, and
mathematical logic, before his untimely death at the age of 26 from liver problems.
See a recent profile of him in the New Yorker.

Finding quantitative estimates of Ramsey numbers (and its generalizations) is generally


a difficult and often fundamental problem in Ramsey theory.

Remark 1.1.1 (Hungarian names). Many Hungarian mathematicians, notable due to


Erdős’ influence, made foundational contributions to this field. So we will encounter
many Hungarian names in this subject.
How to type “Erdős” in LATEX: Erd\H{o}s (incorrect: Erd\"os, which produces
“Erdös”)
How to pronounce Hungarian names:
Hungarian spelling Sounds like Examples
s sh Erdős, Simonovits
sz s Szemerédi, Lovász

Erdős’ original proof


One of the earliest application of the probabilistic method in combinatorics given by
Erdős in his seminal paper:
P. Erdős, Some remarks on the theory of graphs, Bull. Amer. Math. Soc, 1947.

2
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1.1 Lower bounds to Ramsey numbers

Here is the main result of this paper.

Theorem 1.1.2 (Lower bound to Ramsey numbers; Erdős 1947)


𝑛  1− ( 𝑘2 )
If 𝑘 2 < 1, then 𝑅(𝑘, 𝑘) > 𝑛.
In other words, there exists a red-blue edge-coloring of 𝐾𝑛 with no monochromatic
𝐾𝑘 .

In the proof below, we will apply the union bound: for events 𝐸 1 , . . . , 𝐸 𝑚 ,

P(𝐸 1 ∪ · · · ∪ 𝐸 𝑚 ) ≤ P(𝐸 1 ) + · · · + P(𝐸 𝑚 ).

We usually think of each 𝐸𝑖 as a “bad event” that we are trying to avoid.

Proof. Color edges of 𝐾𝑛 with red or blue independently and uniformly at random.

For every fixed subset 𝑆 of 𝑘 vertices, let 𝐴𝑆 denote the event that 𝑆 induces a
monochromatic 𝐾 𝑘 , so that P( 𝐴𝑆 ) = 21− ( 2) . Then, by the union bound,
𝑘

 
© Ø ª ∑︁ 𝑛 1− ( 𝑘)
P(there is a monochromatic 𝐾 𝑘 ) = P ­
­ 𝐴𝑆 ® ≤
® P( 𝐴𝑆 ) = 2 2 < 1.
[𝑛] [𝑛]
𝑘
«𝑆∈ ( 𝑘 ) ¬ 𝑆∈ ( 𝑘 )
Thus, with positive probability, the random coloring gives no monochromatic 𝐾 𝑘 . So
there exists some coloring with no monochromatic 𝐾 𝑘 . □

Remark 1.1.3 (Quantitative bound). By optimizing 𝑛 as a function of 𝑘 in the theorem


above (using Stirling’s formula), we obtain
 
1
𝑅(𝑘, 𝑘) > √ + 𝑜(1) 𝑘2 𝑘/2 .
𝑒 2

Erdős’ 1947 paper actually was phrased in terms of counting: of all 2 ( 2) possible
𝑛

colorings, the total number of bad colors is strictly less than 2 ( 2) .


𝑛

In this course, we mostly consider finite probability spaces. While in principle the
finite probability arguments can be rephrased as counting, some of the later more
involved arguments are impractical without a probabilistic perspective.

Remark 1.1.4 (Constructive lower bounds). The above proof only gives the existence
of a red-blue edge-coloring of 𝐾𝑛 without monochromatic cliques. Is there a way to
find algorithmically find one? With an appropriate 𝑛, even though a random coloring
achieves the goal with very high probability, there is no efficient method (in polynomial
running time) to certify that any specific edge-coloring avoids monochromatic cliques.

3
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1 Introduction

So even though there are lots of Ramsey colorings, it is hard to find and certify an
actual one. This difficulty has been described as finding hay in a haystack.
Finding constructive lower bounds is a major open problem. There was major progress
on this problem stemming from connections to randomness extractors in computer
science (e.g., Barak et al. 2012, Chattopadhyay & Zuckerman 2016, Cohen 2017)

Remark 1.1.5 (Ramsey number upper bounds). Although Ramsey proved that Ram-
sey numbers are finite, his upper bounds are quite large. Erdős–Szekeres (1935) used
a simple and nice inductive argument to show
 
𝑘 +ℓ
𝑅(𝑘 + 1, ℓ + 1) ≤ .
𝑘

For diagonal Ramsey numbers 𝑅(𝑘, 𝑘), this bound has the form 𝑅(𝑘, 𝑘) ≤ (4 − 𝑜(1)) 𝑘 .
Recently, in a major and surprising breakthrough, Campos, Griffiths, Morris, and
Sahasrabudhe (2023+) show that there is some constant 𝑐 > 0 so that for all sufficiently
large 𝑘,
𝑅(𝑘, 𝑘) ≤ (4 − 𝑐) 𝑘 .
This is the first exponential improvement over the Erdős–Szekeres bound.

Alteration method
Let us give another argument that slightly improves the earlier lower bound on Ramsey
numbers.
Instead of just taking a random coloring and analyzing it, we first randomly color, and
then fix some undesirable features. This is called the alteration method (sometimes
also the deletion method).

Theorem 1.1.6 (Ramsey lower bound via alteration)


 
𝑛 1− ( 𝑘)
For any 𝑘, 𝑛, we have 𝑅(𝑘, 𝑘) > 𝑛 − 2 2 .
𝑘

Proof. We construct an edge-coloring of a clique in two steps:

(1) Randomly color each edge of 𝐾𝑛 with red or blue (independently and uniformly
at random);
(2) Delete a vertex from every monochromatic 𝐾 𝑘 .
The process yields a 2-edge-colored clique with no monochromatic 𝐾 𝑘 (since the
second step destroyed all monochromatic cliques).

4
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1.1 Lower bounds to Ramsey numbers

Let us now analyze how many vertices we get at the end.


Let 𝑋 be the number of monochromatic 𝐾 𝑘 ’s in the first step. Since each 𝐾 𝑘 is
monochromatic with probability 21− ( 2) , by the linearity of expectations,
𝑘

 
𝑛 1− ( 𝑘)
E𝑋 = 2 2 .
𝑘

In the second step, we delete at most |𝑋 | vertices (since we delete one vertex from
every clique). Thus final graph has size ≥ 𝑛 − |𝑋 |, which has expectation 𝑛 − 𝑛𝑘 21− ( 2) .
 𝑘

Thus with positive probability, the remaining graph has ≥ 𝑛 − 𝑛𝑘 21− ( 2) vertices (and
 𝑘

no monochromatic 𝐾 𝑘 by construction). □

Remark 1.1.7 (Quantitative bound). By optimizing the choice of 𝑛 in the theorem,


we obtain  
1
𝑅(𝑘, 𝑘) > + 𝑜(1) 𝑘2 𝑘/2 ,
𝑒

which improves the previous bound by a constant factor of 2.

Lovász local lemma


Often we wish to avoid a set of “bad events” 𝐸 1 , . . . , 𝐸 𝑛 . Here are two easy extremes:
• (Union bound) If 𝑖 P(𝐸𝑖 ) < 1, then union bound tells us that we can avoid all
Í

bad events.
• (Independence) If all bad events are independent, then the probability that none
Î𝑛
of 𝐸𝑖 occurs is 𝑖=1 (1 − P(𝐸𝑖 )) > 0 (provided that all P(𝐸𝑖 ) < 1).
What if we are in some intermediate situation, where the union bound is not good
enough, and the bad events are not independent, but there are only few dependencies?
The Lovász local lemma provides us a solution when each event is only independent
with all but a small number of other events.
Here is a version of the Lovász local lemma, which we will prove later in Chapter 6.

5
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1 Introduction

Theorem 1.1.8 (Lovász local lemma — random variable model)


Let 𝑥1 , . . . , 𝑥 𝑁 be independent random variables. Let 𝐵1 , . . . , 𝐵𝑚 ⊆ [𝑁]. For each 𝑖,
let 𝐸𝑖 be an event that depends only on the variables indexed by 𝐵𝑖 (i.e., 𝐸𝑖 is allowed
to depend only on {𝑥 𝑗 : 𝑗 ∈ 𝐵𝑖 }).
Suppose, for every 𝑖 ∈ [𝑚], 𝐵𝑖 has non-empty intersections with at most 𝑑 other 𝐵 𝑗 ’s,
and
1
P[𝐸𝑖 ] ≤ .
(𝑑 + 1)𝑒
Then with positive probability, none of the events 𝐸𝑖 occur.

Here 𝑒 = 2.71 · · · is the base of the natural logarithm. This constant turns out to be
optimal in the above theorem.
Using the Lovász local lemma, let us give one more improvement to the Ramsey
number lower bounds.

Theorem 1.1.9 (Ramsey lower bound via local lemma; Spencer 1977)
  
+ 1 21− ( 2) < 1/𝑒, then 𝑅(𝑘, 𝑘) > 𝑛.
𝑘 𝑛  𝑘
If 2 𝑘−2

Proof. Color the edges of 𝐾𝑛 with red/blue uniformly and independently at random.

For each 𝑘-vertex subset 𝑆, let 𝐸 𝑆 be the event that 𝑆 induces a monochromatic 𝐾 𝑘 . So
P[𝐸 𝑆 ] = 21− ( 2) .
𝑘

In the setup of the local lemma, we have one independent random variable correspond-
ing to each edge. Each event 𝐸 𝑆 depends only on the variables corresponding to the
edges in 𝑆.
If 𝑆 and 𝑆′ are both 𝑘-vertex subsets, their cliques share an edge if and only if
|𝑆 ∩ 𝑆′ | ≥ 2. So for each 𝑆, there are at most 2𝑘 𝑘−2 choices 𝑘-vertex sets 𝑆′ with
 𝑛 

|𝑆 ∩ 𝑆′ | ≥ 2. So the local lemma applies provided that

1 1
21− ( 2) <
𝑘

𝑘 𝑛 
.
𝑒
2 𝑘−2 +1

So with positive probability none of the events 𝐸 𝑆 occur, which means an edge-coloring
with no monochromatic 𝐾 𝑘 ’s. □

Remark 1.1.10 (Quantitative lower bounds). By optimizing the choice of 𝑛, we


obtain √ !
2
𝑅(𝑘, 𝑘) > + 𝑜(1) 𝑘2 𝑘/2
𝑒

6
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1.2 Set systems



once again improving the previous bound by a constant factor of 2. This is the best
known lower bound to 𝑅(𝑘, 𝑘) to date.

1.2 Set systems


In extremal set theory, we often wish to understand the maximum size of a set system
with some given property. A set system F is a collection of subsets of some ground
set, usually [𝑛]. That is, each element of F is a subset of [𝑛]. We will see some classic
results from extremal set theory each with a clever probabilistic proof.

Sperner’s theorem
We say that a set family F is an antichain if no element of F is a subset of another
element of F (i.e., the elements of F are pairwise incomparable by containment).

Question 1.2.1
What is the maximum number of sets in an antichain of subsets of [𝑛]?

The set F = [𝑛]


 
𝑘 (i.e., all 𝑘-element subsets of [𝑛]) has size 𝑛𝑘 . It is an antichain
    
(why?). The size 𝑛𝑘 is maximized when 𝑘 = 𝑛2 or 𝑛2 . The next result shows that
this is indeed the best we can do.

Theorem 1.2.2 (Sperner’s theorem, 1928)


 
𝑛
If F is an antichain of subsets of {1, 2, . . . , 𝑛}, then |F | ≤ .
⌊𝑛/2⌋

In fact, we will show an even stronger result:

Theorem 1.2.3 (LYM inequality; Bollobás 1965, Lubell 1966, Meshalkin 1963, and
Yamamoto 1954)
If F is an antichain of subsets of [𝑛], then

∑︁  𝑛  −1
≤ 1.
| 𝐴|
𝐴∈F

𝑛  𝑛 
Sperner’s theorem follows since | 𝐴| ≤ ⌊𝑛/2⌋ for all 𝐴.

Proof. Let 𝜎(1), . . . , 𝜎(𝑛) be a permutation of 1, . . . , 𝑛 chosen uniformly at random.


Consider the chain:

∅, {𝜎(1)} , {𝜎(1), 𝜎(2)} , {𝜎(1), 𝜎(2), 𝜎(3)} , . . . , {𝜎(1), . . . , 𝜎(𝑛)} .

7
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1 Introduction

For each 𝐴 ⊆ {1, 2, . . . , 𝑛}, let 𝐸 𝐴 denote the event that 𝐴 appears in the above chain.
Then 𝐸 𝐴 occurs if and only if all the elements of 𝐴 appears first in the permutation
𝜎, followed by all the elements of [𝑛] \ 𝐴. The number of such permutations is
| 𝐴|!(𝑛 − | 𝐴|)!. Hence
  −1
| 𝐴|!(𝑛 − | 𝐴|)! 𝑛
P(𝐸 𝐴 ) = = .
𝑛! | 𝐴|

Since F is an antichain, if 𝐴, 𝐵 ∈ F are distinct, then 𝐸 𝐴 and 𝐸 𝐵 cannot both occur


(as otherwise 𝐴 and 𝐵 both appear in the above chain and so one of them contains the
other, violating the antichain hypothesis). So {𝐸 𝐴 : 𝐴 ∈ F } is a set of disjoint events,
and thus their probabilities sum to at most 1. So

∑︁  𝑛  −1 ∑︁
≤ P(𝐸 𝐴 ) ≤ 1. □
| 𝐴|
𝐴∈F 𝐴∈F

Bollobás’ two families theorem

Theorem 1.2.4 (Bollobás’ two families theorem 1965)


Let 𝐴1 , . . . , 𝐴𝑚 be 𝑟-element sets and 𝐵1 , . . . , 𝐵𝑚 be 𝑠-element sets such that 𝐴𝑖 ∩ 𝐵𝑖 =

∅ for all 𝑖 and 𝐴𝑖 ∩ 𝐵 𝑗 ≠ ∅ for all 𝑖 ≠ 𝑗. Then 𝑚 ≤ 𝑟+𝑠 𝑟 .

This bound is sharp: let 𝐴𝑖 range over all 𝑟-element subsets of [𝑟 + 𝑠] and set 𝐵𝑖 =
[𝑟 + 𝑠] \ 𝐴𝑖 .
Let us give an application/motivation for Bollobás’ two families theorem in terms of
transversals. Given a set family F , say that 𝑇 is a transversal for F if 𝑇 ∩ 𝑆 ≠ ∅ for
all 𝑆 ∈ F (i.e., 𝑇 hits every element of F ). Let 𝝉(F), the transversal number of F ,
be the size of the smallest transversal of F . Say that F is 𝝉-critical if 𝜏(F ′) < 𝜏(F )
whenever F ′ is a proper subset of F .

Question 1.2.5
What is the maximum size of a 𝜏-critical 𝑟-uniform F with 𝜏(F ) = 𝑠 + 1?

We claim that the answer is 𝑟+𝑠𝑟 . Indeed, let F = {𝐴1 , . . . , 𝐴𝑚 }, and 𝐵𝑖 an 𝑠-element
transversal of F \ { 𝐴𝑖 } for each 𝑖. Then the hypothesis of Bollobás’ two families

theorem is satisfied. Thus 𝑚 ≤ 𝑟+𝑠 𝑟 .

Conversely, F = [𝑟+𝑠]

𝑟 is 𝜏-critcal 𝑟-uniform with 𝜏(F ) = 𝑠 + 1 (why?).

Here is a more general statement of the Bollobás’ two-family theorem.

8
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1.2 Set systems

Theorem 1.2.6 (Bollobás’ two families theorem 1965)


Let 𝐴1 , . . . , 𝐴𝑚 and 𝐵1 , . . . , 𝐵𝑚 be finite sets such that 𝐴𝑖 ∩ 𝐵𝑖 = ∅ for all 𝑖 and
𝐴𝑖 ∩ 𝐵 𝑗 ≠ ∅ for all 𝑖 ≠ 𝑗. Then
𝑚   −1
∑︁ | 𝐴𝑖 | + |𝐵𝑖 |
≤ 1.
𝑖=1
| 𝐴𝑖 |

Note that Sperner’s theorem and LYM inequality are also special cases, since if
{𝐴1 , . . . , 𝐴𝑚 } is an antichain, then setting 𝐵𝑖 = [𝑛] \ 𝐴𝑖 for all 𝑖 satisfies the hypothesis.

Proof. The proof is a modification of the proof of the LYM inequality earlier.

Consider a uniform random ordering of 𝐴1 ∪ · · · ∪ 𝐴𝑚 ∪ 𝐵1 ∪ · · · ∪ 𝐵𝑚 .


Let 𝐸𝑖 be the event that all elements of 𝐴𝑖 appear before 𝐵𝑖 . Then
  −1
| 𝐴𝑖 | + |𝐵𝑖 |
P(𝐸𝑖 ) = .
| 𝐴𝑖 |

Note that the events 𝐸𝑖 are disjoint, since 𝐸𝑖 and 𝐸 𝑗 both occurring would contradict
Í
the hypothesis for 𝐴𝑖 , 𝐵𝑖 , 𝐴 𝑗 , 𝐵 𝑗 (why?). Thus 𝑖 P(𝐸𝑖 ) ≤ 1. This yields the claimed
inequality. □

Bollobas’ two families theorem has many interesting generalizations that we will not
discuss here (e.g., see Gil Kalai’s blog post). There are also beautiful linear algebraic
proofs of this theorem and its extensions.

Erdős–Ko–Rado theorem on intersecting families


We say that a family F is intersecting if 𝐴 ∩ 𝐵 ≠ ∅ for all 𝐴, 𝐵 ∈ F . That is, no two
sets in F are disjoint.
Here is an easy warm up.

Question 1.2.7 (Intersecting family—unrestricted sizes)


What is the largest intersecting family of subsets of [𝑛]?

One way to generate a large intersecting family is to include all sets that contain a fixed
element (say, the element 1). This family has size 2𝑛−1 and is clearly intersecting.
(This isn’t the only example with size 2𝑛−1 ; can you think of other intersecting families
with the same size?)

9
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1 Introduction

It turns out that one cannot do better than 2𝑛−1 . Since we can pair up each subset of [𝑛]
with its complement. At most one of 𝐴 and [𝑛] \ 𝐴 can be in an intersecting family.
And so at most half of all sets can be in an intersecting family.
The question becomes much more interesting if we restrict to 𝑘-uniform families.

Question 1.2.8 ( 𝑘 -uniform intersecting family)


What is the largest intersecting family of 𝑘-element subsets of [𝑛]?

Example: F = all subsets containing the element 1. Then F is intersecting and



|F | = 𝑛−1
𝑘−1 .

Theorem 1.2.9 (Erdős–Ko–Rado 1961; proved in 1938)


If 𝑛 ≥ 2𝑘, then every intersecting family of 𝑘-element subsets of [𝑛] has size at most
𝑛−1 
𝑘−1 .

Remark 1.2.10. The assumption 𝑛 ≥ 2𝑘 is necessary since if 𝑛 < 2𝑘, then the family
of all 𝑘-element subsets of [𝑛] is automatically intersecting by pigeonhole.

Proof. Consider a uniform random circular permutation of 1, 2, . . . , 𝑛 (arrange them


randomly around a circle)
For each 𝑘-element subset 𝐴 of [𝑛], we say that 𝐴 is contiguous if all the elements of
𝐴 lie in a contiguous block on the circle.

The probability that 𝐴 forms a contiguous set on the circle is exactly 𝑛/ 𝑛𝑘 .

So the expected number of contiguous sets in F is exactly 𝑛 |F | / 𝑛𝑘 .
Since F is intersecting, there are at most 𝑘 contiguous sets in F (under every circular
ordering of [𝑛]). Indeed, suppose that 𝐴 ∈ F is contiguous. Then there are 2(𝑘 − 1)
other contingous sets (not necessarily in F ) that intersect 𝐴, but they can be paired
off into disjoint pairs (check! Here we use the hypothesis that 𝑛 ≥ 2𝑘). Since F is
intersecting, it follows that it contains at most 𝑘 contiguous sets.

Combining with result from the previous paragraph, we see that 𝑛 |F | / 𝑛𝑘 ≤ 𝑘, and
 
hence |F | ≤ 𝑛𝑘 𝑛𝑘 = 𝑛−1
𝑘−1 . □

1.3 2-colorable hypergraphs


An 𝒌-uniform hypergraph (or 𝒌-graph) is a pair 𝐻 = (𝑉, 𝐸), where 𝑉 (vertices)

is a finite set and 𝐸 (edges) is a set of 𝑘-element subsets of 𝐸, i.e., 𝐸 ⊆ 𝑉𝑘 . (So
hypergraphs and set families are the same concept, just different names.)

10
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1.3 2-colorable hypergraphs

We say that 𝐻 is 𝒓-colorable if the vertices can be colored using 𝑟 colors so that no
edge is monochromatic.
Let 𝒎(𝒌) denote the minimum number of edges in a 𝑘-uniform hypergraph that is
not 2-colorable (elsewhere in the literature, “2-colorable” = “property B”, named after
Bernstein who introduced the concept in 1908). Some small values:
• 𝑚(2) = 3
• 𝑚(3) = 7. Example: Fano plane (below) is not 2-colorable (the fact there are no
6-edge non-2-colorable 3-graphs can be proved by exhaustive search).

• 𝑚(4) = 23, proved via exhaustive computer search (Östergård 2014)


Exact value of 𝑚(𝑘) is unknown for all 𝑘 ≥ 5. However, we can get some asymptotic
lower and upper bounds using the probability method.

Theorem 1.3.1 (Erdős 1964)


𝑚(𝑘) ≥ 2 𝑘−1 for every 𝑘 ≥ 2.
(In other words, every 𝑘-uniform hypergraph with fewer than 2 𝑘−1 edges is 2-colorable.)

Proof. Let there be 𝑚 < 2 𝑘−1 edges. In a random 2-coloring, the probability that there
is a monochromatic edge is ≤ 2−𝑘+1 𝑚 < 1. □

Remark 1.3.2. Later in Section 3.5 we will prove an better lower bound 𝑚(𝑘) ≳
√︁
2 𝑘 𝑘/log 𝑘, which is the best known to date.

Perhaps somewhat surprisingly, the state of the art upper bound is also proved using
probabilistic method (random construction).

Theorem 1.3.3 (Erdős 1964)


𝑚(𝑘) = 𝑂 (𝑘 2 2 𝑘 ).
(In other words, there exists a 𝑘-uniform hypergraph with 𝑂 (𝑘 2 2 𝑘 ) edges that is not
2-colorable.)

Proof. Let |𝑉 | = 𝑛 = 𝑘 2 (this choice is motivated by the displayed inequality be-


low). Let 𝐻 be the 𝑘-uniform hypergraph obtained by choosing 𝑚 edges 𝑆1 , . . . , 𝑆 𝑚

independently and uniformly at random (i.e., with replacement) among 𝑉𝑛 .

11
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1 Introduction

Given a coloring 𝜒 : 𝑉 → [2], if 𝜒 colors 𝑎 vertices with one color and 𝑏 vertices
with the other color, then the probability that the (random) edge 𝑆1 is monochromatic
under the (non-random) coloring 𝜒 is

𝑎 𝑏 𝑛/2
+
𝑘
2

𝑘 𝑘 𝑘 2(𝑛/2)(𝑛/2 − 1) · · · (𝑛/2 − 𝑘 + 1) 𝑛/2 − 𝑘 + 1
𝑛 ≥ 𝑛 = ≥2
𝑘 𝑘
𝑛(𝑛 − 1) · · · (𝑛 − 𝑘 + 1) 𝑛−𝑘 +1
 𝑘  𝑘
−𝑘+1 𝑘 −1 −𝑘+1 𝑘 −1
=2 1− =2 1− 2 ≥ 𝑐2−𝑘
𝑛−𝑘 +1 𝑘 −𝑘 +1

for some constant 𝑐 > 0.


Since the edges are chosen independently at random, for any coloring 𝜒,
−𝑘 𝑚
P( 𝜒 is a proper coloring) ≤ (1 − 𝑐2−𝑘 ) 𝑚 ≤ 𝑒 −𝑐2

(using 1 + 𝑥 ≤ 𝑒 𝑥 for all real 𝑥). By the union bound,


∑︁
P(the random hypergraph has a proper 2-coloring) ≤ P( 𝜒 is a proper coloring)
𝜒
−𝑘 𝑚
≤ 2𝑛 𝑒 −𝑐2 <1

for some 𝑚 = 𝑂 (𝑘 2 2 𝑘 ) (recall 𝑛 = 𝑘 2 ). Thus there exists a non-2-colorable 𝑘-uniform


hypergraph with 𝑚 edges. □

1.4 List chromatic number of 𝐾𝑛,𝑛


Given a graph 𝐺, its chromatic number 𝝌(𝑮) is the minimum number of colors
required to proper color its vertices.
In list coloring, each vertex of 𝐺 is assigned a list of allowable colors. We say that 𝐺
is 𝒌-choosable (also called 𝒌-list colorable) if it has a proper coloring no matter how
one assigns a list of 𝑘 colors to each vertex.
We write ch(𝑮), called the list chromatic number (also called: choice number,
choosability, list colorability) of 𝐺, to be the smallest 𝑘 so that 𝐺 is 𝑘-choosable.
We have 𝜒(𝐺) ≤ ch(𝐺) by assigning the same list of colors to each vertex. The
inequality may be strict, as we will see below.
For example, while every bipartite graph is 2-colorable, 𝐾3,3 is not 2-choosable.
Indeed, no list coloring of 𝐾3,3 is possible with color lists (check!):

12
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1.4 List chromatic number of 𝐾𝑛,𝑛

{1, 2} {1, 2}
{1, 3} {1, 3}
{2, 3} {2, 3}

Exercise: check that ch(𝐾3,3 ) = 3.

Question 1.4.1
What is the asymptotic behavior of ch(𝐾𝑛,𝑛 )?

First we prove an upper bound on ch(𝐾𝑛,𝑛 ).

Theorem 1.4.2
If 𝑛 < 2 𝑘−1 , then 𝐾𝑛,𝑛 is 𝑘-choosable.

 
In other words, ch(𝐾𝑛,𝑛 ) ≤ log2 (2𝑛) + 1.

Proof. For each color, mark it either L or R independently and uniformly at random.

For any vertex of 𝐾𝑛,𝑛 on the left part, remove all its colors marked R.
For any vertex of 𝐾𝑛,𝑛 on the right part, remove all its colors marked L.
The probability that some vertex has no colors remaining is at most 2𝑛2−𝑘 < 1 by the
union bound. So with positive probability, every vertex has some color remaining.
Assign the colors arbitrarily for a valid coloring. □

The lower bound on ch(𝐾𝑛,𝑛 ) turns out to follow from the existence of non-2-colorable
𝑘-uniform hypergraph with many edges.

Theorem 1.4.3
If there exists a non-2-colorable 𝑘-uniform hypergraph with 𝑛 edges, then 𝐾𝑛,𝑛 is not
𝑘-choosable.

Proof. Let 𝐻 = (𝑉, 𝐸) be a non-2-colorable 𝑘-uniform hypergraph |𝐸 | = 𝑛 edges.


Now, view 𝑉 as colors and assign to the 𝑖-th vertex of 𝐾𝑛 on both the left and right
bipartitions a list of colors given by the 𝑖-th edge of 𝐻. We leave it as an exercise to
check that this 𝐾𝑛,𝑛 is not list colorable. □

Recall from Theorem 1.3.3 that there exists a non-2-colorable 𝑘-uniform hypergraph
with 𝑂 (𝑘 2 2 𝑘 ) edges. Thus ch(𝐾𝑛,𝑛 ) > (1 − 𝑜(1)) log2 𝑛.
Putting these bounds together:

13
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1 Introduction

Corollary 1.4.4 (List chromatic number of a complete bipartite graph)


ch(𝐾𝑛,𝑛 ) = (1 + 𝑜(1)) log2 𝑛

It turns out that, unlike the chromatic number, the list chromatic number always
grows with the average degree. The following result was proved using the method of
hypergraph containers, a very important modern development in combinatorics that
we will see a glimpse of in Chapter 11. It provides the optimal asymptotic dependence
(the example of 𝐾𝑛,𝑛 shows optimality).

Theorem 1.4.5 (Saxton and Thomason 2015)


If a graph 𝐺 has average degree 𝑑, then, as 𝑑 → ∞,

ch(𝐺) > (1 + 𝑜(1)) log2 𝑑.

They also proved similar results for the list chromatic number of hypergraphs. For
graphs, a slightly weaker result, off by a factor of 2, was proved earlier by Alon (2000).

Exercises
1. Verify the following asymptotic calculations used in Ramsey number lower
bounds:
 
a) For each 𝑘, the largest 𝑛 satisfying 𝑛𝑘 21− ( 2) < 1 has 𝑛 = √1 + 𝑜(1) 𝑘2 𝑘/2 .
 𝑘

𝑒 2

21− ( ) as 𝑛 ranges over positive


𝑛 𝑘
b) For each 𝑘, the maximum
 value of 𝑛 − 𝑘
2

integers is 1𝑒 + 𝑜(1) 𝑘2 𝑘/2 .


  
+ 1 21− ( 2) < 1 satisfies
𝑘 𝑛  𝑘
c) For each 𝑘, the largest 𝑛 satisfying 𝑒 2 𝑘−2
√ 
𝑛 = 𝑒2 + 𝑜(1) 𝑘2 𝑘/2 .

2. Prove that, if there is a real 𝑝 ∈ [0, 1] such that


   
𝑛 ( 𝑘) 𝑛
(1 − 𝑝) ( 2) < 1
𝑡
𝑝 +
2
𝑘 𝑡

then the Ramsey number 𝑅(𝑘, 𝑡) satisfies 𝑅(𝑘, 𝑡) > 𝑛. Using this show that
  3/2
𝑡
𝑅(4, 𝑡) ≥ 𝑐
log 𝑡

for some constant 𝑐 > 0.

14
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

1.4 List chromatic number of 𝐾𝑛,𝑛

3. Let 𝐺 be a graph with 𝑛 vertices and 𝑚 edges. Prove that 𝐾𝑛 can be written as a
union of 𝑂 (𝑛2 (log 𝑛)/𝑚) isomorphic copies of 𝐺 (not necessarily edge-disjoint).
4. Prove that there is an absolute constant 𝐶 > 0 so that for every 𝑛 × 𝑛 matrix
with distinct real entries, one can permute its rows so that no column in the

permuted matrix contains an increasing subsequence of length at least 𝐶 𝑛. (A
subsequence does not need to be selected from consecutive terms. For example,
(1, 2, 3) is an increasing subsequence of (1, 5, 2, 4, 3).)
5. Generalization of Sperner’s theorem. Let F be a collection of subset of [𝑛] that
does not contain 𝑘 + 1 elements forming a chain: 𝐴1 ⊊ · · · ⊊ 𝐴 𝑘+1 . Prove that
F is no larger than taking the union of the 𝑘 levels of the Boolean lattice closest
to the middle layer.
6. Let 𝐺 be a graph on 𝑛 ≥ 10 vertices. Suppose that adding any new edge to 𝐺
would create a new clique on 10 vertices. Prove that 𝐺 has at least 8𝑛 − 36 edges.
Hint: Apply Bollobás’ two families theorem

7. Let 𝑘 ≥ 4 and 𝐻 a 𝑘-uniform hypergraph with at most 4 𝑘−1 /3 𝑘 edges. Prove


that there is a coloring of the vertices of 𝐻 by four colors so that in every edge
all four colors are represented.
8. Given a set F of subsets of [𝑛] and 𝐴 ⊆ [𝑛], write F | 𝐴 := {𝑆 ∩ 𝐴 : 𝑆 ∈ F } (its
projection onto 𝐴). Prove that for every 𝑛 and 𝑘, there exists a set F of subsets
of [𝑛] with |F | = 𝑂 (𝑘2 𝑘 log 𝑛) such that for every 𝑘-element subset 𝐴 of [𝑛],
F | 𝐴 contains all 2 𝑘 subsets of 𝐴.
9. Let 𝐴1 , . . . , 𝐴𝑚 be 𝑟-element sets and 𝐵1 , . . . , 𝐵𝑚 be 𝑠-element sets. Suppose
𝐴𝑖 ∩ 𝐵𝑖 = ∅ for each 𝑖, and for each 𝑖 ≠ 𝑗, either 𝐴𝑖 ∩ 𝐵 𝑗 ≠ ∅ or 𝐴 𝑗 ∩ 𝐵𝑖 ≠ ∅.
Prove that 𝑚 ≤ (𝑟 + 𝑠) 𝑟+𝑠 /(𝑟 𝑟 𝑠 𝑠 ).
10. ★ Show that in every non-2-colorable 𝑛-uniform hypergraph, one can find at
least 𝑛2 2𝑛−1

𝑛 unordered pairs of edges with each pair intersecting in exactly one
vertex.

15
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

2 Linearity of Expectations

Linearity of expectations refers to the following basic fact about the expectation: given
random variables 𝑋1 , . . . , 𝑋𝑛 and constants 𝑐 1 , . . . , 𝑐 𝑛 ,

E[𝑐 1 𝑋1 + · · · + 𝑐 𝑛 𝑋𝑛 ] = 𝑐 1 E[𝑋1 ] + · · · + 𝑐 𝑛 E[𝑋𝑛 ].

This identity does not require any assumption of independence. On the other hand,
generally E[𝑋𝑌 ] ≠ E[𝑋]E[𝑌 ] unless 𝑋 and 𝑌 are uncorrelated (independent random
variables are always uncorrelated).
Here is a simple application (there are also much more involved solutions via enumer-
ation methods).

Question 2.0.1 (Expected number of fixed points)


What is the average number of fixed points of a uniform random permutation of an 𝑛
element set?

Solution. Let 𝑋𝑖 be the event that element 𝑖 ∈ [𝑛] is fixed. Then E[𝑋𝑖 ] = 1/𝑛. The
expected number of fixed points is

E[𝑋1 + · · · + 𝑋𝑛 ] = E[𝑋1 ] + · · · + E[𝑋𝑛 ] = 1. □

2.1 Hamiltonian paths in tournaments


We frequently use the following fact:
with positive probability, 𝑋 ≥ E[𝑋] (likewise for 𝑋 ≤ E[𝑋]).
A tournament is a directed complete graph. A Hamilton path in a directed graph is a
directed path that contains every vertex exactly once.

Question 2.1.1 (Number of Hamilton paths in a tournament)


What is the maximum (and minimum) number of Hamilton paths in an 𝑛-vertex
tournament?

17
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

2 Linearity of Expectations

The minimization problem is easier. The transitive tournament (i.e., respecting a fixed
linear ordering of vertices) has exactly one Hamilton path. On the other hand, every
tournament has at least one Hamilton path (Exercise: prove this! Hint: consider a
longest directed path).
The maximization problem is more difficult and interesting. Here we have some
asymptotic results.

Theorem 2.1.2 (Tournaments wit many Hamilton paths; Szele 1943)


There is a tournament on 𝑛 vertices with at least 𝑛!2−(𝑛−1) Hamilton paths

Proof. Consider a random tournament where every edge is given a random orientation
chosen uniformly and independently. Each of the 𝑛! permutations of vertices forms a
directed path with probability 2−𝑛+1 . So that expected number of Hamilton paths is
𝑛!2−𝑛+1 . Thus, there exists a tournament with at least this many Hamilton paths. □

This was considered the first use of the probabilistic method. Szele conjectured that the
maximum number of Hamilton paths in a tournament on 𝑛 players is 𝑛!/(2 − 𝑜(1)) 𝑛 .
This was proved by Alon (1990) using the Minc–Brégman theorem on permanents (we
will see this later in Chapter 10 on the entropy method).

2.2 Sum-free subset


A subset 𝐴 in an abelian group is sum-free if there do not exist 𝑎, 𝑏, 𝑐 ∈ 𝐴 with
𝑎 + 𝑏 = 𝑐.
Does every 𝑛-element set contain a large sum-free set?

Theorem 2.2.1 (Large sum-free subsets; Erdős 1965)


Every set of 𝑛 nonzero integers contains a sum-free subset of size ≥ 𝑛/3.

Proof. Let 𝐴 ⊆ Z \ {0} with | 𝐴| = 𝑛. For 𝜃 ∈ [0, 1], let

𝐴𝜃 := {𝑎 ∈ 𝐴 : {𝑎𝜃} ∈ (1/3, 2/3)}

where {·} denotes fractional part. Then 𝐴𝜃 is sum-free since (1/3, 2/3) is sum-free in
R/Z.
For 𝜃 uniformly chosen at random, {𝑎𝜃} is also uniformly random in [0, 1], so P(𝑎 ∈
𝐴𝜃 ) = 1/3. By linearity of expectations, E| 𝐴𝜃 | = 𝑛/3. □

18
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

2.3 Turán’s theorem and independent sets

Remark 2.2.2 (Additional results). Alon and Kleitman (1990) noted that one can
improve the bound to ≥ (𝑛 + 1)/3 by noting that | 𝐴𝜃 | = 0 for 𝜃 close to zero (say,
|𝜃| < (3 max 𝑎 ∈ 𝐴 |𝑎|) −1 ), so that | 𝐴𝜃 | < 𝑛/3 with positive probability, and hence
| 𝐴𝜃 | > 𝑛/3 with positive probability. Note that since | 𝐴𝜃 | is an integer, being > 𝑛/3 is
the same as being ≥ (𝑛 + 1)/3.
Bourgain (1997) improved it to ≥ (𝑛 + 2)/3 via a difficult Fourier analytic argument.
This is currently the best lower bound known.
It remains an open problem to prove ≥ (𝑛 + 𝑓 (𝑛))/3 for some function 𝑓 (𝑛) → ∞.
In the other direction, Eberhard, Green, and Manners (2014) showed that there exist
𝑛-element sets of integers whose largest sum-free subset has size ≤ (1/3 + 𝑜(1))𝑛.

2.3 Turán’s theorem and independent sets

Question 2.3.1 (Turán problem)


What is the maximum number of edges in an 𝑛-vertex 𝐾 𝑘 -free graph?

Taking the complement of a graph changes its independent sets to cliques and vice
versa. So the problem is equivalent to one about graphs without large independent
sets.
The following result, due to Caro (1979) and Wei (1981), shows that a graph with
small degrees much contain large independent sets. The probabilistic method proof
shown here is due to Alon and Spencer.

Theorem 2.3.2 (Caro 1979, Wei 1981)


Every graph 𝐺 contains an independent set of size at least
∑︁ 1
,
𝑑𝑣 + 1
𝑣∈𝑉 (𝐺)

where 𝑑𝑣 is the degree of vertex 𝑣.

Proof. Consider a random ordering (permutation) of the vertices. Let 𝐼 be the set of
vertices that appear before all of its neighbors. Then 𝐼 is an independent set.
1
For each 𝑣 ∈ 𝑉, P(𝑣 ∈ 𝐼) = 1+𝑑 𝑣
(this is the probability that 𝑣 appears first among
{𝑣} ∪ 𝑁 (𝑣)). Thus E|𝐼 | = 𝑣∈𝑉 (𝐺) 𝑑 𝑣1+1 . Thus with positive probability, |𝐼 | is at least
Í

this expectation. □

Remark 2.3.3. Equality occurs if 𝐺 is a disjoint union of cliques.

19
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

2 Linearity of Expectations

Remark 2.3.4 (Derandomization). Here is an alternative “greedy algorithm” proof


of the Caro–Wei inequality. At each step, take a vertex of smallest degree, and remove
it and all its neighbors. If each vertex 𝑣 is assigned weight 1/(𝑑𝑣 + 1), then the total
Í
weight removed at each step is at most 1. Thus there must be at least 𝑣 1/(𝑑𝑣 + 1)
steps.
Some probabilistic proofs, especially those involving linearity of expectations, can be
derandomized this way into an efficient deterministic algorithm. However, for many
other proofs (such as Ramsey lower bounds from Section 1.1), it is not known how to
derandomize the proof.

By taking the complement of the graph, independent sets become cliques, and so we
obtain the following corollary.

Corollary 2.3.5
Every 𝑛-vertex graph 𝐺 contains a clique of size at least
∑︁ 1
.
𝑛 − 𝑑𝑣
𝑣∈𝑉 (𝐺)

Note that equality is attained when 𝐺 is multipartite.


Now let us answer the earlier question about maximizing the number of edges in a
𝐾𝑟+1 -free graph.
The Turán graph 𝑻𝒏,𝒓 is the complete multipartite graph formed by partitioning 𝑛
vertices into 𝑟 parts with sizes as equal as possible (differing by at most 1).
Example:

𝑇10,3 = 𝐾3,3,4 =

It is easy to see that 𝑇𝑛,𝑟 is 𝐾𝑟+1 -free.


Turán’s theorem (1941) tells us that 𝑇𝑛,𝑟 indeed maximizes the number of edges among
𝑛-vertex 𝐾𝑟+1 -free graphs. We will prove a slightly weaker statement, below, which is
tight when 𝑛 is divisible by 𝑟.

20
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

2.4 Sampling

Theorem 2.3.6 (Turán’s theorem 1941)


The number of edges in an 𝑛-vertex 𝐾𝑟+1 -free graph is at most

1 𝑛2
 
1− .
𝑟 2

Proof. Let 𝑚 be the number of edges. Since 𝐺 is 𝐾𝑟+1 -free, by Corollary 2.3.5, the
size 𝜔(𝐺) of the largest clique of 𝐺 satisfies
∑︁ 1 𝑛 𝑛
𝑟 ≥ 𝜔(𝐺) ≥ ≥ = .
𝑛 − 𝑑𝑣 𝑛 − 𝑛 𝑣 𝑑𝑣 𝑛 − 2𝑚
1 Í
𝑣∈𝑉 𝑛
 
1 𝑛2
Rearranging gives 𝑚 ≤ 1 − 𝑟 2. □

Remark 2.3.7. By a careful refinement of the above argument, we can deduce Turán’s
theorem that 𝑇𝑛,𝑟 maximizes the number of edges in a 𝑛-vertex 𝐾𝑟+1 -free graph, by
Í 1 Í
noting that 𝑣∈𝑉 𝑛−𝑑 𝑣
is minimized over fixed 𝑣 𝑑 𝑣 when the degrees are nearly equal.

Also, Theorem 2.3.6 is asymptotically tight in the sense that the Turán graph 𝑇𝑛,𝑟 , for
fixed 𝑟 and 𝑛 → ∞, as (1 − 1/𝑟 − 𝑜(1))𝑛2 /2 edges.

For more on this topic, see Chapter 1 of my textbook Graph Theory and Additive
Combinatorics and the class with the same title.

2.4 Sampling
By Turán’s theorem (actually Mantel’s theorem, in this case for triangles, the maximum
number of edges in an 𝑛-vertex triangle-free graph is 𝑛2 /4 .
 

How about the problem for hypergraphs? A tetrahedron, denoted 𝐾4(3) , is a complete 3-
uniform hypergraph (3-graph) on 4 vertices (think of the faces of a usual 3-dimensional
tetrahedron).

Question 2.4.1 (Hypergraph Turán problem for tetrahra)


What is the maximum number of edges in an 𝑛-vertex 3-uniform hypergraph not
containing any tetrahedra?

This turns out to be a notorious open problem. Turán conjectured that the answer is
  
5 𝑛
+ 𝑜(1) ,
9 3

21
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

2 Linearity of Expectations

which can be achieved using the 3-graph illustrated below:

𝑉1

𝑉2 𝑉3

Above, the vertices are partitioned into three nearly equal sets 𝑉1 , 𝑉2 , 𝑉3 , and all the
edges come in two types: (i) with one vertex in each of the three parts, and (ii) two
vertices in 𝑉𝑖 and one vertex in 𝑉𝑖+1 , with the indices considered mod 3.
Let us give some easy upper bounds, in order to illustrate a simple yet important
technique of bounding by sampling.

Proposition 2.4.2 (A cheap sampling bound)


3 𝑛

Every tetrahedron-free 3-graph on 𝑛 ≥ 4 vertices has at most 4 3 edges.

Proof. Let 𝑆 be a 4-vertex subset chosen uniformly at random. If the graph has

𝑝 𝑛3 edges, then the expected number of edges induced by 𝑆 is 4𝑝 by linearity of
expectations (why?).
Since the 3-graph is tetrahedron-free, 𝑆 induces at most 3 edges. Therefore, 4𝑝 ≤ 3.
Thus the total number of edges is 𝑝 𝑛3 ≤ 34 𝑛3 .
 

Why stop at sampling four vertices? Can we do better by sampling five vertices? To
run the above argument, we will know how many edges can there be in a 5-vertex
tetrahedron-free graph.

Lemma 2.4.3
A 5-vertex tetrahedron-free 3-graph has at most 7 edges.

Proof. We can convert a 5-vertex 3-graph 𝐻 to a 5-vertex graph 𝐺, by replacing each


triple by its complement. Then 𝐻 being tetrahedron-free is equivalent to 𝐺 not having a
vertex of degree 4. The maximum number of edges in a 5-vertex graph with maximum
degree at most 3 is ⌊3 · 5/2⌋ = 7 (check this can be achieved). □

We can improve Proposition 2.4.2 by sampling 5 vertices instead of 4 in its proof. This
yields (check):

22
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

2.5 Unbalancing lights

Proposition 2.4.4
7 𝑛

Every tetrahedron-free 3-graph on 𝑛 ≥ 4 vertices has at most 10 3 edges.

By sampling 𝑠 vertices and using brute-force search to solve the 𝑠-vertex problem,
we can improve the upper bound by taking larger values of 𝑠. In fact in principle, if
we had unlimited computational power, we can arbitrarily close to optimum by taking
sufficiently large 𝑠 (why?). However, this is not a practical method due to the cost of
the brute-force search. There are more clever ways to get better bounds (also with the
help of a computer). The best known upper bound notably via a method known as flag

algebras (using sums of squares) due to Razborov, which can give ≤ (0.561 · · · ) 𝑛3 ).
For more on the Hypergraph Turán problem, see the survey by Keevash (2011).

2.5 Unbalancing lights


Consider an 𝑛 × 𝑛 array of light bulbs. Initially some arbitrary subset of the light bulbs
are turned on. We are allowed up toggle the lights (on/off) for an entire row or column
at a time. How many lights can be guarantee to turn on?
If we flip each row/column independently with probability 1/2, then on expectation,
we get exactly half of the lights to turn on. Can we do better?
In the probabilistic method, not every step has to be random. A better strategy is to
first flip all the columns randomly, and then decide what to do with each row greedily
based on what has happened so far. This is captured in the following theorem, where
the left-hand side represents

# {bulbs on} − # {bulbs off} .

Theorem 2.5.1
Let 𝑎𝑖 𝑗 ∈ {−1, 1} for all 𝑖, 𝑗 ∈ [𝑛]. There exists 𝑥𝑖 , 𝑦 𝑗 ∈ {−1, 1} for all 𝑖, 𝑗 ∈ [𝑛] such
that √︂ !
𝑛
∑︁ 2
𝑎 𝑖 𝑗 𝑥𝑖 𝑦 𝑗 ≥ + 𝑜(1) 𝑛3/2 .
𝑖, 𝑗=1
𝜋

Proof. Choose 𝑦 1 , . . . , 𝑦 𝑛 ∈ {−1, 1} independently and uniformly at random. For each


𝑖, let
𝑛
∑︁
𝑅𝑖 = 𝑎𝑖 𝑗 𝑦 𝑗
𝑗=1

23
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

2 Linearity of Expectations

and set 𝑥𝑖 ∈ {−1, 1} to be the sign of 𝑅𝑖 (arbitrarily choose 𝑥𝑖 if 𝑅𝑖 = 0. Then the LHS
sum is
𝑛
∑︁ 𝑛
∑︁
𝑅𝑖 𝑥𝑖 = |𝑅𝑖 | .
𝑖=1 𝑖=1

For each 𝑖, 𝑅𝑖 has the same distribution as a sum of 𝑛 i.i.d. uniform {−1, 1}: 𝑆𝑛 =
𝜀1 + · · · + 𝜀 𝑛 (note that 𝑅𝑖 ’s are not independent for different 𝑖’s). Thus, for each 𝑖
√︂ !
2 √
E[|𝑅𝑖 |] = E[|𝑆𝑛 |] = + 𝑜(1) 𝑛,
𝜋

since by the central limit theorem


 
|𝑆𝑛 |
lim E √ = E[|𝑋 |] where 𝑋 ∼ Normal(0, 1)
𝑛→∞ 𝑛
∫ √︂
1 −𝑥 2 /2 2
=√ |𝑥|𝑒 𝑑𝑥 =
2𝜋 R 𝜋

(one can also use binomial sum identities to compute exactly: E[|𝑆𝑛 |] = 𝑛21−𝑛 𝑛−1 
⌊(𝑛−1)/2⌋ ,
though it is rather unnecessary to do so.) Thus
𝑛
√︂ !
∑︁ 2
E |𝑅𝑖 | = + 𝑜(1) 𝑛3/2 .
𝑖=1
𝜋
√︃ 
2
Thus with positive probability, the sum is ≥ 𝜋 + 𝑜(1) 𝑛3/2 . □

The next example is tricky. The proof will set up a probabilistic process where the
parameters are not given explicitly. A compactness argument will show that a good
choice of parameters exists.

Theorem 2.5.2
Let 𝑘 ≥ 2. Let 𝑉 = 𝑉1 ∪ · · · ∪ 𝑉𝑘 , where 𝑉1 , . . . , 𝑉𝑘 are disjoint sets of size 𝑛. The
edges of the complete 𝑘-uniform hypergraph on 𝑉 are colored with red/blue. Suppose
that every edge formed by taking one vertex from each 𝑉1 , . . . , 𝑉𝑘 is colored blue.
Then there exists 𝑆 ⊆ 𝑉 such that the number of red edges and blue edges in 𝑆 differ
by more than 𝑐 𝑘 𝑛 𝑘 , where 𝑐 𝑘 > 0 is a constant.

Proof. We will write this proof for 𝑘 = 3 for notational simplicity. The same proof
works for any 𝑘.
Let 𝑝 1 , 𝑝 2 , 𝑝 3 be real numbers to be decided. We are going to pick 𝑆 randomly by

24
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

2.6 Crossing number inequality

including each vertex in 𝑉𝑖 with probability 𝑝𝑖 , independently. Let

𝑎𝑖, 𝑗,𝑘 = #{blue edges in 𝑉𝑖 × 𝑉 𝑗 × 𝑉𝑘 } − #{red edges in 𝑉𝑖 × 𝑉 𝑗 × 𝑉𝑘 }.

Then
E[#{blue edges in 𝑆} − #{red edges in 𝑆}]
equals to some polynomial
∑︁
𝑓 ( 𝑝1, 𝑝2, 𝑝3) = 𝑎𝑖, 𝑗,𝑘 𝑝𝑖 𝑝 𝑗 𝑝 𝑘 = 𝑛3 𝑝 1 𝑝 2 𝑝 3 + 𝑎 1,1,1 𝑝 31 + 𝑎 1,1,2 𝑝 21 𝑝 2 + · · · .
𝑖≤ 𝑗 ≤𝑘

(note that 𝑎 1,2,3 = 𝑛3 by hypothesis). We would be done if we can find 𝑝 1 , 𝑝 2 , 𝑝 3 ∈


[0, 1] such that | 𝑓 ( 𝑝 1 , 𝑝 2 , 𝑝 3 )| > 𝑐 for some constant 𝑐 > 0 (not depending on the
𝑎𝑖, 𝑗,𝑘 ’s). Note that |𝑎𝑖, 𝑗,𝑘 | ≤ 𝑛3 . We are done after the following lemma

Lemma 2.5.3
Let 𝑃 𝑘 denote the set of polynomials 𝑔( 𝑝 1 , . . . , 𝑝 𝑘 ) of degree 𝑘, whose coefficients
have absolute value ≤ 1, and the coefficient of 𝑝 1 𝑝 2 · · · 𝑝 𝑘 is 1. Then there is a
constant 𝑐 𝑘 > 0 such that for all 𝑔 ∈ 𝑃 𝑘 , there is some 𝑝 1 , . . . , 𝑝 𝑘 ∈ [0, 1] with
|𝑔( 𝑝 1 , . . . , 𝑝 𝑘 )| ≥ 𝑐.

Proof of Lemma. Set 𝑀 (𝑔) = sup 𝑝1 ,...,𝑝 𝑘 ∈[0,1] |𝑔( 𝑝 1 , . . . , 𝑝 𝑘 )| (note that sup is achieved
as max due to compactness). For 𝑔 ∈ 𝑃 𝑘 , since 𝑔 is nonzero (its coefficient of
𝑝 1 𝑝 2 · · · 𝑝 𝑘 is 1), we have 𝑀 (𝑔) > 0. As 𝑃 𝑘 is compact and 𝑀 : 𝑃 𝑘 → R is continuous,
𝑀 attains a minimum value 𝑐 = 𝑀 (𝑔) > 0 for some 𝑔 ∈ 𝑃 𝑘 . ■ □

2.6 Crossing number inequality


Consider drawings of graphs on a plane using continuous curves as edges.
The crossing number cr(𝑮) is the minimum number of crossings in a drawing of 𝐺.
A graph is planar if cr(𝐺) = 0.
The graphs 𝐾3,3 and 𝐾5 are non-planar. Furthermore, the following theorem charac-
terizes these two graphs as the only obstructions to planarity:
Kuratowski’s theorem (1930). Every non-planar graph contains a subgraph that is
topologically homeomorphic to 𝐾3,3 or 𝐾5 .
Wagner’s theorem (1937). A graph is planar if and only if it does not have 𝐾3,3 or 𝐾5
as a minor.

25
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

2 Linearity of Expectations

(It is not too hard to show that Wagner’s theorem and Kuratowski’s theorem are
equivalent)
If a graph has a lot of edges, is it guaranteed to have a lot of crossings no matter how
it is drawn in the plane?

Question 2.6.1
What is the minimum possible number of crossings that a drawing of:
• 𝐾𝑛 ? (Hill’s conjecture)
• 𝐾𝑛,𝑛 ? (Zarankiewicz conjecture; Turán’s brick factory problem)
• a graph on 𝑛 vertices and 𝑛2 /100 edges?

The following result, due to Ajtai–Chvátal–Newborn–Szemerédi (1982) and Leighton


(1984), lower bounds the number of crossings for graphs with many edges.

Theorem 2.6.2 (Crossing number inequality)


In a graph 𝐺 = (𝑉, 𝐸), if |𝐸 | ≥ 4|𝑉 |, then

|𝐸 | 3
cr(𝐺) ≳ .
|𝑉 | 2

Remark 2.6.3. The constant 4 in |𝐸 | ≥ 4 |𝑉 | can be replaced by any constant greater


than 3 (at the cost of changing the constant in the conclusion). On the other hand,
by considering a large triangular grid, we get a planar graph with average degree
arbitrarily close to 6.

Corollary 2.6.4
In a graph 𝐺 = (𝑉, 𝐸), if |𝐸 | ≳ |𝑉 | 2 , then cr(𝐺) ≳ |𝑉 | 4 .

Proof. The proof has three steps, starting with some basic facts on planar graphs.

Step 1: From zero to one.


Recall Euler’s formula: 𝑣 − 𝑒 + 𝑓 = 2 for every connected planar drawing of graph.
Here 𝑣 is the number of vertices, 𝑒 the number of edges, and 𝑓 the number of faces
(connected components of the complement of the drawing, including the outer infinite
region).
For every connected planar graph with at least one cycle, 3|𝐹 | ≤ 2|𝐸 | since every face
is adjacent to ≥ 3 edges, whereas every edge is adjacent to exactly 2 faces. Plugging
into Euler’s formula, |𝐸 | ≤ 3|𝑉 | − 6.

26
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

2.6 Crossing number inequality

Thus |𝐸 | ≤ 3|𝑉 | for all planar graphs. Hence cr(𝐺) > 0 whenever |𝐸 | > 3|𝑉 |.
Step 2: From one to many.
The above argument gives us one crossing. Next, we will use it to obtain many
crossings.
By deleting one edge for each crossing, we get a planar graph, so |𝐸 | − cr(𝐺) ≤ 3|𝑉 |,
that is
cr(𝐺) ≥ |𝐸 | − 3|𝑉 |.
This is a “cheap bound.” For graphs with |𝐸 | = Θ(𝑛2 ), this gives cr(𝐺) ≳ 𝑛2 . This is
not a great bound. We next will use the probabilistic method to boost this bound.
Step 3: Bootstrapping.
Let 𝑝 ∈ [0, 1] to be decided. Let 𝐺 ′ = (𝑉 ′, 𝐸 ′) be obtained from 𝐺 by randomly
keeping each vertex with probability 𝑝. Then

cr(𝐺 ′) ≥ |𝐸 ′ | − 3|𝑉 ′ |.

So
E cr(𝐺 ′) ≥ E|𝐸 ′ | − 3E|𝑉 ′ |
We have E cr(𝐺 ′) ≤ 𝑝 4 cr(𝐺), E|𝐸 ′ | = 𝑝 2 |𝐸 | and E|𝑉 ′ | = 𝑝E|𝑉 |. So

𝑝 4 cr(𝐺) ≥ 𝑝 2 |𝐸 | − 3𝑝|𝑉 |.

Thus
cr(𝐺) ≥ 𝑝 −2 |𝐸 | − 3𝑝 −3 |𝑉 |.
Setting 𝑝 = 4 |𝑉 | /|𝐸 | ∈ [0, 1] (here is where we use the hypothesis that |𝐸 | ≥ 4 |𝑉 |)
so that 4𝑝 −3 |𝑉 | = 𝑝 −2 |𝐸 |, we obtain cr(𝐺) ≳ |𝐸 | 3 /|𝑉 | 2 . □

Remark 2.6.5. The above idea of boosting a cheap bound to a better bound is an
important one. We saw a version of this idea in Section 2.4 where we sampled a
constant number of vertices to deduce upper bounds on the hypergraph Turán num-
ber. In the above crossing number inequality application, we are also applying some
preliminary cheap bound to some sampled induced subgraph, though this time the
sampled subgraph has super-constant size.
It is tempting to modify the proof by sampling edges instead of vertices, but this does
not work.

27
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

2 Linearity of Expectations

Exercises
1. Let 𝐴 be a measurable subset of the unit sphere in R3 (centered at the origin)
containing no pair of orthogonal points.
a) Prove that 𝐴 occupies at most 1/3 of the sphere in terms of surface area.
b) ★ Prove an upper bound smaller than 1/3 (give your best bound).
2. ★ Prove that every set of 10 points in the plane can be covered by a union of
disjoint unit disks.
3. Let r = (𝑟 1 , . . . , 𝑟 𝑘 ) be a vector of nonzero integers whose sum is nonzero.
Prove that there exists a real 𝑐 > 0 (depending on r only) such that the following
holds: for every finite set 𝐴 of nonzero reals, there exists a subset 𝐵 ⊆ 𝐴 with
|𝐵| ≥ 𝑐| 𝐴| such that there do not exist 𝑏 1 , . . . , 𝑏 𝑘 ∈ 𝐵 with 𝑟 1 𝑏 1 + · · · + 𝑟 𝑘 𝑏 𝑘 = 0.
4. Prove that every set 𝐴 of 𝑛 nonzero integers contains two disjoint subsets 𝐵1 and
𝐵2 , such that both 𝐵1 and 𝐵2 are sum-free, and |𝐵1 | + |𝐵2 | > 2𝑛/3.
5. Let 𝐺 be an 𝑛-vertex graph with 𝑝𝑛2 edges, with 𝑛 ≥ 10 and 𝑝 ≥ 10/𝑛.
Prove that 𝐺 contains a pair of vertex-disjoint and isomorphic subgraphs (not
necessarily induced) each with at least 𝑐 𝑝 2 𝑛2 edges, where 𝑐 > 0 is a constant.
6. ★ Prove that for every positive integer 𝑟, there exists an integer 𝐾 such that the
following holds. Let 𝑆 be a set of 𝑟 𝑘 points evenly spaced on a circle. If we
partition 𝑆 = 𝑆1 ∪ · · · ∪ 𝑆𝑟 so that |𝑆𝑖 | = 𝑘 for each 𝑖, then, provided 𝑘 ≥ 𝐾,
there exist 𝑟 congruent triangles where the vertices of the 𝑖-th triangle lie in 𝑆𝑖 ,
for each 1 ≤ 𝑖 ≤ 𝑟.
7. ★ Prove that [𝑛] 𝑑 cannot be partitioned into fewer than 2𝑑 sets each of the form
𝐴1 × · · · × 𝐴𝑑 where 𝐴𝑖 ⊊ [𝑛].

28
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

3 Alterations

We saw the alterations method in Section 1.1 to give lower bounds to Ramsey numbers.
The basic idea is to first make a random construction, and then fix the blemishes.

3.1 Dominating set in graphs


In a graph 𝐺 = (𝑉, 𝐸), we say that 𝑈 ⊆ 𝑉 is dominating if every vertex in 𝑉 \ 𝑈 has a
neighbor in 𝑈.

Theorem 3.1.1
Every graph on 𝑛 vertices with minimum degree 𝛿 > 1 has a dominating set of size
 
log(𝛿 + 1) + 1
≤ 𝑛.
𝛿+1

Naive attempt: take out vertices greedily. The first vertex eliminates 1 + 𝛿 vertices, but
subsequent vertices eliminate possibly fewer vertices.

Proof. Two-step process (alteration method):


1. Choose a random subset
2. Add enough vertices to make it dominating
Let 𝑝 ∈ [0, 1] to be decided later. Let 𝑋 be a random subset of 𝑉 where every vertex
is included with probability 𝑝 independently.
Let 𝑌 = 𝑉 \ (𝑋 ∪ 𝑁 (𝑋)). Each 𝑣 ∈ 𝑉 lies in 𝑌 with probability ≤ (1 − 𝑝) 1+𝛿 .
Then 𝑋 ∪ 𝑌 is dominating, and

E[|𝑋 ∪ 𝑌 |] = E[|𝑋 |] + E[|𝑌 |] ≤ 𝑝𝑛 + (1 − 𝑝) 1+𝛿 𝑛 ≤ ( 𝑝 + 𝑒 −𝑝(1+𝛿) )𝑛


log(𝛿+1)
using 1 + 𝑥 ≤ 𝑒 𝑥 for all 𝑥 ∈ R. Finally, setting 𝑝 = 𝛿+1 to minimize 𝑝 + 𝑒 −𝑝(1+𝛿) ,
we bound the above expression by
 
1 + log(𝛿 + 1)
≤ . □
𝛿+1

29
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

3 Alterations

3.2 Heilbronn triangle problem

Question 3.2.1
How can one place 𝑛 points in the unit square so that no three points forms a triangle
with small area?

Let
Δ(𝑛) = sup min area( 𝑝𝑞𝑟)/
𝑆⊆[0,1] 2 𝑝,𝑞,𝑟∈𝑆
distinct
|𝑆|=𝑛

Naive constructions fair poorly. E.g., 𝑛 points around a circle has a triangle of area
Θ(1/𝑛3 ) (the triangle formed by three consectutive points has side lengths ≍ 1/𝑛 and
angle 𝜃 = (1 − 1/𝑛)2𝜋). Even worse is arranging points on a grid, as you would get
triangles of zero area.
Heilbronn conjectured that Δ(𝑛) = 𝑂 (𝑛−2 ).
Komlós, Pintz, and Szemerédi (1982) disproved the conjecture, showing Δ(𝑛) ≳
𝑛−2 log 𝑛. They used an elaborate probabilistic construction. Here we show a much
simpler version probabilistic construction that gives a weaker bound Δ(𝑛) ≳ 𝑛−2 .

Remark 3.2.2 (Upper bounds). For a long time, the best upper bound known was
Δ(𝑛) ≤ 𝑛−8/7+𝑜(1) due to Komlós, Pintz, and Szemerédi (1981). This was recently
improved to Δ(𝑛) ≤ 𝑛−8/7−𝑐 by Cohen, Pohoata, and Zakharov (2023+).

Theorem 3.2.3 (Many points without small area triangles)


For every positive integer 𝑛, there exists a set of 𝑛 points in [0, 1] 2 such that every
triple spans a triangle of area ≥ 𝑐𝑛−2 , for some absolute constant 𝑐 > 0.

Proof. Choose 2𝑛 points at random. For every three random points 𝑝, 𝑞, 𝑟, let us
estimate
P 𝑝,𝑞,𝑟 (area( 𝑝, 𝑞, 𝑟) ≤ 𝜀).
By considering the area of a circular annulus around 𝑝, with inner and outer radii 𝑥
and 𝑥 + Δ𝑥, we find

P 𝑝,𝑞 (| 𝑝𝑞| ∈ [𝑥, 𝑥 + Δ𝑥]) ≤ 𝜋((𝑥 + Δ𝑥) 2 − 𝑥 2 ).

30
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

3.3 Markov’s inequality

So the probability density function satisfies

P 𝑝,𝑞 (| 𝑝𝑞| ∈ [𝑥, 𝑥 + 𝑑𝑥]) ≤ 2𝜋𝑥𝑑𝑥.

For fixed 𝑝, 𝑞
 
2𝜀 𝜀
P𝑟 (area( 𝑝𝑞𝑟) ≤ 𝜀) = P𝑟 dist( 𝑝𝑞, 𝑟) ≤ ≲ .
| 𝑝𝑞| | 𝑝𝑞|

Thus, with 𝑝, 𝑞, 𝑟 at random



∫ 2
𝜀
P 𝑝,𝑞,𝑟 (area( 𝑝𝑞𝑟) ≤ 𝜀) ≲ 2𝜋𝑥 𝑑𝑥 ≍ 𝜀.
0 𝑥

Given these 2𝑛 random points, let 𝑋 be the number of triangles with area ≤ 𝜀. Then
E𝑋 = 𝑂 (𝜀𝑛3 ).
Choose 𝜀 = 𝑐/𝑛2 with 𝑐 > 0 small enough so that E𝑋 ≤ 𝑛.
Delete a point from each triangle with area ≤ 𝜀.
The expected number of remaining points is E[2𝑛 − 𝑋] ≥ 𝑛, and no triangles with area
≤ 𝜀 = 𝑐/𝑛2 .
Thus with positive probability, we end up with ≥ 𝑛 points and no triangle with area
≤ 𝑐/𝑛2 . □

Algebraic construction. Here is another construction due to Erdős (in appendix of


Roth (1951)) also giving Δ(𝑛) ≳ 𝑛−2 :
Let 𝑝 be a prime. The set {(𝑥, 𝑥 2 ) ∈ F2𝑝 : 𝑥 ∈ F 𝑝 } has no 3 points collinear (a parabola
meets every line in ≤ 2 points). Take the corresponding set of 𝑝 points in [ 𝑝] 2 ⊆ Z2 .
Then every triangle has area ≥ 1/2 due to Pick’s theorem. Scale back down to a unit
square. (If 𝑛 is not a prime, then use that there is a prime between 𝑛 and 2𝑛.)

3.3 Markov’s inequality


We note an important tool that will be used next.

Theorem 3.3.1 (Markov’s inequality)


Let 𝑋 ≥ 0 be random variable. Then for every 𝑎 > 0,

E[𝑋]
P(𝑋 ≥ 𝑎) ≤ .
𝑎

31
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

3 Alterations

Proof. E[𝑋] ≥ E[𝑋1 𝑋 ≥𝑎 ] ≥ E[𝑎1 𝑋 ≥𝑎 ] = 𝑎P(𝑋 ≥ 𝑎) □

Take-home message: for r.v. 𝑋 ≥ 0, if E𝑋 is very small, then typically 𝑋 is small.

3.4 High girth and high chromatic number


If a graph has a 𝑘-clique, then you know that its chromatic number is at least 𝑘.
Conversely, if a graph has high chromatic number, is it always possible to certify this
fact from some “local information”?
Surprisingly, the answer is no. The following ingenious construction shows that a
graph can be “locally tree-like” while still having high chromatic number.
The girth of a graph is the length of its shortest cycle.

Theorem 3.4.1 (Erdős 1959)


For all 𝑘, ℓ, there exists a graph with girth > ℓ and chromatic number > 𝑘.

Proof. Let 𝐺 ∼ 𝐺 (𝑛, 𝑝) with 𝑝 = (log 𝑛) 2 /𝑛 (the proof works whenever log 𝑛/𝑛 ≪
𝑝 ≪ 𝑛−1+1/ℓ ). Here 𝐺 (𝑛, 𝑝) is Erdős–Rényi random graph (𝑛 vertices, every edge
appearing with probability 𝑝 independently).
Let 𝑋 be the number of cycles of length at most ℓ in 𝐺. By linearity of expectations,

as there are exactly 𝑛𝑖 (𝑖 − 1)!/2 cycles of length 𝑖 in 𝐾𝑛 for each 3 ≤ 𝑖 ≤ 𝑛, we have
(recall that ℓ is a constant)
ℓ   ℓ
∑︁ 𝑛 (𝑖 − 1)! ∑︁
E𝑋 = 𝑖
𝑝 ≤ 𝑛𝑖 𝑝𝑖 = ℓ(log 𝑛) 2𝑖 = 𝑜(𝑛).
𝑖=3
𝑖 2 𝑖=3

By Markov’s inequality
E𝑋
P(𝑋 ≥ 𝑛/2) ≤ = 𝑜(1).
𝑛/2
(This allows us to get rid of all short cycles.)
How can we lower bound the chromatic number 𝜒(·)? Note that 𝜒(𝐺) ≥ |𝑉 (𝐺)|/𝛼(𝐺),
where 𝛼(𝐺) is the independence number (the size of the largest independent set).
With 𝑥 = (3/𝑝) log 𝑛 = 3𝑛/log 𝑛,
 
𝑛
(1 − 𝑝) ( 2 ) < 𝑛𝑥 𝑒 −𝑝𝑥(𝑥−1)/2 = (𝑛𝑒 −𝑝(𝑥−1)/2 ) 𝑥 = 𝑛−Θ(𝑛) = 𝑜(1).
𝑥
P(𝛼(𝐺) ≥ 𝑥) ≤
𝑥

32
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

3.5 Random greedy coloring

Let 𝑛 be large enough so that P(𝑋 ≥ 𝑛/2) < 1/2 and P(𝛼(𝐺) ≥ 𝑥) < 1/2. Then there
is some 𝐺 with fewer than 𝑛/2 cycles of length ≤ ℓ and with 𝛼(𝐺) ≤ 3𝑛/log 𝑛.
Remove a vertex from each cycle to get 𝐺 ′. Then |𝑉 (𝐺 ′)| ≥ 𝑛/2, girth > ℓ, and
𝛼(𝐺 ′) ≤ 𝛼(𝐺) ≤ 3𝑛/log 𝑛, so

|𝑉 (𝐺 ′)| 𝑛/2 log 𝑛


𝜒(𝐺 ′) ≥ ′
≥ = >𝑘
𝛼(𝐺 ) 3𝑛/log 𝑛 6

if 𝑛 is sufficiently large. □

Remark 3.4.2. Erdős (1962) also showed that in fact one needs to see at least a linear
number of vertices to deduce high chromatic number: for all 𝑘, there exists 𝜀 = 𝜀 𝑘
such that for all sufficiently large 𝑛 there exists an 𝑛-vertex graph with chromatic
number > 𝑘 but every subgraph on ⌊𝜀𝑛⌋ vertices is 3-colorable. (In fact, one can take
𝐺 ∼ 𝐺 (𝑛, 𝐶/𝑛); see "Probabilistic Lens: Local coloring" in Alon–Spencer)

3.5 Random greedy coloring


In Section 1.3, we saw a simple argument showing that every 𝑘-uniform hypergraph
with than 2 𝑘−1 edges is 2-colorable (meaning that we can color the vertices red/blue
without no monochromatic edge). Take a moment to remember the proof.
In this section, we improve this result. The next result gives the current best known
bound.

Theorem 3.5.1 (Radhakrishnan and Srinivasan (2000))


There
√︃ is some constant 𝑐 > 0 so that every 𝑘-uniform hypergraph with at most
𝑐 log𝑘 𝑘 2 𝑘 edges is 2-colorable.

Recall from Section 1.3 that there exists a non-2-colorable 𝑘-uniform hypergraph on
𝑘 2 vertices and 𝑂 (𝑘 2 2 𝑘 ) edges, via a random construction.
Here we present a simpler proof, based on a random greedy coloring, due to Cherkashin
and Kozik (2015), following an approach of Pluhaár (2009).

Proof. Consider a 𝑘-graph with 𝑚 edges.

Let us order the vertices using a uniformly random chosen permutation.


Color vertices greedily from left to right: color a vertex blue unless it would create a
monochromatic edge, in which case color it red (i.e., every red vertex is the final vertex
in an edge with all earlier 𝑘 − 1 vertices have already been colored blue).

33
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

3 Alterations

The resulting coloring has no blue edges. The greedy coloring succeeds if it does not
create a red edge.
Analyzing a greedy coloring is tricky, since the color of a single vertex may depend on
the entire history. Instead, we identify a specific feature that necessarily results from
a unsuccessful coloring.
If there is a red edge, then there must be two edges 𝑒, 𝑓 so that the last vertex of 𝑒
is the first vertex of 𝑓 . Call such pair (𝑒, 𝑓 ) conflicting (note that whether (𝑒, 𝑓 ) is
conflicting depends on the random ordering of the vertices, but not on how we assigned
colors).
What is the probability of seeing a conflicting pair? Here is the randomness comes
from the random ordering of vertices.
Each pair of edges with exactly one vertex in common conflicts with probability
(𝑘−1)!2 1 2𝑘−2 −1
(2𝑘−1)! = 2𝑘−1 𝑘−1 ≍ 𝑘 −1/2 2−2𝑘 . Summing over all ≤ 𝑚 2 pairs of edges that
share a unique vertex, we find that the expected number of conflicting pairs is at most
≲ 𝑚 2 𝑘 −1/2 2−2𝑘 , which is < 1 for some 𝑚 ≍ 𝑘 1/4 2 𝑘 . In this case, there is some
ordering of vertices creating no conflicting pairs, in which case the greedy coloring
always succeeds.
The above argument, due to Pluhaár (2009), 1/4 𝑘
√︃ yields 𝑚 ≲ 𝑘 2 . Next we will refine
𝑘
the argument to obtain a better bound of log 𝑘 2
𝑘 as claimed.

Instead of just considering a random permutation, let us map each vertex to [0, 1]
independently and uniformly at random. This map induces an ordering of the vertices,
but it comes with further information that we will use.
Write [0, 1] = 𝐿 ∪ 𝑀 ∪ 𝑅 where (𝑝 to be decided)
     
1− 𝑝 1− 𝑝 1+ 𝑝 1+ 𝑝
𝐿 := 0, , 𝑀 := , , 𝑅 := ,1 .
2 2 2 2

The probability that a given edge lands entirely in 𝐿 is ( 1−𝑝


2 ) , and likewise with 𝑅.
𝑘

Taking a union bound over all edges,


 𝑘
1− 𝑝
P(some edge lies in 𝐿 or 𝑅) ≤ 2𝑚 .
2

Suppose that no edge of 𝐻 lies entirely in 𝐿 or entirely in 𝑅. If (𝑒, 𝑓 ) conflicts, then


their unique common vertex 𝑥 𝑣 ∈ 𝑒 ∩ 𝑓 must lie in 𝑀. So the probability that (𝑒, 𝑓 )

34
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

3.5 Random greedy coloring

conflicts is (here we use 𝑥(1 − 𝑥) ≤ 1/4)


∫ (1+𝑝)/2
𝑥 𝑘−1 (1 − 𝑥) 𝑘−1 𝑑𝑥 ≤ 𝑝4−𝑘+1 .
(1−𝑝)/2

Taking a union bound over all ≤ 𝑚 2 pairs of edges, we find that

P(some conflicting pair has the common vertex in 𝑀) ≤ 𝑚 2 𝑝4−𝑘+1 .

Thus

P(there is a conflicting pair)


≤ P(some edge lies in 𝐿 or 𝑅) + P(some conflicting pair has the common vertex in 𝑀)
 𝑘
1− 𝑝
≤ 2𝑚 + 𝑚 2 𝑝4−𝑘+1
2
< 2−𝑘+1 𝑚𝑒 −𝑝𝑘 + (2−𝑘+1 𝑚) 2 𝑝

set 𝑝 = log(2 𝑘−1 𝑘/𝑚)/𝑘 to minimize the right-hand side to get

𝑚2 𝑚2
 
2 𝑘−2 𝑘
= 𝑘−1 + 𝑘−1 log
4 𝑘 4 𝑘 𝑚
√︁
which is < 1 for 𝑚 = 𝑐2 𝑘 𝑘/log 𝑘 with 𝑐 > 0 being a sufficiently small constant (we
should assume that is 𝑘 large enough to ensure 𝑝 ∈ [0, 1]; smaller values of 𝑘 can be
handled in the theorem exceptionally by later reducing the constant 𝑐). □

Food for thought: what is the source of the gain in the the 𝐿 ∪ 𝑀 ∪ 𝑅 argument? The
expected number of conflicting pairs is unchanged. It must be that we are somehow
clustering the bad events by considering the event when some edge lies in 𝐿 or 𝑅.
It remains an intriguing open problem to improve this bound further.

Exercises
1. Using the alteration method, prove the Ramsey number bound

𝑅(4, 𝑘) ≥ 𝑐(𝑘/log 𝑘) 2

for some constant 𝑐 > 0.


2. Prove that every 3-uniform hypergraph with 𝑛 vertices and 𝑚 ≥ 𝑛 edges contains
an independent set (i.e., a set of vertices containing no edges) of size at least

35
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

3 Alterations


𝑐𝑛3/2 / 𝑚, where 𝑐 > 0 is a constant.
3. Prove that every 𝑘-uniform hypergraph with 𝑛 vertices and 𝑚 edges has a transver-
sal (i.e., a set of vertices intersecting every edge) of size at most 𝑛(log 𝑘)/𝑘 +𝑚/𝑘.
4. Zarankiewicz problem. Prove that for every positive integers 𝑛 ≥ 𝑘 ≥ 2, there
exists an 𝑛 × 𝑛 matrix with {0, 1} entries, with at least 12 𝑛2−2/(𝑘+1) 1’s, such that
there is no 𝑘 × 𝑘 submatrix consisting of all 1’s.
5. Fix 𝑘. Prove that there exists a constant 𝑐 𝑘 > 1 so that for every sufficiently
large 𝑛 > 𝑛0 (𝑘), there exists a collection F of at least 𝑐 𝑛𝑘 subsets of [𝑛] such that
Ñ𝑘
for every 𝑘 distinct 𝐹1 , . . . , 𝐹𝑘 ∈ F , all 2 𝑘 intersections 𝑖=1 𝐺 𝑖 are nonempty,
where each 𝐺 𝑖 is either 𝐹𝑖 or [𝑛] \ 𝐹𝑖 .
6. Acute sets in R𝑛 . Prove√that, for some constant 𝑐 > 0, for every 𝑛, there exists a
family of at least 𝑐(2/ 3) 𝑛 subsets of [𝑛] containing no three distinct members
𝐴, 𝐵, 𝐶 satisfying 𝐴 ∩ 𝐵 ⊆ 𝐶 ⊆ 𝐴 ∪ 𝐵. √
Deduce that there exists a set of at least 𝑐(2/ 3) 𝑛 points in R𝑛 so that all angles
determined by three points from the set are acute.
Remark. The current best lower and upper bounds for the maximum size of
an “acute set” in R𝑛 (i.e., spanning only acute angles) are 2𝑛−1 + 1 and 2𝑛 − 1
respectively.
7. ★ Covering complements of sparse graphs by cliques
a) Prove that every graph with 𝑛 vertices and minimum degree 𝑛 − 𝑑 can be
written as a union of 𝑂 (𝑑 2 log 𝑛) cliques.
b) Prove that every bipartite graph with 𝑛 vertices on each side of the ver-
tex bipartition and minimum degree 𝑛 − 𝑑 can be written as a union of
𝑂 (𝑑 log 𝑛) complete bipartite graphs (assume 𝑑 ≥ 1).
8. ★ Let 𝐺 = (𝑉, 𝐸) be a graph with 𝑛 vertices and minimum degree 𝛿 ≥ 2. Prove
that there exists 𝐴 ⊆ 𝑉 with | 𝐴| = 𝑂 (𝑛(log 𝛿)/𝛿) so that every vertex in 𝑉 \ 𝐴
contains at least one neighbor in 𝐴 and at least one neighbor not in 𝐴.
9. ★ Prove that every graph 𝐺 without isolated vertices has an induced subgraph
𝐻 on at least 𝛼(𝐺)/2 vertices such that all vertices of 𝐻 have odd degree. Here
𝛼(𝐺) is the size of the largest independent set in 𝐺.

36
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

4.1 Does a typical random graph contain a triangle?


We begin with the following motivating question. Recall that the Erdős–Rényi random
graph 𝐺 (𝑛, 𝑝) is the 𝑛-vertex graph with edge probability 𝑝.

Question 4.1.1
For which 𝑝 = 𝑝 𝑛 does 𝐺 (𝑛, 𝑝) contain a triangle with probability 1 − 𝑜(1)?

(We sometimes abbreviate “with probability 1 − 𝑜(1) by “with high probability” or


simply “whp”. In some literature, this is also called “asymptotically almost surely” or
“a.a.s.”)
By computing E𝑋 (also known as the first moment), we deduce the following.

Proposition 4.1.2
If 𝑛𝑝 → 0, then 𝐺 (𝑛, 𝑝) is triangle-free with probability 1 − 𝑜(1).

Proof. Let 𝑋 be the number of triangles in 𝐺 (𝑛, 𝑝). We know from linearity of
expectations that  
𝑛 3
E𝑋 = 𝑝 ≍ 𝑛3 𝑝 3 = 𝑜(1).
3
Thus, by Markov’s inequality,

P(𝑋 ≥ 1) ≤ E𝑋 = 𝑜(1).

In other words, 𝑋 = 0 with probability 1 − 𝑜(1). □

In other words, when 𝑝 ≪ 1/𝑛, 𝐺 (𝑛, 𝑝) is triangle-free with high probaiblity (recall
that 𝑝 ≪ 1/𝑛 means 𝑝 = 𝑜(1/𝑛); see asymptotic notation guide at the beginning of
these notes).
What about when 𝑝 ≫ 1/𝑛? Can we conclude that 𝐺 (𝑛, 𝑝) contains a triangle with
high probability? In this case E𝑋 → ∞, but this is not enough to conclude that

37
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

P(𝑋 ≥ 1) = 1 − 𝑜(1), since we have not ruled out the probability that 𝑋 is almost
always zero but extremely large with some tiny probability.
An important technique in probabilistic combinatorics is to show that some random
variable is concentrated around its mean. This would then imply that outliers are
unlikely.
We will see many methods in this course on proving concentrations of random vari-
ables. In this chapter, we begin with the simplest method. It is usually easiest to
execute and it requires not much hypotheses. The downside is that it only produces
relatively weak (though still useful enough) concentration bounds.
Second moment method: show that a random variable is concentrated near its mean
by bounding its variance.

Definition 4.1.3 (Variance)


The variance of a random variable 𝑋 is

Var[𝑋] := E[(𝑋 − E𝑋) 2 ] = E[𝑋 2 ] − E[𝑋] 2 .

The covariance of two random variables 𝑋 and 𝑌 (jointly distributed) is

Cov[𝑋, 𝑌 ] := E[(𝑋 − E𝑋)(𝑌 − E𝑌 )] = E[𝑋𝑌 ] − E[𝑋]E[𝑌 ].

(Exercise: check the second equality in the definitions of variance and covariance
above).

Remark 4.1.4 (Notation convention). It is standard to use the Greek letter 𝜇 for the
mean, and 𝜎 2 for the variance. Here 𝜎 ≥ 0 is the standard deviation.

The following basic result provides a concentration bound based on the variance.

Theorem 4.1.5 (Chebyshev’s inequality)


Let 𝑋 be a random variable with mean 𝜇 and variance 𝜎 2 . For any 𝜆 > 0

P(|𝑋 − 𝜇| ≥ 𝜆𝜎) ≤ 𝜆−2 .

Proof. By Markov’s inequality,

E[(𝑋 − 𝜇) 2 ] 1
𝐿𝐻𝑆 = P(|𝑋 − 𝜇| 2 ≥ 𝜆2 𝜎 2 ) ≤ 2 2
= 2. □
𝜆 𝜎 𝜆

Remark 4.1.6. Concentration bounds that show small probability of deviating from
the mean are called tail bounds (more precisely: upper tail for 𝑋 ≥ 𝜇 + 𝑎 and lower tail

38
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.1 Does a typical random graph contain a triangle?

for P(𝑋 ≤ 𝜇 − 𝑎)). Chebyshev’s inequality gives tail bounds that decays quadratically.
Later on we will see tools that give much better decay (usually exponential) provided
additional assumptions on the random variable (e.g., independence).

We are often interested in upper bounding the probability of non-existence, i.e., P(𝑋 =
0). Chebyshev’s inequality yields the following bound.

Corollary 4.1.7 (Chebyshev bound on the probability of non-existence)


For any random variable 𝑋,

Var 𝑋
P(𝑋 = 0) ≤ .
(E𝑋) 2

Proof. By Chebyshev inequality, writing 𝜇 = E𝑋,

Var 𝑋
P(𝑋 = 0) ≤ P(|𝑋 − 𝜇| ≥ |𝜇|) ≤ . □
𝜇2

Corollary 4.1.8
If E𝑋 > 0 and Var 𝑋 = 𝑜(E𝑋) 2 , then 𝑋 > 0 and 𝑋 ∼ E𝑋 with probability 1 − 𝑜(1).

Remark 4.1.9 (Asymptotic statements). The above statement is really referring to


not a single random variable, but a sequence of random variables 𝑋𝑛 . It is saying that
if E𝑋𝑛 > 0 and Var 𝑋𝑛 = 𝑜(E𝑋𝑛 ) 2 , then P(𝑋𝑛 > 0) → 1 as 𝑛 → ∞, and for any fixed
𝛿 > 0, P(|𝑋𝑛 − 𝐸 𝑋𝑛 | > 𝛿E𝑋𝑛 ) → 0 as 𝑛 → ∞.

In many situations, it is not too hard to compute the second moment. We have
Var[𝑋] = Cov[𝑋, 𝑋]. Also, covariance is bilinear, i.e., for random variables 𝑋1 , . . .
and 𝑌1 , . . . (no assumptions needed on their independence, etc.) and constants 𝑎 1 , . . .
and 𝑏 1 , . . . , one has
" #
∑︁ ∑︁ ∑︁
Cov 𝑎𝑖 𝑋𝑖 , 𝑏 𝑗𝑌𝑗 = 𝑎𝑖 𝑏 𝑗 Cov[𝑋𝑖 , 𝑌 𝑗 ].
𝑖 𝑗 𝑖, 𝑗

We are often dealing with 𝑋 being the cardinality of some random set. We can usually
write this as a sum of indicator functions, such as 𝑋 = 𝑋1 + · · · + 𝑋𝑛 , so that
𝑛
∑︁ 𝑛
∑︁ ∑︁
Var 𝑋 = Cov[𝑋, 𝑋] = Cov[𝑋𝑖 , 𝑋 𝑗 ] = Var 𝑋𝑖 + 2 Cov[𝑋𝑖 , 𝑋 𝑗 ].
𝑖, 𝑗=1 𝑖=1 𝑖< 𝑗

39
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

We have Cov[𝑋, 𝑌 ] = 0 if 𝑋 and 𝑌 are independent. Thus in the sum we only need to
consider dependent pairs (𝑖, 𝑗).

Example 4.1.10 (Sum of independent Bernoulli). Suppose 𝑋 = 𝑋1 + · · · + 𝑋𝑛 with


each 𝑋𝑖 being an independent Bernoulli random variables with P(𝑋𝑖 = 1) = 𝑝 and
P(𝑋𝑖 = 0) = 1 − 𝑝. Then 𝜇 = 𝑛𝑝 and 𝜎 2 = 𝑛𝑝(1 − 𝑝) (note that Var[𝑋𝑖 ] = 𝑝 − 𝑝 2 and
Cov[𝑋𝑖 , 𝑋 𝑗 ] = 0 if 𝑖 ≠ 𝑗). If 𝑛𝑝 → ∞, then 𝜎 = 𝑜(𝜇), and thus 𝑋 = 𝜇 + 𝑜(𝜇) whp.
Note that the above computation remains identical even if we only knew that the 𝑋𝑖 ’s
are pairwise uncorrelated (much weaker than assuming full independence).
Here the “tail probability” (the bound hidden in “whp”) decays polynomially in the
deviation. Later on we will derive much sharper rates of decay (exponential) using
more powerful tools such as the Chernoff bound when the r.v.’s are independent.

Let us now return to the problem of determining when 𝐺 (𝑛, 𝑝) contains a triangle
whp.

Theorem 4.1.11
If 𝑛𝑝 → ∞, then 𝐺 (𝑛, 𝑝) contains a triangle with probability 1 − 𝑜(1).

Proof. Label the vertices by [𝑛]. Let 𝑋𝑖 𝑗 be the indicator random variable of the edge
𝑖 𝑗, so that 𝑋𝑖 𝑗 = 1 if the edge is present, and 𝑋𝑖 𝑗 = 0 if the edge is not present in the
random graph. Let us write
𝑋𝑖 𝑗 𝑘 := 𝑋𝑖 𝑗 𝑋𝑖𝑘 𝑋 𝑗 𝑘 .
Then the number of triangles in 𝐺 (𝑛, 𝑝) is given by
∑︁
𝑋= 𝑋𝑖 𝑗 𝑋𝑖𝑘 𝑋 𝑗 𝑘 .
𝑖< 𝑗 <𝑘

Now we compute Var 𝑋. Note that the summands of 𝑋 are not all independent.
If 𝑇1 and 𝑇2 are each 3-vertex subsets, then

Cov[𝑋𝑇1 , 𝑋𝑇2 ] = E[𝑋𝑇1 𝑋𝑇2 ] − E[𝑋𝑇1 ]E[𝑋𝑇2 ] = 𝑝 𝑒(𝑇1 ∪𝑇2 ) − 𝑝 𝑒(𝑇1 )+𝑒(𝑇2 )
0 if |𝑇1 ∩ 𝑇2 | ≤ 1,






= 𝑝 5 − 𝑝 6 if |𝑇1 ∩ 𝑇2 | = 2,

 𝑝 3 − 𝑝 6 if 𝑇1 = 𝑇2 .


40
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.1 Does a typical random graph contain a triangle?

The number of pairs (𝑇1 , 𝑇2 ) of triangles sharing exactly on edge is 𝑂 (𝑛4 ). Thus
∑︁
Var 𝑋 = Cov[𝑋𝑇1 , 𝑋𝑇2 ] = 𝑂 (𝑛3 )( 𝑝 3 − 𝑝 6 ) + 𝑂 (𝑛4 )( 𝑝 5 − 𝑝 6 )
𝑇1 ,𝑇2

≲ 𝑛3 𝑝 3 + 𝑛4 𝑝 5 = 𝑜(𝑛6 𝑝 6 ) as 𝑛𝑝 → ∞.

Thus Var 𝑋 = 𝑜(E𝑋) 2 , and hence 𝑋 > 0 whp by Corollary 4.1.8. □

Here is what we have learned so far: for 𝑝 = 𝑝 𝑛 and as 𝑛 → ∞,


(
0 if 𝑛𝑝 → 0,
P(𝐺 (𝑛, 𝑝) contains a triangle) →
1 if 𝑛𝑝 → ∞.

We say that 1/𝑛 is a threshold for containing a triangle, in the sense that if 𝑝 grows
asymptotically faster than this threshold, i.e., 𝑝 ≫ 1/𝑛, then the event occurs with
probability 1 − 𝑜(1), while if 𝑝 ≪ 1/𝑛, then the event occurs with probability 𝑜(1).
Note that the definition of a threshold ignores leading constant factors (so that it is also
correct to say that 2/𝑛 is also a threshold for containing a triangle). Determining the
thresholds of various properties in random graphs (as well as other random settings)
is a central topic in probabilistic combinatorics. We will discuss thresholds in more
depth later in this chapter.
What else might you want to know about the probability that 𝐺 (𝑛, 𝑝) contains a
triangle?

Remark 4.1.12 (Poisson limit). What if 𝑛𝑝 → 𝑐 > 0 for some constant 𝑐 > 0? It
turns out in this case that the number of triangles of 𝐺 (𝑛, 𝑝) approaches a Poisson
distribution with constant mean. You will show this in the homework. It will be
done via the method of moments: if 𝑍 is some random variable with sufficiently
nice properties (known as “determined by moments”, which holds for many common
distributions such as the Poisson distribution and the normal distribution), and 𝑋𝑛 is
a sequence of random variables such that E𝑋𝑛𝑘 → E𝑍 𝑘 for all nonnegative integers 𝑘,
then 𝑋𝑛 converges in distribution to 𝑍.

Remark 4.1.13 (Asymptotic normality). Suppose 𝑛𝑝 → ∞. From the above proof,


we also deduce that 𝑋 ∼ E𝑋, i.e., the number of triangles is concentrated around its
mean. In fact, we know much more. It turns out that the number 𝑋 of triangles in
𝐺 (𝑛, √
𝑝) is asymptotically normal, meaning that it satisfies a central limit theorem: (𝑋 −
E𝑋)/ Var 𝑋 converges in distribution to the standard normal 𝑁 (0, 1) in distribution.
This was shown by Rucinski√ (1988) via the method of moments, by computing the
𝑘-th moment of (𝑋 − E𝑋)/ Var 𝑋 in the limit, and showing that it agrees with the
𝑘-th moment of the standard normal.

41
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

In the homework, you will prove the asymptotic normality of 𝑋 using a later-found
method of projections. The idea is to show that that 𝑋 close to another random variable
that is already known to be asymptotically normal by checking that their difference
has negligible variance. For triangle counts, when 𝑝 ≫ 𝑛−1/2 , we can compare the
number of triangles to the number of edges after a normalization. The method can be
further modified for greater generality. See the §6.4 in the book Random Graphs by
Janson, Łuczak, and Rucinski (2000).

Remark 4.1.14 (Better tail bounds). Later on we will use more powerful tools (in-
cluding martingale methods and Azuma-Hoeffding inequalities, and also Janson in-
equalities) to prove better tail bounds on triangle (and other subgraph) counts.

4.2 Thresholds for fixed subgraphs


In the last section, we determined the threshold for 𝐺 (𝑛, 𝑝) to contain a triangle. What
about other subgraphs instead of a triangle? In this section, we give a complete answer
to this question for any fixed subgraph.

Question 4.2.1
What is the threshold for containing a fixed 𝐻 as a subgraph?

In other words, we wish to find some sequence 𝑞 𝑛 so that:


• (0-statement) if 𝑝 𝑛 /𝑞 𝑛 → 0 (i.e., 𝑝 𝑛 ≪ 𝑞 𝑛 ), then 𝐺 (𝑛, 𝑝 𝑛 ) contains 𝐻 with
probability 𝑜(1);
• (1-statement) if 𝑝 𝑛 /𝑞 𝑛 → ∞ (i.e., 𝑝 𝑛 ≫ 𝑞 𝑛 ), then 𝐺 (𝑛, 𝑝 𝑛 ) contains 𝐻 with
probability 1 − 𝑜(1).
(It is not a priori clear why such a threshold exists in the first place. In fact, threshold
always exist for monotone properties, as we will see in the next section.)
Building on our calculations for triangles from previous section, let us consider a more
general setup for estimating the variance so that we can be more organized in our
calculations.

Setup 4.2.2 (for variance bound with few dependencies)


Suppose 𝑋 = 𝑋1 + · · · + 𝑋𝑚 where 𝑋𝑖 is the indicator random variable for event 𝐴𝑖 .
Write 𝑖 ∼ 𝑗 if 𝑖 ≠ 𝑗 and the pair of events ( 𝐴𝑖 , 𝐴 𝑗 ) are not independent. Define
∑︁
𝚫∗ := max

P 𝐴 𝑗 𝐴𝑖 .
𝑖
𝑗: 𝑗∼𝑖

42
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.2 Thresholds for fixed subgraphs

Remark 4.2.3. (a) For many applications with an underlying symmetry between
the events, the sum in the definition of Δ∗ does not actually depend on 𝑖.
(b) In the definition of the dependency graph (𝑖 ∼ 𝑗) above, we are only considering
pairwise dependence. Later on when we study the Lovász Local Lemma, we
will need a strong notion of a dependency graph.
(c) This method is appropriate for a collection of events with few dependencies. It
is not appropriate for where there are many weak dependencies (e.g., Section 4.5
on the Hardy–Ramanujan theorem on the number of distinct prime divisors).

We have the bound

Cov[𝑋𝑖 , 𝑋 𝑗 ] = E[𝑋𝑖 𝑋 𝑗 ] − E[𝑋𝑖 ]E[𝑋 𝑗 ] ≤ E[𝑋𝑖 𝑋 𝑗 ] = P[ 𝐴𝑖 𝐴 𝑗 ] = P( 𝐴𝑖 )P( 𝐴 𝑗 | 𝐴𝑖 ).

(Here 𝐴𝑖 𝐴 𝑗 is the shorthand for 𝐴𝑖 ∧ 𝐴 𝑗 , meaning that both events occur.) Also

Cov[𝑋𝑖 , 𝑋 𝑗 ] = 0 if 𝑖 ≠ 𝑗 and 𝑖 ≁ 𝑗 .

Thus
𝑚
∑︁ 𝑚
∑︁ 𝑚
∑︁ ∑︁
Var 𝑋 = Cov[𝑋𝑖 , 𝑋 𝑗 ] ≤ P( 𝐴𝑖 ) + P( 𝐴𝑖 ) P( 𝐴 𝑗 | 𝐴𝑖 )
𝑖, 𝑗=1 𝑖=1 𝑖=1 𝑗: 𝑗∼𝑖

≤ E𝑋 + (E𝑋)Δ∗ .

Recall from Corollary 4.1.8 that E𝑋 > 0 and Var 𝑋 = 𝑜(E𝑋) 2 imply 𝑋 > 0 and
𝑋 ∼ E𝑋 whp. So we have the following.

Lemma 4.2.4
In the above setup, if E𝑋 → ∞ and Δ∗ = 𝑜(E𝑋), then 𝑋 > 0 and 𝑋 ∼ E𝑋 whp.

Let us now determine the threshold for containing 𝐾4 .

Theorem 4.2.5
The threshold for containing 𝐾4 is 𝑛−2/3 .

Proof. Let 𝑋 denote the number of copies of 𝐾4 in 𝐺 (𝑛, 𝑝). Then


 
𝑛 6
E𝑋 = 𝑝 ≍ 𝑛4 𝑝 6 .
4

If 𝑝 ≪ 𝑛−2/3 then E𝑋 = 𝑜(1), and thus 𝑋 = 0 whp by Markov’s inequality.

43
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

Now suppose 𝑝 ≫ 𝑛−2/3 , so E𝑋 → ∞. For each 4-vertex subset 𝑆, let 𝐴𝑆 be the event
that 𝑆 is a clique in 𝐺 (𝑛, 𝑝).
For each fixed 𝑆, one has 𝐴𝑆 ∼ 𝐴𝑆′ if and only if |𝑆 ∩ 𝑆′ | ≥ 2.
• The number of 𝑆′ that share exactly 2 vertices with 𝑆 is 6 𝑛−2 2

2 = 𝑂 (𝑛 ), and for
each such 𝑆′ one has P( 𝐴𝑆′ | 𝐴𝑆 ) = 𝑝 5 (as there are 5 additional edges not in the
𝑆-clique that need to appear clique to form the 𝑆′-clique).
• The number of 𝑆′ that share exactly 3 vertices with 𝑆 is 4(𝑛 − 4) = 𝑂 (𝑛), and
for each such 𝑆′ one has P( 𝐴𝑆′ | 𝐴𝑆 ) = 𝑝 3 .
Summing over all above 𝑆′, we find
∑︁
Δ∗ = P( 𝐴𝑆′ | 𝐴𝑆 ) ≲ 𝑛2 𝑝 5 + 𝑛𝑝 3 ≪ 𝑛4 𝑝 6 ≍ E𝑋.
𝑆 ′ :|𝑆 ′ ∩𝑆|∈{2,3}

Thus 𝑋 > 0 whp by Lemma 4.2.4. □

For both 𝐾3 and 𝐾4 , we saw that any choice of 𝑝 = 𝑝 𝑛 with E𝑋 → ∞ one has 𝑋 > 0
whp. Is this generally true?

Example 4.2.6 (First moment is not enough). Let 𝐻 = . We have E𝑋𝐻 ≍ 𝑛5 𝑝 7 .


If E𝑋 = 𝑜(1) then 𝑋 = 0 whp. But what if E𝑋 → ∞, i.e., 𝑝 ≫ 𝑛−5/7 ?
We know that if 𝑛−5/7 ≪ 𝑝 ≪ 𝑛−2/3 , then 𝑋𝐾4 = 0 whp, so 𝑋𝐻 = 0 whp since 𝐾4 ⊆ 𝐻.
On the other hand, if 𝑝 ≫ 𝑛−2/3 , then whp can find 𝐾4 , and pick an arbitrary edge to
extend to 𝐻 (we’ll prove this).

Thus the threshold for 𝐻 = is actually 𝑛−2/3 , and not 𝑛−5/7 as one might have
naively predicted from the first moment alone.
Why didn’t E𝑋𝐻 → ∞ give 𝑋𝐻 > 0 whp in our proof strategy? In the calculation
of Δ∗ , one of the terms is ≍ 𝑛𝑝 (from two copies of 𝐻 with a 𝐾4 -overlap), and
𝑛𝑝 3 𝑛5 𝑝 7 ≍ E𝑋𝐻 if 𝑝 ≪ 𝑛−2/3 .

The above example shows that the threshold is not always necessarily determined by
the expectation. For the property of containing 𝐻, the example suggests that we should
look at the “densest” subgraph of 𝐻 rather than containing 𝐻 itself.

44
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.2 Thresholds for fixed subgraphs

Definition 4.2.7
Define the edge-vertex ratio of a graph 𝐻 by
𝑒𝐻
𝝆(𝑯) := .
𝑣𝐻

(This is the same as half the average degree.)


Define the maximum edge-vertex ratio of a subgraph of 𝐻:

𝒎(𝑯) := max

𝜌(𝐻 ′).
𝐻 ⊆𝐻

Example 4.2.8. Let 𝐻 = . We have 𝜌(𝐻) = 7/5 whereas 𝜌(𝐾4 ) = 3/2 > 7/5.
It is not hard to check that 𝑚(𝐻) = 𝜌(𝐾4 ) = 3/2 as 𝐾4 is the subgraph of 𝐻 with the
maximum edge-vertex ratio.

Remark 4.2.9 (Algorithm). Goldberg (1984) found a polynomial time algorithm for
computing 𝑚(𝐻) via network flow algorithms.

The next theorem completes determines the threshold for containing some fixed graph
𝐻. Basically, it is determined by the expected number of copies of 𝐻 ′, where 𝐻 ′ is the
“denest” subgraph of 𝐻 (i.e., with the maximum edge-vertex ratio).

Theorem 4.2.10 (Threshold for containing a fixed graph: Bollobás 1981)


Fix a graph 𝐻. Then 𝑝 = 𝑛−1/𝑚(𝐻) is a threshold for containing 𝐻 has a subgraph.

Proof. Let 𝐻 ′ be a subgraph of 𝐻 achieving the maximum edge-vertex ratio, i.e.,


𝜌(𝐻 ′) = 𝑚(𝐻). Let 𝑋𝐻 denote the number of copies of 𝐻 in 𝐺 (𝑛, 𝑝).
If 𝑝 ≪ 𝑛−1/𝑚(𝐻) , then E𝑋𝐻 ′ ≍ 𝑛𝑣 𝐻 ′ 𝑝 𝑒 𝐻 ′ = 𝑜(1), so 𝑋𝐻 ′ = 0 whp, hence 𝑋𝐻 = 0 whp.
Now suppose 𝑝 ≫ 𝑛−1/𝑚(𝐻) . Let us count labeled copies of the subgraph 𝐻 in
𝐺 (𝑛, 𝑝). Let 𝐽 be a labeled copy of 𝐻 in 𝐾𝑛 , and let 𝐴𝐽 denote the event that 𝐽 appears
in 𝐺 (𝑛, 𝑝). We have, for fixed 𝐽,
∑︁ ∑︁ ′
Δ∗ = P ( 𝐴𝐽 ′ | 𝐴𝐽 ) = 𝑝 |𝐸 (𝐽 )\𝐸 (𝐽)|
𝐽 ′ ∼𝐽 𝐽 ′ ∼𝐽

For any 𝐽 ′ ∼ 𝐽, we have


′ ′
𝑛 |𝑉 (𝐽 )\𝑉 (𝐽)| 𝑝 |𝐸 (𝐽 )\𝐸 (𝐽)| ≪ 𝑛 |𝑉 (𝐽)| 𝑝 |𝐸 (𝐽)|

45
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

since
′ ′ ′
𝑝 ≫ 𝑛−1/𝑚(𝐻) ≥ 𝑛−1/𝜌(𝐽∩𝐽 ) = 𝑛−|𝑉 (𝐽)∩𝑉 (𝐽 )|/|𝐸 (𝐽)∩𝐸 (𝐽 )| .
It then follows, after considering all the possible ways that 𝐽 ′ can overlap with 𝐽, that
Δ∗ ≪ 𝑛 |𝑉 (𝐽)| 𝑝 |𝐸 (𝐽)| ≍ E𝑋𝐻 . So Lemma 4.2.4 yields the result. □

Remark 4.2.11. The proof also gives that if 𝑝 ≫ 𝑛−1/𝑚(𝐻) , then the number 𝑋𝐻 of
copies of 𝐻 is concentrated near its mean, i.e., with probability 1 − 𝑜(1),
 
𝑛 𝑣 𝐻 ! 𝑒 𝐻 𝑛𝑣 𝐻 𝑝 𝑒 𝐻
𝑋𝐻 ∼ E𝑋𝐻 = 𝑝 ∼ .
𝑣 𝐻 aut(𝐻) aut(𝐻)

4.3 Thresholds
Previously, we computed the threshold for containing a fixed 𝐻 as a subgraph. In
this section, we take a detour from the discussion of the second moment method and
discuss thresholds in more detail.
We begin by discussing the concept more abstractly by first defining the threshold of
any monotone property on subsets. Then we show that thresholds always exist.
Thresholds form a central topic in probabilistic combinatorics. For any given property,
it is natural to ask the following questions:
1. Where is the threshold?
2. Is the transition sharp? (And more precisely, what is width of the transition
window?)
We understand thresholds well for many basic graph properties, but for many others,
it can be a difficult problem. Also, one might think that one must first understand
the location of the threshold before determining the nature of the phase transition, but
surprisingly this is actually not always the case. There are powerful results that can
sometimes show a sharp threshold without identifying the location of the threshold.

Here is some general setup, before specializing to graphs.


Let Ω be some finite set (ground set). Let Ω 𝑝 be a random subset of Ω where each
element is included with probability 𝑝 independently.
An increasing property, also called monotone property, on subsets of Ω is some
binary property so that if 𝐴 ⊆ Ω satisfies the property, any superset of 𝐴 automatically
satisfies the property.
A property is trivial if all subsets of Ω satisfy the property, or if all subsets of Ω do not
satisfy the property. From now on, we only consider non-trivial monotone properties.

46
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.3 Thresholds

A graph property is a property that only depends on isomorphism classes of graphs.


Whether the random graph 𝐺 (𝑛, 𝑝) satisfies a given property can be cast in our setup
by viewing 𝐺 (𝑛, 𝑝) as Ω 𝑝 with Ω = [𝑛]

2 .

Here are some examples of increasing properties for subgraphs of a given set of
vertices:
• Contains some given subgraph
• Connected
• Has perfect matching
• Hamiltonian
• non-3-colorable
A family F ⊆ P (Ω) of subsets of Ω is called an up-set if whenever 𝐴 ∈ F and 𝐴 ⊆ 𝐵,
then 𝐵 ∈ F . Increasing property is the same as being an element of an up-set. We
will use these two terms interchangeably.

Definition 4.3.1 (Threshold)


Let Ω = Ω (𝑛) be a finite set and F = F (𝑛) an monotone property of subsets of Ω. We
say that 𝑞 𝑛 is a threshold for F if,
(
0 if 𝑝 𝑛 /𝑞 𝑛 → 0,
P(Ω 𝑝 𝑛 ∈ F ) →
1 if 𝑝 𝑛 /𝑞 𝑛 → ∞.

Remark 4.3.2. The above definition is only for increasing properties. We can similarly
define the threshold for decreasing properties by an obvious modification. An example
of a non-monotone property is sontaining some 𝐻 as an induced subgraph. Some (but
not all) non-monotone properties also have thresholds, though we will not discuss it
here.

Remark 4.3.3. From the definition, we see that if 𝑟 𝑛 and 𝑟 𝑛′ are both thresholds of
the same property, then they must be within a constant factor of each other (exercise:
check this). Thus it makes sense to say “the threshold” rather than “a threshold.”

Existence of threshold

Question 4.3.4 (Existence of threshold)


Does every non-trivial monotone property have a threshold?

47
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

How would a monotone property not have a threshold? Perhaps one could have
P(Ω1/𝑛 ∈ F ) and P(Ω (log 𝑛)/𝑛 ∈ F ) ∈ [1/10, 9/10] for all sufficiently large 𝑛?
Before answer this question, let us consider an even more elementary claim.

Theorem 4.3.5 (Monotonicity of satisfying probability)


Let Ω be a finite set and F a non-trivial monotone property of subsets of Ω. Then
𝑝 ↦→ P(Ω 𝑝 ∈ F ) is a strictly increasing function of 𝑝 ∈ [0, 1].

Let us give two related proofs of this basic fact. Both are quite instructive. Both are
based on coupling of random processes.

Proof 1. Let 0 ≤ 𝑝 < 𝑞 ≤ 1. Consider the following process to generate two random
subsets of Ω. For each 𝑥, generate uniform 𝑡𝑥 ∈ [0, 1] independently at random. Let

𝐴 = {𝑥 ∈ Ω : 𝑡𝑥 ≤ 𝑝} and 𝐵 = {𝑥 ∈ Ω : 𝑡𝑥 ≤ 𝑞} .

Then 𝐴 has the same distribution as Ω 𝑝 and 𝐵 has the same distribution as Ω𝑞 .
Furthermore, since 𝑝 < 𝑞, we always have 𝐴 ⊆ 𝐵. Since F is monotone, 𝐴 ∈ F
implies 𝐵 ∈ F . Thus

P(Ω 𝑝 ∈ F ) = P( 𝐴 ∈ F ) ≤ P(𝐵 ∈ F ) = P(Ω𝑞 ∈ F ).

To see that the inequality strict, we simply have to observe that with positive probability,
one has 𝐴 ∉ F and 𝐵 ∈ F (e.g., if all 𝑡 𝑥 ∈ ( 𝑝, 𝑞], then 𝐴 = ∅ and 𝐵 = Ω). □

In the second proof, the idea is to reveal a random subset of Ω in independent random
stages.

Proof 2. (By two-round exposure) Let 0 ≤ 𝑝 < 𝑞 ≤ 1. Note that 𝐵 = Ω𝑞 has the same
distribution as the union of two independent 𝐴 = Ω 𝑝 and 𝐴′ = Ω 𝑝 ′ , where 𝑝′ is chosen
to satisfy 1 − 𝑞 = (1 − 𝑝) (1 − 𝑝′) (check that the probability that each element occurs
is the same in the two processes). Thus

P( 𝐴 ∈ F ) ≤ P( 𝐴 ∪ 𝐴′ ∈ F ) = P(𝐵 ∈ F ).

Like earlier, to observe that the inequality is strict, one observes that with positive
probability, one has 𝐴 ∉ F and 𝐴 ∪ 𝐴′ ∈ F . □

The above technique (generalized from two round exposure to multiple round ex-
posures) gives a nice proof of the following theorem (originally proved using the
Kruskal–Katona theorem).1
1 (Thresholds for random subspaces of F𝑞𝑛 ) The proof of the Bollobás–Thomason paper using the

48
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.3 Thresholds

Theorem 4.3.6 (Existence of thresholds: Bollobás and Thomason 1987)


Every sequence of nontrivial monotone properties has a threshold.

The theorem follows from the next non-asymptotic claim.

Lemma 4.3.7 (Multiple round exposure)


Let Ω be a finite set and F some non-trivial monotone property. If 𝑝 ∈ [0, 1] and 𝑚
is nonnegative integer. Then

P(Ω 𝑝 ∉ F ) ≤ P(Ω 𝑝/𝑚 ∉ F ) 𝑚 .

Proof. Consider 𝑚 independent copies of Ω 𝑝/𝑚 , and let 𝑌 be their union. Since F is
monotone increasing, if 𝑌 ∉ F , then none of the 𝑚 copies lie in F . Hence

P(𝑌 ∉ F ) ≤ P(Ω 𝑝/𝑚 ∉ F ) 𝑚 .

Note that 𝑌 has the same distribution as Ω𝑞 for some 𝑞 ≤ 𝑝. So P(Ω 𝑝 ∉ F ) ≤


P(Ω𝑞 ∉ F ) = P(𝑌 ∉ F ) by Theorem 4.3.5. Combining the two inequalities gives the
result. □

Proof of Theorem 4.3.6. Since 𝑝 ↦→ P(Ω 𝑝 ∈ F ) is a continuous strictly increasing


function from 0 to 1 as 𝑝 goes from 0 to 1 (in fact it is a polynomial in 𝑝), there is
some unique “critical probability” 𝑝 𝑐 so that P(Ω 𝑝 𝑐 ∈ F ) = 1/2.
It remains to check for every 𝜀 > 0, there is some 𝑚 = 𝑚(𝜀) (not depending on the
property) so that

P(Ω 𝑝 𝑐 /𝑚 ∉ F ) ≥ 1 − 𝜀 and P(Ω𝑚 𝑝 𝑐 ∉ F ) ≤ 𝜀.

Kruskal–Katona theorem is still relevant. For example, there is an interesting analog of this concept
for properties of subspaces of F𝑞𝑛 , i.e., random linear codes instead of random graphs. The analogue
of the Bollobás–Thomason theorem was proved by Rossman (2020) via the the Kruskal–Katona
approach. The multiple round exposure proof does not seem to work in the random subspace setting,
as one cannot write a subspace as a union of independent copies of smaller subspaces.
As an aside, I disagree with the use of the term “sharp threshold” in Rossman’s paper for describing
all thresholds for subspaces—one really should be looking at the cardinality of the subspaces rather
than their dimensions. In a related work by Guruswami, Mosheiff, Resch, Silas, and Wootters
(2022), they determine thresholds for random linear codes for properties that seem to be analogous
to the property that a random graph contains a given fixed subgraph. Here again I disagree with
them calling it a “sharp threshold.” It is much more like a coarse threshold once you parameterize
by the cardinality of the subspace, which gives you a much better analogy to the random graph
setting.
Thresholds for random linear codes seems to an interesting topic that has only recently been
studied. I think there is more to be done here.

49
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

(here we write Ω𝑡 = Ω if 𝑡 > 1.) Indeed, applying Lemma 4.3.7 with 𝑝 = 𝑝 𝑐 , we have

P(Ω 𝑝 𝑐 /𝑚 ∉ F ) ≥ P(Ω 𝑝 𝑐 ∉ F ) 1/𝑚 = 2−1/𝑚 ≥ 1 − 𝜀 if 𝑚 ≥ (log 2)/𝜀.

Applying Lemma 4.3.7 again with 𝑝 = 𝑚 𝑝 𝑐 , we have

P(Ω𝑚 𝑝 𝑐 ∉ F ) ≤ P(Ω 𝑝 𝑐 ∉ F ) 𝑚 = 2−𝑚 ≤ 𝜀 if 𝑚 ≥ log2 (1/𝜀).

Thus 𝑝 𝑐 is a threshold for F . □

Examples
We will primarily be studying monotone graph properties. In the previous notation,
Ω = [𝑛]

2 , and we are only considering properties that depend on the isomorphism
class of the graph.

Example 4.3.8 (Containing a triangle). We saw earlier in the chapter that the threshold
for containing a triangle is 1/𝑛:

0 if 𝑛𝑝 → 0,




3


P(𝐺 (𝑛, 𝑝) contains a triangle) → 1 − 𝑒 −𝑐 /6 if 𝑛𝑝 → 𝑐 ∈ (0, ∞)


1

 if 𝑛𝑝 → ∞.

In this case, the threshold is determined by the expected number of triangles Θ(𝑛3 𝑝 3 ),
and whether this quantity goes to zero or infinity (in the latter case, we used a second
moment method to show that the number of triangles is positive with high probability).
What if 𝑝 = Θ(1/𝑛)? If 𝑛𝑝 → 𝑐 for some constant 𝑐 > 0, then (you will show in the
homework via the method of moments) that the number of triangles is asymptotically
Poisson distributed, and in particular the limit probability of containing a triangle
increases from 0 to 1 as a continuous function of 𝑐 ∈ (0, ∞). So, in particular, as
𝑝 increases, it goes through a “window of transition” of width Θ(1/𝑛) in order for
P(𝐺 (𝑛, 𝑝) contains a triangle) to increase from 0.01 to 0.99. The width of this window
is on the same order as the threshold. In this case, we call it a coarse transition.

Example 4.3.9 (Containing a subgraph). Theorem 4.2.10 tells us that the threshold
for containing a fixed subgraph 𝐻 is 𝑛−1/𝑚(𝐻) . Here the threshold is not always
determined by the expected number of copies of 𝐻. Instead, we need to look at
the “densest subgraph” 𝐻 ′ ⊆ 𝐻 with the largest edge-vertex ratio (i.e., equivalent to
largest average degree). The threshold is determined by whether the expected number
of copies of 𝐻 ′ goes to zero or infinity.
Similar to the triangle case, we have a coarse threshold.

50
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.3 Thresholds

The analysis can also be generalized to containing one of several fixed subgraphs
𝐻1 , . . . , 𝐻 𝑘 .

Remark 4.3.10 (Monotone graph properties are characterized by subgraph con-


tainment). Every monotone graph property can be characterized as containing some
element of H for some H that could depend on the vertex set 𝑛. For example, the
property of connectivity corresponds to taking H to be all spanning trees. More
generally, one can take H to be the set of all minimal graphs satisfying the property.
When elements of H are unbounded in size, the problem of thresholds become quite
interesting and sometimes difficult.

The original Erdős–Rényi (1959) paper on random graphs already studied several
thresholds, such as the next two examples.

log 𝑛 + 𝑐 𝑛
Example 4.3.11 (No isolated vertices). With 𝑝 = ,
𝑛

0 if 𝑐 𝑛 → −∞




−𝑐


P (𝐺 (𝑛, 𝑝) has no isolated vertices) → 1 − 𝑒 𝑒 if 𝑐 𝑛 → 𝑐


1

 if 𝑐 𝑛 → ∞

It is a good exercise (and included in the problem set) to check the above claims. The
cases 𝑐 𝑛 → −∞ and 𝑐 𝑛 → ∞ can be shown using the second moment method. More
precisely, when 𝑐 𝑛 → 𝑐, by comparing moments one can show that the number of
isolated vertices is asymptotically Poisson.
In this example, the threshold is at (log 𝑛)/𝑛. As we see above, the transition window
is Θ(1/𝑛), much narrower the magnitude of the threshold. In particular, the event
probability goes from 𝑜(1) to 1 − 𝑜(1) when 𝑝 increases from (1 − 𝛿)(log 𝑛)/𝑛 to
(1 + 𝛿)(log 𝑛)/𝑛 for any fixed 𝛿 > 0. In this case, we say that the property has a sharp
threshold at (log 𝑛)/𝑛 (here the leading constant factor is relevant, unlike the earlier
example of a coarse threshold).

log 𝑛 + 𝑐 𝑛
Example 4.3.12 (Connectivity). With 𝑝 =
𝑛

0 if 𝑐 𝑛 → −∞




−𝑐


P (𝐺 (𝑛, 𝑝) is connected) → 1 − 𝑒 𝑒 if 𝑐 𝑛 → 𝑐


1

 if 𝑐 𝑛 → ∞

In fact, a much stronger statement is true, connecting the above two examples: consider
a process where one adds a random edges one at a time, then with probability 1 − 𝑜(1),

51
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

the graph becomes connected as soon as there are no more isolated vertices. Such
stronger characterization is called a hitting time result.
A similar statement is true if we replace “is connected” by “has a perfect matching”
(assuming 𝑛 even).

Example 4.3.13 (Perfect matching in a random hypergraph: Shamir’s problem).


Let 𝐺 (3) (𝑛, 𝑝) be a random hypergraph on 𝑛 vertices, where each triple of vertices
appears as an edge with probability 𝑝. Assume that 𝑛 is divisible by 3. What is the
threshold for the existence of a perfect matching (i.e., a set of 𝑛/3 edges covering all
vertices)?
It is easy to check that the property of having no isolated vertices has a sharp threshold
at 𝑝 = 2𝑛−2 log 𝑛. Is this also a threshold for having a perfect matching? So for smaller
𝑝, one cannot have a perfect matching due to having an isolated vertex. What about
larger 𝑝? This turns out to be a difficult problem known as “Shamir’s problem”.
A difficult result by Johansson, Kahn, and Vu (2008) (this paper won a Fulkerson Prize)
showed that there is some constant 𝐶 > 0 so that if 𝑝 ≥ 𝐶𝑛−2 log 𝑛 then 𝐺 (3) (𝑛, 𝑝)
contains a perfect matching with high probability. They also solved the problem much
generally for 𝐻-factors in random 𝑘-uniform hypergraphs.
Recent exciting breakthroughs on the Kahn–Kalai conjecture (2007) by Frankston,
Kahn, Narayanan, and Park (2021) and Park and Pham (2024) provide new and much
shorter proofs of this threshold for Shamir’s problem.
Recently, Kahn (2022) proved a sharp threshold result, and actually an even stronger
hitting time version, of Shamir’s problem, showing that with high probability, one has
a perfect matching as soon as there are no isolated vertices.

Sharp transition
In some of the examples, the probability that 𝐺 (𝑛, 𝑝) satisfies the property changes
quickly and dramatically as 𝑝 crosses the threshold (physical analogy: similar to how
the structure of water changes dramatically as the temperature drops below freezing).
For example, while for connectivity, while 𝑝 = log 𝑛/𝑛 is a threshold, we see that
𝐺 (𝑛, 0.99 log 𝑛/𝑛) is whp not connected and 𝐺 (𝑛, 1.01 log 𝑛/𝑛) is whp connected,
unlike the situation for containing a triangle earlier. We call this the sharp threshold
phenomenon.

52
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.3 Thresholds

Coarse threshold Sharp threshold

o
Pt P
o S Igoe P
Dpet
Figure 4.1: Examples of coarse and sharp thresholds. The vertical axis is the
probability that 𝐺 (𝑛, 𝑝) satisfies the property.

Definition 4.3.14 (Sharp thresholds)


We say that 𝑟 𝑛 is a sharp threshold for some property F on subsets of Ω if, for every
𝛿 > 0, (
0 if 𝑝 𝑛 /𝑟 𝑛 ≤ 1 − 𝛿,
P(Ω 𝑝 𝑛 ∈ F ) →
1 if 𝑝 𝑛 /𝑟 𝑛 ≥ 1 + 𝛿.
On the other hand, if there is some fixed 𝜀 > 0 and 0 < 𝑐 < 𝐶 so that P(Ω 𝑝 𝑛 ∈ F ) ∈
[𝜀, 1 − 𝜀] for whenever 𝑐 ≤ 𝑝 𝑛 /𝑟 𝑛 ≤ 𝐶, then we say that 𝑟 𝑛 is a coarse threshold.

As in Figure 4.1, the sharp/coarseness of a thresholds is about how quickly P(Ω 𝑝 ∈ F )


transitions from 𝜀 to 1 − 𝜀 as 𝑝 increases. How wide is the transition window for 𝑝? By
the Bollobás–Thomason theorem (Theorem 4.3.6) on the existence of thresholds, this
transition window always has width 𝑂 (𝑟 𝑛 ). If the transition window has width Θ(𝑟 𝑛 )
for some 𝜀 > 0, then we have a coarse threshold. On the other hand, if the transition
window has width 𝑜(𝑟 𝑛 ) for every 𝜀 > 0, then we have a sharp threshold.
From earlier examples, we saw coarse thresholds for the “local” property of containing
some given subgraph, as well as sharp thresholds for “global” properties such as
connectivity. It turns out that this is a general phenomenon.
Friedgut’s sharp threshold theorem (1999), a deep and important result, completely
characterizes when a threshold is coarse versus sharp. We will not state Friedgut’s
theorem precisely here since it is rather technical (and actually not always easy to
apply). Let us just give a flavor. Roughly speaking, the theorem says that:
All monotone graph properties with a coarse threshold may be approxi-
mated by a local property.
In other words, informally, if a monotone graph property P has a coarse threshold,
then there is finite list of graph 𝐺 1 , . . . , 𝐺 𝑚 such that P is “close to” the property of
containing one of 𝐺 1 , . . . , 𝐺 𝑚 as a subgraph.

53
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

We need “close to” since the property could be “contains a triangle and has at least
log 𝑛 edges”, which is not exactly local but it is basically the same as “contains a
triangle.”
There is some subtlety here since we can allow very different properties depending on
the value of 𝑛. E.g., P could be the set of all 𝑛-vertex graphs that contain a 𝐾3 if 𝑛 is
odd and 𝐾4 if 𝑛 is even. Friedgut’s theorem tells us that if there is a threshold, then
there is a partition N = N1 ∪ · · · ∪ N 𝑘 such that on each N𝑖 , P is approximately the
form described in the previous paragraph.
In the last section, we derived that the property of containing some fixed 𝐻 has
threshold 𝑛−1/𝑚(𝐻) for some rational number 𝑚(𝐻). It follows as a corollary of
Friedgut’s theorem that every coarse threshold must have this form.

Corollary 4.3.15 (of Friedgut’s sharp threshold theorem)


Suppose 𝑟 (𝑛) is a coarse threshold of some graph property. Then there is a partition
of N = N1 ∪ · · · ∪ N 𝑘 and rationals 𝛼1 , . . . , 𝛼 𝑘 > 0 such that 𝑟 (𝑛) ≍ 𝑛−𝛼 𝑗 for every
𝑛 ∈ N𝑗.

In particular, if (log 𝑛)/𝑛 is a threshold of some monotone graph property (e.g., this is
the case for connectivity), then we automatically know that it must be a sharp threshold,
even without knowing anything else about the property. Likewise if the threshold has
the form 𝑛−𝛼 for some irrational 𝛼.
The exact statement of Friedgut’s theorem is more cumbersome. We refer those who
are interested to Friedgut’s original 1999 paper and his later survey for details and
applications. This topic is connected more generally to an area known as the analysis
of boolean functions.
Also, it is known that the transition window of every monotone graph property is
(log 𝑛) −2+𝑜(1) (Friedgut––Kalai (1996), Bourgain–Kalai (1997)).
Curiously, tools such as Friedgut’s theorem sometimes allow us to prove the existence
of a sharp threshold without being able to identify its exact location. For example, it is
an important open problem to understand where exactly is the transition for a random
graph to be 𝑘-colorable.

Conjecture 4.3.16 ( 𝑘 -colorability threshold)


For every 𝑘 ≥ 3 there is some real constant 𝑑 𝑘 > 0 such that for any constant 𝑑 > 0,
(
1 if 𝑑 < 𝑑 𝑘 ,
P(𝐺 (𝑛, 𝑑/𝑛) is 𝑘-colorable) →
0 if 𝑑 > 𝑑 𝑘 .

We do know that there exists a sharp threshold for 𝑘-colorability.

54
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.4 Clique number of a random graph

Theorem 4.3.17 (Achlioptas and Friedgut 2000)


For every 𝑘 ≥ 3, there exists a function 𝑑 𝑘 (𝑛) such that for every 𝜀 > 0, and sequence
𝑑 (𝑛) > 0,
(
    1 if 𝑑 (𝑛) < 𝑑 𝑘 (𝑛) − 𝜀,
P 𝐺 𝑛, 𝑑 (𝑛)
𝑛 is 𝑘-colorable →
0 if 𝑑 (𝑛) > 𝑑 𝑘 (𝑛) + 𝜀.

On the other hand, it is not known whether lim𝑛→∞ 𝑑 𝑘 (𝑛) exists, which would imply
Conjecture 4.3.16. Further bounds on 𝑑 𝑘 (𝑛) are known, e.g. the landmark paper of
Achlioptas and Naor (2006) showing that for each fixed 𝑑 > 0, whp 𝜒(𝐺 (𝑛, 𝑑/𝑛) ∈
{𝑘 𝑑 , 𝑘 𝑑 + 1} where 𝑘 𝑑 = min{𝑘 ∈ N : 2𝑘 log 𝑘 > 𝑑}. Also see the later work of
Coja-Oghlan and Vilenchik (2013).

4.4 Clique number of a random graph


The clique number 𝝎(𝑮) of a graph is the maximum number of vertices in a clique
of 𝐺.

Question 4.4.1
What is the clique number of 𝐺 (𝑛, 1/2)?

Let 𝑋 be the number of 𝑘-cliques of 𝐺 (𝑛, 1/2). Define


 
𝑛 − ( 𝑘)
𝑓 (𝑛, 𝑘) := E𝑋 = 2 2 .
𝑘

Let us first do a rough estimate to see what is the critical 𝑘 to get 𝑓 (𝑛, 𝑘) large or small.
𝑛 𝑘
  𝑘
Recall that 𝑒𝑘 ≤ 𝑛𝑘 ≤ 𝑒𝑛𝑘 . We have
 
𝑘
log2 𝑓 (𝑛, 𝑘) = 𝑘 log2 𝑛 − log2 𝑘 − + 𝑂 (1) .
2

And so the transition point is at some 𝑘 ∼ 2 log2 𝑛 in the sense that if 𝑘 ≥ (2+ 𝛿) log2 𝑛,
then 𝑓 (𝑛, 𝑘) → 0 while if 𝑘 ≤ (2 − 𝛿) log2 𝑛, then 𝑓 (𝑛, 𝑘) → ∞.
The next result gives us a lower bound on the typical clique number.

55
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

Theorem 4.4.2 (Second moment bound for clique number)


Let 𝑘 = 𝑘 (𝑛) be some sequence of positive integers.
(a) If 𝑓 (𝑛, 𝑘) → 0, then 𝜔(𝐺 (𝑛, 1/2)) < 𝑘 whp.
(b) If 𝑓 (𝑛, 𝑘) → ∞, then 𝜔(𝐺 (𝑛, 1/2)) ≥ 𝑘 whp.

Proof sketch. The first claim follows from Markov’s inequality as P(𝑋 ≥ 1) ≤ E𝑋.

For the second claim, we bound the variance using Setup 4.2.2. For each 𝑘-element
subset 𝑆 of vertices, let 𝐴𝑆 be the event that 𝑆 is a clique. Let 𝑋𝑆 be the indicator
random variable for 𝐴𝑆 . Recall
∑︁
Δ∗ := max

P 𝐴 𝑗 𝐴𝑖 .
𝑖
𝑗: 𝑗∼𝑖

For fixed 𝑘-set 𝑆, consider all 𝑘-set 𝑇 with |𝑆 ∩ 𝑇 | ≥ 2:


𝑘−1      
∑︁ ∑︁ 𝑘 𝑛−𝑘 ) − ( 𝑘2) omitted 𝑛 − ( 𝑘)
2(
𝑖

Δ = P( 𝐴𝑇 | 𝐴𝑆 ) = 2 ≪ E𝑋 = 2 2 .
𝑖 𝑘 −𝑖 𝑘
𝑇 ∈ ( [𝑛]
𝑘 )
𝑖=2
2≤|𝑆∩𝑇 |≤𝑘−1

It then follows from Lemma 4.2.4 that 𝑋 > 0 (i.e., 𝜔(𝐺) ≥ 𝑘) whp. □

We can a two-point concentration result for the clique number of 𝐺 (𝑛, 1/2). This
result is due to Bollobás–Erdős 1976 and Matula 1976.

Theorem 4.4.3 (Two-point concentration for clique number)


There exists a 𝑘 = 𝑘 (𝑛) ∼ 2 log2 𝑛 such that 𝜔(𝐺 (𝑛, 1/2)) ∈ {𝑘, 𝑘 + 1} whp.

Proof. For 𝑘 ∼ 2 log2 𝑛,

𝑓 (𝑛, 𝑘 + 1) 𝑛 − 𝑘 −𝑘
= 2 = 𝑛−1+𝑜(1) .
𝑓 (𝑛, 𝑘) 𝑘 +1

Let 𝑘 0 = 𝑘 0 (𝑛) ∼ 2 log2 𝑛 be the value with

𝑓 (𝑛, 𝑘 0 ) ≥ 𝑛−1/2 > 𝑓 (𝑛, 𝑘 0 + 1).

Then 𝑓 (𝑛, 𝑘 0 − 1) → ∞ and 𝑓 (𝑛, 𝑘 0 + 1) = 𝑜(1). By Theorem 4.4.2, the clique number
of 𝐺 (𝑛, 1/2) is whp in {𝑘 0 − 1, 𝑘 0 }. □

Remark 4.4.4. By a more careful analysis, one can show that outside a very sparse

56
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.5 Hardy–Ramanujan theorem on the number of prime divisors

subset of integers, one has 𝑓 (𝑛, 𝑘 0 ) → ∞, in which case one has one-point concentra-
tion 𝜔(𝐺 (𝑛, 1/2)) = 𝑘 0 whp.
By taking the complement of the graph, we also get a two-point concentration result
about the independence number of 𝐺 (𝑛, 1/2). Bohman and Hofstad (2024) extended
the two-point concentration result for the independence number of 𝐺 (𝑛, 𝑝) to all
𝑝 ≥ 𝑛−2/3+𝜀 .

Remark 4.4.5 (Chromatic number). Since the chromatic number satisfies 𝜒(𝐺) ≥
𝑛/𝛼(𝐺), we have
𝑛
𝜒(𝐺 (𝑛, 1/2)) ≥ (1 + 𝑜(1)) whp.
2 log2 𝑛

In Theorem 8.3.2, using more advanced methods, we will prove 𝜒(𝐺 (𝑛, 1/2)) ∼
𝑛/(2 log2 𝑛) whp, a theorem due to Bollobás (1987).
In Section 9.3, using martingale concentration, we will show that 𝜒(𝐺 (𝑛, 𝑝)) is tightly
concentrated around its mean without a priori needing to know where the mean is
located.

4.5 Hardy–Ramanujan theorem on the number of


prime divisors
Let 𝝂(𝒏) denote the number of distinct primes dividing 𝑛 (not counting multiplicities).
The next theorem says that “almost all” 𝑛 have (1 + 𝑜(1)) log log 𝑛 prime factors

Theorem 4.5.1 (Hardy and Ramanujan 1917)


For every 𝜀 > 0, there exists 𝐶 such that for all sufficiently large 𝑛, all but 𝜀-fraction
of 𝑥 ∈ [𝑛] satisfy √︁
|𝜈(𝑥) − log log 𝑛| ≤ 𝐶 log log 𝑛

The original proof of Hardy and Ramanujan was quite involved. Here we show a
“probabilistic” proof due to Turán (1934), which played a key role in the development
of probabilistic methods in number theory.

Proof. Choose 𝑥 ∈ [𝑛] uniformly at random. For prime 𝑝, let


(
1 if 𝑝|𝑥,
𝑋𝑝 =
0 otherwise.

57
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

Set 𝑀 = 𝑛1/10 , and (the sum is taken over primes 𝑝).


∑︁
𝑋= 𝑋𝑝 .
𝑝≤𝑀

We have
𝜈(𝑥) − 10 ≤ 𝑋 (𝑥) ≤ 𝜈(𝑥)
since 𝑥 cannot have more than 10 prime factors > 𝑛1/10 . So it suffices to analyze 𝑋.
Since exactly ⌊𝑛/𝑝⌋ positive integers ≤ 𝑛 are divisible by 𝑝, we have

 
⌊𝑛/𝑝⌋ 1 1
E𝑋 𝑝 = = +𝑂 .
𝑛 𝑝 𝑛
We quote Merten’s theorem from analytic number theory:
∑︁
1/𝑝 = log log 𝑛 + 𝑂 (1).
𝑝≤𝑛

(A more precise result says that 𝑂 (1) error term converges to the Meissel–Mertens
constant.) So
∑︁  1  
1
E𝑋 = +𝑂 = log log 𝑛 + 𝑂 (1).
𝑝≤𝑀
𝑝 𝑛

Next we compute the variance. The intuition is that divisibility by distinct primes
should behave somewhat independently. Indeed, if 𝑝𝑞 divides 𝑛, then 𝑋 𝑝 and 𝑋𝑞 are
independent (e.g., by the Chinese Remainder Theorem). If 𝑝𝑞 does not divide 𝑛, but
𝑛 is large enough, then there is some small covariance contribution. (In contrast to the
earlier variance calculations in random graphs, here we have many weak dependices.)
If 𝑝 ≠ 𝑞, then 𝑋 𝑝 𝑋𝑞 = 1 if and only if 𝑝𝑞|𝑥. Thus

Cov[𝑋 𝑝 , 𝑋𝑞 ] = E[𝑋 𝑝 𝑋𝑞 ] − E[𝑋 𝑝 ]E[𝑋𝑞 ]


⌊𝑛/𝑝𝑞⌋ ⌊𝑛/𝑝⌋ ⌊𝑛/𝑞⌋
= −
𝑛 𝑛 𝑛
       
1 1 1 1 1 1
= +𝑂 − +𝑂 +𝑂
𝑝𝑞 𝑛 𝑝 𝑛 𝑞 𝑛
 
1
=𝑂 .
𝑛

Thus
∑︁ 𝑀2
Cov[𝑋 𝑝 , 𝑋𝑞 ] ≲ ≲ 𝑛−4/5 .
𝑝≠𝑞
𝑛

58
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.5 Hardy–Ramanujan theorem on the number of prime divisors

Also, Var 𝑋 𝑝 = E[𝑋 𝑝 ] − (E𝑋 𝑝 ) 2 = (1/𝑝)(1 − 1/𝑝) + 𝑂 (1/𝑛). Combining, we have


∑︁ ∑︁
Var 𝑋 = Var 𝑋 𝑝 + Cov[𝑋 𝑝 , 𝑋𝑞 ]
𝑝≤𝑀 𝑝≠𝑞
∑︁ 1
= + 𝑂 (1) = log log 𝑛 + 𝑂 (1) ∼ E𝑋.
𝑝≤𝑀
𝑝

Thus by Chebyshev’s inequality, for every constant 𝜆 > 0,


 √︁  Var 𝑋 1
P |𝑋 − log log 𝑛| ≥ 𝜆 log log 𝑛 ≤ = + 𝑜(1).
𝜆2 (log log 𝑛) 𝜆2

Finally, recall that |𝑋 − 𝜈| ≤ 10. So the same asymptotic bound holds with 𝑋 replaced
by 𝜈. □
Í
The main idea from the above proof is that the number of prime divisors 𝑋 = 𝑝 𝑋 𝑝
(from the previous proof) behaves like a sum of independent random variables.
A sum of independent random variables often satisfy a central limit theorem (i.e.,
asymptotic normality, convergence to Gaussian), assuming some mild regularity hy-
potheses. In particular, we have the following corollary of the Lindenberg–Feller
central limit theorem (see Durrett, Theorem 3.4.10):

Theorem 4.5.2 (Central limit theorem for sums of independent Bernoullis)


If 𝑋𝑛 is a sum of independent
√ Bernoulli random variables, and Var 𝑋𝑛 → ∞ as 𝑛 → ∞,
then (𝑋𝑛 − E𝑋𝑛 )/ Var 𝑋 converges to the normal distribution.

(Note that the divergent variance hypothesis is necessary and sufficient.)


In the setting of prime divisibility, we do not have genuine independence. Nevertheless,
it is natural to expect that 𝜈(𝑥) still satisfies a central limit theorem. This is indeed the
case, and can be proved by comparing moments against a genuine sum of independent
random Bernoulli random variables.

Theorem 4.5.3 (Asymptotic normality: Erdős and Kac 1940)


With 𝑥 ∈ [𝑛] uniformly chosen at random, 𝜈(𝑥) is asymptotically normal, i.e., for
every 𝜆 ∈ R,
! ∫ ∞
𝜈(𝑥) − log log 𝑛 1 2
lim P𝑥∈[𝑛] √︁ ≥𝜆 =√ 𝑒 −𝑡 /2 𝑑𝑡
𝑛→∞ log log 𝑛 2𝜋 𝜆

The original proof of Erdős and Kac verifies the above intuition using some more
involved results in analytic number theory. Simpler proofs have been subsequently
given, and we outline one such proof below, which is based on computing the moments

59
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

of the distribution. The idea of computing moments for this problem was first used by
Delange (1953), who was apparently not aware of the Erdős–Kacs paper. Also see a
more modern account by Granville and Soundararajan (2007).
The following tool from probability theory allows us to verify asymptotic normality
from convergence of moments.

Theorem 4.5.4 (Method of moments)


Let 𝑋𝑛 be a sequence of real valued random variables such that for every positive integer
𝑘, lim𝑛→∞ E[𝑋𝑛𝑘 ] equals to the 𝑘-th moment of the standard normal distribution. Then
𝑋𝑛 converges in distribution to the standard normal, i.e., lim𝑛→∞ P(𝑋𝑛 ≤ 𝑎) = P(𝑍 ≤
𝑎) for every 𝑎 ∈ R, where 𝑍 is a standard normal.

Remark 4.5.5. The same conclusion holds for any probability distribution that is
“determined by its moments,” i.e., there are no other distributions sharing the same
moments. Many common distributions that arise in practice, e.g., the Poisson distri-
bution, satisfy this property. There are various sufficient conditions for guaranteeing
this moments property, e.g., Carleman’s condition tells us that any probability distri-
bution whose moments do not increase too quickly is determined by its moments. (See
Durrett §3.3.5).

Proof of Erdős–Kacs Theorem 4.5.3. We compare higher moments of 𝑋 = 𝜈(𝑥) with


that of an idealized 𝑌 treating the prime divisors as truly random variables.
Set 𝑀 = 𝑛1/𝑠(𝑛) where 𝑠(𝑛) → ∞ sufficiently slowly. As earlier, 𝜈(𝑥) − 𝑠(𝑛) ≤ 𝜈(𝑥) ≤
𝑣(𝑥).
Í
We construct a “model random variable” mimicking 𝑋. Let 𝑌 = 𝑝≤𝑀 𝑌𝑝 , where
𝑌𝑝 ∼ Bernoulli(1/𝑝) independently for all primes 𝑝 ≤ 𝑀. We can compute:

𝜇 := E𝑌 ∼ E𝑋 ∼ log log 𝑛

and
𝜎 2 := Var 𝑌 ∼ Var 𝑋 ∼ log log 𝑛.

e = (𝑋 − 𝜇)/𝜎 and 𝑌e = (𝑌 − 𝜇)/𝜎.


Let 𝑋
A consequence of the Lindeberg–Feller central limit theorem is that a sum of inde-
pendent Bernoulli random variables with divergent variance satisfies the central limit
theorem. So 𝑌e → 𝑁 (0, 1) in distribution. In particular, E[𝑌e𝑘 ] ∼ E[𝑍 𝑘 ] (asymptotics
as 𝑛 → ∞) where 𝑍 is a standard normal.
Let us compare 𝑋 e𝑘 ] ∼ E[𝑌e𝑘 ].
e and 𝑌e. It suffices to show that for every fixed 𝑘, E[ 𝑋

60
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.6 Distinct sums

For every set of distinct primes 𝑝 1 , . . . 𝑝𝑟 ≤ 𝑀,


   
1 𝑛 1 1
E[𝑋 𝑝1 · · · 𝑋 𝑝𝑟 − 𝑌𝑝1 · · · 𝑌𝑝𝑟 ] = − =𝑂 .
𝑛 𝑝 1 · · · 𝑝𝑟 𝑝 1 · · · 𝑝𝑟 𝑛

Comparing expansions of 𝑋
e𝑘 in terms of the 𝑋 𝑝 ’s (𝑛𝑜(1) terms), we get

e𝑘 − 𝑌e𝑘 ] = 𝑛−1+𝑜(1) = 𝑜(1).


E[ 𝑋

It follows that 𝑋
e is asymptotically normal. □

4.6 Distinct sums


What is the largest subset of [𝑛] all of whose subsets have distinct sums? Equivalently:

Question 4.6.1
For each 𝑘, what is the smallest 𝑛 so that there exists 𝑆 ⊆ [𝑛] with |𝑆| = 𝑘 and all 2 𝑘
subset sums of 𝑆 are distinct?

E.g., 𝑆 = {1, 2, 22 , . . . , 2 𝑘−1 } (the greedy choice).


We begin with an easy pigeonhole argument. Since all 2 𝑘 sums are distinct and are at
most 𝑘𝑛, we have 2 𝑘 ≤ 𝑘𝑛. Thus 𝑛 ≥ 2 𝑘 /𝑘.
Erdős offered $300 for a proof or disproof of the following. It remains open.

Conjecture 4.6.2 (Erdős)


𝑛 ≳ 2𝑘

Let us use the second moment to give a modest improvement on the earlier pigeonhole
argument. The main idea here is that, by second moment, most of the subset sums lie
within an 𝑂 (𝜎)-interval, so that we can improve on the pigeonhole estimate ignoring
outlier subset sums.

Theorem 4.6.3

If there is a 𝑘-element subset of [𝑛] with distinct subset sums. Then 𝑛 ≳ 2 𝑘 / 𝑘.

Proof. Let 𝑆 = {𝑥 1 , . . . , 𝑥 𝑘 } be a 𝑘-element subset of [𝑛] with distinct subset sums.


Set
𝑋 = 𝜀1 𝑥1 + · · · + 𝜀 𝑘 𝑥 𝑘

61
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

where 𝜀𝑖 ∈ {0, 1} are chosen uniformly at random independently. We have


𝑥1 + · · · + 𝑥 𝑘
𝜇 := E𝑋 =
2
and
𝑥 12 + · · · + 𝑥 2𝑘 𝑛2 𝑘
𝜎 2 := Var 𝑋 = ≤ .
4 4
By Chebyshev’s inequality,

1
P(|𝑋 − 𝜇| ≥ 2𝜎) ≤ ,
4
and thus
√ 3
P(|𝑋 − 𝜇| < 𝑛 𝑘) = P(|𝑋 − 𝜇| < 2𝜎) ≥ .
4
−𝑘
√ every (𝜀 1 , . . . , 𝜀 𝑘 ) ∈ {0, 1} , we√have P(𝑋
Since 𝑋 takes distinct values for √ = 𝑥) ≤ 2
𝑘

for all 𝑥. Since there are ≤ 2𝑛 𝑘 elements in the interval (𝜇 − 𝑛 𝑘, 𝜇 + 𝑛 𝑘), we have
√ √
P(|𝑋 − 𝜇| < 𝑛 𝑘) ≤ 2𝑛 𝑘2−𝑘 .

Putting the upper and lowers bounds on P(|𝑋 − 𝜇| < 𝑛 𝑘) together, we get
√ 3
2𝑛 𝑘2−𝑘 ≤ .
4

So 𝑛 ≳ 2 𝑘 / 𝑘. □

Dubroff, Fox, and Xu (2021) gave another short proof of this result by applying Harper’s
vertex-isoperimetric inequality on the cube (this is an example of “concentration of
measure”, which we will explore more later this course).
Consider the graph representing the 𝑛-dimensional boolean cube, with vertex set {0, 1}𝑛
with an edge between every pair of 𝑛-tuples that differ in exactly one coordinate. Given
𝐴 ⊆ {0, 1}𝑛 , write 𝜕 𝐴 for the set of all vertices outside 𝐴 that is adjacent to some
vertex of 𝐴.

Theorem 4.6.4 (Vertex-isomperimetric inequality on the hypercube: Harper 1966)


Every 𝐴 ⊆ {0, 1} 𝑘 with | 𝐴| = 2 𝑘−1 has
 
𝑘
|𝜕 𝐴| ≥ .
⌊𝑘/2⌋

62
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.7 Weierstrass approximation theorem

Remark 4.6.5. A stronger form of Harper’s theorem gives the precise value of

min |𝜕 𝐴|
𝐴⊆{0,1} 𝑛 :| 𝐴|=𝑚

for every (𝑛, 𝑚). Basically, the minimum is achieved when 𝐴 is a a Hamming ball, or, if
𝑚 is not exactly the size of some Hamming ball, then 𝐴 consists of the lexicographically
first 𝑚 elements of {0, 1}𝑛 .

Theorem 4.6.6 (Dubroff–Fox–Xu 2021)


If there is a 𝑘-element subset of [𝑛] with distinct subset sums, then
  √︂ !
𝑘 2 2𝑘
𝑛≥ = + 𝑜(1) √ .
⌊𝑘/2⌋ 𝜋 𝑘

Remark 4.6.7. The above bound has the currently best known leading constant factor,
matching an earlier result by Aliev (2009).

Proof. Let 𝑆 = {𝑥 1 , . . . , 𝑥 𝑘 } be a subset of [𝑛] with distinct sums. Let


n 𝑥1 + · · · + 𝑥 𝑘 o
𝐴 = (𝜀1 , . . . , 𝜀 𝑘 ) ∈ {0, 1} 𝑘 : 𝜀1 𝑥1 + · · · + 𝜀1 𝑥 𝑘 < .
2
Note that due to the distinct sum hypothesis, one can never have 𝜀1 𝑥 1 + · · · + 𝜀 𝑘 𝑥 𝑘 =
(𝑥 1 + · · · + 𝑥 𝑛 )/2. It thus follows by symmetry that | 𝐴| = 2 𝑘−1 .
Note that every element of 𝜕 𝐴 corresponds to some sum of the form 𝑧 + 𝑥𝑖 > (𝑥 1 +
· · · + 𝑥 𝑙 )/2 for some 𝑧 < (𝑥 1 + · · · + 𝑥 𝑘 )/2, and thus 𝑧 + 𝑥𝑖 lies in the open interval
𝑥 + · · · + 𝑥 𝑥 + · · · + 𝑥 
1 𝑘 1 𝑘
, + max 𝑆 .
2 2
𝑘 
Since all subset sums are distinct, we must have 𝑛 ≥ |𝜕 𝐴| ≥ ⌊𝑘/2⌋ by Harper’s
theorem (Theorem 4.6.4). □

4.7 Weierstrass approximation theorem


We finish off the chapter with an application to analysis.
The Weierstrass approximation theorem says that every continuous real function on an
interval can be uniformly approximated by a polynomial.

63
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

Theorem 4.7.1 (Weierstrass approximation theorem 1885)


Let 𝑓 : [0, 1] → R be a continuous function. Let 𝜀 > 0. Then there is a polynomial
𝑝(𝑥) such that | 𝑝(𝑥) − 𝑓 (𝑥)| ≤ 𝜀 for all 𝑥 ∈ [0, 1].

Proof. (Bernstein 1912) The idea is to approximate 𝑓 by a sum of polynomials that


look like “bumps”:
𝑛
∑︁
𝑃𝑛 (𝑥) = 𝐸𝑖 (𝑥) 𝑓 (𝑖/𝑛)
𝑖=0
where
 
𝑛 𝑖
𝐸𝑖 (𝑥) = P(Binomial(𝑛, 𝑥) = 𝑖) = 𝑥 (1 − 𝑥) 𝑛−𝑖 for 0 ≤ 𝑖 ≤ 𝑛
𝑖

is chosen as some polynomials peaks at 𝑥 = 𝑖/𝑛 and then decays as 𝑥 moves away from
𝑖/𝑛.
For each 𝑥 ∈ [0, 1], the binomial distribution Binomial(𝑛, 𝑥) has mean 𝑛𝑥 and variance
𝑛𝑥(1 − 𝑥) ≤ 𝑛. By Chebyshev’s inequality,
∑︁
𝐸𝑖 (𝑥) = P(|Binomial(𝑛, 𝑥) − 𝑛𝑥| > 𝑛2/3 ) ≤ 𝑛−1/3 .
𝑖:|𝑖−𝑛𝑥|>𝑛2/3

(In the next chapter, we will see a much better tail bound.)
Since [0, 1] is compact, 𝑓 is uniformly continuous and bounded. By multiplying by
a constant, we assume that | 𝑓 (𝑥)| ≤ 1 for all 𝑥 ∈ [0, 1]. Also there exists 𝛿 > 0 such
that | 𝑓 (𝑥) − 𝑓 (𝑦)| ≤ 𝜀/2 for all 𝑥, 𝑦 ∈ [0, 1] with |𝑥 − 𝑦| ≤ 𝛿.
Take 𝑛 > max{64𝜀 −3 , 𝛿−3 }. Then for every 𝑥 ∈ [0, 1] (note that 𝑛𝑗=0 𝐸 𝑗 (𝑥) = 1),
Í

𝑛
∑︁
|𝑃𝑛 (𝑥) − 𝑓 (𝑥)| ≤ 𝐸𝑖 (𝑥)| 𝑓 (𝑖/𝑛) − 𝑓 (𝑥)|
𝑖=0
∑︁ ∑︁
≤ 𝐸𝑖 (𝑥)| 𝑓 (𝑖/𝑛) − 𝑓 (𝑥)| + 2𝐸𝑖 (𝑥)
𝑖:|𝑖/𝑛−𝑥|<𝑛 −1/3 <𝛿 𝑖:|𝑖−𝑛𝑥|>𝑛2/3
𝜀
≤ + 2𝑛−1/3 ≤ 𝜀. □
2

Exercises
1. Let 𝑋 be a nonnegative real-valued random variable. Suppose P(𝑋 = 0) < 1.
Prove that
Var 𝑋
P(𝑋 = 0) ≤ .
E[𝑋 2 ]

64
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.7 Weierstrass approximation theorem

2. Let 𝑋 be a random variable with mean 𝜇 and variance 𝜎 2 . Prove that for all
𝜆 > 0,
𝜎2
P(𝑋 ≥ 𝜇 + 𝜆) ≤ 2 .
𝜎 + 𝜆2

3. Threshold for 𝑘-APs. Let [𝑛] 𝑝 denote the random subset of {1, . . . , 𝑛} where
every element is included with probability 𝑝 independently. For each fixed
integer 𝑘 ≥ 3, determine the threshold for [𝑛] 𝑝 to contain a 𝑘-term arithmetic
progression.
4. What is the threshold function for 𝐺 (𝑛, 𝑝) to contain a cycle?
5. Show that, for each fixed positive integer 𝑘, there is a sequence 𝑝 𝑛 such that

P(𝐺 (𝑛, 𝑝 𝑛 ) has a connected component with exactly 𝑘 vertices) → 1 as 𝑛 → ∞.

Hint: Make the random graph contain some specific subgraph but not some others.

6. Poisson limit. Let 𝑋 be the number of triangles in 𝐺 (𝑛, 𝑐/𝑛) for some fixed
𝑐 > 0.

a) For every nonnegative integer 𝑘, determine the limit of E 𝑋𝑘 as 𝑛 → ∞.
b) Let 𝑌 ∼ Binomial(𝑛, 𝜆/𝑛) for some fixed 𝜆 > 0. For every nonnegative

integer 𝑘, determine the limit of E 𝑌𝑘 as 𝑛 → ∞, and show that it agrees
with the limit in (a) for some 𝜆 = 𝜆(𝑐).
We know that 𝑌 converges to the Poisson distribution with mean 𝜆. Also,
the Poisson distribution is determined by its moments.
c) Compute, for fixed every nonnegative integer 𝑡, the limit of P(𝑋 = 𝑡) as
𝑛 → ∞.
(In particular, this gives the limit probability that 𝐺 (𝑛, 𝑐/𝑛) contains a
triangle, i.e., lim𝑛→∞ P(𝑋 > 0). This limit increases from 0 to 1 continu-
ously when 𝑐 ranges from 0 to +∞, thereby showing that the property of
containing a triangle has a coarse threshold.)
7. Central limit theorem for triangle counts. Find a real (non-random) sequence 𝑎 𝑛
so that, letting 𝑋 be the number of triangles and 𝑌 be the number of edges in the
random graph 𝐺 (𝑛, 1/2), one has

Var(𝑋 − 𝑎 𝑛𝑌 ) = 𝑜(Var 𝑋).



Deduce that 𝑋 is asymptotically normal, that is, (𝑋 − E𝑋)/ Var 𝑋 converges to
the normal distribution.
(You can solve for the minimizing 𝑎 𝑛 directly similar to ordinary least squares linear regression,

65
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4 Second Moment

or first write the edge indicator variables as 𝑋𝑖 𝑗 = (1 + 𝑌𝑖 𝑗 )/2 and then expand. The latter
approach likely yields a cleaner computation.)

8. Isolated vertices. Let 𝑝 𝑛 = (log 𝑛 + 𝑐 𝑛 )/𝑛.


a) Show that, as 𝑛 → ∞,
(
0 if 𝑐 𝑛 → −∞,
P(𝐺 (𝑛, 𝑝 𝑛 ) has no isolated vertices) →
1 if 𝑐 𝑛 → ∞.

b) Suppose 𝑐 𝑛 → 𝑐 ∈ R, compute, with proof, the limit of LHS above as


𝑛 → ∞, by following the approach in 6.
9. ★ Is the threshold for the bipartiteness of a random graph coarse or sharp?
(You are not allowed to use any theorems that we did not prove in class/notes.)

10. Triangle packing. Prove that, with probability approaching 1 as 𝑛 → ∞,


𝐺 (𝑛, 𝑛 −1/2 ) has at least 𝑐𝑛3/2 edge-disjoint triangles, where 𝑐 > 0 is some
constant.
Hint: rephrase as finding a large independent set

11. Nearly perfect triangle factor. Prove that, with probability approaching 1 as
𝑛 → ∞,
a) 𝐺 (𝑛, 𝑛−2/3 ) has at least 𝑛/100 vertex-disjoint triangles.
b) Simple nibble. 𝐺 (𝑛, 𝐶𝑛−2/3 ) has at least 0.33𝑛 vertex-disjoint triangles,
for some constant 𝐶.
iterate (a)
Hint: view a random graph as the union of several independent random graphs &

12. Permuted correlation. Recall that the correlation of two non-constant


√︁ random
variables 𝑋 and 𝑌 is defined to be corr(𝑋, 𝑌 ) := Cov[𝑋, 𝑌 ]/ (Var 𝑋)(Var 𝑌 ).
Let 𝑓 , 𝑔 ∈ [𝑛] → R be two non-constant functions. Prove that there exist
permutations 𝜋 and 𝜏 of [𝑛] such that, with 𝑍 being a uniform random element
of [𝑛],
2
corr( 𝑓 (𝜋(𝑍)), 𝑔(𝑍)) − corr( 𝑓 (𝜏(𝑍)), 𝑔(𝑍)) ≥ √ .
𝑛−1
Furthermore, show that equality can be achieved for even 𝑛.
Hint: Compute the variance of the correlation for a random permutation.


13. Let 𝑣 1 = (𝑥 1 , 𝑦 1 ), . . . , 𝑣 𝑛 = (𝑥 𝑛 , 𝑦 𝑛 ) ∈ Z2 with |𝑥𝑖 | , |𝑦𝑖 | ≤ 2𝑛/2 /(100 𝑛) for all
𝑖 ∈ [𝑛]. Show that there are two disjoint sets 𝐼, 𝐽 ⊆ [𝑛], not both empty, such
Í Í
that 𝑖∈𝐼 𝑣 𝑖 = 𝑗 ∈𝐽 𝑣 𝑗 .

66
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

4.7 Weierstrass approximation theorem

14. ★ Prove that there is an absolute constant 𝐶 > 0 so that the following holds. For
every prime 𝑝 and every 𝐴 ⊆ Z/𝑝Z with | 𝐴| = 𝑘, there exists √ an integer 𝑥 so
that {𝑥𝑎 : 𝑎 ∈ 𝐴} intersects every interval of length at least 𝐶 𝑝/ 𝑘 in Z/𝑝Z.
15. ★ Prove that there is a constant 𝑐 > 0 so that every hyperplane containing the
origin in R𝑛 intersects at least 𝑐-fraction of the 2𝑛 closed unit balls centered at
{−1, 1}𝑛 .

67
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

5 Chernoff Bound

The Chernoff bound is an extremely useful bound on the tails of a sum of independent
random variables. It is proved by bounding the moment generating function. This
proof technique is interesting and important in its own right. We will see this proof
method come up again later on when we prove martingale concentration inequalities.
The method allows us to adapt the proof of the Chernoff bound to other distributions.
Let us give the proof in the most basic case for simplicity and clarity.

Theorem 5.0.1 (Chernoff bound)


Let 𝑆𝑛 = 𝑋1 + · · · + 𝑋𝑛 where 𝑋𝑖 ∈ {−1, 1} uniformly iid. Let 𝜆 > 0. Then
√ 2
P(𝑆𝑛 ≥ 𝜆 𝑛) ≤ 𝑒 −𝜆 /2 .


In contrast, Chebyshev’s inequality gives a weaker bound P(𝑆𝑛 ≥ 𝜆 𝑛) ≤ 1/𝜆2 . On the
other hand, Chebyshev’s inequality is application in wider settings as it only requires
pairwise independence (for the second moment) as opposed to full independence.

Proof. Let 𝑡 ≥ 0. Consider the moment generating function


" #
𝑒 −𝑡 + 𝑒 𝑡
Í Ö Ö  𝑛
 𝑡𝑆 𝑛
  𝑡 𝑋𝑖
 𝑡 𝑋𝑖
 𝑡 𝑋𝑖

E 𝑒 =E 𝑒 𝑖 =E 𝑒 = E 𝑒 = .
𝑖 𝑖
2

By comparing Taylor series, we have

𝑒 −𝑡 + 𝑒 𝑡 ∑︁ 𝑥 2𝑘 ∑︁ 𝑥 2𝑘 2
= ≤ = 𝑒 𝑡 /2 .
2 𝑘 ≥0
(2𝑘)! 𝑘 ≥0 𝑘!2 𝑘

By Markov’s inequality,
 
√ E 𝑒 𝑡𝑆 √
𝑛+𝑡 2 𝑛/2
P(𝑆𝑛 ≥ 𝜆 𝑛) ≤ √ ≤ 𝑒 −𝑡𝜆 .
𝑒 𝑡𝜆 𝑛


Setting 𝑡 = 𝜆/ 𝑛 gives the bound. □

Remark 5.0.2. The technique of considering the moment generating function can
be thought morally as taking an appropriately high moment. Indeed, E[𝑒 𝑡𝑆 ] =

69
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

5 Chernoff Bound

Í 𝑛 ]𝑡 𝑛 /𝑛!
𝑛≥0 E[𝑆 contains all the moments data of the random variable.
The second moment method (Chebyshev + Markov) can be thought of as the first
iteration of this idea. By taking fourth moments (now requiring 4-wise independence
of the summands), we can obtain tail bounds of the form ≲ 𝜆−4 . And similarly with
higher moments.
In some applications, where one cannot assume independence, but can estimate some
high moments, the above philosophy can allow us to prove good tail bounds as well.
√ 2
Also by symmetry, P(𝑆𝑛 ≤ −𝜆 𝑛) ≤ 𝑒 −𝜆 /2 . Thus we have the following two-sided
tail bound.

Corollary 5.0.3
With 𝑆𝑛 as before, for any 𝜆 ≥ 0,
√ 2
P(|𝑆𝑛 | ≥ 𝜆 𝑛) ≤ 2𝑒 −𝜆 /2 .

Remark 5.0.4. It is easy to adapt the above proof so that each 𝑋𝑖 is a mean-zero
random variable taking [−1, 1]-values, and independent (but not necessarily identical)
across all 𝑖. Indeed, by convexity, we have 𝑒 𝑡𝑥 ≤ 1−𝑥 −𝑡 1+𝑥 𝑡
2 𝑒 + 2 𝑒 for all 𝑥 ∈ [−1, 1] by
−𝑡
convexity, so that E[𝑒 𝑡 𝑋 ] ≤ 𝑒 +𝑒
𝑡
2 . In particular, we obtain the following tail bounds
on the binomial distribution.

Theorem 5.0.5 (Chernoff bound with bounded variables)


Let each 𝑋𝑖 be an independent random variable taking values in [−1, 1] and E𝑋𝑖 = 0.
Then 𝑆𝑛 = 𝑋1 + · · · + 𝑋𝑛 satisfies
√ 2
P(𝑆𝑛 ≥ 𝜆 𝑛) ≤ 𝑒 −𝜆 /2 .

Corollary 5.0.6
Let 𝑋 be a sum of 𝑛 independent Bernoulli’s (with not necessarily identitical proba-
bility). Let 𝜇 = E𝑋 and 𝜆 > 0. Then
√ 2 √ 2
P(𝑋 ≥ 𝜇 + 𝜆 𝑛) ≤ 𝑒 −𝜆 /2 and P(𝑋 ≤ 𝜇 − 𝜆 𝑛) ≤ 𝑒 −𝜆 /2

The Chernoff bound compares well to that of the normal distribution. For the standard
2
normal 𝑍 ∼ 𝑁 (0, 1), one has E[𝑒 𝑡𝑍 ] = 𝑒 𝑡 /2 and so
2 /2
P(𝑍 ≥ 𝜆) = P(𝑒 𝑡𝑍 ≥ 𝑒 𝑡𝜆 ) ≤ 𝑒 −𝑡𝜆 E[𝑒 𝑡 𝑋 ] = 𝑒 −𝑡𝜆+𝑡 .

70
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

5.1 Discrepancy

Set 𝑡 = 𝜆 and get


2 /2
P(𝑍 ≥ 𝜆) ≤ 𝑒 −𝜆 .
And this is actually pretty tight, as, for 𝜆 → ∞,
∞ 2
𝑒 −𝜆 /2

1 −𝑡 2 /2
P(𝑍 ≥ 𝜆) = √ 𝑒 𝑑𝑡 ∼ √ .
2𝜋 𝜆 2𝜋𝜆

The same proof method allows you to prove bounds for other sums of random variables,
which you can adjust based on the distributions. See the Alon–Spencer textbook,
Appendix A, for examples of bounds and proofs.
For example, for a sum of independent Bernoulli random variables with small means,
we can improve on the above estimates as follows.

Theorem 5.0.7
Let 𝑋 be the sum of independent Bernoulli random variables (not necessarily the same
probability). Let 𝜇 = E𝑋. For all 𝜀 > 0,
𝜀2
P(𝑋 ≥ (1 + 𝜀)𝜇) ≤ 𝑒 −((1+𝜀) log(1+𝜀)−𝜀)𝜇 ≤ 𝑒 − 1+𝜀 𝜇

and
2 𝜇/2
P(𝑋 ≤ (1 − 𝜀)𝜇) ≤ 𝑒 −𝜀 .

Remark 5.0.8. The bounds for upper and lower tails are necessarily asymmetric when
the probabilities are small. Why? Think about what happens when 𝑋 ∼ Bin(𝑛, 𝑐/𝑛),
which, for a constant 𝑐 > 0, converges as 𝑛 → ∞ to a Poisson distribution with mean 𝑐,
2
whose value at 𝑘 is 𝑒 −𝑐 𝑐 𝑘 /𝑘! = 𝑒 −Θ(𝑘 log 𝑘) and not the sub-Gaussian decay 𝑒 −Ω(𝑘 ) as
one might naively predict by an incorrect application of the Chernoff bound formula.
Nonetheless, both formulas tell us that both tails exponentially decay like 𝜀 2 for small
values of 𝜀 ∈ [0, 1].

5.1 Discrepancy
Given a hypergraph (i.e., set family), can we color the vertices red/blue so that every
edge has roughly the same number of red versus blue vertices? (Contrast this problem
to 2-coloring hypergraphs from Section 1.3.)

71
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

5 Chernoff Bound

Theorem 5.1.1
Let F be a collection of 𝑚 subsets of [𝑛]. Then √︁there exists some assignment
[𝑛] → {−1, 1} so that the sum on every set in F is 𝑂 ( 𝑛 log 𝑚) in absolute value.

Proof. Put ±1 iid uniformly at random on each vertex. On each edge, the probability
√︁
that the sum exceeds 2 𝑛 log 𝑚 in absolute value is, by Chernoff bound, less than
2𝑒 −2 log 𝑚 = 2/𝑚 2 . By union bound over all
√︁ 𝑚 edges, with probability greater than
1 − 2/𝑚 ≥ 0, no edge has sum exceeding 2 𝑛 log 𝑚. □

Remark 5.1.2. In a beautiful landmark paper titled Six standard deviations suffice,
Spencer (1985) showed that one can remove the logarithmic term by a more sophisti-
cated semirandom assignment algorithm.

Theorem 5.1.3 (Six standard deviations suffice: Spencer 1985)


Let F be a collection of 𝑛 subsets of [𝑛]. Then there exists some assignment [𝑛] →

{−1, 1} so that the sum on every set in F is at most 6 𝑛 in absolute value.

More generally,
√︁ if F be a collection of 𝑚 ≥ 𝑛 subsets of [𝑛], then we can replace 6 𝑛
by 𝑂 ( 𝑛 log(2𝑚/𝑛)).

Remark 5.1.4. More generally, Spencer proves that the same holds if vertices have
[0, 1]-valued weights.

The idea, very roughly speaking, is to first generalize from {−1, 1}-valued assign-
ments to [−1, 1]-valued assignments. Then the all-zero vector is a trivially satisfying
assignment. We then randomly, in iterations, alter the values from 0 to other values

in [−1, 1], while avoiding potential violations (e.g., edges with sum close to 6 𝑛 in
absolute value), and finalizing a color of a color when its value moves to either −1 and
1.
Spencer’s original proof was not algorithmic, and he suspected that it could not be
made efficiently algorithmic. In a breakthrough result, Bansal (2010) gave an efficient
algorithm for producing a coloring with small discrepancy. Lovett and Meka (2015)
provided another element algorithm with a beautiful proof.
Here is a famous conjecture on discrepancy.

72
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

5.2 Nearly equiangular vectors

Conjecture 5.1.5 (Komlós)


There exists some absolute constant 𝐾 so that for any 𝑣 1 , . . . , 𝑣 𝑚 ∈ R𝑛 all lying in the
unit ball, there exist 𝜀1 , . . . , 𝜀 𝑚 ∈ {−1, 1} such that

𝜀1 𝑣 1 + · · · + 𝜀 𝑚 𝑣 𝑚 ∈ [−𝐾, 𝐾] 𝑛 .

√︁
Banaszczyk (1998) proved the bound 𝐾 = 𝑂 ( log 𝑛) in a beautiful paper using deep
ideas from convex geometry.
Spencer’s theorem’s implies the special case of Komlós conjecture where all vec-
tors 𝑣 𝑖 have the form 𝑛−1/2 (±1, . . . , ±1) (or more generally when all coordinates are
𝑂 (𝑛−1/2 )). The deduction is easy when 𝑚 ≤ 𝑛. When 𝑚 > 𝑛, we use the following
observation.

Lemma 5.1.6
Let 𝑣 1 , . . . , 𝑣 𝑚 ∈ R𝑛 . Then there exists 𝑎 1 , . . . , 𝑎 𝑚 ∈ [−1, 1] 𝑚 with |{𝑖 : 𝑎𝑖 ∉ {−1, 1}}| ≤
𝑛 such that
𝑎1 𝑣1 + · · · + 𝑎𝑚 𝑣 𝑚 = 0

Proof. Find (𝑎 1 , . . . , 𝑎 𝑚 ) ∈ [−1, 1] 𝑚 satisfying and as many 𝑎𝑖 ∈ {−1, 1} as possi-


ble. Let 𝐼 = {𝑖 : 𝑎𝑖 ∉ {−1, 1}}. If |𝐼 | > 𝑛, then we can find some nontrivial linear
combination of the vectors 𝑣 𝑖 , 𝑖 ∈ 𝐼, allowing us to to move (𝑎𝑖 )𝑖∈𝐼 ’s to new values,
while preserving 𝑎 1 𝑣 1 + · · · + 𝑎 𝑚 𝑣 𝑚 = 0, and end up with at one additional 𝑎𝑖 taking
{−1, 1}-value. □

Let us explain how to deduce the special caes of Kómlos conjecture as stated earlier.
Let 𝑎 1 , . . . , 𝑎 𝑚 and 𝐼 = {𝑖 : 𝑎𝑖 ∉ {−1, 1}} as in the Lemma. Take 𝜀𝑖 = 𝑎𝑖 for all 𝑖 ∉ 𝐼,
and apply a corollary of Spencer’s theorem to find 𝜀𝑖 ∈ {−1, 1}𝑛 , 𝑖 ∈ 𝐼 with
∑︁
(𝜀𝑖 − 𝑎𝑖 )𝑣 𝑖 ∈ [−𝐾, 𝐾] 𝑛 ,
𝑖∈𝐼

which would yield the desired result. The above step can be deduced from Spencer’s
theorem by first assuming that each 𝑎𝑖 ∈ [−1, 1] has finite binary length (a compactness
argument), and then rounding it off one digit at a time during Spencer’s theorem,
starting from the least significant bit (see Corollary 8 in Spencer’s paper for details).

5.2 Nearly equiangular vectors


How many vectors can one place in R𝑑 so that pairwise make equal angles?

73
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

5 Chernoff Bound

Let 𝑆 = {𝑣 1 , . . . , 𝑣 𝑚 } be a set of unit vectors in R𝑛 whose pairwise inner products all


equal to some 𝛼 ∈ [−1, 1). How large can 𝑆 be?
The Gram matrix of 𝑆, defined as the matrix of pairwise inner products, has 1’s on the
diagonal and 𝛼 off diagonal. So

| ··· | | ··· | 𝑣 · 𝑣 · · · 𝑣1 · 𝑣𝑚
ª © 1 1 .
­𝑣 . . . 𝑣 ® = ­ ... .. ª® = (1 − 𝛼)𝐼 + 𝛼𝐽
© ª ©
­𝑣 . . . 𝑣 ® . . . ®
­ 1 𝑚® ­ 1 𝑚® ­ 𝑚 𝑚

« | ··· | ¬ « | · · · | ¬ «𝑣 𝑚 · 𝑣 1 · · · 𝑣 𝑚 · 𝑣 𝑚 ¬
(here 𝐼𝑚 and 𝐽𝑚 are the 𝑚 × 𝑚 identity and all-ones matrix respectively). Since the
eigenvalues of 𝐽𝑚 are 𝑚 (once) and 0 (repeated 𝑚 − 1 times), the eigenvalues of
𝐼𝑚 + (𝛼 − 1)𝐽𝑚 are (𝑚 − 1)𝛼 + 1 (once) and 1 − 𝛼 (𝑚 − 1 times). Since the Gram matrix
is positive semidefinite, all its eigenvalues are nonnegative, and so 𝛼 ≥ −1/(𝑚 − 1).
• If 𝛼 ≠ −1/(𝑚 − 1), then this 𝑚 × 𝑚 matrix is non-singular, and since its rank is
at most 𝑛 (as 𝑣 𝑖 ∈ R𝑛 ), we have 𝑚 ≤ 𝑛.
• If 𝛼 = −1/(𝑚 − 1), then this matrix has rank 𝑚 − 1, and we conclude that
𝑚 ≤ 𝑛 + 1.
It is left as an exercise to check all these bounds are tight.
Exercise: given 𝑚 unit vectors in R𝑛 whose pairwise inner products are all ≤ −𝛽, one
has 𝑚 ≤ 1 + ⌊1/𝛽⌋. (A bit more difficult: show that for 𝛽 = 0, one has 𝑚 ≤ 2𝑛).
What if instead of asking for exactly equal angles, we ask for approximately the same
angle. It turns out that we can get many more vectors.

Theorem 5.2.1 (Exponentially many approximately equiangular vectors)


For every 𝛼 ∈ (0, 1) and 𝜀 > 0, there exists 𝑐 > 0 so that for every 𝑛, one can find at
least 2𝑐𝑛 unit vectors in R𝑛 whose pairwise inner products all lie in [𝛼 − 𝜀, 𝛼 + 𝜀].

Remark 5.2.2. Such a collection of vectors is a type of “spherical code.” Also, by


examining the volume of spherical caps, one can prove an upper bound of the form
2𝐶 𝛼, 𝜀 𝑛 .

Proof. Let 𝑝 = (1 + 𝛼)/2, and let 𝑣 1 , . . . , 𝑣 𝑚 ∈ {−1, 1}𝑛 be independent random
vectors which each coordinate independently is +1 with probability 𝑝 and −1 with
probability 1 − 𝑝. Then for 𝑖 ≠ 𝑗, the dot product 𝑣 𝑖 · 𝑣 𝑗 is a sum of 𝑛 independent
±1-valued random variables each with mean

𝑝 2 + (1 − 𝑝) 2 − 2𝑝(1 − 𝑝) = ( 𝑝 − (1 − 𝑝)) 2 = (2𝑝 − 1) 2 = 𝛼.

74
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

5.3 Hajós conjecture counterexample

Applying Chernoff bound in the form of Theorem 5.0.5 (after linear transformation on
each variable to make each term taking value in [−1, 1] and mean centered at zero),
we get
2
P |𝑣 𝑖 · 𝑣 𝑗 − 𝑛𝛼| ≥ 𝑛𝜀 ≤ 2𝑒 −Ω(𝑛𝜀 ) .


2
By the union bound, the probability that |𝑣 𝑖 ·𝑣 𝑗 −𝑛𝛼| > 𝑛𝜀 for some 𝑖 ≠ 𝑗 is < 𝑚 2 𝑒 Ω(𝑛𝜀 ) ,
which is < 1 for some 𝑚 at least 2𝑐𝑛 . So with positive probability, so such pair occurs,
√ √
and then 𝑣 1 / 𝑛, . . . , 𝑣 𝑚 / 𝑛 is a collection of unit vectors in R𝑛 whose pairwise inner
products all lie in [𝛼 − 𝜀, 𝛼 + 𝜀]. □

Remark 5.2.3 (Equiangular lines with a fixed angle). Given a fixed angle 𝜃, for large
𝑛, how many lines in R𝑛 through the origin can one place whose pairwise angles are all
exactly 𝜃? This problem was solved by Jiang, Tidor, Yao, Zhang, Zhao (2021). This
is the same as asking for a set of unit vectors in R𝑛 whose pairwise inner products are
±𝛼. It turns out that for fixed 𝛼, the maximum number of lines grows linearly with the
dimension 𝑛, and the rate of growth depends on properties of 𝛼 in relation to spectral
graph theory. We refer to the cited paper for details.

5.3 Hajós conjecture counterexample


We begin by reviewing some classic result from graph theory. Recall some definitions:
• 𝐻 is an induced subgraph of 𝐺 if 𝐻 can be obtained from 𝐺 by removing
vertices;
• 𝐻 is a subgraph if 𝐺 if 𝐻 can be obtained from 𝐺 by removing vertices and
edges;
• 𝐻 is a subdivision of 𝐺 if 𝐻 can be obtained from a subgraph of 𝐺 by contracting
induced paths to edges;
• 𝐻 is a minor of 𝐺 if 𝐻 can be obtained from a subgraph of 𝐺 by by contracting
edges to vertices.
Kuratowski’s theorem (1930). Every graph without 𝐾3,3 and 𝐾5 as subdivisions as
subdivision is planar.
Wagner’s theorem (1937). Every graph free of 𝐾3,3 and 𝐾5 as minors is planar.
(There is a short argument shows that Kuratowski and Wagner’s theorems are equiva-
lent.)
Four color theorem (Appel and Haken 1977) Every planar graph is 4-colorable.
Corollary: Every graph without 𝐾3,3 and 𝐾5 as minors is 4-colorable.

75
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

5 Chernoff Bound

The condition on 𝐾5 is clearly necessary, but what about 𝐾3,3 ? What is the “real”
reason for 4-colorability?
Hadwidger’s conjecture, below, remains a major conjectures in graph theory.

Conjecture 5.3.1 (Hadwiger 1936)


For every 𝑡 ≥ 1, every graph without a 𝐾𝑡+1 minor is 𝑡-colorable.

• 𝑡 = 1 trivial
• 𝑡 = 2 nearly trivial (if 𝐺 is 𝐾3 -minor-free, then it’s a tree)
• 𝑡 = 3 elementary graph theoretic arguments
• 𝑡 = 4 is equivalent to the 4-color theorem (Wagner 1937)
• 𝑡 = 5 is equivalent to the 4-color theorem (Robertson–Seymour–Thomas 1994;
this work won a Fulkerson Prize)
• 𝑡 ≥ 6 remains open
Let us explore a variation of Hadwiger’s conjecture:
Hajós conjecture. (1961) Every graph without a 𝐾𝑡+1 -subdivision is 𝑡-colorable.
Hajós conjecture is true for 𝑡 ≤ 3. However, it turns out to be false in general. Catlin
(1979) constructed counterexamples for all 𝑡 ≥ 6 (𝑡 = 4, 5 are still open).
It turns out that Hajós conjecture is not just false, but very false.
Erdős–Fajtlowicz (1981) showed that almost every graph is a counterexample (it’s a
good idea to check for potential counterexamples among random graphs!)

Theorem 5.3.2
 √ 
With probability 1 − 𝑜(1), 𝐺 (𝑛, 1/2) has no 𝐾𝑡 -subdivision with 𝑡 = 10 𝑛 .

From Theorem 4.4.3 we know that, with high probability, 𝐺 (𝑛, 1/2) has independence
number ∼ 2 log2 𝑛 and hence chromatic number ≥ (1 + 𝑜(1) 2 log𝑛 𝑛 . Thus the above
2
result shows that 𝐺 (𝑛, 1/2) is whp a counterexample to Hajós conjecture.

Proof. If 𝐺 had a 𝐾𝑡 -subdivision, say with 𝑆 ⊆ 𝑉, |𝑆| = 𝑡. Each pair of vertices of 𝑆


are connected via a path, whose intermediate vertices are outside 𝑆, and distinct for
different pairs of vertices.

At most 𝑛 of the 2𝑡 pairs of vertices in 𝑆 can be joined this way using a path of at

least two edges, since each uses up a vertex outside 𝑆. Thus at ≥ 2𝑡 − 𝑛 of the pairs
of vertices of 𝑆 form edges.

76
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

5.3 Hajós conjecture counterexample

By Chernoff bound, for fixed 𝑡-vertex subset 𝑆


      
𝑡 3 𝑡 2
P 𝑒(𝑆) ≥ − 𝑛 ≤ P 𝑒(𝑆) ≥ ≤ 𝑒 −𝑡 /10 .
2 4 2

Taking a union bound over all 𝑡-vertex subsets 𝑆, and noting that
  √
𝑛 −𝑡 2 /10 2
𝑒 < 𝑛𝑡 𝑒 −𝑡 /10 ≤ 𝑒 −10𝑛+𝑂 ( 𝑛 log 𝑛) = 𝑜(1)
𝑡

we see that whp no such 𝑆 exists, so that this 𝐺 (𝑛, 1/2) whp has no 𝐾𝑡 -subdivision □

Remark 5.3.3 (Quantitative question). One can ask the following quantitative ques-
tion regarding Hadwidger’s conjecture:
Can every graph without a 𝐾𝑡+1 -minor can be properly colored with a small number of
colors?
Wagner (1964) showed that every graph without 𝐾𝑡+1 -minor is 2𝑡−1 colorable.
Here is the proof: assume that the graph is connected. Take a vertex 𝑣 and let 𝐿 𝑖 be
the set of vertices with distance exactly 𝑖 from 𝑣. The subgraph induced on 𝐿 𝑖 has no
𝐾𝑡 -minor, since otherwise such a 𝐾𝑡 -minor would extend to a 𝐾𝑡+1 -minor with 𝑣. Then
by induction 𝐿 𝑖 is 2𝑡−2 -colorable (check base cases), and using alternating colors for
even and odd layers 𝐿 𝑖 yields a proper coloring of 𝐺.
This bound has been improved over time. Delcourt and Postle (2021+) showed that
every graph with no 𝐾𝑡 -minor is 𝑂 (𝑡 log log 𝑡)-colorable.

For more on Hadwiger’s conjecture, see Seymour’s survey (2016).

Exercises
1. Prove that with probability 1 − 𝑜(1) as 𝑛 → ∞, every bipartite subgraph of
𝐺 (𝑛, 1/2) has at most 𝑛2 /8 + 10𝑛3/2 edges.
2. Unbalancing lights. Prove that there is a constant 𝐶 so that for every positive
integer 𝑛, one can find an 𝑛 × 𝑛 matrix 𝐴 with {−1, 1} entries, so that for all
vectors 𝑥, 𝑦 ∈ {−1, 1}𝑛 , |𝑦 ⊺ 𝐴𝑥| ≤ 𝐶𝑛3/2 .
3. Prove that there exists a constant 𝑐 > 1 such that for every 𝑛, there are at least
𝑐 𝑛 points in R𝑛 so that every triple of points form a triangle whose angles are all
less than 61◦ .

77
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

5 Chernoff Bound

4. Planted clique. Give a deterministic polynomial-time algorithm for the follow-


ing task so that it succeeds over the random input with probability approaching
1 as 𝑛 → ∞.
Input: some
j k 𝑛-vertex 𝐺 created as the union of 𝐺 (𝑛, 1/2) and a clique
unlabeled
√︁
on 𝑡 = 100 𝑛 log 𝑛 vertices.
Output: a clique in 𝐺 of size 𝑡.
5. Weighing coins. You are given 𝑛 coins, each with one of two known weights,
but otherwise indistinguishable. You can use a scale that outputs the combined
weight of any subset of the coins. You must decide in advance which subsets
𝑆1 , . . . , 𝑆 𝑘 ⊆ [𝑛] of the coins to weigh. We wish to determine the minimum
number of weighings needed to identify the weight of every coin. (Below, 𝑋
and 𝑌 represent two possibilities for which coins are of the first weight.)
a) ★ Prove that if 𝑘 ≤ 1.99𝑛/log2 𝑛 and 𝑛 is sufficiently large, then for every
𝑆1 , . . . , 𝑆 𝑘 ⊆ [𝑛], there are two distinct subsets 𝑋, 𝑌 ⊆ [𝑛] such that
|𝑋 ∩ 𝑆𝑖 | = |𝑌 ∩ 𝑆𝑖 | for all 𝑖 ∈ [𝑘].
(There is a neat solution to part (a) using information theory, though here you are explicitly
asked to solve it using the Chernoff bound.)

b) ★ Show that there is some constant 𝐶 such that (a) is false if 1.99 is replaced
by 𝐶. (What is the best 𝐶 you can get?)

78
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

The Lovász local lemma (LLL) was introduced in the paper of Erdős and Lovász
(1975). It is a powerful tool in the probabilistic method.
In many problems, we wish to avoid a certain set of “bad events.” Here are two easy
to handle scenarios:
• (Complete independence) All the bad events are independent and have probabil-
ity less than 1.
• (Union bound) The sum of the bad event probabilities is less than 1.
The local lemma deals with an intermediate situation where there is a small amount of
local dependencies.
We saw an application of the Lovász local lemma back in Section 1.1, where we used
it to lower bound Ramsey numbers. This chapter explores the local lemma and its
applications in depth.

6.1 Statement and proof

Definition 6.1.1 (Independence from a set of events)


Here we say that an event 𝐴0 is independent from events 𝐴1 , . . . , 𝐴𝑚 if 𝐴0 is indepen-
dent of every event of the form 𝐵1 ∧ · · · ∧ 𝐵𝑚 (we sometimes omit the “logical and”
symbol ∧) where each 𝐵𝑖 is either 𝐴𝑖 or 𝐴𝑖 , i.e.,

P( 𝐴0 𝐵1 · · · 𝐵𝑚 ) = P( 𝐴0 )P(𝐵1 · · · 𝐵𝑚 ),

or, equivalently, using Bayes’s rule:

P( 𝐴0 |𝐵1 · · · 𝐵𝑚 ) = P( 𝐴0 ).

Given a collection of events, we can associate to it a dependency graph. This is a


slightly subtle notion, as we will explain. Technically speaking, the graph can be a
directed graph (=digraph), but for most applications, it will be sufficient (and easier)
to use undirected graphs.

79
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

Definition 6.1.2 (Dependency (di)graph)


Let 𝐴1 , . . . , 𝐴𝑛 be events (the “bad events” we wish to avoid). Let 𝐺 be a (directed)
graph with vertex set [𝑛]. We say that 𝐺 is a dependency (di)graph for the events
𝐴1 , . . . , 𝐴𝑛 if, for for every 𝑖, 𝐴𝑖 is independent from all {𝐴 𝑗 : 𝑗 ∉ 𝑁 (𝑖) ∪ {𝑖}} (𝑁 (𝑖)
is the set of (out)neighbors of 𝑖 in 𝐺).

Remark 6.1.3 (Non-uniqueness). Given a collection of events, there can be more


than one valid dependency graphs. For example, the complete graph is always a valid
dependency graph.

Remark 6.1.4 (Important!). Independence ≠ pairwise independence


The dependency graph is not made by joining 𝑖 ∼ 𝑗 whenever 𝐴𝑖 and 𝐴 𝑗 are not
independent (i.e., P( 𝐴𝑖 𝐴 𝑗 ) ≠ P( 𝐴𝑖 )P( 𝐴 𝑗 )).
Example: suppose one picks 𝑥1 , 𝑥2 , 𝑥3 ∈ Z/2Z uniformly and independently at random
and set, for each 𝑖 = 1, 2, 3 (indices taken mod 3), 𝐴𝑖 the event that 𝑥𝑖+1 + 𝑥𝑖+2 = 0.
Then these events are pairwise independent but not independent. So the empty graph
on three vertices is not a valid dependency graph (on the other hand, having at least
two edges makes it a valid dependency graph).

In practice, it is not too hard to construct a valid dependency graph, since most
applications of the Lovász local lemma use the following setup (which we saw in
Section 1.1).

Setup 6.1.5 (Random variable model / hypergraph coloring)


Let {𝑥𝑖 : 𝑖 ∈ 𝐼} be a collection of independent random variables. Let 𝐸 1 , . . . , 𝐸 𝑛 be
events where each 𝐸𝑖 depends only on the variables indexed by some subset 𝐵𝑖 ⊆ 𝐼 of
variables. A canonical dependency graph for the events 𝐸 1 , . . . , 𝐸 𝑛 has vertex set [𝑛]
and an edge 𝑖 𝑗 whenever 𝐵𝑖 ∩ 𝐵 𝑗 ≠ ∅.

It is easy to check that the canonical dependency graph above is indeed a valid depen-
dency graph.

Example 6.1.6 (Boolean satisfiability problem (SAT)). Given a CNF formula (con-
junctive normal form, i.e., and-of-or’s), e.g., (∧ = and; ∨ = or)

(𝑥 1 ∨ 𝑥 2 ∨ 𝑥3 ) ∧ (𝑥 1 ∨ 𝑥 2 ∨ 𝑥 4 ) ∧ (𝑥 2 ∨ 𝑥 4 ∨ 𝑥 5 ) ∧ · · ·

the problem is to find a satisfying assignment with boolean variables 𝑥 1 , 𝑥2 , . . . . Many


problems in computer science can be modeled using this way. This problem can be
viewed as in Setup 6.1.5, where 𝐴𝑖 is the event that the 𝑖-th clause is violated.

80
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.1 Statement and proof

The following formulation of the local lemma is easiest to apply and is the most
commonly used. It applies to settings where the dependency graph has small maximum
degree.

Theorem 6.1.7 (Lovász local lemma; symmetric form)


Let 𝐴1 , . . . , 𝐴𝑛 be events, with P[ 𝐴𝑖 ] ≤ 𝑝 for all 𝑖. Suppose that each 𝐴𝑖 is independent
from a set of all other 𝐴 𝑗 except for at most 𝑑 of them. If

𝑒 𝑝(𝑑 + 1) ≤ 1,

then with some positive probability, none of the events 𝐴𝑖 occur.

Remark 6.1.8. The constant 𝑒 is best possible (Shearer 1985). In most applications,
the precise value of the constant is unimportant.

Theorem 6.1.9 (Lovász local lemma; general form)


Let 𝐴1 , . . . , 𝐴𝑛 be events. For each 𝑖 ∈ [𝑛], let 𝑁 (𝑖) be such that 𝐴𝑖 is independent
from {𝐴 𝑗 : 𝑗 ∉ {𝑖} ∪ 𝑁 (𝑖)}. If 𝑥 1 , . . . , 𝑥 𝑛 ∈ [0, 1) satisfy
Ö
P( 𝐴𝑖 ) ≤ 𝑥𝑖 (1 − 𝑥 𝑗 ) for all 𝑖 ∈ [𝑛],
𝑗 ∈𝑁 (𝑖)

then
𝑛
Ö
P(none of the events 𝐴𝑖 occur) ≥ (1 − 𝑥𝑖 ).
𝑖=1

Proof that the general form implies the symmetric form. Set 𝑥𝑖 = 1/(𝑑 + 1) < 1 for
all 𝑖. Then
 𝑑
Ö 1 1 1
𝑥𝑖 (1 − 𝑥 𝑗 ) ≥ 1− > ≥𝑝
𝑑+1 𝑑+1 (𝑑 + 1)𝑒
𝑗 ∈𝑁 (𝑖)

so the hypothesis of general local lemma holds. □

Here is another corollary of the general form local lemma, which applies if the total
probability of any neighborhood in a dependency graph is small.

Corollary 6.1.10
Í
In the setup of Theorem 6.1.9, if P( 𝐴𝑖 ) < 1/2 and 𝑗 ∈𝑁 (𝑖) P( 𝐴 𝑗 ) ≤ 1/4 for all 𝑖, then
with positive probability none of the events 𝐴𝑖 occur.

81
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

Proof. In Theorem 6.1.9, set 𝑥𝑖 = 2P( 𝐴𝑖 ) for each 𝑖. Then

Ö ∑︁ ª ∑︁
(1 − 𝑥 𝑗 ) ≥ 𝑥𝑖 ­1 − 𝑥 𝑗 ® = 2P( 𝐴𝑖 ) ­1 − 2P( 𝐴𝑖 ) ® ≥ P( 𝐴𝑖 ).
© © ª
𝑥𝑖
𝑗 ∈𝑁 (𝑖) « 𝑗 ∈𝑁 (𝑖) ¬ « 𝑗 ∈𝑁 (𝑖) ¬
(The first inequality is by “union bound.”) □

In some applications, one may need to apply the general form local lemma with
carefully chosen values for 𝑥𝑖 .

Proof of Lovász local lemma (general case). We will prove that

Û ª
𝐴 𝑗 ® ≤ 𝑥𝑖 whenever 𝑖 ∉ 𝑆 ⊆ [𝑛]. (6.1)
©
P ­ 𝐴𝑖
« 𝑗 ∈𝑆 ¬
Once (6.1) has been established, we then deduce that
     
P( 𝐴1 · · · 𝐴𝑛 ) = P( 𝐴1 )P 𝐴2 𝐴1 P 𝐴3 𝐴1 𝐴2 · · · P 𝐴𝑛 𝐴1 · · · 𝐴𝑛−1
≥ (1 − 𝑥 1 ) (1 − 𝑥 2 ) · · · (1 − 𝑥 𝑛 ),

which is the conclusion of the local lemma.


Now we prove (6.1) by induction on |𝑆|. The base case |𝑆| = 0 is trivial.
Let 𝑖 ∉ 𝑆. Let 𝑆1 = 𝑆 ∩ 𝑁 (𝑖) and 𝑆2 = 𝑆 \ 𝑆1 . We have
 Ó Ó 
Û ª P 𝐴𝑖 𝑗 ∈𝑆1 𝐴 𝑗 𝑗 ∈𝑆2 𝐴 𝑗
𝐴𝑗® = (6.2)
©
P ­ 𝐴𝑖 Ó Ó 
« 𝑗 ∈𝑆 ¬ P 𝐴
𝑗 ∈𝑆1 𝑗 𝑗 ∈𝑆2 𝑗𝐴

For the RHS of (6.2), using that 𝐴𝑖 is independent of 𝑗 ∈ 𝑆2 : 𝐴 𝑗 ,

Û ª Ö
numerator ≤ P ­ 𝐴𝑖 𝐴 𝑗 ® = P( 𝐴𝑖 ) ≤ 𝑥𝑖 (1 − 𝑥𝑖 ), (6.3)
©

« 𝑗 ∈𝑆2 ¬ 𝑗 ∈𝑁 (𝑖)

82
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.2 Coloring hypergraphs

and, denoting the elements of 𝑆1 by 𝑆1 = { 𝑗1 , . . . , 𝑗𝑟 },

Û ª © Û ª Û ª
denominator = P ­ 𝐴 𝑗1 𝐴 𝑗 ® · · · P ­ 𝐴 𝑗𝑟 𝐴 𝑗1 · · · 𝐴 𝑗𝑟 −1
© ©
𝐴 𝑗 ® P ­ 𝐴 𝑗2 𝐴 𝑗1 𝐴𝑗®
« 𝑗 ∈𝑆2 ¬ « 𝑗 ∈𝑆2 ¬ « 𝑗 ∈𝑆2 ¬
≥ (1 − 𝑥 𝑗1 ) · · · (1 − 𝑥 𝑗𝑟 ) [by induction hypothesis]
Ö
≥ (1 − 𝑥𝑖 )
𝑗 ∈𝑁 (𝑖)

Thus (6.2) ≤ 𝑥𝑖 , thereby finishing the induction proof of (6.1). □

Remark 6.1.11. We used the independence assumption only at step (6.3) of the proof.
Upon a closerexamination, we  see that we only need to know correlation inequalities
Ó
of the form P 𝐴𝑖 𝑗 ∈𝑆2 𝐴 𝑗 ≤ P( 𝐴𝑖 ) for 𝑆 2 ⊆ 𝑁 (𝑖), rather than independence. This
observation allows a strengthening of the local lemma, known as a lopsided local
lemma, that we will explore later in the chapter.

6.2 Coloring hypergraphs


Previously, in Theorem 1.3.1, we saw that every 𝑘-uniform hypergraph with fewer
than 2 𝑘−1 edges is 2-colorable. The next theorem gives a sufficient local condition for
2-colorability.

Theorem 6.2.1
A 𝑘-uniform hypergraph is 2-colorable if every edge intersects at most 𝑒 −1 2 𝑘−1 − 1
other edges

Proof. For each edge 𝑓 , let 𝐴 𝑓 be the event that 𝑓 is monochromatic. Then P( 𝐴 𝑓 ) =
𝑝 := 2−𝑘+1 . Each 𝐴 𝑓 is independent from all 𝐴 𝑓 ′ where 𝑓 ′ is disjoint from 𝑓 . Since
at most 𝑑 := 𝑒 −1 2 𝑘−1 − 1 edges intersect every edge, and 𝑒(𝑑 + 1) 𝑝 ≤ 1, so the local
lemma implies that with positive probability, none of the events 𝐴 𝑓 occur. □

Corollary 6.2.2
For 𝑘 ≥ 9, every 𝑘-uniform 𝑘-regular hypergraph is 2-colorable.
(Here 𝑘-regular means that every vertex lies in exactly 𝑘 edges.)

Proof. Every edge intersects ≤ 𝑑 = 𝑘 (𝑘 −1) other edges. And 𝑒(𝑘 (𝑘 −1) +1)2−𝑘+1 < 1
for 𝑘 ≥ 9. □

83
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

Remark 6.2.3. The statement is false for 𝑘 = 2 (triangle) and 𝑘 = 3 (Fano plane) but
actually true for all 𝑘 ≥ 4 (Thomassen 1992).

Here is an example where the symmetric form of the local lemma is insufficient (why?).

Theorem 6.2.4
Let 𝐻 be a (non-uniform) hypergraph where every edge has size ≥ 3. Suppose
∑︁ 1
2−| 𝑓 | ≤ , for each edge 𝑒,
8
𝑓 ∈𝐸 (𝐻)\{𝑒}:𝑒∩ 𝑓 ≠∅

then 𝐻 is 2-colorable.

Proof. Consider a uniform random 2-coloring of the vertices. Let 𝐴𝑒 be the event that
edge 𝑒 is monochromatic. Then P( 𝐴𝑒 ) = 2−|𝑒|+1 ≤ 1/4 since |𝑒| ≥ 3. Also,
∑︁ ∑︁
P( 𝐴 𝑓 ) = 2−| 𝑓 |+1 ≤ 1/4.
𝑓 ∈𝐸 (𝐻)\{𝑒}:𝑒∩ 𝑓 ≠∅ 𝑓 ∈𝐸 (𝐻)\{𝑒}:𝑒∩ 𝑓 ≠∅

Thus by Corollary 6.1.10 one can avoid all events 𝐴𝑒 , and hence 𝐻 is 2-colorable. □

Remark 6.2.5. A sign to look beyond the symmetric local lemma is when there are
bad events of very different nature (e.g., having very different probabilities).

Compactness argument
Now we highlight an important compactness argument that allows us to deduce the
existence of an infinite object, even though the local lemma itself is only applicable to
finite systems.

Theorem 6.2.6
Let 𝐻 be a (non-uniform) hypergraph on a possibly infinite vertex set, such that
each edge is finite, has at least 𝑘 vertices, and intersect at most 𝑑 other edges. If
𝑒2−𝑘+1 (𝑑 + 1) ≤ 1, then 𝐻 has a proper 2-coloring.

Proof. From a vanilla application of the symmetric local lemma, we deduce that for
any finite subset 𝑋 of vertices, there exists an 2-coloring 𝑋 so that no edge contained
in 𝑋 is monochromatic (color each vertex iid uniformly, and consider the bad event 𝐴𝑒
that the edge 𝑒 ⊆ 𝑋 is monochromatic).
Next we extend the coloring to the entire vertex set 𝑉 by a compactness argument. The
set of all colorings is [2] 𝑉 . By Tikhonov’s theorem (which says a product of a possibly

84
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.2 Coloring hypergraphs

infinite collection of compact topological spaces is compact), [2] 𝑉 is compact under


the product topology.
For each finite subset 𝑋, let 𝐶 𝑋 ⊆ [2] 𝑉 be the subset of colorings where no edge
contained in 𝑋 is monochromatic. Earlier from the local lemma we saw that 𝐶 𝑋 ≠ ∅.
If 𝑌 ⊆ 𝑋, then 𝐶𝑌 ⊇ 𝐶 𝑋 . Thus

𝐶 𝑋1 ∩ · · · ∩ 𝐶 𝑋ℓ ⊇ 𝐶 𝑋1 ∪···∪𝑋ℓ ,

so {𝐶 𝑋 : |𝑋 | < ∞} is a collection of closed subsets of [2] 𝑉 with the finite intersection


property (i.e., the intersection of any finite subcollection is nonempty).
Recall from point-set topology the following basic fact (a defining property): a space
is compact if and only if every family of closed subsets having the finite intersection
property has non-empty intersection.
Hence by compactness of [2] 𝑉 , the intersection of 𝐶 𝑋 taken over all finite 𝑋 is
non-empty. Any element of this intersection corresponds to a valid coloring of the
hypergraph. □

More generally, the above compactness argument yields the following.

Lemma 6.2.7 (Compactness argument)


Consider a variation of the random variable model (Setup 6.1.5) where each variable
has only finitely many choices but there can be possibly infinitely many events (each
event depends on a finite subset of variables). If it is possible to avoid any finite subset
of events, then it is possible to avoid all the events. □

Remark 6.2.8. Note the conclusion may be false if we do not assume the random
variable model (why?).

The next application appears in the paper of Erdős and Lovász (1975) where the local
lemma originally appears.
Consider 𝑘-coloring the real numbers, i.e., a function 𝑐 : R → [𝑘]. We say that 𝑇 ⊆ R
is multicolored with respect to 𝑐 if all 𝑘 colors appear in 𝑇.

Question 6.2.9
For each 𝑘 is there an 𝑚 so that for every 𝑆 ⊆ R with |𝑆| = 𝑚, one can 𝑘-color R so
that every translate of 𝑆 is multicolored?

The following theorem shows that this can be done whenever 𝑚 > (3 + 𝜀)𝑘 log 𝑘 and
𝑘 > 𝑘 0 (𝜀) sufficiently large.

85
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

Theorem 6.2.10
The answer to the above equation is yes if
 𝑚
1
𝑒(𝑚(𝑚 − 1) + 1)𝑘 1 − ≤ 1.
𝑘

Proof. Each translate of 𝑆 is not multicolored with probability 𝑝 ≤ 𝑘 (1 − 1/𝑘) 𝑚 , and


each translate of 𝑆 intersects at most 𝑚(𝑚 − 1) other translates. Consider a bad event
for each translate of 𝑆 contained in 𝑋. The symmetric local lemma tells us that it is
possible to avoid any finite collection of bad events. By the compactness argument, it
is possible to avoid all the bad events. □

Coloring arithmetic progressions


Here is an application where we need to apply the asymmetric local lemma.

Theorem 6.2.11 (Beck 1980)


For every 𝜀 > 0, there exists 𝑘 0 and a 2-coloring of Z with no monochromatic 𝑘-term
arithmetic progressions with 𝑘 ≥ 𝑘 0 and common difference less than 2 (1−𝜀)𝑘 .

Proof. We pick a uniform random color for each element of Z. For each 𝑘-term
arithmetic progression in Z with 𝑘 ≥ 𝑘 0 and common difference less than 2 (1−𝜀)𝑘 ,
consider the event that this 𝑘-AP is monochromatic. By the compactness argument, it
suffices to check that we can avoid any finite subset of events.
The event that a particular 𝑘-AP is monochromatic has probability exactly 2−𝑘+1 .
(Since this number depends on 𝑘, we should use the asymmetric local lemma.)
Recall that in the asymmetric local lemma (Theorem 6.1.9), we need to select 𝑥𝑖 ∈ [0, 1)
for each bad event 𝐴𝑖 so that
Ö
P( 𝐴𝑖 ) ≤ 𝑥𝑖 (1 − 𝑥 𝑗 ) for all 𝑖 ∈ [𝑛].
𝑗 ∈𝑁 (𝑖)

It is usually a good idea to select 𝑥𝑖 to be somewhat similar to P( 𝐴𝑖 ). In this case, if 𝐴𝑖


is the event corresponding to a 𝑘-AP, then we take
  1−𝜀/2
−(1−𝜀/2)𝑘 P( 𝐴𝑖 )
𝑥𝑖 = 2 =
2

(with the same 𝜀 as in the statement of the theorem).

86
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.2 Coloring hypergraphs

Fix a 𝑘-AP 𝑃 in Z with 𝑘 ≥ 𝑘 0 . The number of ℓ-APs with ℓ ≥ 𝑘 0 and common


difference less than 2 (1−𝜀)ℓ that intersects 𝑃 is at most 𝑘ℓ2 (1−𝜀)ℓ (one choice for the
element of 𝑘, a choice of the position of the ℓ-AP, and at most 2 (1−𝜀)ℓ choices for the
common difference). So to apply the local lemma, it suffices to check that
Ö  𝑘ℓ2 (1− 𝜀)ℓ
−𝜀𝑘/2+1 −(1−𝜀/2)ℓ
2 ≤ 1−2 .
ℓ≥𝑘 0

Note that 1 − 𝑥 ≥ 𝑒 −2𝑥 for 𝑥 ∈ [0, 1/2]. So


! !
∑︁ ∑︁
𝑅𝐻𝑆 ≥ exp − 21−(1−𝜀/2)ℓ · 𝑘ℓ2 (1−𝜀)ℓ = exp −𝑘 ℓ21−𝜀ℓ/2
ℓ≥𝑘 0 ℓ≥𝑘 0

ℓ21−𝜀ℓ/2 < 𝜀/4, and so


Í
By making 𝑘 0 = 𝑘 0 (𝜀) large enough, we can ensure that ℓ≥𝑘 0
continuing,
· · · ≥ 𝑒 −𝜀𝑘/4 ≥ 2−𝜀𝑘/2+1
provided that 𝑘 ≥ 𝑘 0 (𝜀). So we can apply the local lemma to conclude. □

Decomposing coverings
We say that a collection of disks in R𝑑 is a covering if their union is R𝑑 . We say that
it is a 𝒌-fold covering if every point of R𝑑 is contained in at least 𝑘 disks (so 1-fold
covering is the same as a covering).
We say that a 𝑘-fold covering is decomposable if it can be partitioned into two cover-
ings.
In R𝑑 , is a every 𝑘-fold covering by unit balls decomposable if 𝑘 is sufficiently large?
A fun exercise: in R1 , every 𝑘-fold covering by intervals can be partitioned into 𝑘
coverings.
Mani-Levitska and Pach (1986) showed that every 33-fold covering of R2 is decom-
posable.
What about higher dimensions?
Surprising, they also showed that for every 𝑘, there exists a 𝑘-fold indecomposable
covering of R3 (and similarly for R𝑑 for 𝑑 ≥ 3).
However, it turns out that indecomposable coverings must cover the space quite un-
evenly:

87
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

Theorem 6.2.12 (Mani-Levitska and Pach 1986)


Every 𝑘-fold nondecomposable covering of R3 by open unit balls must cover some
point ≳ 2 𝑘/3 times.

Remark 6.2.13. In R𝑑 , the same proof gives ≥ 𝑐 𝑑 2 𝑘/𝑑 .

We will need the following combinatorial geometric fact:

Lemma 6.2.14
A set of 𝑛 ≥ 2 spheres in R3 cut R3 into at most 𝑛3 connected components.

Proof. Let us first consider the problem in one dimension lower. Let 𝑓 (𝑚) be the
maximum number of connected regions that 𝑚 circles on a sphere in R3 can cut the
sphere into.
We have 𝑓 (𝑚 + 1) ≤ 𝑓 (𝑚) + 2𝑚 for all 𝑚 ≥ 1 since adding a new circle to a set of 𝑚
circles creates at most 2𝑚 intersection points, so that the new circle is divided into at
most 2𝑚 arcs, and hence its addition creates at most 2𝑚 new regions.
Combined with 𝑓 (1) = 2, we deduce 𝑓 (𝑚) ≤ 𝑚(𝑚 − 1) + 2 for all 𝑚 ≥ 1.
Now let 𝑔(𝑚) be the maximum number of connected regions that 𝑚 spheres in R3 can
cut R3 into. We have 𝑔(1) = 2, and 𝑔(𝑚 + 1) ≤ 𝑔(𝑚) + 𝑓 (𝑚) ≤ 𝑔(𝑚) by a similar
argument as earlier. So 𝑔(𝑚) ≤ 𝑓 (𝑚 − 1) + 𝑓 (𝑚 − 2) + · · · + 𝑓 (1) + 𝑔(0) ≤ 𝑚 3 . □

Proof. Suppose for contradiction that every point in R3 is covered by at most 𝑡 ≤ 𝑐2 𝑘/3
unit balls from 𝐹 (for some sufficiently small 𝑐 that we will pick later).
Construct an infinite hypergraph 𝐻 with vertex set being the set of balls and edges
having the form 𝐸 𝑥 = {balls containing 𝑥} for some 𝑥 ∈ R3 . Note that |𝐸 𝑥 | ≥ 𝑘 since
we have a 𝑘-fold covering.
Also, note that if 𝑥, 𝑦 ∈ R3 lie in the same connected component in the complement of
the union of all the unit spheres, then 𝐸 𝑥 = 𝐸 𝑦 (i.e., the same edge).
Claim: every edge of intersects at most 𝑑 = 𝑂 (𝑡 3 ) other edges
Proof of claim: Let 𝑥 ∈ R3 . If 𝐸 𝑥 ∩ 𝐸 𝑦 ≠ ∅, then |𝑥 − 𝑦| ≤ 2, so all the balls in
𝐸 𝑦 lie in the radius-4 ball centered at 𝑥. The volume of the radius-4 ball is 43 times
the unit ball. Since every point lies in at most 𝑡 balls, there are at most 43 𝑡 balls
appearing among those 𝐸 𝑦 intersecting 𝑥, and these balls cut the radius-2 centered at
𝑥 into 𝑂 (𝑡 3 ) connected regions by the earlier lemma, and two different 𝑦’s in the same
region produce the same 𝐸 𝑦 . So 𝐸 𝑥 intersects 𝑂 (𝑡 3 ) other edges. ■

88
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.3 Independent transversal

With 𝑡 ≤ 𝑐2 𝑘/3 and 𝑐 sufficiently small, and knowing 𝑑 = 𝑂 (𝑡 3 ) from the claim, we
have 𝑒2−𝑘+1 (𝑑 + 1) ≤ 1. It then follows by Theorem 6.2.6 (local lemma + compactness
argument) that this hypergraph is 2-colorable, which corresponds to a decomposition
of the covering, a contradiction. □

6.3 Independent transversal


The application of the local lemma in this section is instructive in that it is not obvious
at first what to choose as bad events (even if you are already told to apply the local
lemma). It is worth trying different possibilities.
Every graph with maximum degree Δ contains an independent set of size ≥ |𝑉 |/(Δ+1)
(choose the independent set greedily). The following lemma shows that by decreasing
the desired size of the independent set by a constant factor, we can guarantee an
independent set that is also a transversal to a vertex set partition.

Theorem 6.3.1
Let 𝐺 = (𝑉, 𝐸) be a graph with maximum degree Δ and let 𝑉 = 𝑉1 ∪ · · · ∪ 𝑉𝑟 be a
partition, where each |𝑉𝑖 | ≥ 2𝑒Δ. Then there is an independent set in 𝐺 containing
one vertex from each 𝑉𝑖 .

Proof. The first step in the proof is simple yet subtle: we may assume that |𝑉𝑖 | = 𝑘 :=
⌈2𝑒Δ⌉ for each 𝑖, or else we can remove some vertices from 𝑉𝑖 (if we do not trim the
vertex sets now, we will run into difficulties later).
Pick 𝑣 𝑖 ∈ 𝑉𝑖 uniformly at random, independently for each 𝑖.
This is an instance of the random variable model (Setup 6.1.5), where 𝑣 1 , . . . , 𝑣 𝑟 are
the random variables.
We would like to design a collection of “bad events” so that if we avoid all of them,
then {𝑣 1 , . . . , 𝑣 𝑟 } is guaranteed to be independent set.
What do we choose as bad events? It turns out that some choices work better than
others.
Attempt 1:
For each 1 ≤ 𝑖 < 𝑗 ≤ 𝑟 where there exists an edge between 𝑉𝑖 and 𝑉 𝑗 , let 𝐴𝑖, 𝑗 be the
event that 𝑣 𝑖 is adjacent to 𝑣 𝑗 .
We find that P( 𝐴𝑖, 𝑗 ) ≤ Δ/𝑘.
The canonical dependency graph has 𝐴𝑖, 𝑗 ∼ 𝐴𝑖 ′ , 𝑗 ′ if and only if the two sets {𝑖, 𝑗 } and
{𝑖′, 𝑗 ′ } intersect. This dependency graph has max degree ≤ 2Δ𝑘 (starting from (𝑖, 𝑗),

89
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

look at the neighbors of all vertices in 𝑉𝑖 ∪ 𝑉 𝑗 ). The max degree is too large compared
to the bad event probabilities.
Attempt 2:
For each edge 𝑒 ∈ 𝐸, let 𝐴𝑒 be the event that both endpoints of 𝑒 are picked.
We have P( 𝐴𝑒 ) = 1/𝑘 2 .
The canonical dependency graph has 𝐴𝑒 ∼ 𝐴 𝑓 if some 𝑉𝑖 intersects both 𝑒 and 𝑓 .
This dependency graph has max degree ≤ 2𝑘Δ (if 𝑒 is between 𝑉𝑖 and 𝑉 𝑗 , then 𝑓 must
be incident to 𝑉𝑖 ∪ 𝑉 𝑗 ).
We have 𝑒(1/𝑘 2 )(2𝑘Δ + 1) ≤ 1, so the local lemma implies the with probability no
bad event occurs, in which case {𝑣 1 , . . . , 𝑣 𝑟 } is an independent set. □

Remark 6.3.2. Alon (1988) introduced the above result as lemma in his near resolution
of the still-open linear arboricity conjecture (see the Alon–Spencer textbook §5.5).
Alon’s approach makes heavy use of the local lemma.
Haxell (1995, 2001) relaxed the hypothesis to |𝑉𝑖 | ≥ 2Δ for each 𝑖. The statement
becomes false if 2Δ is replaced by 2Δ − 1 (Szabó and Tardos 2006).

6.4 Directed cycles of length divisible by 𝑘


A directed graph is 𝒅-regular if every vertex has in-degree 𝑑 and out-degree 𝑑.

Theorem 6.4.1 (Alon and Linial 1989)


For every 𝑘 there exists 𝑑 so that every 𝑑-regular directed graph has a directed cycle
of length divisible by 𝑘.

Corollary 6.4.2
For every 𝑘 there exists 𝑑 so that every 2𝑑-regular graph has a cycle of length divisible
by 𝑘.

Proof. Every 2𝑑-regular graph can be made into a 𝑑-regular digraph by orientating its
edges according to an Eulerian tour. And then we can apply the previous theorem. □

We will prove the following more general statement.

90
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.4 Directed cycles of length divisible by 𝑘

Theorem 6.4.3 (Alon and Linial 1989)


Every directed graph with min out-degree 𝛿 and max in-degree Δ contains a cycle of
length divisible by 𝑘 ∈ N as long as

𝛿
𝑘≤ .
1 + log(1 + 𝛿Δ)

Proof. By deleting edges, can assume that every vertex has out-degree exactly 𝛿.

Assign every vertex 𝑣 an element 𝑥 𝑣 ∈ Z/𝑘Z iid uniformly at random.


We will look for directed cycles where the labels increase by 1 (mod 𝑘) at each step.
These cycles all have length divisible by 𝑘.
For each vertex 𝑣, let 𝐴𝑣 be the event that there is nowhere to go from 𝑣 (i.e., if no
outneighbor is labeled 𝑥 𝑣 + 1 (mod 𝑘)). We have

P( 𝐴𝑣 ) = (1 − 1/𝑘) 𝛿 ≤ 𝑒 −𝛿/𝑘 .

Since 𝐴𝑣 depends only on {𝑥 𝑤 : 𝑤 ∈ {𝑣} ∪ 𝑁 + (𝑣)}, where 𝑁 + (𝑣) denotes the out-
neighbors of 𝑣 and 𝑁 − (𝑣) the in-neighbors of 𝑣, the canonical dependency graph
has
𝐴𝑣 ∼ 𝐴𝑤 if {𝑣} ∪ 𝑁 + (𝑣) intersects {𝑤} ∪ 𝑁 + (𝑤).
The maximum degree in the dependency graph is at most Δ + 𝛿Δ (starting from 𝑣, there
are
(1) at most Δ choices stepping backward
(2) 𝛿 choices stepping forward, and
(3) at most 𝛿(Δ − 1) choices stepping forward and then backward to land somewhere
other than 𝑣).
So an application of the local lemma shows that we are done as long as 𝑒 1−𝛿/𝑘 (1+Δ+𝛿Δ),
i.e.,
𝑘 ≤ 𝛿/(1 + log(1 + Δ + 𝛿Δ)).
This is almost, but not quite the result (though, for most applications, we would be
perfectly happy with such a bound).
The final trick is to notice that we actually have an even smaller valid dependency
digraph:
𝐴𝑣 is independent of all 𝐴𝑤 where 𝑁 + (𝑣) is disjoint from 𝑁 + (𝑤) ∪ {𝑤}.

91
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

Indeed, even if we fix the colors of all vertices outside 𝑁 + (𝑣), the conditional proba-
bility that 𝐴𝑣 is still (1 − 1/𝑘) 𝛿 .
The number of 𝑤 such that 𝑁 + (𝑣) intersects 𝑁 + (𝑤) ∪ {𝑤} is at most 𝛿Δ (no longer
need to consider (1) in the previous count). And we have

𝑒 𝑝(𝛿Δ + 1) ≤ 𝑒 1−𝛿/𝑘 (𝛿Δ + 1) ≤ 1.

So we are done by the local lemma. □

6.5 Lopsided local lemma


Let us move beyond the random variable model, and consider a collection of bad
events in the general setup of the local lemma. Instead of requiring that each event is
independent of its non-neighbors (in the dependency graph), what if we assume that
avoiding some bad events make it easier to avoid some others? Intuitively, it seems
that it would only make it easier to avoid bad events.
We can make this notion precise by re-examining the proof of the local lemma. Where
did we actually use the independence assumption in the hypothesis of the local lemma?
It was in the following step, Equation (6.3):

Û ª Ö
numerator ≤ P ­ 𝐴𝑖 𝐴 𝑗 ® = P( 𝐴𝑖 ) ≤ 𝑥𝑖 (1 − 𝑥𝑖 ).
©

« 𝑗 ∈𝑆 2 ¬ 𝑗 ∈𝑁 (𝑖)

If we had changed the middle = to ≤, the whole proof would remain valid. This
observation allows us to weaken the independence assumption. Therefore we have the
following theorem, which was used by Erdős and Spencer (1991) to give an application
to Latin transversals that we will see shortly.

92
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.5 Lopsided local lemma

Theorem 6.5.1 (Lopsided local lemma)


Let 𝐴1 , . . . , 𝐴𝑛 be events. For each 𝑖, let 𝑁 (𝑖) ⊆ [𝑛] be such that

Û ª
𝐴 𝑗 ® ≤ P( 𝐴𝑖 ) for all 𝑖 ∈ [𝑛] and 𝑆 ⊆ [𝑛] \ (𝑁 (𝑖) ∪ {𝑖}) (6.1)
©
P ­ 𝐴𝑖
« 𝑗 ∈𝑆 ¬
Suppose there exist 𝑥1 , . . . , 𝑥 𝑛 ∈ [0, 1) such that
Ö
P( 𝐴𝑖 ) ≤ 𝑥𝑖 (1 − 𝑥 𝑗 ) for all 𝑖 ∈ [𝑛].
𝑗 ∈𝑁 (𝑖)

Then
𝑛
Ö
P(none of the events 𝐴𝑖 occur) ≥ (1 − 𝑥𝑖 ).
𝑖=1

Like earlier, by setting 𝑥𝑖 = 1/(𝑑 + 1), we deduce a symmetric version that is easier to
apply.

Corollary 6.5.2 (Lopsided local lemma; symmetric version)


In the previous theorem, if |𝑁 (𝑖)| ≤ 𝑑 and P( 𝐴𝑖 ) ≤ 𝑝 for every 𝑖 ∈ [𝑛], and 𝑒 𝑝(𝑑 +1) ≤
1, then with positive probability none of the events 𝐴𝑖 occur.

The (di)graph where 𝑁 (𝑖) is the set of (out-)neighbors of 𝑖 is called a negative depen-
dency (di)graph.

Remark 6.5.3 (Important!). Just as with the usual local lemma, the negative depen-
dency graph is not constructed by simply checking pairs of events.

The hypothesis of Theorem 6.5.1 seems annoying to check. Fortunately, many appli-
cations of lopsided local lemma fall within a model that we will soon describe, where
there is a canonical negative dependency graph that is straightforward to construct.
This is analogous to the random variable model for the usual local lemma, where the
canonical dependence graph has two events adjacency if they share variables.

Random injection model


We describe a random injection model where there is an easy-to-construct canonical
negative dependency graph (Lu and Székely 2007).
Recall that a matching in a graph is a subset of edges with no two sharing a vertex.
In a bipartite graph with vertex parts 𝑋 and 𝑌 , a complete matching from 𝑋 to 𝑌 is a
matching where every vertex of 𝑋 belongs to an edge of the matching.

93
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

Setup 6.5.4 (Random injection model)


Let 𝑋 and 𝑌 be finite sets with |𝑋 | ≤ |𝑌 |.
Let 𝑓 : 𝑋 → 𝑌 be an injection chosen uniformly at random. We can also represent 𝑓
by a complete matching 𝑀 from 𝑋 to 𝑌 in 𝐾 𝑋,𝑌 (the complete bipartite graph between
𝑋 and 𝑌 ). We will speak interchangeably of the injection 𝑓 and matching 𝑀.
For a given matching 𝐹 (not necessarily complete) in 𝐾 𝑋,𝑌 , let 𝐴𝐹 denote the event
that 𝐹 ⊆ 𝑀.
Let 𝐹1 , . . . , 𝐹𝑛 be matchings in 𝐾 𝑋,𝑌 . The canonical negative dependency graph for
the vents 𝐴𝐹1 , . . . , 𝐴𝐹𝑛 has one vertex for each event, and an edge between the events
𝐴𝐹𝑖 and 𝐴𝐹 𝑗 (𝑖 ≠ 𝑗) if 𝐹𝑖 and 𝐹 𝑗 are not vertex disjoint.

The following result shows that the above canonical negative dependency graph is a
valid for the lopsided local lemma (Theorem 6.5.1).

Theorem 6.5.5 (Nonnegative dependence for random injections)


In Setup 6.5.4, let 𝐹0 be a matching in 𝐾 𝑋,𝑌 such that 𝐹0 is vertex disjoint from
𝐹1 ∪ · · · ∪ 𝐹𝑘 . Then  
P 𝐴𝐹0 𝐴 𝐹1 · · · 𝐴 𝐹𝑘 ≤ P( 𝐴𝐹0 ).

Proof. Let 𝑋0 ⊆ 𝑋 and 𝑌0 ⊆ 𝑌 be the set of endpoints of 𝐹0 .

For each matching 𝑇 in 𝐾 𝑋,𝑌 , let

M𝑇 = {complete matchings from 𝑋 to 𝑌 containing 𝑇 but not containing any of 𝐹1 , . . . , 𝐹𝑘 } .

For the desired inequality, note that


  M 𝐹0 M 𝐹0
𝐿𝐻𝑆 = P 𝐴𝐹0 𝐴 𝐹1 · · · 𝐴 𝐹𝑘 = =Í
|M∅ | 𝑇 : 𝑋0 ↩→𝑌 |M𝑇 |

where the sum is taken over all |𝑌 | (|𝑌 | − 1) · · · (|𝑌 | − |𝑋 | + 1) complete matchings 𝑇
from 𝑋0 to 𝑌 (which we denote by 𝑇 : 𝑋0 ↩→ 𝑌 ), and

1
𝑅𝐻𝑆 = P( 𝐴𝐹0 ) = .
|{𝑇 : 𝑋0 ↩→ 𝑌 }|

Thus to show that 𝐿𝐻𝑆 ≤ 𝑅𝐻𝑆, it suffices to prove

M 𝐹0 ≤ |M𝑇 | for every 𝑇 : 𝑋0 ↩→ 𝑌 .

94
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.5 Lopsided local lemma

It suffices to construct an injection M 𝐹0 ↩→ M𝑇 . Let 𝑌1 be the set of endpoints of 𝑇


in 𝑌 . Fix a permutation 𝜎 of 𝑌 such that
• 𝜎 fixes all elements of 𝑌 outside 𝑌0 ∪ 𝑌1 ; and
• 𝜎 sends 𝐹0 to 𝑇.
Then 𝜎 induces a permutation on the set of complete matchings from 𝑋 to 𝑌 . It
remains to show that if we start with a matching in M 𝐹0 , so that it avoids 𝐹𝑖 for all
𝑖 ≥ 1, then it is sent to a matching that also avoids 𝐹𝑖 for all 𝑖 ≥ 1 (and hence lies in
M𝑇 ). Indeed, this follows from the hypothesis that none of the edges in 𝐹𝑖 use any
vertex from 𝑋0 or 𝑌0 . □

As an example, here is a quick application.

Corollary 6.5.6 (Derangement lower bound)


The probability that a uniform random permutation of [𝑛] has no fixed points is at
least (1 − 1/𝑛) 𝑛 .

Proof. In the random injection model, let 𝑋 = 𝑌 = [𝑛]. Let 𝑓 : 𝑋 → 𝑌 be a uniform


random permutation. For each 𝑖 ∈ [𝑛], let 𝐹𝑖 be the single edge (𝑖, 𝑖), i.e., 𝐴𝐹𝑖 is the
even that 𝑓 (𝑖) = 𝑖. Note that the canonical negative dependency graph is empty since
no two 𝐹𝑖 ’s share a vertex. Since P( 𝐴𝑖 ) = 1 − 1/𝑛, we can set 𝑥𝑖 = 1 − 1/𝑛 for each 𝑖
in the lopsided local lemma to obtain the conclusion
 𝑛
1
P( 𝑓 has no fixed points) = P( 𝐴1 · · · 𝐴𝑛 ) ≥ 1 − . □
𝑛

Remark 6.5.7. A fixed-point free permutation is called a derangement. Using


Í𝑛
inclusion-exclusion, one can deduce an exact answer to the above question: 𝑖=0 (−1) 𝑖 /𝑖!.
This quantity converges to 1/𝑒 as 𝑘 → ∞, and the above lower bound (1 − 1/𝑛) 𝑛 also
converges to 1/𝑒 and so is asymptotically optimal.

Latin transversal
A Latin square of order 𝑛 is an 𝑛 × 𝑛 array filled with 𝑛 symbols so that every symbol
appears exactly once in every row and column. Example:
1 2 3
2 3 1
3 1 2
These objects are called Latin squares because they were studied by Euler (1707–1783)
who used Latin symbols to fill the arrays.

95
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

Given an 𝑛 × 𝑛 array, a transversal is a set of 𝑛 entries with one in every row and
column. A Latin transversal is a transversal with distinct entries. Example:
1 2 3
2 3 1
3 1 2
Here are is a famous open conjecture about Latin transversals.1 (Do you see why the
hypothesis on parity is necessary?)

Conjecture 6.5.8 (Ryser 1967)


Every odd order Latin square has a transversal.

The conjecture should be modified for even order Latin squares.

Conjecture 6.5.9 (Ryser-Brualdi-Stein conjecture)


Every even order Latin square has a transversal containing all but at most one symbol.

Remark 6.5.10. Keevash, Pokrovskiy, Sudakov and Yepremyan (2022) proved that
every order 𝑛 Latin square contains a transversal containing all but 𝑂 (log 𝑛/log log 𝑛)
symbols, improving an earlier bound of 𝑂 (log2 𝑛) by Hatami and Shor (2008).
Recently, Montgomery announced a proof of the conjecture for all sufficiently large
even 𝑛. The proof uses sophisticated techniques combining the semi-random method
and the absorption method.

The next result is the original application of the lopsided local lemma.

Theorem 6.5.11 (Erdős and Spencer 1991)


Every 𝑛×𝑛 array where every entry appears at most 𝑛/(4𝑒) times has a Latin transversal.

Proof. Pick a transversal uniformly at random. This is the same as picking a permuta-
tion 𝑓 : [𝑛] → [𝑛] uniformly at random. In Setup 6.5.4, the random injection model,
transversals correspond to perfect matchings.
For each pair of equal entries in the array not both lying in the same row or column,
consider the bad event that the transversal contains both entries.
The canonical negative dependency graph is obtained by joining an edge between two
bad events if the four entries involved share some row or column.
1 Not to be confused with another conjecture also known as Ryser’s conjecture concerning an inequality

between the covering number and the matching number of multipartite hypergraphs, as a generaliza-
tion of König’s theorem. See Best and Wanless (2018) for a historical commentary and a translation
of Ryser’s 1967 paper.

96
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.6 Algorithmic local lemma

Let us count neighbors in this negative dependency graph. Fix a pair of equal entries
in the array. Their rows and columns span fewer than 4𝑛 entries, and for each such
entry 𝑧, there are at most 𝑛/(4𝑒) − 1 choices for another entry equal to 𝑧. Thus the
maximum degree in the canonical negative dependence graph is
𝑛  𝑛(𝑛 − 1)
≤ (4𝑛 − 4) −1 ≤ − 1.
4𝑒 𝑒
We can now apply the symmetric lopsided local lemma to conclude that with positive
probability, none of the bad events occur. □

6.6 Algorithmic local lemma


Consider an instance of a problem in the random variable setting (e.g., 𝑘-CNF) for
which the local lemma guarantees a solution. Can one find a satisfying assignment
efficiently?
The local lemma tells you that some good configuration exists, but the proof is non-
constructive. The probability that a random sample avoids all the bad events is often
very small (usually exponentially small, e.g., in the case of a set of independent bad
events). It had been an open problem for a long time whether the local lemma can be
made algorithmic.
Moser (2009), during his PhD, achieved a breakthrough by coming up with the first
efficient algorithmic version of the local lemma for finding a satisfying assignment
for 𝑘-CNF formulas. Moser and Tardos (2010) later extended the algorithm for the
general local lemma in the random variable model.

Remark 6.6.1 (Too hard in general). The Moser–Tardos algorithm works in the ran-
dom variable model (there are subsequent work that concern other models such as
the random injection model). Some assumption on the model is necessary since the
problem can be computationally hard in general.
For example, let 𝑞 = 2 𝑘 , and 𝑓 : [𝑞] → [𝑞] be some fixed bijection (with an explicit
description and easy to compute). Consider the computational task of inverting 𝑓 :
given 𝑦 ∈ [𝑞], find 𝑥 such that 𝑓 (𝑥) = 𝑦 (we would like an algorithm with running
time polynomial in 𝑘).
If 𝑥 ∈ [𝑞] is chosen uniformly, then 𝑓 (𝑥) ∈ [𝑞] is also uniform. For each 𝑖 ∈ [𝑘], let
𝐴𝑖 be the event that 𝑓 (𝑥) and 𝑦 disagree on 𝑖-th bit. Then 𝐴1 , . . . , 𝐴 𝑘 are independent
events. Also, 𝑓 (𝑥) = 𝑦 if and only if no event 𝐴𝑖 occurs. So a trivial version of the
local lemma (with empty dependency graph) implies the existence of some 𝑥 such that
𝑓 (𝑥) = 𝑦.

97
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

On the other hand, it is believed that there exist functions 𝑓 that is easy to compute but
hard to invert. Such functions are called one-way functions, and they are a fundamental
building block in cryptography. For example, let 𝑔 be a multiplicative generator of F𝑞 ,
and let 𝑓 : F𝑞 → F𝑞 be given by 𝑓 (0) = 0 and 𝑓 (𝑥) = 𝑔 𝑥 and for 𝑥 ≠ 0. Then inverting
𝑓 is the discrete logarithm problem, which is believed to be computationally difficult.
The computational difficulty of this problem is the basis for the security of important
public key cryptography schemes, such as the Diffie–Hellman key exchange.

Moser–Tardos algorithm
The algorithm is surprisingly simple.

Algorithm 6.6.2 (Moser–Tardos “fix-it”)


input : a set of variables and events in the random variable model
output : an assignment of variables avoiding all bad events
Initialize by setting all variables to arbitrary values;
while there is some violated event do
Pick an arbitrary violated event and uniformly resample its variables;

(We can make the algorithm more precise by specifying a way to pick an “arbitrary” choice, e.g., the
lexicographically first choice.)

Theorem 6.6.3 (Moser and Tardos 2010)


In Algorithm 6.6.2, letting 𝐴1 , . . . , 𝐴𝑛 denote the bad events, suppose there are
𝑥 1 , . . . , 𝑥 𝑛 ∈ [0, 1) such that
Ö
P( 𝐴𝑖 ) ≤ 𝑥𝑖 (1 − 𝑥 𝑗 ) for all 𝑖 ∈ [𝑛],
𝑗 ∈𝑁 (𝑖)

then for each 𝑖,


𝑥𝑖
E[number of times that 𝐴𝑖 is chosen for resampling] ≤ .
1 − 𝑥𝑖

We won’t prove the general theorem here. The proof in Moser and Tardos (2010) is
beautifully written and not too long. I highly recommend it reading it. In the next
subsection, we will prove the correctness of the algorithm in a special case using a
neat idea known as entropy compression.

Remark 6.6.4 (Las Vegas versus Monte Carlo). Here are some important classes of
randomized algorithms:

98
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.6 Algorithmic local lemma

• Monte Carlo algorithm (MC): a randomized algorithm that terminates with an


output, but there is a small probability that the output is incorrect;
• Las Vegas algorithm (LV): a randomized algorithm that always returns a correct
answer, but may run for a long time (or possibly forever).
The Moser–Tardos algorithm is a LV algorithm whose expected runtime is bounded by
Í
𝑖 𝑥𝑖 /(1 − 𝑥𝑖 ), which is usually at most polynomial in the parameters of the problem.

We are usually interested in randomized algorithms whose running time is small (e.g.,
at most a polynomial of the input size).
We can convert an efficient LV algorithm into an efficient MC algorithm as follows:
suppose the LV algorithm has expected running time 𝑇, and now we run the algorithm
but if it takes more than 𝐶𝑇 time, then halt and declare a failure. Markov’s inequality
then shows that the algorithm fails with probability ≤ 1/𝐶.
However, it is not always possible to convert an efficient MC algorithm into an efficient
LV algorithm. Starting with an MC algorithm, one might hope to repeatedly run it
until a correct answer has been found. However, there might not be an efficient way to
check the answer.
For example, consider the problem of finding a Ramsey coloring, specifically, 2-
edge-coloring of 𝐾𝑛 without a monochromatic clique of size ≥ 100 log2 𝑛. A uniform
random coloring works with overwhelming probability, as can be checked by a simple
union bound (see Theorem 1.1.2). However, we do not have an efficient way to check
whether the random edge-coloring indeed has the desired property. It is a major open
problem to find an LV algorithm for finding such an edge-coloring.

Entropy compression argument


We now give a simple and elegant proof for a special case of the above algorithm, due
to Moser (2009). Actually, the argument in his paper is quite a bit more complicated.
Moser presented a version of the proof below in a conference, and his ideas were
popularized by Fortnow and Tao. (Fortnow called Moser’s talk “one of the best STOC
talks ever”). Tao introduced the phase entropy compression argument to describe
Moser’s influential idea. (We won’t use the language of entropy here, and instead use a
more elementary argument involving counting and the pigeonhole principle. We will
discuss entropy in Chapter 10.)
To keep the argument simple, we work in the setting of 𝑘-CNFs. Recall from Ex-
ample 6.1.6 that a 𝒌-CNF formula (conjunctive normal form) consist of a logical
conjunction (i.e., and, ∧) of clauses, where each clause is a disjunction (i.e., or, ∨)
of exactly 𝑘 literals. We shall require that the 𝑘 literals of each clause use distinct

99
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

variables (𝑥 1 , . . . , 𝑥 𝑁 ), and each variable appears either in its positive 𝑥𝑖 or negative


form 𝑥𝑖 . For example, here is a 3-CNF with 4 clauses on 6 variables:

(𝑥 1 ∨ 𝑥 2 ∨ 𝑥 3 ) ∧ (𝑥 1 ∨ 𝑥2 ∨ 𝑥 4 ) ∧ (𝑥 2 ∨ 𝑥4 ∨ 𝑥 5 ) ∧ (𝑥 3 ∨ 𝑥5 ∨ 𝑥 6 ).

The problem is to find a satisfying assignment with boolean variables so that the
expression output to TRUE.

Algorithm 6.6.5 (Moser “fix-it”)


input : a 𝑘-CNF
output : a satisfying assignment
1 Initialize by setting all variables to arbitrary values;
2 while there is some violated clause 𝐶 do
3 fix (𝐶);
4 Subroutine fix (clause 𝐶) :
5 Resample the variables in 𝐶 uniformly at random;
6 while there is some violated clause 𝐷 that shares a variable with 𝐶 do
7 fix (𝐷);

(We can make the algorithm more well defined by specifying a way to pick an “arbitrary” choice, e.g.,
the lexicographically first choice. Also, in Line 6, we allow taking 𝐷 = 𝐶.)

Theorem 6.6.6 (Correctness of Moser’s algorithm)


Given a 𝑘-CNF where every clause shares variables with at most 2 𝑘−3 other clauses,
Algorithm 6.6.5 output a satisfying assignment with expected running time at most
polynomial in the number of variables and clauses.

Note that the Lovász local lemma guarantees the existence of a solution if each clause
shares variables with at most 2 𝑘 /𝑒 − 1 clauses (each clause is violated with probability
exactly 2−𝑘 in a uniform random assignment of variables). So the theorem above is
tight up to an unimportant constant factor.

Lemma 6.6.7 (Outer while loop)


Each clause of the 𝑘-CNF appears at most once as a violated clause in the outer while
loop (Line 2).

Proof. Given an assignment of variables, by calling fix(𝐶) for any clause 𝐶, any
clause that was previously satisfied remains satisfied after the completion of the execu-
tion of fix(𝐶). Furthermore, 𝐶 becomes satisfied after the function call. Thus, once
fix(𝐶) is called, 𝐶 can never show up again as a violated clause in Line 2. □

100
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.6 Algorithmic local lemma

Lemma 6.6.8 (The number of recursive calls to fix)


Fix a 𝑘-CNF on 𝑛 variables where every clause shares variables with at most 2 𝑘−3
other clauses. Also fix a clause 𝐶0 and some assignment of variables. Then, in an
execution of fix(𝐶0 ), for any positive integer ℓ,

P(there are at least ℓ recursive calls to fix in Line 7) ≤ 2−ℓ+𝑛+1 .

It follows that the expected number of recursive calls to fix is 𝑛 + 𝑂 (1). Thus, in
the Moser algorithm (Algorithm 6.6.5), the expected total number of calls to fix is
𝑚𝑛 + 𝑂 (𝑚), where 𝑛 is the number of variables and 𝑚 is the number of clauses. This
proves the correctness of the algorithm (Theorem 6.6.6).

Proof. Let us formalize the randomness in the algorithm by first initializing a random
string of bits. Specifically, let 𝑥 ∈ {0, 1} 𝑘ℓ be generated uniformly at random. When-
ever the a clause in resampled in Line 5, one replaces the variables in the clause by
the next 𝑘 bits from 𝑥. Furthermore, if the line Line 7 is called for the ℓ-th time, we
halt the algorithm and declare a failure (as we would have run out of random bits to
resample had we continued).
At the same time, we keep an execution trace which keeps track of which clauses got
called fix, and also when the inner while loop Line 6 ends. Note that the very first
call to fix(𝐶0 ) is not included in the execution trace since it is already given as fixed
and so we don’t need to include this information. Here is an example of an execution
trace, writing C7 for the 7th clause in the 𝑘-CNF:

fix(C7) called
fix(C4) called
fix(C7) called
while loop ended
fix(C2) called
while loop ended
while loop ended
...

For illustration, here is the example of how clause variables could intersect:

C2: ****
C4: ****
C7: ****

It is straightforward to deduce which while loop ended corresponds to which fix


call by reading the execution trace and keeping track of a first-in-first-out stack.

101
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

Encoding the execution trace as a bit string. We fix at the beginning some canonical
order of all clauses (e.g., lexicographic). It would be too expensive to refer to each
clause in its absolute position in this order (this is an important point!). Instead, we
note that every clause shares variables with at most 2 𝑘−3 other clauses, and only these
≤ 2 𝑘−3 could be called in the inner while loop in Line 6. So we can record which one
got called using a 𝑘 − 3 bit string.
• fix(𝐷) called: suppose this was called inside an execution of fix(𝐶), and
𝐷 is the 𝑗-th clause among all clauses sharing a variable with 𝐶, then record in
the execution trace bit string 0 followed by exactly ℓ − 3 bits giving the binary
representation of 𝑗 (prepended by zeros to get exactly ℓ − 3 bits).
• while loop ended: record 1 in the execution trace bit string.
Note that one can recover the execution trace from the above bit string encoding.
Now, suppose the algorithm terminates as a failure due to fix being called the ℓ-th
time. Here is the key claim.

Key claim (recovering randomness). At the moment right before the ℓ-th recur-
sive call to fix on Line 7, we can completely recover 𝑥 from the current variable
assignments and the execution trace.

Note that all ℓ𝑘 random bits in 𝑥 have been used up at this point.
To see the key claim, note that from the execution trace, we can determine which clauses
were resampled and in what order. Furthermore, if fix(𝐷) was called on Line 7, then
𝐷 must have been violated right before the call, and there is a unique possibility for
the violating assignment to 𝐷 right before the call (e.g., if 𝐷 = 𝑥 1 ∨ 𝑥2 ∨ 𝑥 3 , then the
only violating assignment is (𝑥 1 , 𝑥2 , 𝑥3 ) = (0, 0, 1)). We can then rewind history, and
put the reassigned values to 𝐷 back into the random bit string 𝑥 to complete recover 𝑥.
How long can the execution bit string be? It has length ≤ ℓ(𝑘 − 1). Indeed, each of the
≤ ℓ recursive calls to fix produces 𝑘 − 2 bits for the call to fix and 1 bit for ending
the while loop. So the total number of possible execution strings is ≤ 2ℓ(𝑘−1)+1 (the
+1 accounts for variable lengths, though it can removed with a more careful analysis).
Thus, the key claim implies that each 𝑥 ∈ {0, 1}ℓ𝑘 that leads to a failed execution
produces a unique pair (variable assignment, execution bit string). Thus

P(≥ ℓ recursive calls to fix) 2ℓ𝑘 = |{𝑥 ∈ {0, 1}𝑛 leading to failure}| ≤ 2𝑛 2ℓ(𝑘−1)+1 .

Therefore, the failure probability is ≤ 2−ℓ+𝑛+1 . □

Remark 6.6.9 (Entropy compression). Tao use the phrase “entropy compression” to
describe this argument. The intuition is that the recoverability of the random bit string

102
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.6 Algorithmic local lemma

𝑥 means that we are somehow “compressing” a ℓ𝑘-bit random string into a shorter
length losslessly, but that would be impossible. Each call to fix uses up 𝑘 random bits
and converts it to 𝑘 − 1 bits to the execute trace (plus at most 𝑛 bits of extra information,
namely the current variables assignment, and this is viewed as a constant amount of
information), and this conversion is reversible. So we are “compressing entropy.” The
conservation of information tells us that we cannot losslessly compress 𝑘 random bits
to 𝑘 − 1 bits for very long.

Remark 6.6.10 (Relationship between the two proofs of the local lemma?). The
above proof, along with extensions of these ideas in Moser and Tardos (2010), seems
to give a completely different proof of the local lemma than the one we saw at the
beginning of the chapter. Is there some way to relate these seemingly completely
different proofs? Are they secretly the same proof? We do not know. This is an
interesting open-ended research problem.

Exercises

1. Show that it is possible to color the edges of 𝐾𝑛 with at most 3 𝑛 colors so that
there are no monochromatic triangles.
2. Prove that it is possible to color the vertices of every 𝑘-uniform 𝑘-regular hyper-
graph using at most 𝑘/log 𝑘 colors so that every color appears at most 𝑂 (log 𝑘)
times on each edge.
3. ★ Hitting thin rectangles. Prove that there is a constant 𝐶 > 0 so that for every
sufficiently small 𝜀 > 0, one can choose exactly one point inside each grid square
[𝑛, 𝑛 + 1) × [𝑚, 𝑚 + 1) ⊂ R2 , 𝑚, 𝑛 ∈ Z, so that every rectangle of dimensions
𝜀 by (𝐶/𝜀) log(1/𝜀) in the plane (not necessarily axis-aligned) contains at least
one chosen point.
4. List coloring. Prove that there is some constant 𝑐 > 0 so that given a graph and
a set of 𝑘 acceptable colors for each vertex such that every color is acceptable
for at most 𝑐𝑘 neighbors of each vertex, there is always a proper coloring where
every vertex is assigned one of its acceptable colors.
5. Prove that, for every 𝜀 > 0, there exist ℓ0 and some (𝑎 1 , 𝑎 2 , . . . ) ∈ {0, 1}N
such that for every ℓ > ℓ0 and every 𝑖 > 1, the vectors (𝑎𝑖 , 𝑎𝑖+1 , . . . , 𝑎𝑖+ℓ−1 ) and
(𝑎𝑖+ℓ , 𝑎𝑖+ℓ+1 , . . . , 𝑎𝑖+2ℓ−1 ) differ in at least ( 21 − 𝜀)ℓ coordinates.
6. Avoiding periodically colored paths. Prove that for every Δ, there exists 𝑘 so
that every graph with maximum degree at most Δ has a vertex-coloring using 𝑘
colors so that there is no path of the form 𝑣 1 𝑣 2 . . . 𝑣 2ℓ (for any positive integer

103
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6 Lovász Local Lemma

ℓ) where 𝑣 𝑖 has the same color as 𝑣 𝑖+ℓ for each 𝑖 ∈ [ℓ]. (Note that vertices on a
path must be distinct.)
7. Prove that every graph with maximum degree Δ can be properly edge-colored
using 𝑂 (Δ) colors so that every cycle contains at least three colors.
(An edge-coloring is proper if it never assigns the same color to two edges sharing a vertex.)

8. ★ Prove that for every Δ, there exists 𝑔 so that every bipartite graph with maximum
degree Δ and girth at least 𝑔 can be properly edge-colored using Δ + 1 colors so
that every cycle contains at least three colors.
9. ★ Prove that for every positive integer 𝑟, there exists 𝐶𝑟 so that every graph with
maximum degree Δ has a proper vertex coloring using at most 𝐶𝑟 Δ1+1/𝑟 colors
so that every vertex has at most 𝑟 neighbors of each color.
10. Vertex-disjoint cycles in digraphs. (Recall that a directed graph is 𝑘-regular if
all vertices have in-degree and out-degree both equal to 𝑘. Also, cycles cannot
repeat vertices.)
a) Prove that every 𝑘-regular directed graph has at least 𝑐𝑘/log 𝑘 vertex-
disjoint directed cycles, where 𝑐 > 0 is some constant.
b) ★ Prove that every 𝑘-regular directed graph has at least 𝑐𝑘 vertex-disjoint
directed cycles, where 𝑐 > 0 is some constant.
Hint: split in two and iterate

11. a) Generalization of Cayley’s formula. Using Prüfer codes, prove the identity
∑︁
𝑥1 𝑥 2 · · · 𝑥 𝑛 (𝑥 1 + · · · + 𝑥 𝑛 ) 𝑛−2 = 𝑥1𝑑𝑇 (1) 𝑥 2𝑑𝑇 (2) · · · 𝑥 𝑛𝑑𝑇 (𝑛)
𝑇

where the sum is over all trees 𝑇 on 𝑛 vertices labeled by [𝑛] and 𝑑𝑇 (𝑖) is
the degree of vertex 𝑖 in 𝑇.
b) Let 𝐹 be a forest with vertex set [𝑛], with components having 𝑓1 , . . . , 𝑓𝑠
vertices so that 𝑓1 + · · · + 𝑓𝑠 = 𝑛. Prove that the number of trees on the
( 𝑓𝑖 /𝑛 𝑓𝑖 −1 ).
Î𝑠
vertex set [𝑛] that contain 𝐹 is exactly 𝑛𝑛−2 𝑖=1
c) Independence property for uniform spanning tree of 𝐾𝑛 . Show that if 𝐻1
and 𝐻2 are vertex-disjoint subgraphs of 𝐾𝑛 , then for a uniformly random
spanning tree 𝑇 of 𝐾𝑛 , the events 𝐻1 ⊆ 𝑇 and 𝐻2 ⊆ 𝑇 are independent.
d) ★ Packing rainbow spanning trees. Prove that there is a constant 𝑐 > 0
so that for every edge-coloring of 𝐾𝑛 where each color appears at most
𝑐𝑛 times, there exist at least 𝑐𝑛 edge-disjoint spanning trees, where each
spanning tree has all its edges colored differently.

104
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

6.6 Algorithmic local lemma

(In your submission, you may assume previous parts without proof.)

The next two problems use the lopsided local lemma.


12. Packing two copies of a graph. Prove that there is a constant 𝑐 > 0 so that if 𝐻
is an 𝑛-vertex 𝑚-edge graph with maximum degree at most 𝑐𝑛2 /𝑚, then one can
find two edge-disjoint copies of 𝐻 in the complete graph 𝐾𝑛 .
13. ★ Packing Latin transversals. Prove that there is a constant 𝑐 > 0 so that every
𝑛 × 𝑛 matrix where no entry appears more than 𝑐𝑛 times contains 𝑐𝑛 disjoint
Latin transversals.

105
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

7 Correlation Inequalities

7.1 Harris–FKG inequality


Recall that 𝐴 ⊆ {0, 1}𝑛 is called an increasing event (also: increasing property, up-
set) if 𝐴 is upwards-closed, meaning that whenever 𝑥 is in 𝐴, then everything above 𝑥
in the boolean lattice also lies in 𝐴. In other words,

if 𝑥 ∈ 𝐴 and 𝑥 ≤ 𝑦 (coordinatewise), then 𝑦 ∈ 𝐴.

Similarly, a decreasing event is defined by a downward closed collection of subset of


{0, 1}𝑛 . A subset 𝐴 ⊆ {0, 1}𝑛 is increasing if and only if its complement 𝐴 ⊆ {0, 1}𝑛
is decreasing.
The main theorem of this chapter tells us that
increasing events of independent variables are positively correlated .

Theorem 7.1.1 (Harris 1960)


If 𝐴 and 𝐵 are increasing events of independent boolean random variables, then

P( 𝐴𝐵) ≥ P( 𝐴)P(𝐵).

Equivalently, we can write P ( 𝐴 | 𝐵) ≥ P( 𝐴).

Remark 7.1.2 (Independence assumption). It is important that the boolean random


variables are independent, also they do not have to be identically distributed.
There are other important settings where the independence assumption can be relaxed.
This is important for certain statistical physics models, where much of this theory
originally arose. Indeed, the above inequality is often called the FKG inequality,
attributed to Fortuin, Kasteleyn, Ginibre (1971), who proved a more general result in
the setting of distributive lattices, which we will not discuss here (see Alon–Spencer).

Remark 7.1.3 (Percolation). Many of such inequalities were initially introduced for
the study of percolations. A classic setting of this problem takes place in infinite
grid with vertices Z2 with edges connecting adjacent vertices at distance 1. Suppose

107
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

7 Correlation Inequalities

we keep each edge of this infinite grid with probability 𝑝 independently, what is the
probability that the origin is part of an infinite component (in which case we say
that there is “percolation”)? This is supposed to an idealized mathematical model
of how a fluid permeates through a medium. Harris showed that with probability 1,
percolation does not occur for 𝑝 ≤ 1/2. A later breakthrough of Kesten (1980) shows
that percolation occurs with probability 1 for all 𝑝 > 1/2. Thus the “bond percolation
threshold” for Z2 is exactly 1/2. Such exact results are extremely rare.

Example 7.1.4. Here is a quick application of Harris’ inequality to a random graph


𝐺 (𝑛, 𝑝):
P(planar | connected) ≤ P(planar).
Indeed, being planar is a decreasing property, whereas being connected is an increasing
property.

We state and prove a more general result, which says that independent random variables
possess positive association.
Let each Ω𝑖 be a linearly ordered set (i.e., {0, 1}, R) and 𝑥𝑖 ∈ Ω𝑖 with respect to some
probability distribution independent for each 𝑖. We say that a function 𝑓 (𝑥 1 , . . . , 𝑥 𝑛 )
is monotone increasing if

𝑓 (𝑥) ≤ 𝑓 (𝑦) whenever 𝑥 ≤ 𝑦 coordinatewise.

Theorem 7.1.5 (Harris)


If 𝑓 and 𝑔 are monotone increasing functions of independent random variables, then

E[ 𝑓 𝑔] ≥ (E 𝑓 )(E𝑔).

This version of Harris inequality implies the earlier version by setting 𝑓 = 1 𝐴 and
𝑔 = 1𝐵 .

Proof. We use induction on 𝑛.

For 𝑛 = 1, for independent 𝑥, 𝑦 ∈ Ω1 , we have

0 ≤ E[( 𝑓 (𝑥) − 𝑓 (𝑦))(𝑔(𝑥) − 𝑔(𝑦))] = 2E[ 𝑓 𝑔] − 2(E 𝑓 )(E𝑔).

So E[ 𝑓 𝑔] ≥ (E 𝑓 )(E𝑔). (The one-variable case is sometimes called Chebyshev’s


inequality. It can also be deduced using the rearrangement inequality).
Now assume 𝑛 ≥ 2. Let ℎ = 𝑓 𝑔 : Ω1 ×· · ·×Ω𝑛 → R. Define marginals 𝑓1 , 𝑔1 , ℎ1 : Ω1 →

108
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

7.1 Harris–FKG inequality

R by

𝑓1 (𝑦 1 ) = E[ 𝑓 |𝑥 1 = 𝑦 1 ] = E (𝑥2 ,...,𝑥 𝑛 )∈Ω2 ×···×Ω𝑛 [ 𝑓 (𝑦 1 , 𝑥2 , . . . , 𝑥 𝑛 )],


𝑔1 (𝑦 1 ) = E[𝑔|𝑥 1 = 𝑦 1 ] = E (𝑥2 ,...,𝑥 𝑛 )∈Ω2 ×···×Ω𝑛 [𝑔(𝑦 1 , 𝑥2 , . . . , 𝑥 𝑛 )],
ℎ1 (𝑦 1 ) = E[ℎ|𝑥 1 = 𝑦 1 ] = E (𝑥2 ,...,𝑥 𝑛 )∈Ω2 ×···×Ω𝑛 [ℎ(𝑦 1 , 𝑥2 , . . . , 𝑥 𝑛 )].

Note that 𝑓1 and 𝑔1 are 1-variable monotone increasing functions on Ω1 .


For every fixed 𝑦 1 ∈ Ω1 , the function (𝑥 2 , . . . , 𝑥 𝑛 ) ↦→ 𝑓 (𝑦 1 , 𝑥2 , . . . , 𝑥 𝑛 ) is monotone
increasing, and likewise with 𝑔. So applying the induction hypothesis for 𝑛 − 1, we
have
ℎ1 (𝑦 1 ) ≥ 𝑓1 (𝑦 1 )𝑔1 (𝑦 1 ). (7.1)

Thus

E[ 𝑓 𝑔] = E[ℎ] = E[ℎ1 ] ≥ E[ 𝑓1 𝑔1 ] [by (7.1)]


≥ (E 𝑓1 )(E𝑔1 ) [by the 𝑛 = 1 case]
= (E 𝑓 )(E𝑔). □

Corollary 7.1.6 (Decreasing events and multiple events)


Let 𝐴 and 𝐵 be events on independent random variables.
(a) If 𝐴 and 𝐵 are decreasing, then P( 𝐴 ∧ 𝐵) ≥ P( 𝐴)P(𝐵).
(b) If 𝐴 is increasing and 𝐵 is decreasing, then P( 𝐴 ∧ 𝐵) ≤ P( 𝐴)P(𝐵).
If 𝐴1 , . . . , 𝐴 𝑘 are all increasing (or all decreasing) events on independent random
variables, then
P( 𝐴1 · · · 𝐴 𝑘 ) ≥ P( 𝐴1 ) · · · P( 𝐴 𝑘 ).

Proof. For the second inequality, note that the complement 𝐵 is increasing, so

Harris
P( 𝐴𝐵) = P( 𝐴) − P( 𝐴𝐵) ≤ P( 𝐴) − P( 𝐴)P(𝐵) = P( 𝐴)P(𝐵).

The proof of the first inequality is similar. For the last inequality we apply the Harris
inequality repeatedly. □

109
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

7 Correlation Inequalities

7.2 Applications to random graphs


Triangle-free probability

Question 7.2.1
What’s the probability that 𝐺 (𝑛, 𝑝) is triangle-free?

Harris inequality will allow us to prove a lower bound. In the next chapter, we will use
Janson inequalities to derive upper bounds.

Theorem 7.2.2
P(𝐺 (𝑛, 𝑝) is triangle-free) ≥ (1 − 𝑝 3 ) ( 3)
𝑛

Proof. For each triple of distinct vertices 𝑖, 𝑗, 𝑘 ∈ [𝑛], the event that 𝑖 𝑗 𝑘 does not form
a triangle is a decreasing event (here the ground set is the set of edges of the complete
graph on 𝑛). So by Harris’ inequality,

©Û
P(𝐺 (𝑛, 𝑝) is triangle-free) = P ­ {𝑖 𝑗 𝑘 not a triangle}®
ª

Ö«𝑖< 𝑗 <𝑘 ¬
P(𝑖 𝑗 𝑘 not a triangle) = (1 − 𝑝 3 ) ( 3) .
𝑛
≥ □
𝑖< 𝑗 <𝑘

3
Remark 7.2.3. How good is this bound? For 𝑝 ≤ 0.99, we have 1 − 𝑝 3 = 𝑒 −Θ( 𝑝 ) , so
the above bound gives
3 𝑝3 )
P(𝐺 (𝑛, 𝑝) is triangle-free) ≥ 𝑒 −Θ(𝑛 .

Here is another lower bound


2 𝑝)
P(𝐺 (𝑛, 𝑝) is triangle-free) ≥ P(𝐺 (𝑛, 𝑝) is empty) = (1 − 𝑝) ( 2) = 𝑒 −Θ(𝑛
𝑛
.

The bound from Harris is better when 𝑝 ≪ 𝑛−1/2 . Putting them together, we obtain
( 3 𝑝3 )
𝑒 −Θ(𝑛 if 𝑝 ≲ 𝑛−1/2
P(𝐺 (𝑛, 𝑝) is triangle-free) ≳ 2 𝑝)
𝑒 −Θ(𝑛 if 𝑛−1/2 ≲ 𝑝 ≤ 0.99

(note that the asymptotics agree at the boundary 𝑝 ≍ 𝑛−1/2 ). In the next chapter, we
will prove matching upper bounds using Janson inequalities.

110
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

7.2 Applications to random graphs

Maximum degree

Question 7.2.4
What’s the probability that the maximum degree of 𝐺 (𝑛, 1/2) is at most 𝑛/2?

For each vertex 𝑣, deg(𝑣) ≤ 𝑛/2 is a decreasing event with probability just slightly
over 1/2. So by Harris inequality, the probability that every 𝑣 has deg(𝑣) ≤ 𝑛/2 is at
least ≥ 2−𝑛 .
It turns out that the appearance of high degree vertices is much more correlated than
the independent case. The truth is exponentially more than the above bound.

Theorem 7.2.5 (Riordan and Selby 2000)

P(maxdeg 𝐺 (𝑛, 1/2) ≤ 𝑛/2) = (0.6102 · · · + 𝑜(1)) 𝑛

Instead of giving a proof, we consider an easier continuous model of the problem that
motivates the numerical answer. Building on this intuition, Riordan and Selby (2000)
proved the result in the random graph setting, although this is beyond the scope of this
class.
In a random graphs, we assign independent Bernoulli random variables on edges of a
complete graph. Instead, let us assign independent standard normal random variables
to each edge of the complete graph.

Proposition 7.2.6 (Max degree with normal random edge labels)


Assign an independent standard normal random variable 𝑍𝑢𝑣 to each edge of 𝐾𝑛 . Let
Í
𝑊𝑣 = 𝑢≠𝑣 𝑍𝑢𝑣 be the sum of the labels of the edges incident to a vertex 𝑣. Then

P(𝑊𝑣 ≤ 0 ∀𝑣) = (0.6102 · · · + 𝑜(1)) 𝑛

The event 𝑊𝑣 ≤ 0 is supposed to model the event that the degree at vertex 𝑣 is less
than 𝑛/2. Of course, other than intuition, there is no justification here that these two
models should behave similarly
We have P(𝑊𝑣 ≤ 0) = 1/2. Since each {𝑊𝑣 ≤ 0} is a decreasing event of the
independent edge labels, Harris’ inequality tells us that

P(𝑊𝑣 ≤ 0 ∀𝑣) ≥ 2−𝑛 .

The truth turns out to be significantly greater.

Proof sketch of Proposition 7.2.6. The tuple (𝑊𝑣 )𝑣∈[𝑛] has a joint normal distribution,
with each coordinate variance 𝑛 − 1 and pairwise covariance 1. So (𝑊𝑣 )𝑣∈[𝑛] has the

111
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

7 Correlation Inequalities

same distribution as

𝑛 − 2(𝑍1′ , 𝑍2′ , . . . , 𝑍𝑛′ ) + 𝑍0′ (1, 1, . . . , 1)

where 𝑍0′ , . . . , 𝑍𝑛′ are iid standard normals.


Let Φ be the pdf and cdf of the standard normal 𝑁 (0, 1).
Thus
𝑍0′ ∞
  ∫  𝑛
1 −𝑧2 /2 −𝑧
P(𝑊𝑣 ≤ 0 ∀𝑣) = P 𝑍𝑖′ ≤ −√ ∀𝑖 ∈ [𝑛] = √ 𝑑𝑧 𝑒 Φ √
𝑛−2 2𝜋 𝑛−2 −∞

where the final step is obtained by conditioning on 𝑍0′ . Substituting 𝑧 = 𝑦 𝑛, the above
quantity equals to

𝑦2
√︂ ∫  √︂ 
𝑛 𝑛 𝑓 (𝑦) 𝑛
= 𝑒 𝑑𝑦 where 𝑓 (𝑦) = − + log Φ 𝑦 .
2𝜋 −∞ 2 𝑛−2

We can estimate the above integral for large 𝑛 using the Laplace method (which can be
justified rigorously by considering Taylor expansion around the maximum of 𝑓 ). We
have
𝑦2
𝑓 (𝑦) ≈ 𝑔(𝑦) := − + log Φ (𝑦)
2
and we can deduce that

1 1
lim log P(max 𝑊𝑣 ≤ 0) = lim log 𝑒 𝑛 𝑓 (𝑦) 𝑑𝑦 = max 𝑔 = log 0.6102 · · · . □
𝑛→∞ 𝑛 𝑣∈[𝑛] 𝑛→∞ 𝑛

Exercises
1. Let 𝐺 = (𝑉, 𝐸) be a graph. Color every edge with red or blue independently
and uniformly at random. Let 𝐸 0 be the set of red edges and 𝐸 1 the set of blue
edges. Let 𝐺 𝑖 = (𝑉, 𝐸𝑖 ) for each 𝑖 = 0, 1. Prove that

P(𝐺 0 and 𝐺 1 are both connected) ≤ P(𝐺 0 is connected) 2 .

2. A set family F is intersecting if 𝐴 ∩ 𝐵 ≠ ∅ for all 𝐴, 𝐵 ∈ F . Let F1 , . . . , F𝑘


each be a collection of subsets of [𝑛] and suppose that each F𝑖 is intersecting.
Ð𝑘
Prove that 𝑖=1 F𝑖 ≤ 2𝑛 − 2𝑛−𝑘 .
3. Percolation. Let 𝐺 𝑚,𝑛 be the grid graph on vertex set [𝑚] × [𝑛] (𝑚 vertices wide
and 𝑛 vertices tall). A horizontal crossing is a path that connects some left-most
vertex to some right-most vertex. See below for an example of a horizontal
crossing in 𝐺 7,5 .

112
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

7.2 Applications to random graphs

Let 𝐻𝑚,𝑛 denote the random subgraph of 𝐺 𝑚,𝑛 obtained by keeping every edge
with probability 1/2 independently.
Let RSW (𝑘) denote the following statement: there exists a constant 𝑐 𝑘 > 0 such
that for all positive integers 𝑛, P(𝐻 𝑘𝑛,𝑛 has a horizontal crossing) ≥ 𝑐 𝑘 .
a) Prove RSW (1).
b) Prove that RSW (2) implies RSW (100).
c) ★★ (Very challenging) Prove RSW (2).
4. Let 𝐴 and 𝐵 be two independent increasing events of independent random
variables. Prove that there are two disjoint subsets 𝑆 and 𝑇 of these random
variables so that 𝐴 depends only on 𝑆 and 𝐵 depends only on 𝑇.
5. Let 𝑈1 and 𝑈2 be increasing events and 𝐷 a decreasing event of independent
Boolean random variables. Suppose 𝑈1 and 𝑈2 are independent. Prove that
P(𝑈1 |𝑈2 ∩ 𝐷) ≤ P(𝑈1 |𝑈2 ).
6. Coupon collector. Let 𝑠1 , . . . , 𝑠𝑚 be independent random elements in [𝑛] (not
necessarily uniform or identically distributed; chosen with replacement) and
𝑆 = {𝑠1 , . . . , 𝑠𝑚 }. Let 𝐼 and 𝐽 be disjoint subsets of [𝑛]. Prove that P(𝐼 ∪ 𝐽 ⊆
𝑆) ≤ P(𝐼 ⊆ 𝑆)P(𝐽 ⊆ 𝑆).
7. ★ Prove that there exist 𝑐 < 1 and 𝜀 > 0 such that if 𝐴1 , . . . , 𝐴 𝑘 are increasing
events of independent Boolean random variables with P( 𝐴𝑖 ) ≤ 𝜀 for all 𝑖, then,
letting 𝑋 denote the number of events 𝐴𝑖 that occur, one has P(𝑋 = 1) ≤ 𝑐.
(Give your smallest 𝑐. It is conjectured that any 𝑐 > 1/𝑒 works.)
8. ★ Disjoint containment. Let S and T each be a collection of subsets of [𝑛].
Let 𝑅 ⊆ [𝑛] be a random subset where each element is included independently
(not necessarily with the same probability). Let 𝐴 be the event that 𝑆 ⊆ 𝑅 for
some 𝑆 ∈ S. Let 𝐵 be the event that 𝑇 ⊆ 𝑅 for some 𝑇 ∈ T . Let 𝐶 denote
the event there exist disjoint 𝑆, 𝑇 ⊆ 𝑅 with 𝑆 ∈ S and 𝑇 ∈ T . Prove that
P(𝐶) ≤ P( 𝐴)P(𝐵).

113
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8 Janson Inequalities

We present a collection of inequalities, known collectively as Janson inequalities


(Janson 1990). These tools allow us to estimate lower tail large deviation probabilities.
A typical application of Janson’s inequality allows us to upper bound the probability
that a random graph 𝐺 (𝑛, 𝑝) does not contain any copy of some subgraph. Compared
to the second moment method from Chapter 4, Janson inequalities (which is applicable
in more limited setups) gives much better bounds, usually with exponential decays.

8.1 Probability of non-existence


The following setup should be a reminiscent of both the second moment method as
well as Lovász local lemma (the random variable model).

Setup 8.1.1 (for Janson’s inequality: counting containments)


Let 𝑅 be a random subset of [𝑁] with each element included independently (possibly
with different probabilities).
Let 𝑆1 , . . . , 𝑆 𝑘 ⊆ [𝑁]. Let 𝐴𝑖 be the event that 𝑆𝑖 ⊆ 𝑅. Let
∑︁
𝑋= 1 𝐴𝑖
𝑖

be the number of sets 𝑆𝑖 contained in the same set 𝑅. Let


∑︁
𝜇 = E[𝑋] = P( 𝐴𝑖 ).
𝑖

Write 𝑖 ∼ 𝑗 if 𝑖 ≠ 𝑗 and 𝑆𝑖 ∩ 𝑆 𝑗 ≠ ∅. Let (as in the second moment method)


∑︁ ∑︁
Δ= P( 𝐴𝑖 𝐴 𝑗 ) = P(𝑆𝑖 ∪ 𝑆 𝑗 ⊆ 𝑅)
(𝑖, 𝑗):𝑖∼ 𝑗 (𝑖, 𝑗):𝑖∼ 𝑗

(note that (𝑖, 𝑗) and ( 𝑗, 𝑖) is each counted once).

The following inequality appeared in Janson, Łuczak, and Ruciński (1990).

115
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8 Janson Inequalities

Theorem 8.1.2 (Janson inequality I)


Assuming Setup 8.1.1,
P(𝑋 = 0) ≤ 𝑒 −𝜇+Δ/2 .

This inequality is most useful when Δ = 𝑜(𝜇).

Remark 8.1.3. When P( 𝐴𝑖 ) = 𝑜(1) (which is the case in a typical application), Harris’
inequality gives us

  𝑘
Ö  
P(𝑋 = 0) = P 𝐴1 · · · 𝐴 𝑘 ≥ P 𝐴𝑖
𝑖=1
𝑘 𝑘
!
Ö ∑︁
= (1 − P( 𝐴𝑖 )) = exp −(1 + 𝑜(1)) P( 𝐴𝑖 ) = 𝑒 −(1+𝑜(1))𝜇 .
𝑖=1 𝑖=1

In the setting where Δ = 𝑜(𝜇), two bounds match to give P(𝑋 = 0) = 𝑒 −(1+𝑜(1)𝜇 .

Proof. Let
𝑟𝑖 = P( 𝐴𝑖 | 𝐴1 · · · 𝐴𝑖−1 ).
We have

P(𝑋 = 0) = P( 𝐴1 · · · 𝐴 𝑘 )
= P( 𝐴1 )P( 𝐴2 | 𝐴1 ) · · · P( 𝐴 𝑘 | 𝐴1 · · · 𝐴 𝑘−1 )
= (1 − 𝑟 1 ) · · · (1 − 𝑟 𝑘 )
≤ 𝑒 −𝑟1 −···−𝑟 𝑘

It suffices now to prove that:


Claim. For each 𝑖 ∈ [𝑘]
∑︁
𝑟𝑖 ≥ P( 𝐴𝑖 ) − P( 𝐴𝑖 𝐴 𝑗 ).
𝑗 <𝑖: 𝑗∼𝑖

Summing the claim over 𝑖 ∈ [𝑘] would then yield


𝑘
∑︁ ∑︁ 1 ∑︁ ∑︁ Δ
𝑟𝑖 ≥ P( 𝐴𝑖 ) − P( 𝐴𝑖 𝐴 𝑗 ) = 𝜇 −
𝑖=1 𝑖
2 𝑖 𝑗∼𝑖 2

and thus !  
∑︁ Δ
P(𝑋 = 0) ≤ exp − 𝑟𝑖 ≤ exp −𝜇 +
𝑖
2

116
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8.1 Probability of non-existence

Proof of claim. Recall that 𝑖 is given and fixed. Let


Û Û
𝐷0 = 𝐴𝑗 and 𝐷1 = 𝐴𝑗
𝑗 <𝑖: 𝑗 ≁𝑖 𝑗 <𝑖: 𝑗∼𝑖

Then
P( 𝐴𝑖 𝐷 0 𝐷 1 ) P( 𝐴𝑖 𝐷 0 𝐷 1 )
𝑟𝑖 = P( 𝐴𝑖 | 𝐴1 · · · 𝐴𝑖−1 ) = P( 𝐴𝑖 |𝐷 0 𝐷 1 ) = ≥
P(𝐷 0 𝐷 1 ) P(𝐷 0 )
= P( 𝐴𝑖 𝐷 1 |𝐷 0 ) = P( 𝐴𝑖 |𝐷 0 ) − P( 𝐴𝑖 𝐷 1 |𝐷 0 )
= P( 𝐴𝑖 ) − P( 𝐴𝑖 𝐷 1 |𝐷 0 ) [by independence]

Since 𝐴𝑖 and 𝐷 1 are both increasing events, and 𝐷 0 is a decreasing event, by Harris’
inequality (Corollary 7.1.6),
!
Ü ∑︁
P( 𝐴𝑖 𝐷 1 |𝐷 0 ) ≤ P( 𝐴𝑖 𝐷 1 ) = P 𝐴𝑖 ∧ 𝐴𝑗 ≤ P( 𝐴𝑖 𝐴 𝑗 )
𝑗 <𝑖: 𝑗∼𝑖 𝑗 <𝑖: 𝑗∼𝑖

This concludes the proof of the claim, and thus the proof of the theorem. □

Remark 8.1.4 (History). Janson’s original proof was via analytic interpolation. The
above proof is based on Boppana and Spencer (1989) with a modification by Warnke
(personal communication). It has some similarities to the proof of Lovász local lemma
from Section 6.1. The above proof incorporates ideas from Riordan and Warnke
(2015), who extended Janson’s inequality from principal up-set to general up-sets.
Indeed, the above proof only requires that the events 𝐴𝑖 are increasing, whereas earlier
proofs of the result (e.g., the proof in Alon–Spencer) requires the full assumption of
Setup 8.1.1, namely that each 𝐴𝑖 is an event of the form 𝑆𝑖 ⊆ 𝑅𝑖 (i.e., a principal
up-set).

Question 8.1.5
What is the probability that 𝐺 (𝑛, 𝑝) is triangle-free?

In Setup 8.1.1, let [𝑁] with 𝑁 = 𝑛2 be the set of edges of 𝐾𝑛 , and let 𝑆1 , . . . , 𝑆 ( 𝑛) be
3
3-element sets where each 𝑆𝑖 is the edge-set of a triangle. As in the second moment
calculation in Section 4.2, we have
 
𝑛 3
𝜇= 𝑝 ≍ 𝑛3 𝑝 3 and Δ ≍ 𝑛4 𝑝 5 .
3

(where Δ is obtained by considering all appearances of a pair of triangles glued along


an edge).

117
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8 Janson Inequalities

If 𝑝 ≪ 𝑛−1/2 , then Δ = 𝑜(𝜇), in which case Janson inequality I (Theorem 8.1.2 and
Remark 8.1.3) gives the following.

Theorem 8.1.6
If 𝑝 = 𝑜(𝑛−1/2 ) , then
3 𝑝 3 /6
P(𝐺 (𝑛, 𝑝) is triangle-free) = 𝑒 −(1+𝑜(1))𝜇 = 𝑒 −(1+𝑜(1))𝑛 .

Corollary 8.1.7
For a constant 𝑐 > 0,
3 /6
lim P(𝐺 (𝑛, 𝑐/𝑛) is triangle-free) = 𝑒 −𝑐 .
𝑛→∞

In fact, the number of triangles in 𝐺 (𝑛, 𝑐/𝑛) converges to a Poisson distribution


with mean 𝑐3 /6. On the other hand, when 𝑝 ≫ 1/𝑛, the number of triangles is
asymptotically normal.
What about if 𝑝 ≫ 𝑛−1/2 , so that Δ ≫ 𝜇. Janson inequality I does not tell us anything
nontrivial. Do we still expect the triangle-free probability to be 𝑒 −(1+𝑜(1))𝜇 , or even
≤ 𝑒 −𝑐𝜇 ?
As noted earlier in Remark 7.2.3, another way to obtain a lower bound on the probability
triangle-freeness is to consider the probability the 𝐺 (𝑛, 𝑝) is empty (or contained in
some fixed complete bipartite graph), in which case we obtain
2 2 𝑝)
P(𝐺 (𝑛, 𝑝) is triangle-free) ≥ (1 − 𝑝) Θ(𝑛 ) = 𝑒 −Θ(𝑛

(the second step assumes that 𝑝 is bounded away from 1. If 𝑝 ≫ 𝑛−1/2 , so the above
2
lower bound better than the previous one: 𝑒 −Θ(𝑛 𝑝) ≫ 𝑒 −(1+𝑜(1))𝜇 .
Nevertheless, we’ll still use Janson to bootstrap an upper bound on the triangle-free
probability. More generally, the next theorem works in the complement region of the
Janson inequality I, where now Δ ≥ 𝜇.

Theorem 8.1.8 (Janson inequality II)


Assuming Setup 8.1.1, if Δ ≥ 𝜇, then
2 /(2Δ)
P(𝑋 = 0) ≤ 𝑒 −𝜇 .

The proof idea is to applying the first Janson inequality on a randomly sampled subset
of events. This sampling technique might remind you of some earlier proofs, e.g.,
the proof of the crossing number inequality (Theorem 2.6.2), where we first proved a

118
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8.1 Probability of non-existence

“cheap bound” that worked in a more limited range, and then used sampling to obtain
a better bound.
Í
Proof. For each 𝑇 ⊆ [𝑘], let 𝑋𝑇 := 𝑖∈𝑇 1 𝐴𝑖 denote the number of occurring events in
𝑇. We have
P(𝑋 = 0) ≤ P(𝑋𝑇 = 0) ≤ 𝑒 −𝜇𝑇 +Δ𝑇 /2
where ∑︁
𝜇𝑇 = P( 𝐴𝑖 )
𝑖∈𝑇
and ∑︁
Δ𝑇 = P( 𝐴𝑖 𝐴 𝑗 )
(𝑖, 𝑗)∈𝑇 2 :𝑖∼ 𝑗

Choose 𝑇 ⊆ [𝑘] randomly by including every element with probability 𝑞 ∈ [0, 1]


independently. We have

E𝜇𝑇 = 𝑞𝜇 and EΔ𝑇 = 𝑞 2 Δ

and so
E(−𝜇𝑇 + Δ𝑇 /2) = −𝑞𝜇 + 𝑞 2 Δ/2.
By linearity of expectations, thus there is some choice of 𝑇 ⊆ [𝑘] so that

−𝜇𝑇 + Δ𝑇 /2 ≤ −𝑞𝜇 + 𝑞 2 Δ/2

so that
2 Δ/2
P(𝑋 = 0) ≤ 𝑒 −𝑞𝜇+𝑞
for every 𝑞 ∈ [0, 1]. Since Δ ≥ 𝜇, we can set 𝑞 = 𝜇/Δ ∈ [0, 1] to get the result. □

To summarize, the first two Janson inequalities tell us that


(
𝑒 −𝜇+Δ/2 if Δ < 𝜇
P(𝑋 = 0) ≤ 2
𝑒 −𝜇 /(2Δ) if Δ ≥ 𝜇.

Remark 8.1.9. If 𝜇 → ∞ and Δ ≪ 𝜇2 , then Janson inequality II implies P(𝑋 = 0) =


𝑜(1), which we knew from second moment method. However Janson’s inequality
gives an exponentially decaying tail bound, compared to only a polynomially decaying
tail via the second moment method. The exponential tail will be important in an
application below to determining the chromatic number of 𝐺 (𝑛, 1/2).

Let us revisit the example of estimating the probability that 𝐺 (𝑛, 𝑝) is triangle-free,

119
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8 Janson Inequalities

now in the regime 𝑝 ≫ 𝑛−1/2 . We have

𝑛3 𝑝 3 ≍ 𝜇 ≪ Δ ≍ 𝑛4 𝑝 5 .

So so for large enough 𝑛, Janson inequality II tells us


2 /(2Δ) 2 𝑝)
P(𝐺 (𝑛, 𝑝) is triangle-free) ≤ 𝑒 −𝜇 = 𝑒 −Θ(𝑛

Since
2 𝑝)
P(𝐺 (𝑛, 𝑝) is triangle-free) ≥ P(𝐺 (𝑛, 𝑝) is empty) ≥ (1 − 𝑝) ( 2) = 𝑒 −Θ(𝑛
𝑛

where the final step assumes that 𝑝 is bounded away from 1, we conclude that
2 𝑝)
P(𝐺 (𝑛, 𝑝) is triangle-free) = 𝑒 −Θ(𝑛

We summarize the results below (strictly speaking we have not yet checked the case
𝑝 ≍ 𝑛−1/2 , which we can verify by applying Janson inequalities; note that the two
regimes below match at the boundary).

Theorem 8.1.10
Suppose 𝑝 = 𝑝 𝑛 ≤ 0.99. Then
(
exp −Θ(𝑛2 𝑝) if 𝑝 ≳ 𝑛−1/2

P(𝐺 (𝑛, 𝑝) is triangle-free) =
exp −Θ(𝑛3 𝑝 3 ) if 𝑝 ≲ 𝑛−1/2


Remark 8.1.11. Sharper results are known. Here are some highlights.
2
1. The number of triangle-free graphs on 𝑛 vertices is 2 (1+𝑜(1))𝑛 /4 . In fact, an even
stronger statement is true: almost all (i.e., 1−𝑜(1) fraction) 𝑛-vertex triangle-free
graphs are bipartite (Erdős, Kleitman, and Rothschild 1976).
√︁ √
2. If 𝑚 ≥ 𝐶𝑛3/2 log 𝑛 for any constant 𝐶 > 3/4 (and this is best possible), then
almost all all 𝑛-vertex 𝑚-edge triangle-free graphs are bipartite (Osthus, Prömel,
and Taraz 2003). This result has been extended to 𝐾𝑟 -free graphs for every fixed
𝑟 (Balogh, Morris, Samotij, and Warnke 2016).
3. For 𝑛−1/2 ≪ 𝑝 ≪ 1, (Łuczak 2000)

− log P(𝐺 (𝑛, 𝑝) is triangle-free) ∼ − log P(𝐺 (𝑛, 𝑝) is bipartite) ∼ 𝑛2 𝑝/4.

This result was generalized to general 𝐻-free graphs using the powerful recent
method of hypergraph containers (Balogh, Morris, and Samotij 2015).

120
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8.2 Lower tails

8.2 Lower tails


Previously we looked at the probability of non-existence. Now we would like to
estimate lower tail probabilities. Here is a model problem.

Question 8.2.1
Fix a constant 0 < 𝛿 ≤ 1. Let 𝑋 be the number of triangles of 𝐺 (𝑛, 𝑝). Estimate

P(𝑋 ≤ (1 − 𝛿)E𝑋).

We will bootstrap Janson inequality I, P(𝑋 = 0) ≤ exp(−𝜇 + Δ/2), to an upper bound


on lower tail probabilities.

Theorem 8.2.2 (Janson inequality III)


Assume Setup 8.1.1. For any 0 ≤ 𝑡 ≤ 𝜇,

−𝑡 2
 
P(𝑋 ≤ 𝜇 − 𝑡) ≤ exp
2(𝜇 + Δ)

Note that setting 𝑡 = 𝜇 we basically recover the first two Janson inequalities (up to an
unimportant constant factor in the exponent):

−𝜇2
 
P(𝑋 = 0) ≤ exp . (8.1)
2(𝜇 + Δ)

(Note that this form of the inequality conveniently captures Janson inequalities I & II.)

Proof. (by Lutz Warnke1 ) We start the proof similarly to the proof of the Chernoff
bound, by applying Markov’s inequality on the moment generating function. To that
end, let 𝜆 ≥ 0 to be optimized later. Let

𝑞 = 1 − 𝑒 −𝜆 .

By Markov’s inequality,
 
−𝜆𝑋 −𝜆(𝜇−𝑡)
P(𝑋 ≤ 𝜇 − 𝑡) = P 𝑒 ≥𝑒
≤ 𝑒𝜆(𝜇−𝑡) E 𝑒 −𝜆𝑋
≤ 𝑒𝜆(𝜇−𝑡) E[(1 − 𝑞) 𝑋 ].

1 Personal communication

121
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8 Janson Inequalities

For each 𝑖 ∈ [𝑘], let 𝑊𝑖 ∼ Bernoulli(𝑞) independently. Consider the random variable
𝑘
∑︁
𝑌= 1 𝐴𝑖 𝑊𝑖 .
𝑖=1

Conditioned on the value of 𝑋, the probability that 𝑌 = 0 is (1−𝑞) 𝑋 (i.e., the probability
that 𝑊𝑖 = 0 for each of the 𝑋 events 𝐴𝑖 that occurred). Taking expectation over 𝑋, we
have
P(𝑌 = 0) = E[P(𝑌 = 0|𝑋)] = E[(1 − 𝑞) 𝑋 ].

Note that 𝑌 fits within Setup 8.1.1 by introducing 𝑘 new elements to the ground set
[𝑁], where each new element is included according to 𝑊𝑖 , and enlarging each 𝑆𝑖 to
include this new element. The relevant parameters of 𝑌 are

𝜇𝑌 := E𝑌 = 𝑞𝜇

and ∑︁
Δ𝑌 := E[1 𝐴𝑖 𝑊𝑖 1 𝐴 𝑗 𝑊 𝑗 ] = 𝑞 2 Δ.
(𝑖, 𝑗):𝑖∼ 𝑗

Then Janson inequality I applied to 𝑌 gives


2 Δ/2
P(𝑌 = 0) ≤ 𝑒 −𝜇𝑌 +Δ𝑌 /2 = 𝑒 −𝑞𝜇+𝑞 .

Therefore,
2 Δ/2
E[(1 − 𝑞) 𝑋 ] = P(𝑌 = 0) ≤ 𝑒 −𝑞𝜇+𝑞 .
Continuing the moment calculation at the beginning of the proof, and using that

𝜆2
𝜆− ≤ 𝑞 ≤ 𝜆,
2
we have

P(𝑋 ≤ −𝜇 + 𝑡) ≤ 𝑒𝜆(𝜇−𝑡) E[(1 − 𝑞) 𝑋 ]


 
≤ exp 𝜆(𝜇 − 𝑡) − 𝑞𝜇 + 𝑞 2 Δ/2
𝜆2
   

≤ exp 𝜆(𝜇 − 𝑡) − 𝜆 − 𝜇+𝜆
2 2
2
 
𝜆
= exp −𝜆𝑡 + (𝜇 + Δ)
2
 
−𝑡 2
We optimize by setting 𝜆 = 𝑡/(𝜇 + Δ) to obtain ≤ exp 2(𝜇+Δ) . □

Example 8.2.3 (Lower tails for triangle counts). Let 𝑋 be the number of triangles in

122
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8.2 Lower tails

𝐺 (𝑛, 𝑝). We have 𝜇 ≍ 𝑛3 𝑝 3 and Δ ≍ 𝑛4 𝑝 5 . Fix a constant 𝛿 ∈ (0, 1]. Let 𝑡 = 𝛿E𝑋.
We have
 (
exp −Θ𝛿 (𝑛2 𝑝) if 𝑝 ≳ 𝑛−1/2 ,

−𝛿2 𝑛6 𝑝 6
 
P(𝑋 ≤ (1 − 𝛿)E𝑋) ≤ exp −Θ 3 3 =
𝑛 𝑝 + 𝑛4 𝑝 5 exp −Θ𝛿 (𝑛3 𝑝 3 ) if 𝑝 ≲ 𝑛−1/2 .


The bounds are tight up to a constant in the exponent, since


(
exp −Θ(𝑛2 𝑝) if 𝑝 ≳ 𝑛−1/2 ,

P(𝑋 ≤ (1 − 𝛿)E𝑋) ≥ P(𝑋 = 0) =
exp −Θ(𝑛3 𝑝 3 ) if 𝑝 ≲ 𝑛−1/2 .


Example 8.2.4 (No corresponding Janson inequality for upper tails). Continuing
with 𝑋 being the number of triangles of 𝐺 (𝑛, 𝑝), from on the above lower tail results,
we might expect P(𝑋 ≥ (1 + 𝛿)E𝑋) ≤ exp(−Θ𝛿 (𝑛2 𝑝)), but actually this is false!
By planting a clique of size Θ(𝑛𝑝), we can force 𝑋 ≥ (1 + 𝛿)E𝑋. Thus
2 𝑝2 )
P(𝑋 ≥ (1 + 𝛿)E𝑋) ≥ 𝑝 Θ 𝛿 (𝑛

which is much bigger than exp −Θ(𝑛2 𝑝) . The above is actually the truth (Kahn–


DeMarco 2012 and Chatterjee 2012):

2 𝑝2 ) log 𝑛
P(𝑋 ≥ (1 + 𝛿)E𝑋) = 𝑝 Θ 𝛿 (𝑛 if 𝑝 ≳ ,
𝑛
but the proof is much more intricate. Recent results allow us to understand the exact
constant in the exponent though new developments in large deviation theory. The
current state of knowledge is summarized below.

Theorem 8.2.5 (Harel, Mousset, Samotij 2022)


Let 𝑋 be the number of triangles in 𝐺 (𝑛, 𝑝) with 𝑝 = 𝑝 𝑛 satisfying 𝑛−1/2 ≪ 𝑝 ≪ 1,

𝛿 𝛿2/3 2 2


− log P(𝑋 ≥ (1 + 𝛿)E𝑋) ∼ min , 𝑛 𝑝 log(1/𝑝),
3 2

and for 𝑛−1 log 𝑛 ≪ 𝑝 ≪ 𝑛−1/2 ,

𝛿2/3 2 2
− log P(𝑋 ≥ (1 + 𝛿)E𝑋) ∼ 𝑛 𝑝 log(1/𝑝).
2

Remark 8.2.6. The leading constants were determined by Lubetzky and Zhao (2017)
by solving an associated variational problem. Earlier results, starting with Chatter-
jee and Varadhan (2011) and Chatterjee and Dembo (2016) prove large deviation

123
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8 Janson Inequalities

frameworks that gave the above theorem for sufficiently slowly decaying 𝑝 ≥ 𝑛−𝑐 .
For the corresponding problem for lower tails, see Kozma and Samotij (2023) for an
approach using relative entropy that reduces the rate problem to a variational problem.
The exact leading constant is known only for sufficiently small 𝛿 > 0, where the answer
is given by “replica symmetry”, meaning that the exponential rate is given by a uniform
decrement in edge densities for the random graph. In contrast, for 𝛿 close to 1, we
expect (though cannot prove) that the typical structure of a conditioned random graph
is close to a two-block model (Zhao 2017).

8.3 Chromatic number of a random graph

Question 8.3.1
What is the chromatic number of 𝐺 (𝑛, 1/2)?

In Section 4.4, we used the second moment method to find the clique number 𝜔 of
𝐺 (𝑛, 1/2). We saw that, with probability 1 − 𝑜(1), the clique number is concentrated
on two values, and in particular,

𝜔(𝐺 (𝑛, 1/2)) ∼ 2 log2 𝑛 whp.

The independence number 𝜶(𝑮) is the size of the largest independent set in 𝐺. The
independence number 𝛼(𝐺) is the equal to the clique number the complement of
𝐺. Since 𝐺 (𝑛, 1/2) and its graph complement have the same distribution, we have
𝛼(𝐺 (𝑛, 1/2)) ∼ 2 log2 𝑛 whp as well.
Using the following lower bound on the chromatic number 𝜒(𝐺):

|𝑉 (𝐺)|
𝜒(𝐺) ≥
𝛼(𝐺)

(since each color class is an independent set), we obtain that

(1 + 𝑜(1))𝑛
𝜒(𝐺 (𝑛, 1/2)) ≥ whp.
log2 𝑛

The following landmark theorem shows that the above lower bound on 𝜒(𝐺 (𝑛, 1/2))
is asymptotically tight.

124
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8.3 Chromatic number of a random graph

Theorem 8.3.2 (Chromatic number of a random graph — Bollobás 1988)


With probability 1 − 𝑜(1),
𝑛
𝜒(𝐺 (𝑛, 1/2)) ∼ .
2 log2 𝑛

Recall that 𝜔(𝐺 (𝑛, 1/2)) is typically concentrated around the point 𝑘 where the ex-
pected number of 𝑘-cliques 𝑛𝑘 2− ( 2) is neither too large nor too close to zero. The
 𝑘

next lemma show that this probability drops very quickly when we decrease 𝑘 even by
a constant.

Lemma 8.3.3
𝑛  − ( 𝑘2 )
Let 𝑘 0 = 𝑘 0 (𝑛) be the largest possible integer 𝑘 so that 𝑘 2 ≥ 1. Then
2−𝑜(1)
P(𝛼(𝐺 (𝑛, 1/2)) < 𝑘 0 − 3) ≤ 𝑒 −𝑛

Note that there is a trivial lower bound of 2− ( 2) coming from an empty graph.
𝑛

Proof. Let us prove the equivalent claim


2−𝑜 (1)
P(𝜔(𝐺 (𝑛, 1/2)) < 𝑘 0 − 3) ≤ 𝑒 −𝑛 .

𝑛  − ( 𝑘2 )
Let 𝜇 𝑘 := 𝑘 2 . For 𝑘 ∼ 𝑘 0 (𝑛) ∼ 2 log2 𝑛, we have
𝑛 
𝜇 𝑘+1 𝑘+1 −𝑘 𝑛 −(2+𝑜(1)) log2 𝑛 1
= 𝑛 2 ∼ 2 = 1−𝑜(1) .
𝜇𝑘 𝑘
𝑘 𝑛

Let 𝑘 = 𝑘 0 − 3 and applying Setup 8.1.1 for Janson inequality with 𝑋 being the number
of 𝑘-cliques, we have
𝜇 = 𝜇 𝑘 > 𝑛3−𝑜(1)
and (details of the computation omitted)

𝑘4
Δ ∼ 𝜇2 = 𝑛4−𝑜(1) .
𝑛2
So Δ > 𝜇 for sufficiently large 𝑛, and we can apply Janson inequality II:
2−𝑜(1)
P(𝜔(𝐺 (𝑛, 1/2)) < 𝑘) = P(𝑋 = 0) ≤ 𝑒 −𝑛 . □

Proof of Theorem 8.3.2. The lower bound proof was discussed before the theorem
statement. For the upper bound we will give a strategy to properly color the random
graph with (2 + 𝑜(1)) log2 𝑛 colors. We will proceed by taking out independent sets of

125
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8 Janson Inequalities

size ∼ 2 log2 𝑛 iteratively until 𝑜(𝑛/log 𝑛) vertices remain, at which point we can use
a different color for each remaining vertex.
Note that after taking out the first independent set of size ∼ 2 log2 𝑛, we cannot claim
that the remaining graph is still distributed as 𝐺 (𝑛, 1/2). It is not. Our selection of the
vertices was dependent on the random graph. We are not allowed to “resample” the
edges after the first selection.
The strategy is to apply the previous lemma to see that every large enough subset of
vertices has an independent set of size ∼ 2 log2 𝑛.
Let 𝐺 ∼ 𝐺 (𝑛, 1/2). Let 𝑚 = 𝑛/(log 𝑛) 2 , say. For any set 𝑆 of 𝑚 vertices, the
 

induced subgraph 𝐺 [𝑆] has the distribution of 𝐺 (𝑚, 1/2). By Lemma 8.3.3, for

𝑘 = 𝑘 0 (𝑚) − 3 ∼ 2 log2 𝑚 ∼ 2 log2 𝑛,

we have
2−𝑜 (1) 2−𝑜(1)
P(𝛼(𝐺 [𝑆]) < 𝑘) = 𝑒 −𝑚 = 𝑒 −𝑛 .

Taking a union bound over all 𝑚𝑛 < 2𝑛 such sets 𝑆,
2−𝑜(1)
P(there is an 𝑚-vertex subset 𝑆 with 𝛼(𝐺 [𝑆]) < 𝑘) < 2𝑛 𝑒 −𝑛 = 𝑜(1).

So the following statement is true in 𝐺 (𝑛, 1/2) with probability 1 − 𝑜(1):


(*) Every 𝑚-vertex subset contains a 𝑘-vertex independent set.
Assume that 𝐺 has property (*). Now we execute our strategy at the beginning of the
proof:
1. While ≥ 𝑚 vertices remain:
i. Find an independent set of size 𝑘, and let it form its own color class
ii. Remove these 𝑘 vertices
2. Color the remaining < 𝑚 vertices each with a new color.
The result is a proper coloring. The number of colors used is
𝑛 𝑛
+𝑚 ∼ . □
𝑘 2 log2 𝑛

Exercises
1. 3-AP-free probability. Determine, for all 0 < 𝑝 ≤ 0.99 (𝑝 is allowed to depend
on 𝑛), the probability that [𝑛] 𝑝 does not contain a 3-term arithmetic progression,
up to a constant factor in the exponent. (The form of the answer should be similar

126
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

8.3 Chromatic number of a random graph

to the conclusion in class about the probability that 𝐺 (𝑛, 𝑝) is triangle-free. See
3 for notation.)
2. Prove that with probability 1 − 𝑜(1), the size of the largest subset of vertices of
𝐺 (𝑛, 1/2) inducing a triangle-free subgraph is Θ(log 𝑛).
3. Nearly perfect triangle factor, again. Using Janson inequalities this time, give
another solution to Problem 11 in the following generality.
a) Prove that for every 𝜀 > 0, there exists 𝐶𝜀 > 0 such that such that with
probability 1 − 𝑜(1), 𝐺 (𝑛, 𝐶𝜀 𝑛−2/3 ) contains at least (1/3 − 𝜀)𝑛 vertex-
disjoint triangles.
b) (Optional) Compare the the dependence of the optimal 𝐶𝜀 on 𝜀 you obtain
using the method in Problem 11 versus this problem (don’t worry about
leading constant factors).
4. ★Threshold for extensions. Show that for every constant 𝐶 > 16/5, if 𝑛2 𝑝 5 >
𝐶 log 𝑛, then with probability 1 − 𝑜(1), every edge of 𝐺 (𝑛, 𝑝) is contained in a
𝐾4 .
(Be careful, this event is not increasing, and so it is insufficient to just prove the result for one
specific 𝑝.)

5. Lower tails of small subgraph counts. Fix graph 𝐻 and 𝛿 ∈ (0, 1]. Let 𝑋𝐻 denote
the number of copies of 𝐻 in 𝐺 (𝑛, 𝑝). Prove that for all 𝑛 and 0 < 𝑝 < 0.99,
′ ′
P(𝑋𝐻 ≤ (1 − 𝛿)E𝑋𝐻 ) = 𝑒 −Θ𝐻, 𝛿 (Φ𝐻 ) where Φ𝐻 := min ′ 𝑛𝑣(𝐻 ) 𝑝 𝑒(𝐻 ) .
𝐻 ′ ⊆𝐻:𝑒(𝐻 )>0

Here the hidden constants in Θ𝐻,𝛿 may depend on 𝐻 and 𝛿 (but not on 𝑛 and 𝑝).
6. ★ List chromatic number of a random graph. Show that the list chromatic number
of 𝐺 (𝑛, 1/2) is (1 + 𝑜(1)) 2 log𝑛 𝑛 with probability 1 − 𝑜(1).
2

The list-chromatic number (also called choosability) of a graph 𝐺 is defined to the minimum 𝑘
such that if every vertex of 𝐺 is assigned a list of 𝑘 acceptable colors, then there exists a proper
coloring of 𝐺 where every vertex is colored by one of its acceptable colors.

127
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

9.1 Bounded differences inequality


Recall that the Chernoff bound allows to prove exponential tail bounds for sums of
independent random variables. For example, if 𝑍 is a sum of 𝑛 independent Bernoulli
random variables, then
2
P(|𝑍 − E𝑍 | ≥ 𝑡) ≤ 2𝑒 −2𝑡 /𝑛 .

In this chapter, we develop tools for proving similar tail bounds for other random
variables that do not necessarily arise as a sum of independent random variables.
The next theorem says:
A Lipschitz function of many independent random variables is con-
centrated.
We will prove the following important and useful result, known by several names:
McDiarmid’s inequality, Azuma–Hoeffding inequality, and bounded differences
inequality.

Theorem 9.1.1 (Bounded differences inequality)


Let 𝑋1 ∈ Ω1 , . . . , 𝑋𝑛 ∈ Ω𝑛 be independent random variables. Suppose 𝑓 : Ω1 × · · · ×
Ω𝑛 → R satisfies
𝑓 (𝑥 1 , . . . , 𝑥 𝑛 ) − 𝑓 (𝑥 1′ , . . . , 𝑥 𝑛′ ) ≤ 1 (9.1)
whenever (𝑥 1 , . . . , 𝑥 𝑛 ) and (𝑥 1′ , . . . , 𝑥 𝑛′ ) differ on exactly one coordinate. Then the
random variable 𝑍 = 𝑓 (𝑋1 , . . . , 𝑋𝑛 ) satisfies, for every 𝜆 ≥ 0,
2 /𝑛 2 /𝑛
P(𝑍 − E𝑍 ≥ 𝜆) ≤ 𝑒 −2𝜆 and P(𝑍 − E𝑍 ≤ −𝜆) ≤ 𝑒 −2𝜆 .

In particular, we can apply the above inequality to 𝑓 (𝑥 1 , . . . , 𝑥 𝑛 ) = 𝑥1 + · · · + 𝑥 𝑛 to


recover the Chernoff bound. The theorem tells us that the window of fluctuation of 𝑍

has length 𝑂 ( 𝑛).

Example 9.1.2 (Coupon collector). Let 𝑠1 , . . . , 𝑠𝑛 ∈ [𝑛] chosen uniformly and inde-

129
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

pendently at random. Denote the number of “missing” elements by

𝑍 = |[𝑛] \ {𝑠1 , . . . , 𝑠𝑛 }|.

Note that changing one of the 𝑠1 , . . . , 𝑠𝑛 changes 𝑍 by at most 1, so we have


2 /𝑛
P (|𝑍 − E𝑍 | ≥ 𝜆) ≤ 2𝑒 −2𝜆 ,

with  𝑛  
1 𝑛−1 𝑛
E𝑍 = 𝑛 1 − ∈ , .
𝑛 𝑒 𝑒

Theorem 9.1.1 holds more generally allowing the bounded difference to depend on the
coordinate.

Theorem 9.1.3 (Bounded differences inequality)


Let 𝑋1 ∈ Ω1 , . . . , 𝑋𝑛 ∈ Ω𝑛 be independent random variables. Suppose 𝑓 : Ω1 × · · · ×
Ω𝑛 → R satisfies
𝑓 (𝑥 1 , . . . , 𝑥 𝑛 ) − 𝑓 (𝑥 1′ , . . . , 𝑥 𝑛′ ) ≤ 𝑐𝑖 (9.2)
whenever (𝑥1 , . . . , 𝑥 𝑛 ) and (𝑥1 , . . . , 𝑥 𝑛 ) differ only on the 𝑖-th coordinate. Here
𝑐 1 , . . . , 𝑐 𝑛 are constants. Then the random variable 𝑍 = 𝑓 (𝑋1 , . . . , 𝑋𝑛 ) satisfies,
for every 𝜆 ≥ 0, !
−2𝜆2
P(𝑍 − E𝑍 ≥ 𝜆) ≤ exp 2
𝑐 1 + · · · + 𝑐2𝑛
and !
−2𝜆2
P(𝑍 − E𝑍 ≤ −𝜆) ≤ exp 2 .
𝑐 1 + · · · + 𝑐2𝑛

We will prove these inequality using martingales.

9.2 Martingales concentration inequalities

Definition 9.2.1
A martingale is a random real sequence 𝑍0 , 𝑍1 , . . . such that for every 𝑍𝑛 , E|𝑍𝑛 | < ∞
and
E[𝑍𝑛+1 |𝑍0 , . . . , 𝑍𝑛 ] = 𝑍𝑛 .

(To be more formal, we should talk about filtrations of a probability space . . . )

Example 9.2.2 (Random walks with independent steps). If (𝑋𝑖 )𝑖≥0 is a sequence of

130
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.2 Martingales concentration inequalities

Í
independent random variables with E𝑋𝑖 = 0 for all 𝑖, then the partial sums 𝑍𝑛 = 𝑖≤𝑛 𝑋𝑖
is a Martingale.

Example 9.2.3 (Betting strategy). Betting on a sequence of fair coin tosses. After
round, you are allow to change your bet. Let 𝑍𝑛 be your balance after the 𝑛-th round.
Then 𝑍𝑛 is always a martingale regardless of your strategy.
Originally, the term “martingale” referred to the betting strategy where one doubles
the bet each time until the first win and then stop betting. Then, with probability 1,
𝑍𝑛 = 1 for all sufficiently large 𝑛. (Why does this “free money” strategy not actually
work?)

The next example is especially important to us.

Example 9.2.4 (Doob martingale). Let 𝑋1 , . . . , 𝑋𝑛 be a random sequence (not nec-


essarily independent, though they often are independent in practice). Consider a
function 𝑓 (𝑋1 , . . . , 𝑋𝑛 ). Let 𝑍𝑖 be the expected value of 𝑓 after “revealing” (exposing)
𝑋1 , . . . , 𝑋𝑖 , i.e.,
𝑍𝑖 = E[ 𝑓 (𝑋1 , . . . , 𝑋𝑛 )|𝑋1 , . . . , 𝑋𝑖 ].
So 𝑍𝑖 is the expected value of the random variable 𝑍 = 𝑓 (𝑋1 , . . . , 𝑋𝑛 ) after seeing the
first 𝑖 arguments, and letting the remaining arguments be random. Then 𝑍0 , . . . , 𝑍𝑛 is
a martingale (why?). It satisfies 𝑍0 = E𝑍 (a non-random quantity) and 𝑍𝑛 = 𝑍 (the
random variable that we care about), and thereby offering a way to interpolate between
the two.

Example 9.2.5 (Edge-exposure martingale). We can reveal the random graph 𝐺 (𝑛, 𝑝)
by first fixing an order on all unordered pairs of [𝑛] and then revealing in order whether
each pair is an edge. For any graph parameter 𝑓 (𝐺) we can produce a martingale
𝑋0 , 𝑋1 , . . . , 𝑋 ( 𝑛) where 𝑍𝑖 is the conditional expectation of 𝑓 (𝐺 (𝑛, 𝑝)) after revealing
2
whether there are edges for first 𝑖 pairs of vertices. See Figure 9.1 for an example.

Example 9.2.6 (Vertex-exposure martingale). Similar to the previous example, ex-


cept that we now first fix an order on the vertex set, and, at the 𝑖-th step, with 0 ≤ 𝑖 ≤ 𝑛,
we reveal all edges whose endpoints are contained in the first 𝑖 vertices. See Figure 9.1
for an example.
Sometimes it is better to use the edge-exposure martingale and sometimes it is better to
use the vertex-exposure martingale. It depends on the application. There is a trade-off
between the length of the martingale and the control on the bounded differences.

The main result is that a martingale with bounded differences must be concen-
trated. The following fundamental result is called Azuma’s inequality or the Azuma–
Hoeffding inequality.

131
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

3 3
2.5
2 2
2.25 2.25
2 2
2
2 2
2 2
2 2
2
2 2
1.75 1.75

2 2
1.5
1 1

Figure 9.1: The edge-exposure martingale (left) and vertex-exposure martingale


(right) for the chromatic number of 𝐺 (𝑛, 1/2) with 𝑛 = 3. The
martingale is obtained by starting at the leftmost point, and splitting
at each branch with equal probability.

Theorem 9.2.7 (Azuma’s inequality)


Let 𝑍0 , 𝑍1 , . . . , 𝑍𝑛 be a martingale satisfying

|𝑍𝑖 − 𝑍𝑖−1 | ≤ 1 for each 𝑖 ∈ [𝑛].

Then for every 𝜆 > 0,


√ 2
P(𝑍𝑛 − 𝑍0 ≥ 𝜆 𝑛) ≤ 𝑒 −𝜆 /2 .

Note that this is the same bound that we derived in Chapter 5 for 𝑍𝑛 = 𝑋1 + · · · 𝑋𝑛
where 𝑋𝑖 ∈ {−1, 1} uniform and iid.
More generally, allowing different bounds on different steps of the martingale, we have
the following.

Theorem 9.2.8 (Azuma’s inequality)


Let 𝑍0 , 𝑍1 , . . . , 𝑍𝑛 be a martingale satisfying

|𝑍𝑖 − 𝑍𝑖−1 | ≤ 𝑐𝑖 for each 𝑖 ∈ [𝑛].

For any 𝜆 > 0, !


−𝜆2
P(𝑍𝑛 − 𝑍0 ≥ 𝜆) ≤ exp .
2(𝑐21 + · · · + 𝑐2𝑛 )

The above formulations of Azuma’s inequality can be used to recover the bounded
differences inequality (Theorems 9.1.1 and 9.1.3) up to a usually unimportant constant
in the exponent. To obtain the exact statement of Theorem 9.1.3, we state the following

132
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.2 Martingales concentration inequalities

strengthening of Azuma’s inequality. (You are welcome to ignore the next statement
if you do not care about the constant factor in the exponent — and really, you should
not care.)

Theorem 9.2.9 (Azuma’s inequality for Doob martingales)


Consider a Doob martingale 𝑍𝑖 = E[ 𝑓 (𝑋1 , . . . , 𝑋𝑛 )|𝑋1 , . . . , 𝑋𝑖 ] as in Example 9.2.4.
Suppose, conditioned on any value of (𝑋1 , . . . , 𝑋𝑖−1 ), the possibilities for 𝑍𝑖 lies in
an interval of length 𝑐𝑖 (here 𝑐𝑖 is non-random, but the location of the interval may
depend on 𝑋1 , . . . , 𝑋𝑖−1 ). Then for any 𝜆 > 0,
!
−2𝜆2
P(𝑍𝑛 − 𝑍0 ≥ 𝜆) ≤ exp 2 .
𝑐 1 + · · · + 𝑐2𝑛

Remark 9.2.10. Applying the inequality to the martingale with terms −𝑍 𝑛 , we obtain
the following lower tail bound:
!
−2𝜆2
P(𝑍𝑛 − 𝑍0 ≤ −𝜆) ≤ exp 2 .
𝑐 1 + · · · + 𝑐2𝑛

And we can put them together as


!
−2𝜆2
P(|𝑍𝑛 − 𝑍0 | ≥ 𝜆) ≤ 2 exp 2 .
𝑐 1 + · · · + 𝑐2𝑛

Remark 9.2.11. Theorem 9.2.8 is a special case of Theorem 9.2.9, since we can take
(𝑋1 , . . . , 𝑋𝑛 ) = (𝑍1 . . . , 𝑍𝑛 ) and 𝑓 (𝑋1 , . . . , 𝑋𝑛 ) = 𝑋𝑛 . Note that the |𝑍𝑖 − 𝑍𝑖−1 | ≤ 𝑐𝑖
condition in Theorem 9.2.8 implies that 𝑍𝑖 lies in an interval of length 2𝑐𝑖 if we
condition on (𝑋1 , . . . , 𝑋𝑖−1 ).

Lemma 9.2.12 (Hoeffding’s lemma)


Let 𝑋 be a real random variable contained in an interval of length ℓ. Suppose E𝑋 = 0.
Then
2
E[𝑒 𝑋 ] ≤ 𝑒 ℓ /8 .

Proof. Suppose 𝑋 ∈ [𝑎, 𝑏] with 𝑎 ≤ 0 ≤ 𝑏 and 𝑏 − 𝑎 = ℓ. Then since 𝑒 𝑥 is convex,


using a linear upper bound on the interval [𝑎, 𝑏], we have (note that RHS below is
linear in 𝑥)
𝑏−𝑥 𝑎 𝑥−𝑎 𝑏
𝑒𝑥 ≤ 𝑒 + 𝑒 , for all 𝑥 ∈ [𝑎, 𝑏].
𝑏−𝑎 𝑏−𝑎

133
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Since E𝑋 = 0, we obtain
𝑏 𝑎 −𝑎 𝑏
E𝑒 𝑋 ≤ 𝑒 + 𝑒 .
𝑏−𝑎 𝑏−𝑎
Let 𝑝 = −𝑎/(𝑏 − 𝑎). Then 𝑎 = −𝑝ℓ and 𝑏 = (1 − 𝑝)ℓ. So
 
log E𝑒 𝑋 ≤ log (1 − 𝑝)𝑒 −𝑝ℓ + 𝑝𝑒 (1−𝑝)ℓ = −𝑝ℓ + log(1 − 𝑝 + 𝑝𝑒 ℓ ).

Fix 𝑝 ∈ [0, 1]. Let


𝜑(ℓ) := −𝑝ℓ + log(1 − 𝑝 + 𝑝𝑒 ℓ ).
It remains to show that 𝜑(ℓ) ≤ ℓ 2 /8 for all ℓ ≥ 0, which follows from 𝜑(0) = 𝜑′ (0) = 0
and 𝜑′′ (ℓ) ≤ 1/4 for all ℓ ≥ 0, as
  
′′ 𝑝 𝑝 1
𝜑 (ℓ) = −𝑝ℓ
1− −𝑝ℓ
≤ ,
(1 − 𝑝)𝑒 +𝑝 (1 − 𝑝)𝑒 +𝑝 4

since 𝑡 (1 − 𝑡) ≤ 1/4 for all 𝑡 ∈ [0, 1]. □

Proof of Theorem 9.2.9. Let 𝑡 ≥ 0 be some constant to be decided later. Conditional


on any values of (𝑋1 , . . . , 𝑋𝑖−1 ), the random variable 𝑍𝑖 − 𝑍𝑖−1 has mean zero and lies
in an interval of length 𝑐𝑖 . So Lemma 9.2.12 gives
2 𝑐 2 /8
E[𝑒 𝑡 (𝑍𝑖 −𝑍𝑖−1 ) |𝑋1 , . . . , 𝑋𝑖−1 ] ≤ 𝑒 𝑡 𝑖 .

Then the moment generating function satisfies


h i
E[𝑒 𝑡 (𝑍𝑛 −𝑍0 ) ] = E 𝑒 𝑡 (𝑍𝑖 −𝑍𝑖−1 ) 𝑒 𝑡 (𝑍𝑖−1 −𝑍0 )
h h i i
= E E 𝑒 𝑡 (𝑍𝑖 −𝑍𝑖−1 ) 𝑋1 , . . . , 𝑋𝑖−1 𝑒 𝑡 (𝑍𝑖−1 −𝑍0 )
h i
𝑡 2 𝑐2𝑛 /8 𝑡 (𝑍𝑖−1 −𝑍0 )
=𝑒 E 𝑒 .

Iterating, we obtain h i 2 2 2
E 𝑒 𝑡 (𝑍𝑛 −𝑍0 ) ≤ 𝑒 𝑡 (𝑐1 +···𝑐 𝑛 )/8 .
By Markov,
𝑡2 2
h i 2
P(𝑍𝑛 − 𝑍0 ≥ 𝜆) ≤ 𝑒 −𝑡𝜆 E 𝑒 𝑡 (𝑍𝑛 −𝑍0 ) ≤ 𝑒 −𝑡𝜆+ 8 (𝑐1 +···𝑐 𝑛 ) .

Setting 𝑡 = 4𝜆/(𝑐21 + · · · + 𝑐2𝑛 ) yields the theorem. □

Now we apply Azuma’s inequality to deduce the bounded differences inequality.

Proof of the bounded differences inequality (Theorem 9.1.3). Consider the Doob mar-

134
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.3 Chromatic number of random graphs

tingale 𝑍𝑖 = E[ 𝑓 (𝑋1 , . . . , 𝑋𝑛 )|𝑋1 , . . . , 𝑋𝑖 ]. The hypothesis of Theorem 9.1.3 implies


that the hypothesis of Theorem 9.2.9 is satisfied. The same conclusion then follows. □

Remark 9.2.13. Azuma’s inequality (Theorem 9.2.9) is more versatile than (Theo-
rem 9.1.3). For example, while changing 𝑋𝑖 might change 𝑓 (𝑋1 , . . . , 𝑋𝑛 ) by a lot in
the worst case over all possible (𝑋1 , . . . , 𝑋𝑛 ), it might not change it by much in expec-
tation over random choices of (𝑋𝑖+1 , . . . , 𝑋𝑛 ). And so the 𝑐𝑖 in Theorem 9.2.9 could
potentially be smaller than in Theorem 9.1.3. This will be useful in some applications,
including one that we will see later in the chapter.

9.3 Chromatic number of random graphs


Concentration of the chromatic number
Even before Bollobás (1988) showed that 𝜒(𝐺 (𝑛, 1/2)) ∼ 2 log𝑛 𝑛 whp (Theorem 8.3.2),
2
using the bounded difference inequality, it was already known that the chromatic

number of a random graph must be concentrated in a 𝑂 ( 𝑛) window around its mean.
The following application shows that one can prove concentration around the mean
without even knowing where is the mean!

Theorem 9.3.1 (Shamir and Spencer 1987)


For every 𝜆 ≥ 0, the chromatic number of a random graph 𝑍 = 𝜒(𝐺 (𝑛, 𝑝)) satisfies
√ 2
P(|𝑍 − E𝑍 | ≥ 𝜆 𝑛 − 1) ≤ 2𝑒 −2𝜆 .

Proof. Let 𝑉 = [𝑛], and consider each vertex labeled graph as an element of Ω2 ×
· · · × Ω𝑛 where Ω𝑖 = {0, 1}𝑖−1 and its coordinates correspond to edges whose larger
coordinate is 𝑖 (cf. the vertex-exposure martingale Example 9.2.6). If two graphs 𝐺
and 𝐺 ′ differ only in edges incident to one vertex 𝑣, then | 𝜒(𝐺) − 𝜒(𝐺 ′)| ≤ 1 since,
given a proper coloring of 𝐺 using 𝜒(𝐺) colors, one can obtain a proper coloring
of 𝐺 ′ using 𝜒(𝐺) + 1 colors by using a new color for 𝑣. Theorem 9.1.3 implies the
result. □

Remark 9.3.2 (Non-concentration of the chromatic number). Heckel (2021) showed


that the 𝜒(𝐺 (𝑛, 1/2)) is not concentrated on any interval of length 𝑛𝑐 for any constant
𝑐 < 1/4. This was the opposite of what most experts believed in. It has been
conjectured that width of the window of concentrations fluctuates between 𝑛1/4+𝑜(1) to
𝑛1/2+𝑜(1) depending on 𝑛.

135
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Clique number, again


Previously in Section 8.3, we used Janson inequalities to prove the following ex-
ponentially small bound on the probability that 𝐺 (𝑛, 1/2) has small clique num-
ber. This was a crucial step in the proof of Bollobás’ theorem (Theorem 8.3.2) that
𝜒(𝐺 (𝑛, 1/2)) ∼ 𝑛/(2 log2 𝑛) whp. Here we give a different proof using the bounded
difference inequality instead of Janson inequalities. The proof below in fact was the
original approach of Bollobás (1988).

Lemma 9.3.3 (Same as Lemma 8.3.3)


𝑛  − ( 𝑘20 )
Let 𝑘 0 = 𝑘 0 (𝑛) ∼ 2 log2 𝑛 be the largest positive integer so that 𝑘0 2 ≥ 1. Then

2−𝑜(1)
P(𝜔(𝐺 (𝑛, 1/2)) < 𝑘 0 − 3) = 𝑒 −𝑛 .

A naive approach might be to estimate the number of 𝑘-cliques in 𝐺 (this is the


approach taken with Janson inequalities. The issue is that this quantity can change too
much when we modify one edge of 𝐺. We will use a more subtle function on graphs.
Note that we only care about whether there exists a 𝑘-clique or not.

Proof. Let 𝑘 = 𝑘 0 − 3. Let 𝑌 = 𝑌 (𝐺) be the maximum number of edge-disjoint set of


𝑘-cliques in 𝐺. Then as a function of 𝐺, 𝑌 changes by at most 1 if we change 𝐺 by
one edge. (Note that the same does not hold if we change 𝐺 by one vertex, e.g., when
𝐺 consists of many 𝑘-cliques glued along a common vertex.)
So by the bounded differences inequality, for 𝐺 ∼ 𝐺 (𝑛, 1/2),
!
2(E𝑌 ) 2
P(𝜔(𝐺) < 𝑘) = P(𝑌 = 0) ≤ P(𝑌 − E𝑌 ≤ −E𝑌 ) ≤ exp − 𝑛 . (9.1)
2

It remains to show that E𝑌 ≥ 𝑛2−𝑜(1) . Create an auxiliary graph H whose vertices


are the 𝑘-cliques in 𝐺, with a pair of 𝑘-cliques adjacent if they overlap in at least 2
vertices. Then 𝑌 = 𝛼(H ). We would like to lower bound the independence number
of this graph based on its average degree. Here are two ways to proceed:
1. Recall the Caro–Wei inequality (Corollary 2.3.5): for every graph 𝐻 with average
degree 𝑑, we have
∑︁ 1 |𝑉 (𝐻)| |𝑉 (𝐻)| 2
𝛼(𝐻) ≥ ≥ = .
1 + 𝑑𝑣 1+𝑑 |𝑉 (𝐻)| + 2 |𝐸 (𝐻)|
𝑣∈𝑉 (𝐻)

2. Let 𝐻 ′ be the induced subgraph obtained from 𝐻 by keeping every vertex

136
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.3 Chromatic number of random graphs

independently with probability 𝑞. We have

𝛼(𝐻) ≥ 𝛼(𝐻 ′) ≥ |𝑉 (𝐻 ′)| − |𝐸 (𝐻 ′)| .

Taking expectations of both sides, and noting that E |𝑉 (𝐻 ′)| = 𝑞 |𝑉 (𝐻)| and
E |𝐸 (𝐻 ′)| = 𝑞 2 |𝐸 (𝐻)| by linearity of expectations, we have

𝛼(𝐻) ≥ 𝑞E |𝑉 (𝐻)| − 𝑞 2 |𝐸 (𝐻)| for every 𝑞 ∈ [0, 1].

Provided that |𝐸 (𝐻)| ≥ |𝑉 (𝐻)| /2, we can take 𝑞 = |𝑉 (𝐻)| /(2 |𝐸 (𝐻)|) ∈ [0, 1]
and obtain
|𝑉 (𝐻)| 2 1
𝛼(𝐻) ≥ if |𝐸 (𝐻)| ≥ |𝑉 (𝐻)| .
4 |𝐸 (𝐻)| 2
(This method allows us to recover Turán’s theorem up to a factor of 2, whereas
the Caro–Wei inequality recovers Turán’s theorem exactly. For the present
application, we do not care about these constant factors.)
By a second moment argument (details again omitted, like in the proofs of Theo-
rem 4.4.2 and Lemma 8.3.3), we have, with probability 1 − 𝑜(1), that the number of
𝑘-cliques in 𝐺 is  
𝑛 − ( 𝑘)
|𝑉 (H )| ∼ E |𝑉 (H )| = 2 2 = 𝑛3−𝑜(1)
𝑘
and the number of unordered pairs of edge-overlapping 𝑘-cliques in 𝐺 is

E |𝐸 (H )| = 𝑛4−𝑜(1) .

Thus, with probability 1 − 𝑜(1), we can apply either of the above lower bounds on
independent sets to obtain

|𝑉 (H )| 2 𝑛6−𝑜(1) 𝑛6−𝑜(1)
E𝑌 ≳ E ≳E ≥ = 𝑛2−𝑜(1) .
|𝐸 (H )| |𝐸 (H )| E |𝐸 (H )|
2−𝑜 (1)
Together with (9.1), this completes the proof that P(𝜔(𝐺) < 𝑘) = 𝑒 −𝑛 . □

Chromatic number of sparse random graphs


Let us show that 𝐺 (𝑛, 𝑝) is concentrated on a constant size window if 𝑝 is small
enough.

137
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Theorem 9.3.4 (Shamir and Spencer 1987)


Let 𝛼 > 5/6 be fixed. Then for 𝑝 < 𝑛−𝛼 , 𝜒(𝐺 (𝑛, 𝑝)) is concentrated on four values
with probability 1 − 𝑜(1). That is, there exists 𝑢 = 𝑢(𝑛, 𝑝) such that, as 𝑛 → ∞,

P(𝑢 ≤ 𝜒(𝐺 (𝑛, 𝑝)) ≤ 𝑢 + 3) = 1 − 𝑜(1).

Proof. It suffices to show that for all 𝜀 > 0, there exists 𝑢 = 𝑢(𝑛, 𝑝, 𝜀) so that, provided
𝑝 < 𝑛−𝛼 and 𝑛 is sufficiently large,

P(𝑢 ≤ 𝜒(𝐺 (𝑛, 𝑝)) ≤ 𝑢 + 3) ≥ 1 − 3𝜀.

Let 𝑢 be the least integer so that

P( 𝜒(𝐺 (𝑛, 𝑝)) ≤ 𝑢) > 𝜀.

Now we make a clever choice of a random variable.


Let 𝐺 ∼ 𝐺 (𝑛, 𝑝). Let 𝑌 = 𝑌 (𝐺) denote the minimum size of a subset 𝑆 ⊆ 𝑉 (𝐺) such
that 𝐺 − 𝑆 is 𝑢-colorable.
Note that 𝑌 changes by at most 1 if we change the edges around one vertex of 𝐺. Thus,
by applying Theorem 9.1.1 with respect to vertex-exposure (Example 9.2.6), we have
√ 2
P(𝑌 ≤ E𝑌 − 𝜆 𝑛) ≤ 𝑒 −2𝜆
√ 2
and P(𝑌 ≥ E𝑌 + 𝜆 𝑛) ≤ 𝑒 −2𝜆 .
2
We choose 𝜆 = 𝜆(𝜀) > 0 so that 𝑒 −2𝜆 = 𝜀.
First, we use the lower tail bound to show that E𝑌 must be small. We have

−2(E𝑌 ) 2
 
−2𝜆2
𝑒 = 𝜀 < P( 𝜒(𝐺) ≤ 𝑢) = P(𝑌 = 0) = P(𝑌 ≤ E𝑌 − E𝑌 ) ≤ exp .
𝑛

Thus

E𝑌 ≤ 𝜆 𝑛.

Next, we apply the upper tail bound to show that 𝑌 is rarely large. We have
√ √ 2
P(𝑌 ≥ 2𝜆 𝑛) ≤ P(𝑌 ≥ E𝑌 + 𝜆 𝑛) ≤ 𝑒 −2𝜆 = 𝜀.

Each of the following three events occur with probability at least 1− 𝜀, for large enough
𝑛,

• By the above argument, there is some 𝑆 ⊆ 𝑉 (𝐺) with |𝑆| ≤ 2𝜆 𝑛 and 𝐺 − 𝑆

138
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.4 Isoperimetric inequalities: a geometric perspective

may be properly 𝑢-colored.


• By the next lemma, one can properly 3-color 𝐺 [𝑆].
• 𝜒(𝐺) ≥ 𝑢 (by the minimality of 𝑢 at the beginning of the proof).
Thus, with probability at least 1 − 3𝜀, all three events occur, and so we have 𝑢 ≤
𝜒(𝐺) ≤ 𝑢 + 3. □

Lemma 9.3.5
Fix 𝛼 > 5/6 and 𝐶. Let 𝑝 ≤ 𝑛−𝛼 . Then with probability 1 − 𝑜(1) every subset of at

most 𝐶 𝑛 vertices of 𝐺 (𝑛, 𝑝) can be properly 3-colored.

Proof. Let 𝐺 ∼ 𝐺 (𝑛, 𝑝). Assume that 𝐺 is not 3-colorable. Choose minimum size
𝑇 ⊆ 𝑉 (𝐺) so that the induced subgraph 𝐺 [𝑇] is not 3-colorable.
We see that 𝐺 [𝑇] has minimum degree at least 3, since if deg𝐺 [𝑇] (𝑥) < 3, then 𝑇 − 𝑥
cannot be 3-colorable either (if it were, then can extend coloring to 𝑥), contradicting
the minimality of 𝑇.
Thus 𝐺 [𝑇] has at least 3|𝑇 |/2 edges. The probability that 𝐺 has some induced subgraph
√ 
on 𝑡 ≤ 𝐶 𝑛 vertices and ≥ 3𝑡/2 edges is, by a union bound, (recall 𝑛𝑘 ≤ (𝑛𝑒/𝑘) 𝑘 )
√ √
∑︁𝑛
𝐶   𝑡  𝐶 𝑛
𝑛 2 3𝑡/2
∑︁ 𝑛𝑒  𝑡  𝑡𝑒  3𝑡/2 −3𝑡𝛼/2
≤ 𝑝 ≤ 𝑛
𝑡=4
𝑡 3𝑡/2 𝑡=4
𝑡 3
√ √
𝐶
∑︁𝑛  √  𝑡 𝐶∑︁𝑛  𝑡
≤ 𝑂 (𝑛1−3𝛼/2 𝑡) ≤ 𝑂 (𝑛1−3𝛼/2+1/4 ) .
𝑡=4 𝑡=4

The sum is 𝑜(1) provided that 𝛼 > 5/6. □

Remark 9.3.6. Theorem 9.3.4 was subsequently improved (by a refinement of the
above techniques) by Łuczak (1991) and Alon and Krivelevich (1997). We now know
that the chromatic number of 𝐺 (𝑛, 𝑛−𝛼 ) has two-point concentration for all 𝛼 > 1/2.

9.4 Isoperimetric inequalities: a geometric


perspective
We shall explore the following connection, which are two sides of the same coin:
Probability Geometry
Concentration of Lipschitz functions Isoperimetric inequalities

139
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Milman recognized the importance of the concentration of measure phenomenon,


which he heavily promoted in the 1970’s. The subject was have been since then
extensively developed. It plays a central role in probability theory, the analysis of
Banach spaces, and it also has been influential in theoretical computer science.

Euclidean space
The classic isoperimetric theorem in R𝑛 says that that among all subset of R𝑛 of given
volume, the ball has the smallest surface volume. (The word “isoperimetric” refers to
fixing the perimeter; equivalently we fix the surface area and ask to maximize volume.)
This result (at least in two-dimensions) was known to the Greeks, but rigorous proofs
were only found in towards the end of the nineteenth century.
Let (𝑋, 𝑑 𝑋 ) be a metric space. Let 𝐴 ⊆ 𝑋. For any 𝑥 ∈ 𝑋, write 𝑑 𝑋 (𝑥, 𝐴) :=
inf 𝑎∈𝐴 𝑑 𝑋 (𝑥, 𝑎) for the distance from 𝑥 to 𝐴. Denote the set of all points within
distance 𝑡 from 𝐴 by
𝐴𝑡 := {𝑥 ∈ 𝑋 : 𝑑 𝑋 (𝑥, 𝐴) ≤ 𝑡} (9.1)
This is also known as the radius-𝒕 neighborhood of 𝑨. One can visualize 𝐴𝑡 by
“expanding” 𝐴 by distance 𝑡.

Theorem 9.4.1 (Isoperimetric inequality in Euclidean space)


Let 𝐴 ⊆ R𝑛 be a measurable set, and let 𝐵 ⊆ R𝑛 be a ball vol( 𝐴) = vol(𝐵). Then, for
all 𝑡 ≥ 0,
vol 𝐴𝑡 ≥ vol 𝐵𝑡 .

Remark 9.4.2. A clean way to prove the above inequality is via the Brunn–Minkowski
theorem.
Classically, the isoperimetric inequality is stated as (here 𝜕 𝐴 is the boundary of 𝐴)

vol𝑛−1 𝜕 𝐴 ≥ vol𝑛−1 𝜕𝐵.

These two formulations are equivalent. Indeed, assuming Theorem 9.4.1, we have

𝑑 vol 𝐴𝑡 − vol 𝐴
vol𝑛−1 𝜕 𝐴 = vol𝑛 𝐴𝑡 = lim
𝑑𝑡 𝑡=0 𝑡→0 𝑡
vol 𝐵𝑡 − vol 𝐵
≥ lim = vol𝑛−1 𝜕𝐵.
𝑡→0 𝑡
Conversely, we can obtain the neighborhood version from the boundary version by
integrating (noting that 𝐵𝑡 is always a ball).

140
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.4 Isoperimetric inequalities: a geometric perspective

The cube
We have an analogous result in the {0, 1}𝑛 with respect to Hamming distance.In
Hamming cube, Harper’s theorem gives the exact result. Below, for 𝐴 ⊆ {0, 1}𝑛 , we
write 𝐴𝑡 as in (9.1) for 𝑋 = {0, 1}𝑛 and 𝑑 𝑋 being the Hamming distance.

Theorem 9.4.3 (Isoperimetic inequality in the Hamming cube; Harper 1966)


Let 𝐴 ⊆ {0, 1}𝑛 . Let 𝐵 ⊆ {0, 1}𝑛 be a Hamming ball with | 𝐴| ≥ |𝐵|. Then for all
𝑡 ≥ 0,
| 𝐴𝑡 | ≥ |𝐵𝑡 |.

Remark 9.4.4. The above statement is tight when 𝐴 has the same size as a Hamming
  
ball, i.e., when | 𝐴| = 𝑛0 + 𝑛1 + · · · + 𝑛𝑘 for some integer 𝑘. Actually, more is true.
For any value of | 𝐴| and 𝑡, the size of 𝐴𝑡 is minimized by taking 𝐴 to be an initial
segment of {0, 1}𝑛 according to the simplicial ordering: first sort by Hamming weight,
and for ties, sort by lexicographic order. For more on this topic, particularly extremal
set theory, see the book Combinatorics by Bollobás (1986).

Combined with the isoperimetic inequality on the cube, we obtain the following
surprising consequence. Suppose we start with just half of the cube, and then expand
it by a bit (recall that the diameter of the cube is 𝑛, and we will be expanding it by
𝑜(𝑛)), then resulting expansion occupies nearly all of the cube.

Theorem 9.4.5 (Rapid expansion from half to 1 − 𝜀 )


Let 𝑡 > 0. For every 𝐴 ⊆ {0, 1}𝑛 with | 𝐴| ≥ 2𝑛−1 , we have
2 /𝑛
| 𝐴𝑡 | > (1 − 𝑒 −2𝑡 )2𝑛 .

Proof. Let 𝐵 = {𝑥 ∈ {0, 1} 𝑛 : weight(𝑥) < 𝑛/2}, so that |𝐵| ≤ 2𝑛−1 ≤ | 𝐴|. Then by
Harper’s theorem (Theorem 9.4.3),
2 /𝑛
| 𝐴𝑡 | ≥ |𝐵𝑡 | = |{𝑥 ∈ {0, 1}𝑛 : weight(𝑥) < 𝑛/2 + 𝑡}| > (1 − 𝑒 −2𝑡 )2𝑛

by the Chernoff bound. □

In fact, using the above, we can deduce that even if we start with a small fraction (e.g.,
1%) of the cube, and expand it slightly, then we would cover most of the cube.

141
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Theorem 9.4.6 (Rapid expansion from 𝜀 to 1 − 𝜀 )


√︁
Let 𝜀 > 0 and 𝐶 = 2 log(1/(𝜀)). If 𝐴 ⊆ {0, 1}𝑛 with | 𝐴| ≥ 𝜀2𝑛 , then

𝐴𝐶 √𝑛 ≥ (1 − 𝜀)2𝑛 .

√︁ 2
First proof via Harper’s isoperimetric inequality. Let 𝑡 = log(1/𝜀)𝑛/2 so that 𝑒 −2𝑡 /𝑛 =
𝜀. Applying Theorem 9.4.5 to 𝐴′ = {0, 1}𝑛 \ 𝐴𝑡 , we see that | 𝐴′ | < 2𝑛−1 (or else
𝐴′𝑡 > (1 − 𝜀)2𝑛 , so 𝐴′𝑡 would intersect 𝐴, which is impossible since the distance be-
tween 𝐴 and 𝐴′ is greater than 𝑡). Thus | 𝐴𝑡 | ≥ 2𝑛−1 , and then applying Theorem 9.4.5
yields | 𝐴2𝑡 | ≥ (1 − 𝜀)2𝑛 . □

Let us give another proof of Theorem 9.4.6 without using Harper’s exact isoperimetric
theorem in the Hamming cube, and instead use the bounded differences inequality that
we proved earlier.

Second proof via the bounded differences inequality. Pick a uniform random 𝑥 ∈
{0, 1}𝑛 and let 𝑋 = dist(𝑥, 𝐴). Note that 𝑋 changes by at most 1 if a single coor-
dinate of 𝑥 is changed. Applying the bounded differences inequality, Theorem 9.1.1,
we have the lower tail
2 /𝑛
P(𝑋 − E𝑋 ≤ −𝜆) ≤ 𝑒 −2𝜆 for all 𝜆 ≥ 0

We have 𝑋 = 0 if and only if 𝑥 ∈ 𝐴, so


2 /𝑛
𝜀 ≤ P(𝑥 ∈ 𝐴) = P(𝑋 = 0) = P(𝑋 − E𝑋 ≤ −E𝑋) ≤ 𝑒 −2(E𝑋) .

Thus √︂ √
log(1/𝜀)𝑛 𝐶 𝑛
E𝑋 ≤ = .
2 2
Now we apply the upper tail of the bounded differences inequality
2 /𝑛
P(𝑋 − E𝑋 ≥ 𝜆) ≤ 𝑒 −2𝜆 for all 𝜆 ≥ 0

to yield √ 


𝐶 𝑛
P(𝑥 ∉ 𝐴𝐶 √𝑛 ) = P(𝑋 > 𝐶 𝑛) ≤ P 𝑋 ≥ E𝑋 + ≤ 𝜀. □
2

Isoperimetry versus concentration


The above two proofs illustrate the link between geometric isoperimetric inequalities
and probabilistic concentration inequalities. Let know now state a simple result that

142
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.4 Isoperimetric inequalities: a geometric perspective

formalizes this connection.

Definition 9.4.7 (Lipschitz functions)


Given two metric spaces (𝑋, 𝑑 𝑋 ) and (𝑌 , 𝑑𝑌 ), we say that a function 𝑓 : 𝑋 → 𝑌 is
𝑪-Lipschitz if

𝑑𝑌 ( 𝑓 (𝑥), 𝑓 (𝑥 ′)) ≤ 𝐶𝑑 𝑋 (𝑥, 𝑥 ′) for all 𝑥, 𝑥 ′ ∈ 𝑋.

So the bounded differences inequality applies to Lipschitz functions with respect to


the Hamming distance. In particular, it tells us that if 𝑓 : {0, 1}𝑛 → R is 1-Lipschitz
(with respect to the Hamming distance on {0, 1}𝑛 ), it must be concentrated around its
mean with respect to the uniform measure on {0, 1}𝑛 :
2
P(| 𝑓 − E 𝑓 | ≥ 𝑛𝜆) ≤ 2𝑒 −2𝑛𝜆 .

So 𝑓 is almost constant almost everywhere. This is a counterintuitive high dimensional


phenomenon.

Theorem 9.4.8 (Equivalence between notions of concentration of measure)


Let 𝑡, 𝜀 ≥ 0. In a probability space (Ω, P) equipped with a metric. The following are
equivalent:
(a) (Expansion/approximate isoperimetry) If 𝐴 ⊆ Ω with P( 𝐴) ≥ 1/2, then

P( 𝐴𝑡 ) ≥ 1 − 𝜀.

(b) (Concentration of Lipschitz functions) If 𝑓 : Ω → R is 1-Lipschitz and 𝑚 ∈ R


satisfies P( 𝑓 ≤ 𝑚) ≥ 1/2, then

P( 𝑓 > 𝑚 + 𝑡) ≤ 𝜀.

Remark 9.4.9 (Median). In (b), we often take 𝑚 to be a median of 𝑓 , which is defined


to be a value such that P( 𝑓 ≥ 𝑚) ≥ 1/2 and P( 𝑓 ≤ 𝑚) ≥ 1/2 (the median always exists
but is not necessarily unique). For distributions with good concentration properties,
the median and mean are usually close to each other. For example, we leave it as an
2
exercise to check that if there is some 𝑚 such that P(| 𝑓 − 𝑚| ≥ 𝑡) ≤ 2𝑒 −𝑡 /2 for all
𝑡 ≥ 0, then the mean and the medians of 𝑓 all lie within 𝑂 (1) of 𝑚.

Proof. (a) =⇒ (b): Let 𝐴 = {𝑥 ∈ Ω : 𝑓 (𝑥) ≤ 𝑚}. So P( 𝐴) ≥ 1/2. Since 𝑓 is


1-Lipschitz, we have 𝑓 (𝑥) ≤ 𝑚 + 𝑡 for all 𝑥 ∈ 𝐴𝑡 . Thus by (a)

P( 𝑓 > 𝑚 + 𝑡) ≤ P( 𝐴𝑡 ) ≤ 𝜀.

143
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

(b) =⇒ (a): Let 𝑓 (𝑥) = dist(𝑥, 𝐴) and 𝑚 = 0. Then P( 𝑓 ≤ 0) = P( 𝐴) ≥ 1/2. Also 𝑓


is 1-Lipschitz. So by (b),

P( 𝐴𝑡 ) = P( 𝑓 > 𝑚 + 𝑡) ≤ 𝜀. □

Informally, we say that a space (or rather, a sequence of spaces), has concentration
of measure if 𝜀 decays rapidly as a function of 𝑡 in the above theorem (the notion of
“Lévy family” makes this precise). Earlier we saw that the Hamming cube exhibits has
concentration of measure. Other notable spaces with concentration of measure include
the sphere, Gauss space, orthogonal and unitary groups, postively-curved manifolds,
and the symmetric group.

The sphere
We discuss analogs of the concentration of measure phenomenon in high dimensional
geometry. This is rich and beautiful subject. An excellent introductory to this topic is
the survey An Elementary Introduction to Modern Convex Geometry by Ball (1997).
Recall the isoperimetric inequality in R𝑛 says:
If 𝐴 ⊆ R𝑛 has the same measure as ball 𝐵, then vol( 𝐴𝑡 ) ≥ vol(𝐵𝑡 ) for all
𝑡 ≥ 0.
Analogous exact isoperimetric inequalities are known in several other spaces. We
already saw it for the boolean cube (Theorem 9.4.3). The case of sphere and Gaussian
space are particularly noteworthy. The following theorem is due to Lévy (∼1919).

Theorem 9.4.10 (Lévy’s isoperimetric inequality on the sphere)


On a sphere in R𝑛 , let 𝐴 be a measurable subset and 𝐵 a spherical cap with vol𝑛−1 ( 𝐴) =
vol𝑛−1 (𝐵). Then for all 𝑡 ≥ 0,

vol𝑛−1 ( 𝐴𝑡 ) ≥ vol𝑛−1 (𝐵𝑡 ).

We have the following upper bound estimate on the size of spherical caps.

Theorem 9.4.11 (Upper bound on spherical cap size)


Let 𝑥 = (𝑥 1 , . . . , 𝑥 𝑛 ) ∈ R𝑛 be a uniform random unit vector in R𝑛 . Then for any 𝜀 ≥ 0,
2 /2
P(𝑥 1 ≥ 𝜀) ≤ 𝑒 −𝑛𝜀 .

The following proof (including figures) is taken from Tokz (2012), building on the
method by Ball (1997).

144
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.4 Isoperimetric inequalities: a geometric perspective

Proof. Let 𝐶 denote the spherical cap consisting of unit vectors 𝑥 with 𝑥 1 ≥ 𝜀. Write
𝐶e for the convex hull of 𝐶 with the origin, i.e., the conical sector spanned by 𝐶. The
e in a ball of radius 𝑟 ≤ 𝑒 −𝜀2 /2 . Writing 𝐵(𝑟) for a ball of radius 𝑟
idea is to contain 𝐶
in R𝑛 so that, we have

vol𝑛−1 𝐶 vol𝑛 𝐶
e vol𝑛 𝐵(𝑟) 2
= = = 𝑟 𝑛 ≤ 𝑒 −𝜀 𝑛/2 .
vol𝑛−1 𝑆 𝑛−1 vol𝑛 𝐵𝑛 (1) vol𝑛 𝐵(1)


Case 1: 𝜀 ∈ [0, 1/ 2].
"2
1

P
p

Bn (0, 1)
Cone


As shown above, 𝐶e is contained in a ball of radius 𝑟 = 1 − 𝜀 2 ≤ 𝑒 −𝜀2 /2 .

Case 2: 𝜀 ∈ [1/ 2, 1].

r Q
Co
ne

Then 𝐶 e is contained in a ball of radius 𝑟 as shown above. Using similar triangles, we


2
find that 𝑟/(1/2) = 1/𝜀. √So 𝑟 = 1/(2𝜀) ≤ 𝑒 −𝜀 /2 , where final inequality is equivalent
2
to 𝑒 𝑥 /2 ≤ 2𝑥 for all [1/ 2, 1], which, by convexity, only needs to be checked at the
endpoints of the interval. □

Combining the above two theorems, we deduce the following concentration of measure
results.

145
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Corollary 9.4.12 (Concentration of measure on the sphere)


Let 𝐴 be a measurable subset of the unit sphere in R𝑛 , equipped with the metric
inherited from R𝑛 . If 𝐴 ⊆ 𝑆 𝑛−1 has vol𝑛−1 ( 𝐴)/vol𝑛−1 (𝑆 𝑛−1 ) ≥ 1/2, then

vol𝑛−1 ( 𝐴𝑡 ) 2
≥ 1 − 𝑒 −𝑛𝑡 /4 .
vol𝑛−1 (𝑆 )
𝑛−1

Remark 9.4.13. See


√︁ §14 in Barvinok’s notes for a proof of the sharper estimate with
2 2 /2
𝑒 −𝑛𝑡 /4 replaced by 𝜋/8𝑒 −𝑛𝑡 , where now we are using the geodesic distance on the
sphere.

Corollary 9.4.14 (Concentration of measure on the sphere)


Let 𝑆 𝑛−1 denote the unit sphere in R𝑛 . If 𝑓 : 𝑆 𝑛−1 → R is a 1-Lipschitz measurable
function, then there is some real 𝑚 so that, for the uniform measure on the sphere,
2 /4
P(| 𝑓 − 𝑚| > 𝑡) ≤ 2𝑒 −𝑛𝑡 .

Informally: every Lipschitz function on a high dimensional sphere is almost constant


almost everywhere.
This is a rather counterintuitive high-dimensional phenomenon.

Gauss space
Another related setting is the Gauss space, which is R𝑛 equipped with the the proba-
bility measure 𝛾𝑛 induced by the Gaussian random vector whose coordinates are 𝑛 iid
standard normals, i.e., the normal random vector in R𝑛 with covariance matrix 𝐼𝑛 . Its
2
probability density function of 𝛾𝑛 at 𝑥 ∈ R𝑛 is (2𝜋) −𝑛 𝑒 −|𝑥| /2 . The metric on R𝑛 is the
usual Euclidean metric.
What would an isoperimetric inequality in Gauss space look like?
Although earlier examples of isoperimetric optimizers were all balls, for the Gauss
space, the answer is actually a half-spaces, i.e., points on one side of some hyperplane.
The Gaussian isoperimetric inequality, below, was first shown independently by Borell
(1975) and Sudakov and Tsirel’son (1974).

Theorem 9.4.15 (Gaussian isoperimetric inequality)


If 𝐴, 𝐻 ⊆ R𝑛 , 𝐻 a half-space, and 𝛾( 𝐴) = 𝛾(𝐻), then 𝛾( 𝐴𝑡 ) ≥ 𝛾(𝐻𝑡 ) for all 𝑡 ≥ 0,
where 𝛾 is the Gauss measure.

146
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.4 Isoperimetric inequalities: a geometric perspective

2 /2
If 𝐻 = {𝑥 1 ≤ 0}, then 𝐻𝑡 = {𝑥 1 ≤ 𝑡}, which has Gaussian measure ≥ 1 − 𝑒 −𝑡 . Thus:

Corollary 9.4.16 (Concentration of measure for Gaussian vectors)


If 𝑓 : R𝑛 → R is 1-Lipschitz, and 𝑍 is a vector of i.i.d. standard normals, then 𝑋 = 𝑓 (𝑍)
satisfies, for some 𝑚,
2
P(|𝑋 − 𝑚| ≥ 𝑡) ≤ 2𝑒 −𝑡 /2 .

Here is a rather handwavy explanation why the half-space is a reasonable answer.


Consider {−1, 1} 𝑚𝑛 , where both 𝑚 and 𝑛 are large. Let us group the coordinates of
{−1, 1} 𝑚𝑛 into block of length 𝑚. The sum of entries in each block (after normalizing

by 𝑚) approximates normal random variable by the central limit theorem.
In the Hamming cube, Harper’s theorem tells us Hamming balls are isoperimetric
optimizers. Since a Hamming ball in {−1, 1} 𝑚𝑛 is given by all points whose sum of
coordinates is below a certain threshold, we should look at the analogous subset in the
Gauss space, which would then consist of all points whose sum of coordinates is below
a certain threshold. The set of all points whose of coordinate sum is below a certain
threshold is half-space. Note also that the Gaussian measure is radially symmetric.
The sphere as approximately a sum of independent Gaussians. The Gauss space is
a nice space to work with because a standard normal vector simultaneously possesses
two useful properties (and it is essentially the only such random vector to have both
properties):
(a) Rotational invariance
(b) Independence of coordinates
The squared-length of a random Gaussian vector is 𝑍12 + · · · + 𝑍𝑛2 with iid 𝑍1 , . . . , 𝑍𝑛 ∈

𝑁 (0, 1). It has mean 𝑛 and a 𝑂 ( 𝑛) window of concentration (e.g., by a straightforward
√︁ √ √
adaptation of the Chernoff bound proof). Since 𝑛 + 𝑂 ( 𝑛) = 𝑛 + 𝑂 (1), the length

of Gaussian vector is concentrated in a 𝑂 (1) window around 𝑛 (the concentration can
also be deduced from the above corollary for 𝑓 (𝑥) = |𝑥|). So most of the distribution

in the Gauss space lies within a constant distance of a sphere of radius 𝑛. Due to
rotational invariance, we see that a Gaussian distribution approximates the uniform

distribution on sphere of radius 𝑛 in high dimensions. In other words:

random Gaussian vector ≈ 𝑛 · random unit vector.

Random Gaussian vectors often yield easier calculations due to coordinate indepen-
dence, and so they often give an accessible way to analyze random unit vectors.
Note that how a half-space in the Gauss space intersect the sphere in a spherical cap,
with both italicized objects being isoperimetric optimizers in their respective spaces.

147
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Sub-Gaussian distributions
We introduce some terminology that captures notions we have seen so far. It will also
be convenient for later discussions.

Definition 9.4.17 (Sub-Gaussian distribution)


We say that a random variable 𝑋 is 𝑲-subGaussian about its mean if
2 /𝐾 2
P(|𝑋 − E𝑋 | ≥ 𝑡) ≤ 2𝑒 −𝑡 for all 𝑡 ≥ 0.

Remark 9.4.18. This definition is not standard. Some places say 𝜎 2 -subGaussian for
what we mean by 𝜎-subGaussian.

Usually we will not worry about constant factors. Thus, saying that a family of random
variables 𝑋𝑛 is 𝑂 (𝐾𝑛 )-subGaussian about its mean is the same as saying that there
exist constant 𝐶, 𝑐 > 0 such that
2 /𝐾 2
P (|𝑋𝑛 − E𝑋𝑛 | ≥ 𝑡) ≤ 𝐶𝑒 −𝑐𝑡 𝑛 for all 𝑡 ≥ 0 and 𝑛.

Also note that, up to changing the constants 𝑐, 𝐶, the definition does not change if we
replace E𝑋𝑛 by a median of 𝑋𝑛 above.

Example 9.4.19. The concentration inequalities so far can be rephrased in terms of


subGaussian distributions. Below is summary of results of the form: if 𝑋 is a random
point drawn from the given space, and 𝑓 is a 1-Lipschitz function, then 𝑓 (𝑋) is
𝐾-subGaussian.

space distance -subGaussian reference



{0, 1}𝑛 Hamming 𝑂 ( 𝑛) bounded diff. ineq. (Thm. 9.1.1)

𝑆 𝑛−1 Euclidean 𝑂 (1/ 𝑛) Lévy concentration (Cor. 9.4.14)
Gauss space R𝑛 Euclidean 𝑂 (1) Gaussian isoperimetric ineq. (Cor. 9.4.16)

The following lemma shows that for subGaussian random variables, it does not matter
much if we define the tails around its median, mean, or root-mean-square.

148
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.4 Isoperimetric inequalities: a geometric perspective

Lemma 9.4.20 (Median vs. mean for subGaussian distributions)


There exists a constant 𝐶 > 0 so that the following holds for any real random variable
𝑋 satisfying, for some constants 𝑚 and 𝐾,
2 /𝐾 2
P(|𝑋 − 𝑚| ≥ 𝑡) ≤ 2𝑒 −𝑡 for all 𝑡 ≥ 0.

(a) Every median M𝑋 of 𝑋 satisfies

|M𝑋 − 𝑚| ≤ 𝐶𝐾.

(b) The mean of 𝑋 satisfies


|E𝑋 − 𝑚| ≤ 𝐶𝐾.

(c) For any 𝑝 ≥ 1, writing ∥ 𝑋 ∥ 𝑝 := (E |𝑋 | 𝑝 ) 1/𝑝 for the 𝐿 𝑝 norm of 𝑋,



∥ 𝑋 ∥ 𝑝 − 𝑚 ≤ 𝐶𝐾 𝑝.

(d) For every constant 𝐴 there exists a constant 𝑐 > 0 so that if |𝑚′ − 𝑚| ≤ 𝐴𝐾, then
2 /𝐾 2
P(|𝑋 − 𝑚′ | ≥ 𝑡) ≤ 2𝑒 −𝑐𝑡 for all 𝑡 ≥ 0.

Proof. By considering 𝑋/𝐾 instead of 𝑋, we may assume that 𝐾 = 1 for convenience.


√︁ 2
(a) For any 𝑡 >√︁ 2 log 2, we have P(|𝑋 − 𝑚| ≥ 𝑡) ≤ 2𝑒 −𝑡 < 1/2. So every median of
𝑋 lies within 2 log 2 of 𝑚.
(b) We have
∫ ∞
|E𝑋 − 𝑚| ≤ E |𝑋 − 𝑚| = P(|𝑋 − 𝑚| ≥ 𝑡) 𝑑𝑡
0
∫ ∞ √
2
≤ 2𝑒 −𝑡 𝑑𝑡 = 𝜋.
0

(c) Using the triangle inequality on the 𝐿 𝑝 norm, we have


∫ ∞  1/𝑝
𝑝 1/𝑝 𝑝
∥ 𝑋 ∥ 𝑝 − 𝑚 ≤ ∥ 𝑋 − 𝑚∥ 𝑝 = (E |𝑋 − 𝑚| ) = P(|𝑋 − 𝑚| ≥ 𝑡) 𝑑𝑡
0
∫ ∞  1/𝑝
−𝑡 2/ 𝑝 1/𝑝 𝑝  1/𝑝
 √
≤ 2𝑒 𝑑𝑡 =2 Γ 1+ = 𝑂 ( 𝑝).
0 2

2
(c) We can make 𝑐 small enough so that 𝑅𝐻𝑆 = 2𝑒 −𝑐𝑡 ≥ 1 for 𝑡 ≤ 2𝐴. For 𝑡 > 2𝐴,

149
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

we note that
2 /4
P(|𝑋 − 𝑚′ | ≥ 𝑡) ≤ P(|𝑋 − 𝑚| ≥ 𝑡/2) ≤ 2𝑒 −𝑡 . □

Remark 9.4.21 (Equivalent characterization of subGaussian distributions). Given


a real random variable 𝑋, if any of the below is true for some 𝐾𝑖 , then the other
conditions are true for some 𝐾 𝑗 ≤ 𝐶𝐾𝑖 for some absolute constant 𝐶.
2 2
(a) (Tails)P(|𝑋 | ≥ 𝑡) ≤ 2𝑒 −𝑡 /𝐾1 for all 𝑡 ≥ 0.

(b) (Moments) ∥ 𝑋 ∥ 𝐿 𝑝 ≤ 𝐾2 𝑝 for all 𝑝 ≥ 1.
2 /𝐾 2
(c) (MGF of 𝑋 2 ) E𝑒 𝑋 3 ≤ 2.
We leave the proofs as exercises. Also see §2.5.1 in the textbook High-Dimensional
Probability by Vershynin (2018), which gives a superb introduction to the subject.

Johnson–Lindenstrauss Lemma
Given a set of 𝑁 points in high-dimensional Euclidean space, the next result tells us
that one can embed them in 𝑂 (𝜀 −2 log 𝑁) dimensions without sacrificing pairwise
distances by more than 1 ± 𝜀 factor. This is known as dimension reduction. It is an
important tool in many areas, from functional analysis to algorithms.

Theorem 9.4.22 (Johnson and Lindenstrauss 1982)


There exists a constant 𝐶 > 0 so that the following holds. Let 𝜀 > 0. Let 𝑋 be a set of
𝑁 points in R𝑚 . Then for any 𝑑 > 𝐶𝜀 −2 log 𝑁, there exists 𝑓 : 𝑋 → R𝑑 so that

(1 − 𝜀) |𝑥 − 𝑦| ≤ | 𝑓 (𝑥) − 𝑓 (𝑦)| ≤ (1 + 𝜀) |𝑥 − 𝑦| for all 𝑥, 𝑦 ∈ 𝑋.

Remark 9.4.23. Here the requirement 𝑑 > 𝐶𝜀 −2 log 𝑁 on the dimension is optimal
up to a constant factor (Larsen and Nelson 2017).
√︁
We will take 𝑓 to be 𝑚/𝑑 times an orthogonal projection onto a 𝑑-dimensional
subspace chosen uniformly at random. The theorem then follows from the following
lemma together with a union bound.

150
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.4 Isoperimetric inequalities: a geometric perspective

Lemma 9.4.24 (Random projection)


There exists a constant 𝐶 > 0 so that the following holds. Let 𝑚 ≥ 𝑑 and let
𝑃 : R𝑚 → R𝑑 denote the orthogonal projection onto the subspace spanned by the first
𝑑 coordinates. Let 𝑧 be a uniform random point on the unit sphere in R𝑚 . Let 𝑦 = 𝑃𝑧
and 𝑌 = |𝑦|. Then, for all 𝑡 ≥ 0,
√︂ !
𝑑 2
P 𝑌− ≥ 𝑡 ≤ 2𝑒 −𝑐𝑚𝑡 .
𝑚

To prove the Theorem 9.4.22, for each pair of distinct points 𝑥, 𝑥 ′ ∈ 𝑋, set

𝑥 − 𝑥′ | 𝑓 (𝑥) − 𝑓 (𝑥 ′)|
√︂
𝑚
𝑧= , so that 𝑌= .
|𝑥 − 𝑥 ′ | 𝑑 |𝑥 − 𝑥 ′ |

Then the length of the projection of 𝑧 onto a uniform random√︁𝑑-dimensional subspace


has the same distribution as 𝑌 in the lemma. So setting 𝑡 = 𝜀 𝑑/𝑚, we find that
 √︂ 
𝑚
P 𝑌 − 1 ≥ 𝜀 ≤ 2𝑒 −𝑐𝜀𝑑 < 2𝑁 −𝑐𝐶 .
𝑑

Provided that 𝐶 > 1/𝑐, we can take a union bound over all 𝑁2 < 𝑁 2 /2 pairs of points


of 𝑋 to show that with some positive probability, the random 𝑓 works.

Proof of the lemma. We have 𝑧 21 + · · · + 𝑧 2𝑛 = 1 and each 𝑧𝑖 has the same distribution,
so E[𝑧𝑖2 ] = 1/𝑚 for each 𝑖. Thus

 𝑑
E[𝑌 2 ] = E 𝑧21 + · · · + 𝑧 2𝑑 = .

𝑚
Note that 𝑃 is 1-Lipschitz on the unit sphere. By Lévy’s concentration measure
theorem on the sphere, letting M𝑌 denote the median of 𝑌 ,
2 /4
P (|𝑌 − M𝑌 | ≥ 𝑡) ≤ 2𝑒 −𝑚𝑡 .
√︁
The result then follows by Lemma 9.4.20, using that ∥𝑌 ∥ 2 = 𝑑/𝑚. □

Here is a cute application of Johnson–Lindenstrauss (this is related to a homework


problem on the Chernoff bound).

Corollary 9.4.25
2𝑑
There is a constant 𝑐 > 0 so that for every positive integer 𝑑, there is a set of 𝑒 𝑐𝜀
points in R𝑑 whose pairwise distances are in [1 − 𝜀, 1 + 𝜀].

151
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Proof. Applying Theorem 9.4.22 a regular simplex with unit edge lengths with 𝑁
vertices in R𝑁−1 to yield 𝑁 points in R𝑑 for 𝑑 = 𝑂 (𝜀 −2 log 𝑁) and pairwise distances
in [1 − 𝜀, 1 + 𝜀]. □

9.5 Talagrand’s inequality


Talagrand (1995) developed a powerful concentration inequality. It is applicable to
many combinatorial optimization problems on independent random inputs. The most
general form of Talagrand’s inequality can be somewhat difficult to grasp. So we start
by discussing a special case with an easier geometric statement. Though, to obtain the
full power of Talagrand’s inequality with combinatorial consequences, we will need
the full statement to be given later.
We omit the proof of Talagrand’s inequality (see the Alon–Spencer textbook or Tao’s
blog post) and instead focus on explaining the theorem and its applications.

Distance to a subspace
We start with a geometrically motivated question.

Problem 9.5.1
Let 𝑉 be a fixed 𝑑-dimensional subspace. Let 𝑥 ∼ Unif{−1, 1}𝑛 . How well is dist(𝑥, 𝑉)
concentrated?

Let 𝑃 = ( 𝑝𝑖 𝑗 ) ∈ R𝑛×𝑛 be the matrix giving the orthogonal projection onto 𝑉 ⊥ . We


have tr 𝑃 = dim 𝑉 ⊥ = 𝑛 − 𝑑. Then
∑︁
dist(𝑥, 𝑉) 2 = |𝑥 · 𝑃𝑥| = 𝑥𝑖 𝑥 𝑗 𝑝 𝑖 𝑗 .
𝑖, 𝑗

So ∑︁
E[dist(𝑥, 𝑉) 2 ] = 𝑝𝑖𝑖 = tr 𝑃 = 𝑛 − 𝑑.
𝑖


How well is dist(𝑥, 𝑉) concentrated around 𝑛 − 𝑑?
Some easier special cases (codimension-1):
• If 𝑉 is a coordinate subspace, then dist(𝑥, 𝑉) is a constant not depending on 𝑥.

• If 𝑉 = (1, 1, . . . , 1) ⊥ , then dist(𝑥, 𝑉) = |𝑥 1 + · · · + 𝑥 𝑛 |/ 𝑛 which converge to |𝑍 |
for 𝑍 ∼ 𝑁 (0, 1). In particular, it is 𝑂 (1)-subGaussian.

152
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.5 Talagrand’s inequality

• More generally, if for a hyperplane 𝑉 = 𝛼⊥ for some unit vector 𝛼 = (𝛼1 , . . . , 𝛼𝑛 ) ∈


R𝑛 , one has dist(𝑥, 𝑉) = |𝛼 · 𝑥|. Note that flipping 𝑥𝑖 changes |𝛼 · 𝑥| by at most
2|𝛼𝑖 |. So by the bounded differences inequality Theorem 9.1.3, for every 𝑡 ≥ 0,
!
−2𝑡 2 −𝑡 2 /2
P(|dist(𝑥, 𝑉) − E dist(𝑥, 𝑉)| ≥ 𝑡) ≤ 2 exp ≤ 2𝑒 .
4(𝛼12 + · · · + 𝛼𝑛2 )

So again dist(𝑥, 𝑉) is 𝑂 (1)-subGaussian.


What about higher codimensional subspaces 𝑉? Then

dist(𝑥, 𝑉) = sup |𝛼 · 𝑥| .
𝛼∈𝑉 ⊥
|𝛼|=1

It is not clear how to apply the bounded difference inequality to all such 𝛼 in the above
supremum simultaneously.
The bounded difference inequality applied to the function 𝑥 ∈ {−1, 1}𝑛 ↦→ dist(𝑥, 𝑉),
which is 2-Lipschitz (with respect to Hamming distance), gives
2 /(2𝑛)
P (|dist(𝑥, 𝑉) − E dist(𝑥, 𝑉)| ≥ 𝑡) ≤ 2𝑒 −𝑡 ,

showing that dist(𝑥, 𝑉) is 𝑂 ( 𝑛)-subGaussian—but this is a pretty bad result, as

|dist(𝑥, 𝑉)| ≤ 𝑛 (half the length of the longest diagonal of the cube).
Perhaps the reason why the above bound is so poor is that the bounded difference
inequality is measuring distance in {−1, 1}𝑛 using the Hamming distance (ℓ1 ) whereas
we really care about the Euclidean distance (ℓ2 ).
If, instead of sampling 𝑥 ∈ {−1, 1}𝑛 , we took 𝑥 to be a uniformly random point on

the radius 𝑛 sphere in R𝑛 (which contains {−1, 1}𝑛 ), then Lévy concentration on
the sphere (Corollary 9.4.14) implies that dist(𝑥, 𝑉) is 𝑂 (1)-subGaussian. Perhaps a
similar bound holds when 𝑥 is chosen from {−1, 1}𝑛 ?
Here is a corollary of Talagrand’s inequality, which we will state in its general form
later.

Theorem 9.5.2
Let 𝑉 be a fixed 𝑑-dimensional subspace in R𝑛 . For uniformly random 𝑥 ∈ {−1, 1}𝑛 ,
one has  √  2
P | dist(𝑥, 𝑉) − 𝑛 − 𝑑| ≥ 𝑡 ≤ 𝐶𝑒 −𝑐𝑡 ,
where 𝐶, 𝑐 > 0 are some constants.

153
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Convex Lipschitz functions of independent random variables


Let us now state Talagrand’s inequality, first in a special case for convex functions, and
then more generally. Below dist(·, ·) means Euclidean distance.

Theorem 9.5.3 (Talagrand)


Let 𝐴 ⊆ R𝑛 be convex. Let 𝑥 ∼ Unif{0, 1}𝑛 . Then for any 𝑡 ≥ 0,
2 /4
P(𝑥 ∈ 𝐴)P(dist(𝑥, 𝐴) ≥ 𝑡) ≤ 𝑒 −𝑡 .

Remark 9.5.4. (1) Note that 𝐴 is a convex body in R𝑛 and not simply a set of points
in 𝐴.
2 /𝑛
(2) The bounded differences inequality gives us an upper bound of the form 𝑒 −𝑐𝑡 ,
which is much worse than Talagrand’s bound.

Example 9.5.5 (Talagrand’s inequality fails for nonconvex sets). Let


n 𝑛 √ o
𝐴 = 𝑥 ∈ {0, 1}𝑛 : wt(𝑥) ≤ − 𝑛
2
(here 𝐴 is a discrete set of points and not their convex hull). Then for every 𝑦 ∈ {0, 1}𝑛
with wt(𝑦) ≥ 𝑛/2, one has dist(𝑦, 𝐴) ≥ 𝑛1/4 (note that this is Euclidean distance
and not Hamming distance). Using the central limit theorem, we have, for some
constant 𝑐 > 0 and sufficiently large 𝑛, for 𝑥 ∼ Uniform({−1, 1}𝑛 ), P(𝑥 ∈ 𝐴) ≥ 𝑐
and P(wt(𝑥) ≥ 𝑛/2) ≥ 1/2, so the conclusion of Talagrand’s inequality is false for
𝑡 = 𝑛1/4 , in the case of this nonconvex 𝐴.

By an argument similar to our proof of Theorem 9.4.8 (the equivalence of notions of


concentration of measure), one can deduce the following consequence.

Corollary 9.5.6 (Talagrand)


Let 𝑓 : R𝑛 → R be a convex and 1-Lipschitz function (with respect to Euclidean
distance on R𝑛 ). Let 𝑥 ∼ Unif{0, 1}𝑛 . Then for any 𝑟 ∈ R and 𝑡 ≥ 0,
2 /4
P( 𝑓 (𝑥) ≤ 𝑟)P( 𝑓 (𝑥) ≥ 𝑟 + 𝑡) ≤ 𝑒 −𝑡 .

Remark 9.5.7. The proof below shows that the assumption that 𝑓 is convex can be
weakened to 𝑓 being quasiconvex, i.e., { 𝑓 ≤ 𝑎} is convex for every 𝑎 ∈ R.

Proof that Theorem 9.5.3 and Corollary 9.5.6 are equivalent. Theorem 9.5.3 implies
Corollary 9.5.6: take 𝐴 = {𝑥 : 𝑓 (𝑥) ≤ 𝑟 }. We have 𝑓 (𝑥) ≤ 𝑟 + 𝑡 whenever

154
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.5 Talagrand’s inequality

dist(𝑎, 𝐴) ≤ 𝑡 since 𝑓 is 1-Lipschitz. So P( 𝑓 (𝑥) ≤ 𝑟) = P(𝑥 ∈ 𝐴) and P( 𝑓 (𝑥) ≥


𝑟 + 𝑡) ≤ P(dist(𝑥, 𝐴) ≥ 𝑡).
Corollary 9.5.6 implies Theorem 9.5.3: 𝑟 = 0 and take 𝑓 (𝑥) = dist(𝑥, 𝐴), which is a
convex function since 𝐴 is convex. □

Let us write M𝑋 to be a median for the random variable 𝑋, i.e., a non-random real so
that P(𝑋 ≥ M𝑋) ≥ 1/2 and P(𝑋 ≤ M𝑋) ≥ 1/2.

Corollary 9.5.8 (Talagrand)


Let 𝑓 : R𝑛 → R be a convex and 1-Lipschitz function (with respect to Euclidean
distance on R𝑛 ). Let 𝑥 ∼ Unif ({0, 1}𝑛 ). Then
2 /4
P(| 𝑓 (𝑥) − M 𝑓 (𝑥)| ≥ 𝑡) ≤ 4𝑒 −𝑡 .

Proof. Setting 𝑟 = M 𝑓 (𝑥) in Corollary 9.5.6 yields


2 /4
P( 𝑓 (𝑥) ≥ M 𝑓 (𝑥) + 𝑡) ≤ 2𝑒 −𝑡 .

Setting 𝑟 = M 𝑓 (𝑥) − 𝑡 in Corollary 9.5.6 yields


2 /4
P( 𝑓 (𝑥) ≤ M 𝑓 (𝑥) − 𝑡) ≤ 2𝑒 −𝑡 . □

Combining the two tail bounds yields the corollary.

Theorem 9.5.2 then follows. Indeed, Corollary 9.5.8 shows that dist(𝑥, 𝑉) (which is
a convex 1-Lipschitz function of 𝑥 ∈ R𝑛 ) is 𝑂 (1)-subGaussian, which immediately
implies Theorem 9.5.2.

Example 9.5.9 (Operator norm of a random matrix). Let 𝐴 be a random matrix whose
2
entries are uniform iid from {−1, 1}. Viewing 𝐴 ↦→ ∥ 𝐴∥ op as a function R𝑛 → R,
we see that it is convex (since the operator norm is a norm) and 1-Lipschitz (using
that ∥·∥ op ≤ ∥·∥ HS , where the latter is the Hilbert–Schmidt norm, also known as the
Frobenius norm, i.e., the ℓ2 -norm of the matrix entries). It follows by Talagrand’s
inequality (Corollary 9.5.8) that ∥ 𝐴∥ op is 𝑂 (1)-subGaussian about its mean.

Convex distance
Talagrand’s inequality has a much more general form, which has far-reaching combi-
natorial applications. We need a define a more subtle notion of distance.
We consider Ω = Ω1 × · · · × Ω𝑛 with product probability measure (i.e., independent
random variables).

155
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Weighted hamming distance: given 𝛼 = (𝛼1 , . . . , 𝛼𝑛 ) ∈ R𝑛≥0 , 𝑥, 𝑦 ∈ Ω, we set


∑︁
𝒅𝜶 (𝒙, 𝒚) := 𝛼𝑖
𝑖:𝑥 𝑖 ≠𝑦 𝑗

For 𝐴 ⊆ Ω,
𝒅𝜶 (𝒙, 𝑨) := inf 𝑑𝛼 (𝑥, 𝑦).
𝑦∈𝐴

Talagrand’s convex distance between 𝑥 ∈ Ω and 𝐴 ⊆ Ω is defined by

𝒅𝑻 (𝒙, 𝑨) := sup 𝑑𝛼 (𝑥, 𝐴).


𝛼∈R𝑛≥0
|𝛼|=1

Here |𝛼| denotes Euclidean length:

|𝛼| 2 := 𝛼12 + · · · + 𝛼𝑛2 .

Example 9.5.10 (Euclidean distance to convex hull). If 𝐴 ⊆ {0, 1} 𝑛 and 𝑥 ∈ {0, 1} 𝑛 ,


then 𝑑𝑇 (𝑥, 𝐴) is the Euclidean distance from 𝑥 to the convex hull of 𝐴.

Let us give another interpretation of convex distance. For 𝑥, 𝑦 ∈ Ω, let

𝜙𝑥 (𝑦) = (1𝑥1 ≠𝑦1 , 1𝑥2 ≠𝑦2 , . . . , 1𝑥 𝑛 ≠𝑦 𝑛 ) ∈ {0, 1}𝑛

be the vector of coordinatewise disagreements between 𝑥 and 𝑦. Write

𝜙𝑥 ( 𝐴) = {𝜙𝑥 (𝑦) : 𝑦 ∈ 𝐴} ⊆ {0, 1}𝑛 .

Then for any 𝛼 ∈ R𝑛≥0 ,


® 𝜙𝑥 ( 𝐴)),
𝑑𝛼 (𝑥, 𝐴) = 𝑑𝛼 ( 0,
where the LHS is the weighted Hamming distance in Ω whereas the RHS takes
place in {0, 1}𝑛 . Taking the supremum over 𝛼 ∈ R𝑛≥0 with |𝛼| = 1, and using the
Example 9.5.10, we deduce

® ConvexHull 𝜙𝑥 ( 𝐴)).
𝑑𝑇 (𝑥, 𝐴) = dist( 0,

The general form of Talagrand’s inequality says the following. Note that it reduces to
the earlier special case Theorem 9.5.3 if Ω = {0, 1}𝑛 .

156
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.5 Talagrand’s inequality

Theorem 9.5.11 (Talagrand’s inequality: general form)


Let 𝐴 ⊆ Ω = Ω1 × · · · × Ω𝑛 , with Ω equipped with a product probability measure. Let
𝑥 ∈ Ω be chosen randomly with independent coordinates. Let 𝑡 ≥ 0. Then
2 /4
P(𝑥 ∈ 𝐴)P(𝑑𝑇 (𝑥, 𝐴) ≥ 𝑡) ≤ 𝑒 −𝑡 .

Let us see how Talagrand’s inequality recovers a more general form of our geometric
inequalities from earlier, extending from independent boolean random variables to
independent bounded random variables.

Lemma 9.5.12 (Convex distance upper bounds Euclidean distance)


Let 𝐴 ⊆ [0, 1] 𝑛 and 𝑥 ∈ [0, 1] 𝑛 . Then dist(𝑥, ConvexHull 𝐴) ≤ 𝑑𝑇 (𝑥, 𝐴).

Proof. For any 𝛼 ∈ R𝑛 , and any 𝑦 ∈ [0, 1] 𝑛 , we have


𝑛
∑︁ 𝑛
∑︁
|(𝑥 − 𝑦) · 𝛼| ≤ |𝛼𝑖 | |𝑥𝑖 − 𝑦𝑖 | ≤ |𝛼𝑖 | .
𝑖=1 𝑖:𝑥 𝑖 ≠𝑦 𝑖

First taking the infimum over all 𝑦 ∈ 𝐴, and then taking the supremum over unit vectors
𝛼, the LHS becomes dist(𝑥, ConvexHull 𝐴) and the RHS becomes 𝑑𝑇 (𝑥, 𝐴). □

Corollary 9.5.13 (Talagrand’s inequality: convex sets and convex Lipschitz func-
tions)
Let 𝑥 = (𝑥 1 , . . . , 𝑥 𝑛 ) ∈ [0, 1] 𝑛 be independent random variables (not necessarily
identical). Let 𝑡 ≥ 0. Let 𝐴 ⊆ [0, 1] 𝑛 be a convex set. Then
2 /4
P(𝑥 ∈ 𝐴)P(dist(𝑥, 𝐴) ≥ 𝑡) ≤ 𝑒 −𝑡

where dist is Euclidean distance. Also, if 𝑓 : [0, 1] 𝑛 → R is a convex 1-Lipschitz


function, then
2
P(| 𝑓 − M 𝑓 | ≥ 𝑡) ≤ 4𝑒 −𝑡 /4 .

Here is a form of Talagrand’s inequality that is useful for combinatorial applications.


Below, one should think of 𝑓 (𝑥) as the value of some optimization problem on some
random input 𝑥. There is a hypothesis on how much 𝑓 (𝑥) can change if we alter 𝑥.
An example that we will examine in the next section is the length of the shortest tour
through 𝑛 random points in the unit square (the Euclidean traveling salesman problem).

157
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Theorem 9.5.14 (Talagrand’s inequality — functions with weighted certificates)


Let Ω = Ω1 × · · · × Ω𝑛 equipped with the product measure. Let 𝑓 : Ω → R be a
function. Suppose for every 𝑥 ∈ Ω, there is some 𝛼(𝑥) = (𝛼1 (𝑥), . . . , 𝛼𝑛 (𝑥)) ∈ R𝑛≥0
such that ∑︁
𝑓 (𝑦) ≥ 𝑓 (𝑥) − 𝛼𝑖 (𝑥) for all 𝑦 ∈ Ω.
𝑖:𝑥 𝑖 ≠𝑦 𝑖

Then, for every 𝑡 ≥ 0, (recall |𝛼| 2 = 2)


Í𝑛
𝑖=1 𝛼𝑖 (𝑥)

2 /𝐾 2
P (| 𝑓 − M 𝑓 | ≥ 𝑡) ≤ 4𝑒 −𝑡 where 𝐾 = 2 sup |𝛼(𝑥)| .
𝑥∈Ω

Remark 9.5.15. By considering − 𝑓 instead of 𝑓 , we can change the hypothesis on 𝑓


to ∑︁
𝑓 (𝑦) ≤ 𝑓 (𝑥) + 𝛼𝑖 (𝑥) for all 𝑦 ∈ Ω.
𝑖:𝑥 𝑖 ≠𝑦 𝑖

Note that 𝑥 and 𝑦 play asymmetric roles.

Remark 9.5.16. The vector 𝛼(𝑥) measures the resilience of 𝑓 (𝑥) under changing
some coordinates of 𝑥. It is important that we can choose a different weight 𝛼(𝑥) for
each 𝑥. In fact, if we do not let 𝛼(𝑥) change with 𝑥, then Theorem 9.5.14 recovers the
bounded differences inequality Theorem 9.1.3 up to an unimportant constant factor in
the exponent of the bound.

Proof. Let 𝑟 ∈ R. Let 𝐴 = {𝑦 ∈ Ω : 𝑓 (𝑦) ≤ 𝑟 − 𝑡}. Consider an 𝑥 ∈ Ω with 𝑓 (𝑥) ≥ 𝑟.


By hypothesis, there is some 𝛼(𝑥) ∈ R𝑛≥0 such that

𝑑𝛼(𝑥) (𝑥, 𝑦) ≥ 𝑓 (𝑥) − 𝑓 (𝑦) ≥ 𝑡 for all 𝑦 ∈ 𝐴.

Taking infimum over 𝑦 ∈ 𝐴, we find

|𝛼(𝑥)| 𝑑𝑇 (𝑥, 𝐴) ≥ 𝑡.

So
𝑡 2𝑡
𝑑𝑇 (𝑥, 𝐴) ≥ ≥ .
|𝛼(𝑥)| 𝐾
And hence by Talagrand’s inequality Theorem 9.5.11,
 
2𝑡 2 2
P( 𝑓 ≤ 𝑟 − 𝑡)P( 𝑓 ≥ 𝑟) ≤ P(𝑥 ∈ 𝐴)P 𝑑𝑇 (𝑥, 𝐴) ≥ ≤ 𝑒 −𝑡 /𝐾 .
𝐾

158
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.5 Talagrand’s inequality

Taking 𝑟 = M 𝑓 + 𝑡 yields
2 /𝐾 2
P( 𝑓 ≥ M 𝑓 + 𝑡) ≤ 2𝑒 −𝑡 ,

and taking 𝑟 = M 𝑓 yields


2 /𝐾 2
P( 𝑓 ≤ M 𝑓 − 𝑡) ≤ 2𝑒 −𝑡 .

Putting them together yields the final result. □

Largest eigenvalue of a random matrix

Theorem 9.5.17
Let 𝐴 = (𝑎𝑖 𝑗 ) be an 𝑛 × 𝑛 symmetric random matrix with independent entries in
[−1, 1]. Let 𝜆 1 (𝑋) denote the largest eigenvalue of 𝐴. Then
2 /32
P(|𝜆 1 ( 𝐴) − M𝜆 1 ( 𝐴)| ≥ 𝑡) ≤ 4𝑒 −𝑡 .

Proof. We shall verify the hypotheses of Theorem 9.5.14. We would like to come up
with a good choice of a weight vector 𝛼( 𝐴) for each matrix 𝐴 so that for any other
symmetric matrix 𝐵 with [−1, 1] entries,
∑︁
𝜆 1 (𝐵) ≥ 𝜆 1 ( 𝐴) − 𝛼𝑖, 𝑗 . (9.1)
𝑖≤ 𝑗:𝑎 𝑖 𝑗 ≠𝑏 𝑖 𝑗

Note that in a random symmetric matrix we only have 𝑛(𝑛 + 1)/2 independent random
entries: the entries below the diagonal are obtained by reflecting the upper diagonal
entries.
Let 𝑣 = 𝑣( 𝐴) be the unit eigenvector of 𝐴 corresponding to the eigenvalue 𝜆 1 ( 𝐴).
Then, by the Courant–Fischer characterization of eigenvalues,

𝑣 ⊺ 𝐴𝑣 = 𝜆 1 ( 𝐴) and 𝑣 ⊺ 𝐵𝑣 ≤ 𝜆 1 (𝐵).

Thus
∑︁ ∑︁
𝜆 1 ( 𝐴) − 𝜆 1 (𝐵) ≤ 𝑣 ⊺ ( 𝐴 − 𝐵)𝑣 ≤ |𝑣 𝑖 ||𝑣 𝑗 | 𝑎𝑖 𝑗 − 𝑏𝑖 𝑗 ≤ 2|𝑣 𝑖 ||𝑣 𝑗 |.
𝑖, 𝑗:𝑎 𝑖 𝑗 ≠𝑏 𝑖 𝑗 𝑖, 𝑗:𝑎 𝑖 𝑗 ≠𝑏 𝑖 𝑗

Thus (9.1) holds for the vector 𝛼( 𝐴) = (𝛼𝑖 𝑗 )𝑖≤ 𝑗 defined by


(
4|𝑣 𝑖 ||𝑣 𝑗 | if 𝑖 < 𝑗
𝛼𝑖 𝑗 =
2|𝑣 𝑖 |2 if 𝑖 = 𝑗 .

159
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

We have !2
∑︁ ∑︁ ∑︁
𝛼𝑖2𝑗 ≤ 8 |𝑣 𝑖 | 2 |𝑣 𝑗 | 2 = 8 |𝑣 𝑖 | 2 = 8.
𝑖≤ 𝑗 𝑖, 𝑗 𝑖

So Theorem 9.5.14 yields the result. □

Remark 9.5.18. If 𝐴 has mean zero entries, then a moments computation shows that

E𝜆 1 ( 𝐴) = 𝑂 ( 𝑛) (the constant can be computed as well). A much more advanced
fact is that, say for uniform {−1, 1} entries, the true scale of fluctuation is 𝑛−1/6 , and
when normalized, the distribution converges to something known as the Tracy–Widom
distribution. This limiting distribution is “universal” in the sense that it occurs in many
naturally occurring problems, including the next example.

Certifiable functions and longest increasing subsequence


An increasing subsequence of a permutation 𝜎 = (𝜎1 , . . . , 𝜎𝑛 ) is defined to be some
(𝜎𝑖1 , . . . , 𝜎𝑖ℓ ) for some 𝑖1 < · · · < 𝑖ℓ .

Question 9.5.19
How well is the length 𝑋 of the longest increasing subsequence of uniform random
permutation concentrated?

While the entries of 𝜎 are not independent, we can generate a uniform random permu-
tation by taking iid uniform 𝑥 1 , . . . , 𝑥 𝑛 ∼ Unif [0, 1] and let 𝜎 record the ordering of
the 𝑥𝑖 ’s. This trick converts the problem into one about independent random variables.

We leave it as an exercise to deduce that 𝑋 is Θ( 𝑛) whp.
Changing one of the 𝑥𝑖 ’s changes LIS by at most 1, so the bounded differences inequality

tells us that 𝑋 is 𝑂 ( 𝑛)-subGaussian. Can we do better?
The assertion that a permutation has an increasing permutation of length 𝑠 can be
checked by verifying 𝑠 coordinates of the permutation. Talagrand’s inequality√ tells
us that in such situations the typical fluctuation should be on the order 𝑂 ( M𝑋), or
𝑂 (𝑛1/4 ) in this case.

Definition 9.5.20
Let Ω = Ω1 × · · · × Ω𝑛 . Let 𝐴 ⊆ Ω. We say that 𝐴 is 𝒔-certifiable if for every 𝑥 ∈ 𝐴,
there exists a set 𝐼 (𝑥) ⊆ [𝑛] with |𝐼 | ≤ 𝑠 such that for every 𝑦 ∈ Ω with 𝑥𝑖 = 𝑦𝑖 for all
𝑖 ∈ 𝐼 (𝑥), one has 𝑦 ∈ 𝐴.

For example, for a random permutation as earlier, having an increasing subsequence


of length ≥ 𝑠 is 𝑠-certifiable (namely by the indices of the length 𝑠 increasing subse-
quence).

160
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.5 Talagrand’s inequality

Theorem 9.5.21 (Talagrand’s inequality for certifiable functions)


Let Ω = Ω1 × · · · × Ω𝑛 be equipped with a product measure. Let 𝑓 : Ω → R be
1-Lipschitz with respect to Hamming distance on Ω. Suppose that { 𝑓 ≥ 𝑟 } is 𝑠-
certifiable. Then, for every 𝑡 ≥ 0,
2 /(4𝑠)
P( 𝑓 ≤ 𝑟 − 𝑡)P( 𝑓 ≥ 𝑟) ≤ 𝑒 −𝑡 .

Proof. Let 𝐴, 𝐵 ⊆ Ω be given by 𝐴 = {𝑥 : 𝑓 (𝑥) ≤ 𝑟 − 𝑡} and 𝐵 = {𝑦 : 𝑓 (𝑦) ≥ 𝑟 }. For


every 𝑦 ∈ 𝐵, let 𝐼 (𝑦) ⊆ [𝑛] denote a set of ≤ 𝑠 coordinates that certify 𝑓 ≥ 𝑟. Due to
𝑓 being 1-Lipschitz, we see that every 𝑥 ∈ 𝐴 disagrees with 𝑦 on ≥ 𝑡 coordinates of
𝐼 (𝑦).
For every 𝑦 ∈ 𝐵, let 𝛼(𝑦) be the indicator vector for 𝐼 (𝑦) normalized in length to a
unit vector. Then for any 𝑥 ∈ 𝐴,

|{𝑖 ∈ 𝐼 (𝑦) : 𝑥𝑖 ≠ 𝑦𝑖 }| 𝑡
𝑑𝛼 (𝑥, 𝑦) = √︁ ≥ √ .
|𝐼 | 𝑠

Thus 𝑑𝑇 (𝑦, 𝐴) ≥ 𝑡/ 𝑠. Thus
√ 2
P( 𝑓 ≤ 𝑟 − 𝑡)P( 𝑓 ≥ 𝑟) ≤ P( 𝐴)P(𝐵) ≤ P(𝑥 ∈ 𝐴)P(𝑑𝑇 (𝑥, 𝐴) ≥ 𝑡/ 𝑠) ≤ 𝑒 −𝑡 /(4𝑠)

by Talagrand’s inequality (Theorem 9.5.11). □

Corollary 9.5.22 (Talagrand’s inequality for certifiable functions)


Let Ω = Ω1 × · · · × Ω𝑛 be equipped with a product measure. Let 𝑓 : Ω → R be
1-Lipschitz with respect to Hamming distance on Ω. Suppose { 𝑓 ≥ 𝑟 } is 𝑟-certifiable
for every 𝑟. Then for every 𝑡 ≥ 0,

−𝑡 2
 
P( 𝑓 ≤ M 𝑓 − 𝑡) ≤ 2 exp
4M 𝑓

and
−𝑡 2
 
P( 𝑓 ≥ M 𝑓 + 𝑡) ≤ 2 exp .
4(M 𝑓 + 𝑡)

Proof. Applying the previous theorem, we have, for every 𝑟 ∈ R and every 𝑡 ≥ 0,

−𝑡 2
 
P( 𝑓 ≤ 𝑟 − 𝑡)P(𝑋 ≥ 𝑟) ≤ exp .
4𝑟

161
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Setting 𝑟 = M 𝑓 , we obtain the lower tail.

−𝑡 2
 
P( 𝑓 ≤ M 𝑓 − 𝑡) ≤ 2 exp .
4M 𝑓

Setting 𝑟 = M 𝑓 + 𝑡, we obtain the upper tail

−𝑡 2
 
P(𝑋 ≥ M 𝑓 + 𝑡) ≤ 2 exp . □
4(M 𝑓 + 𝑡)

We can apply the above corollary to [0, 1] 𝑛 with 𝑓 being the length of the longest

subsequence. Then 𝑓 ≥ 𝑟 is 𝑟-certifiable. It is also easy to deduce that M 𝑓 = 𝑂 ( 𝑛).
The above tail bounds give us a concentration window of width 𝑂 (𝑛1/4 ).

Corollary 9.5.23 (Longest increasing subsequence)


Let 𝑋 be the length of the longest increasing subsequence of a random permutation of
[𝑛]. Then for every 𝜀 > 0 there exists 𝐶 > 0 so that

P(|𝑋 − M𝑋 | ≤ 𝐶𝑛1/4 ) ≥ 1 − 𝜀.

Remark 9.5.24. The distribution of the length 𝑋 of longest increasing subsequence


of a uniform random permutation is now well understood through some deep results.

Vershik and Kerov (1977) showed that E𝑋 ∼ 2 𝑛.
Baik, Deift, and Johansson (1999) showed that the correct scaling factor is 𝑛1/6 , and

furthermore, 𝑛−1/6 (𝑋 − 2 𝑛) converges to the Tracy–Widom distribution, the same
distribution for the top eigenvalue of a random matrix.

9.6 Euclidean traveling salesman problem


Given points 𝑥 1 , . . . , 𝑥 𝑛 ∈ [0, 1] 2 , let 𝐿 (𝑥 1 , . . . , 𝑥 𝑛 ) = 𝐿 ({𝑥 1 , . . . , 𝑥 𝑛 }) denote the
length of the shortest tour through all given points and returns to its starting point.
Equivalently, 𝐿 (𝑥 1 , . . . , 𝑥 𝑛 ) is the minimum of

|𝑥 𝜎(1) − 𝑥 𝜎(2) | + |𝑥 𝜎(2) − 𝑥 𝜎(3) | + · · · + |𝑥 𝜎(𝑛) − 𝑥 𝜎(1) |

as 𝜎 ranges over all permutations of [𝑛]. This Euclidean traveling salesman problem
is NP-hard to solve exactly, although there is a (1 + 𝜀)-factor approximation algorithm
with running polynomial time for any constant 𝜀 > 0 (Arora 1998).
Let
𝐿 𝑛 = 𝐿 (𝑥 1 , . . . , 𝑥 𝑛 ) with i.i.d. 𝑥 1 , . . . , 𝑥 𝑛 ∼ Unif([0, 1] 2 )

162
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.6 Euclidean traveling salesman problem

The Mona Lisa TSP challenge. A tour of 1000 random points.


© Robert Bosch. All rights reserved. This content is excluded from our Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use.


Exercise: E𝐿 𝑛 = Θ( 𝑛)

Beardwood, Halton, and Hammersley (1959) showed that whp 𝐿 𝑛 / 𝑛 converges to
some constant as 𝑛 → ∞ (the exact value of the constant is unknown).
We shall focus on the concentration of 𝐿 𝑛 .
We will present two methods that illustrate different techniques from this chapter.

Martingale methods
The following simple monotonicity property will be important for us: for any 𝑆 and
𝑥 ∈ [0, 1] 2 ,
𝐿 (𝑆) ≤ 𝐿 (𝑆 ∪ {𝑥}) ≤ 𝐿(𝑆) + 2 dist(𝑥, 𝑆).
Here is the justification for the second inequality. Let 𝑦 be the closest point in 𝑆 to
𝑥. Consider a shortest tour through 𝑆 of length 𝐿(𝑆). Let us modify this tour by first
traversing through it, and when we hit 𝑦, we take a detour excursion from 𝑦 to 𝑥 and
then back to 𝑦. The length of this tour, which contains 𝑆 ∪ {𝑥}, is 𝐿(𝑆) + 2 dist(𝑥, 𝑆),
and thus the shortest tour through 𝑆 ∪ {𝑥} has length at most 𝐿 (𝑆) + 2 dist(𝑥, 𝑆).
If we simply apply the bounded difference inequality, we find that changing a single

𝑥𝑖 might change 𝐿(𝑥1 , . . . , 𝑥 𝑛 ) by 𝑂 (1) in the worse case, and so 𝐿 𝑛 is 𝑂 ( 𝑛)-

subGaussian about its mean. This is a trivial result since 𝐿 𝑛 is typically Θ( 𝑛).
To do better, we apply Azuma’s inequlality to the Doob martingale. The key obser-
vation is that the initially revealed points do not affect the conditional expectations by
much even in the worst case.

163
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Theorem 9.6.1 (Rhee and Talagrand 1987)


√︁
𝐿 𝑛 is 𝑂 ( log 𝑛)-subGaussian about its mean. That is,

−𝑐𝑡 2
 
P(|𝐿 𝑛 − E𝐿 𝑛 | ≥ 𝑡) ≤ exp for all 𝑡 > 0,
log 𝑛

where 𝑐 > 0 is some constant.

We need the following estimate.

Lemma 9.6.2
Let 𝑆 be a set of 𝑘 random points chosen independently and uniformly in [0, 1] 2 . For
any (non-random) point 𝑦 ∈ [0, 1] 2 , one has

1
E dist(𝑦, 𝑆) ≲ √ .
𝑘

Proof. We have

∫ 2
E dist(𝑦, 𝑆) = P(dist(𝑦, 𝑆) ≥ 𝑡) 𝑑𝑡
0

∫ 2    𝑘
= 1 − area 𝐵(𝑦, 𝑡) ∩ [0, 1] 2 𝑑𝑡
0

∫ 2   
≤ exp −𝑘 area 𝐵(𝑦, 𝑡) ∩ [0, 1] 2 𝑑𝑡
∫0 ∞ 
21 
≤ exp −Ω(𝑘𝑡 ) 𝑑𝑡 ≲ √ . □
0 𝑘
Proof of Theorem 9.6.1. Let

𝐿 𝑛,𝑖 (𝑥 1 , . . . , 𝑥𝑖 ) = E [𝐿 𝑛 (𝑥 1 , . . . , 𝑥 𝑛 ) | 𝑥 1 , . . . , 𝑥𝑖 ]

be the expectation of 𝐿 𝑛 conditional on the first 𝑖 points (and averaging over the
remaining 𝑛 − 𝑖 points).
 
Claim. 𝐿 𝑛,𝑖 is 𝑂 √ 1 -Lipschitz with respect to Hamming distance.
𝑛−𝑖+1
We have

𝐿 (𝑥 1 , . . . , 𝑥𝑖 , . . . 𝑥 𝑛 ) ≤ 𝐿 (𝑥 1 , . . . , 𝑥𝑖′, . . . 𝑥 𝑛 ) + 2 dist(𝑥𝑖 , {𝑥 1 , . . . , 𝑥𝑖−1 , 𝑥𝑖+1 , . . . , 𝑥 𝑛 })


(
2 dist(𝑥𝑖 , {𝑥𝑖+1 , . . . , 𝑥 𝑛 }) if 𝑖 < 𝑛
≤ 𝐿 (𝑥 1 , . . . , 𝑥𝑖 , . . . 𝑥 𝑛 ) +
𝑂 (1) if 𝑖 = 𝑛.

164
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.6 Euclidean traveling salesman problem

Taking expectation over 𝑥𝑖+1 , . . . , 𝑥 𝑛 , and applying the previous lemma, we find that
 
′ 1
𝐿 𝑛,𝑖 (𝑥 1 , . . . , 𝑥𝑖 ) ≤ 𝐿 𝑛,𝑖 (𝑥 1 , . . . , 𝑥𝑖−1 , 𝑥𝑖 ) + 𝑂 √ .
𝑛−𝑖+1
This proves the claim. Thus the Doob martingale

𝑍𝑖 = E [𝐿 𝑛 (𝑥 1 , . . . , 𝑥 𝑛 ) | 𝑥 1 , . . . , 𝑥𝑖 ] = 𝐿 𝑛,𝑖 (𝑥 1 , . . . , 𝑥𝑖 )

satisfies
1
|𝑍𝑖 − 𝑍𝑖−1 | ≲ √ for each 1 ≤ 𝑖 ≤ 𝑛.
𝑛−𝑖+1
Now we apply Azuma’s inequality (Theorem 9.2.8). Since
𝑛  2
∑︁ 1
√ = 𝑂 (log 𝑛),
𝑖=1 𝑛−𝑖+1
√︁
we deduce that 𝑍 𝑁 = 𝐿 𝑛 is 𝑂 ( log 𝑛)-subGaussian about its mean. □

Talagrand’s inequality
Using Talagrand’s inequality, we will prove the following stronger estimate.

Theorem 9.6.3 (Rhee and Talagrand 1989)


𝐿 𝑛 is 𝑂 (1)-subGaussian about its mean. That is,
2
P(|𝐿 𝑛 − E𝐿 𝑛 | ≥ 𝑡) ≤ 𝑒 −𝑐𝑡 for all 𝑡 > 0,

where 𝑐 > 0 is some constant.

Remark 9.6.4. Rhee (1991) showed that this tail bound is sharp.

The proof below, following Steele (1997), applies the “space-filling curve heuristic.”
A space-filling curve is a continuous surjection from [0, 1] to [0, 1] 2 . Peano (1890)
constructed the first space-filling curve. Hilbert (1891) constructed another space-
filling curve known as the Hilbert curve. We will not give a precise description of the
Hilbert curve here. Intuitively, the Hilbert curve is the pointwise limit of a sequence of
piecewise linear curves illustrated in Figure 9.2. I recommend this 3Blue1Brown video
on YouTube for a beautiful animation of the Hilbert curve along with applications.
We will only need the following property of the Hilbert space filling curve.

165
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

Figure 9.2: The Hilbert space-filling curve is the limit of discrete curves illus-
trated.

Definition 9.6.5 (Hölder continuity)


Given two metric spaces (𝑋, 𝑑 𝑋 ) and (𝑌 , 𝑑𝑌 ), we say that a map 𝑓 : 𝑋 → 𝑌 is Hölder
continuous with exponent 𝜶 if there is some constant 𝐶 (depending on 𝑓 ) so that

𝑑𝑌 ( 𝑓 (𝑥), 𝑓 (𝑥 ′)) ≤ 𝐶𝑑 𝑋 (𝑥, 𝑥 ′) 𝛼 for all 𝑥, 𝑥 ′ ∈ 𝑋.

Remark 9.6.6. Hölder continuity with exponent 1 is the same as Lipschitz continuity.
Often 𝑋 has bounded diameter, in which case if 𝑓 is Hölder continuous with exponent
𝛼, then it is so with any exponent 𝛼′ < 𝛼.

Theorem 9.6.7
The Hilbert curve 𝐻 : [0, 1] → [0, 1] 2 is Hölder continuous with exponent 1/2.

Proof sketch. The Hilbert space-filling curve 𝐻 sends every interval of the form
[(𝑖 − 1)/4𝑛 , 𝑖/4𝑛 ] to a square of the form [( 𝑗 − 1)/2𝑛 , 𝑗/2𝑛 ] × [(𝑘 − 1)/2𝑛 , 𝑘/2𝑛 ].
Indeed, for each fixed 𝑛, the discrete curves eventually all have this property.
Let 𝑥, 𝑦 ∈ [0, 1], and let 𝑛 be the largest integer so that 𝑥, 𝑦 ∈ [(𝑖 − 1)/4𝑛 , (𝑖 +
1)/4𝑛 ] for some integer 𝑖. Then |𝑥 − 𝑦| = Θ(4−𝑛 ), and |𝐻 (𝑥) − 𝐻 (𝑦)| ≲ 2−𝑛 . Thus
|𝐻 (𝑥) − 𝐻 (𝑦)| ≲ |𝑥 − 𝑦| 1/2 . □

166
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.6 Euclidean traveling salesman problem

Remark 9.6.8. If a space filling space is Hölder continuous with exponent 𝛼, then
𝛼 ≤ 1/2. Indeed, the images of the intervals [(𝑖 − 1)/𝑘, 𝑖/𝑘], 𝑖 = √
1, . . . , 𝑘, cover the
unit square, and thus one intervals must have image diameter ≳ 1/ 𝑘.

Lemma 9.6.9 (Space-filling curve heuristic)


Let 𝑥 1 , . . . , 𝑥 𝑛 ∈ [0, 1] 2 . There is a permutation of 𝜎 of [𝑛] with (indices taken mod
𝑛)
𝑛
∑︁ 2
𝑥 𝜎(𝑖) − 𝑥 𝜎(𝑖+1) = 𝑂 (1).
𝑖=1

Proof. Order the points as they appear on the Hilbert space filling curve 𝐻 : [0, 1] →
[0, 1] 2 (since 𝐻 is not injective, there is more than one possible order). Then, there
exist 0 ≤ 𝑡1 ≤ 𝑡 2 ≤ · · · ≤ 𝑡 𝑛 ≤ 1 so that 𝐻 (𝑡𝑖 ) = 𝑥 𝜎(𝑖) for each 𝑖. Using that 𝐻 is
Hölder continuous with exponent 1/2, we have
𝑛
∑︁ 𝑛
∑︁ 𝑛
∑︁
2 2
𝑥 𝜎(𝑖) − 𝑥 𝜎(𝑖+1) = |𝐻 (𝑡𝑖 ) − 𝐻 (𝑡𝑖+1 )| ≲ |𝑡𝑖 − 𝑡𝑖+1 | ≤ 2. □
𝑖=1 𝑖=1 𝑖=1

Remark 9.6.10. We leave it as an exercise to find an elementary proof of the lemma


without invoking the existence of a space-filling curve. Hint: consider a finite approx-
imation of the Hilbert curve.

Using Talagrand’s inequality in the form of Theorem 9.5.14, to prove Theorem 9.6.3
that 𝐿 𝑛 is 𝑂 (1)-subGaussian, it suffices to prove the following lemma.

Lemma 9.6.11
Let Ω = ( [0, 1] 2 ) 𝑛 be the space of 𝑛-tuples of points in [0, 1] 2 . There exists a map
𝛼 : Ω → R𝑛≥0 so that for all 𝑥 ∈ Ω, 𝛼(𝑥) = (𝛼1 (𝑥), . . . , 𝛼𝑛 (𝑥)) ∈ R𝑛≥0 satisfies
∑︁
𝐿 (𝑥) ≤ 𝐿 (𝑦) + 𝛼𝑖 (𝑥) for all 𝑥, 𝑦 ∈ Ω (9.1)
𝑖:𝑥 𝑖 ≠𝑦 𝑖

and
𝑛
∑︁
sup 𝛼𝑖 (𝑥) 2 = 𝑂 (1). (9.2)
𝑥∈Ω 𝑖=1

Proof. Let 𝑥 = (𝑥 1 , . . . , 𝑥 𝑛 ) ∈ Ω, and let 𝜎 be the permutation of [𝑛] given by


Lemma 9.6.9, the space-filling curve heuristic. Then 𝜎 induces a tour of 𝑥 1 , . . . , 𝑥 𝑛 .
Let 𝛼𝑖 (𝑥) equal twice the sum of the lengths of the two edges incident to 𝑥𝑖 in this tour

167
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

(indices taken mod 𝑛):


 
𝛼𝑖 (𝑥) = 2 𝑥𝑖 − 𝑥 𝜎(𝜎 −1 (𝑖)+1) + 𝑥𝑖 − 𝑥 𝜎(𝜎 −1 (𝑖)−1) .

Intuitively, this quantity captures “difficulty to serve” 𝑥𝑖 .


Now we prove (9.1). First we take care of the first case when 𝑥𝑖 ≠ 𝑦𝑖 for all 𝑖: (9.1)
follows from
𝑛 𝑛
∑︁ 1 ∑︁
𝐿 (𝑥) ≤ 𝑥 𝜎(𝑖) − 𝑥 𝜎𝑖+1 = 𝛼𝑖 (𝑥).
𝑖=1
2 𝑖=1
Now suppose that 𝑥𝑖 = 𝑦𝑖 for at least one 𝑖. Suppose we have a tour through 𝑦 of length
𝐿 (𝑦). Consider, for each 𝑖 with 𝑥𝑖 ≠ 𝑦𝑖 , the point 𝑥𝑖 along with the two segments
incident to 𝑥𝑖 in the 𝜎-induced tour through 𝑥 (these are the “new edges”). Starting
with an optimal tour through 𝑦, and by making various detours/excursions on the new
edges, we can reach all the points of 𝑥, traversing each new edge at most twice. The
Í
length of the new tour is at most 𝐿(𝑦) + 𝑖:𝑥𝑖 ≠𝑦𝑖 𝛼𝑖 (𝑥). This proves (9.1).
Finally, it remains to prove (9.2). By Lemma 9.6.9,
𝑛
∑︁ 𝑛
∑︁ 2
𝛼𝑖 (𝑥) 2 ≤ 4 𝑥 𝜎( 𝑗) − 𝑥 𝜎( 𝑗+1) + 𝑥 𝜎( 𝑗) − 𝑥 𝜎( 𝑗+1)
𝑖=1 𝑗=1
𝑛
∑︁ 2
≲ 𝑥 𝜎( 𝑗) − 𝑥 𝜎( 𝑗+1) = 𝑂 (1). □
𝑗=1

Exercises
1. Sub-Gaussian tails. For each part, prove there is some constant 𝑐 > 0 so that,
for all 𝜆 > 0, √ 2
P(|𝑋 − E𝑋 | ≥ 𝜆 Var 𝑋) ≤ 2𝑒 −𝑐𝜆 .
a) 𝑋 is the number of triangles in 𝐺 (𝑛, 1/2).
b) 𝑋 is the number of inversions of a uniform random permutation of [𝑛] (an
inversion of 𝜎 ∈ 𝑆𝑛 is a pair (𝑖, 𝑗) with 𝑖 < 𝑗 and 𝜎(𝑖) > 𝜎( 𝑗)).
2. Prove that for every 𝜀 > 0 there exists 𝛿 > 0 and 𝑛0 such that for all 𝑛 ≥ 𝑛0
and 𝑆1 , . . . , 𝑆 𝑚 ⊂ [2𝑛] with 𝑚 ≤ 2𝛿𝑛 and |𝑆𝑖 | = 𝑛 for all 𝑖 ∈ [𝑚], there exists a
function 𝑓 : [2𝑛] → [𝑛] so that (1 − 𝑒 −1 − 𝜀)𝑛 ≤ | 𝑓 (𝑆𝑖 )| ≤ (1 − 𝑒 −1 + 𝜀)𝑛 for
all 𝑖 ∈ [𝑚].
3. Simultaneous bisections. Fix Δ. Let 𝐺 1 , . . . , 𝐺 𝑚 with 𝑚 = 2𝑜(𝑛) be connected
graphs of maximum degree at most Δ on the same vertex set 𝑉 with |𝑉 | = 𝑛.
Prove that there exists a partition 𝑉 = 𝐴∪𝐵 so that every 𝐺 𝑖 has (1+𝑜(1))𝑒(𝐺 𝑖 )/2

168
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.6 Euclidean traveling salesman problem

edges between 𝐴 and 𝐵.


4. ★ Prove that there is some constant 𝑐 > 0 so that for every graph 𝐺 with chromatic
number 𝑘, letting 𝑆 be a uniform random subset of 𝑉 and 𝐺 [𝑆] the subgraph
induced by 𝑆, one has, for every 𝑡 ≥ 0,
2 /𝑘
P( 𝜒(𝐺 [𝑆]) ≤ 𝑘/2 − 𝑡) ≤ 𝑒 −𝑐𝑡 .

5. ★ Prove that there is some constant 𝑐 > 0 so that, with probability 1 − 𝑜(1),
𝐺 (𝑛, 1/2) has a bipartite subgraph with at least 𝑛2 /8 + 𝑐𝑛3/2 edges.
6. Let 𝑘 ≤ 𝑛/2 be positive integers and 𝐺 an 𝑛-vertex graph with average degree
at most 𝑛/𝑘. Prove that a uniform random 𝑘-element subset of the vertices of 𝐺
contains an independent set of size at least 𝑐𝑘 with probability at least 1 − 𝑒 −𝑐𝑘 ,
where 𝑐 > 0 is a constant.
7. ★ Prove that there exists a constant 𝑐 > 0 so that the following holds. Let 𝐺 be a
𝑑-regular graph and 𝑣 0 ∈ 𝑉 (𝐺). Let 𝑚 ∈ N and consider a simple random walk
𝑣 0 , 𝑣 1 , . . . , 𝑣 𝑚 where each 𝑣 𝑖+1 is a uniform random neighbor of 𝑣 𝑖 . For each
𝑣 ∈ 𝑉 (𝐺), let 𝑋𝑣 be the number times that 𝑣 appears among 𝑣 0 , . . . , 𝑣 𝑚 . For that
for every 𝑣 ∈ 𝑉 (𝐺) and 𝜆 > 0

1 ∑︁ 2
P ­ 𝑋𝑣 − 𝑋𝑤 ≥ 𝜆 + 1® ≤ 2𝑒 −𝑐𝜆 /𝑚
© ª
𝑑
« 𝑤∈𝑁 (𝑣) ¬
Here 𝑁 (𝑣) is the neighborhood of 𝑣.
8. Prove that for every 𝑘 there exists a 2 (1+𝑜(1))𝑘/2 -vertex graph that contains every
𝑘-vertex graph as an induced subgraph.
9. ★ Tighter concentration of chromatic number
a) Prove that with probability 1 − 𝑜(1), every vertex subset of 𝐺 (𝑛, 1/2) with
at least 𝑛1/3 vertices contains an independent set of size at least 𝑐 log 𝑛,
where 𝑐 > 0 is some constant.
b) Prove that there exists some function 𝑓 (𝑛) and constant 𝐶 such that for all
𝑛 ≥ 2,

P( 𝑓 (𝑛) ≤ 𝜒(𝐺 (𝑛, 1/2)) ≤ 𝑓 (𝑛) + 𝐶 𝑛/log 𝑛) ≥ 0.99.

10. Show that for every 𝜀 > 0 there exists 𝐶 > 0 so that every 𝑆 ⊆ [4] 𝑛 with
|𝑆| ≥ 𝜀4𝑛 contains four elements with pairwise Hamming distance at least

𝑛 − 𝐶 𝑛 apart.

169
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9 Concentration of Measure

11. Concentration of measure in the symmetric group. Let 𝑈 ⊆ 𝑆𝑛 be a set of


at least 𝑛!/2 permutations of [𝑛]. Let 𝑈𝑡 denote the set of permutations that
can be obtained starting from some element of 𝑈 and then applying at most 𝑡
transpositions. Prove that
2 /𝑛
|𝑈𝑡 | ≥ (1 − 𝑒 −𝑐𝑡 )𝑛!

for every 𝑡 ≥ 0, where 𝑐 > 0 is some constant.


Hint: Apply Azuma to a Doob martingale that reveals a random permutation

For the remaining exercises in this section, use Talagrand’s inequality


12. Let 𝑄 be a subset of the unit sphere in R𝑛 . Let x ∈ [−1, 1] 𝑛 be a random vector
with independent random coordinates. Let 𝑋 = supq ∈𝑄 ⟨x, q⟩. Let 𝑡 > 0. Prove
that
2
P(|𝑋 − M𝑋 | ≥ 𝑡) ≤ 4𝑒 −𝑐𝑡
where 𝑐 > 0 is some constant.
13. First passage percolation. Prove that there are constants 𝑐, 𝐶 > 0 so that the
following holds. Let 𝐺 be a graph, and let 𝑢 and 𝑤 be two distinct vertices with
distance at most ℓ between them. Every edge of 𝐺 is independently assigned
some random weight in [0, 1] (not necessarily uniform or identically distributed).
The weight of a path is defined to be the sum of the weights of its edges. Let 𝑋
be the minimum weight of a path from 𝑢 to 𝑤 using at most ℓ edges. Prove that
there is some 𝑚 ∈ R so that
2 /ℓ
P(|𝑋 − 𝑚| ≥ 𝑡) ≤ 𝐶𝑒 −𝑐𝑡 .

14. ★ Second largest eigenvalue of a random matrix. Let 𝐴 be an 𝑛 × 𝑛 random


symmetric matrix whose entries on and above the diagonal are independent and
in [−1, 1]. Show that the second largest eigenvalue 𝜆 2 ( 𝐴) satisfies
2
P(|𝜆 2 ( 𝐴) − E𝜆 2 ( 𝐴)| ≥ 𝑡) ≤ 𝐶𝑒 −𝑐𝑡 ,

for every 𝑡 ≥ 0, where 𝐶, 𝑐 > 0 are constants.


Hint in white: use the Courant–Fischer characterization of the second eigenvalue

15. Longest common subsequence. Let (𝑎 1 , . . . , 𝑎 𝑛 ) and (𝑏 1 , . . . , 𝑏 𝑚 ) be two random


sequences with independent entries (not necessarily identically distributed). Let
𝑋 denote the length of the longest common subsequence, i.e., the largest 𝑘 such
that there exist 𝑖1 < · · · < 𝑖 𝑘 and 𝑗1 < · · · < 𝑗 𝑘 with 𝑥𝑖1 = 𝑦 𝑗1 , . . . , 𝑥𝑖 𝑘 = 𝑦 𝑗 𝑘 .

170
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

9.6 Euclidean traveling salesman problem

Show that, for all 𝑡 ≥ 0,

−𝑐𝑡 2 −𝑐𝑡 2
   
P(𝑋 ≥ M𝑋 + 𝑡) ≤ 2 exp and P(𝑋 ≤ M𝑋 − 𝑡) ≤ 2 exp
M𝑋 + 𝑡 M𝑋

where 𝑐 > 0 is some constant.

171
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

My greatest concern was what to call it. I thought of calling it “informa-


tion,” but the word was overly used, so I decided to call it “uncertainty.”
When I discussed it with John von Neumann, he had a better idea. Von
Neumann told me, “You should call it entropy, for two reasons. In the
first place your uncertainty function has been used in statistical mechanics
under that name, so it already has a name. In the second place, and more
important, nobody knows what entropy really is, so in a debate you will
always have the advantage.”
Claude Shannon, 1971
In this chapter, we look at some neat and powerful applications of entropy to combi-
natorics. For a standard introduction to information theory, see the textbook by Cover
and Thomas.

10.1 Basic properties


We define the (binary) entropy of a discrete random variable as follows.

Definition 10.1.1
Given a discrete random variable 𝑋 taking values in 𝑆, with 𝑝 𝑠 := P(𝑋 = 𝑠), its entropy
(or binary entropy to emphasis the base-2 logarithm) is defined to be
∑︁
𝑯(𝑿) := −𝑝 𝑠 log2 𝑝 𝑠
𝑠∈𝑆

(by convention if 𝑝 𝑠 = 0 then the corresponding summand is set to zero).

Remark 10.1.2 (Base of the logarithm). It is also fine to use another base for the
logarithm, e.g., the natural log, as long as we are consistent throughout. There is some
combinatorial preference for base-2 due to its interpretation as counts bits. For certain
results, such as Pinsker’s inequality (which we will unfortunately not cover here), the
choice of the base does matter.

173
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

Remark 10.1.3 (Information theoretic interpretation). Intuitively, 𝐻 (𝑋) measures


the amount of “surprise” in the randomness of 𝑋. It can also be interpreted as the
amount of information learned by seeing the random variable 𝑋. A more rigorous
interpretation of this intuition is given by the Shannon source coding theorem, which,
informally, says that the minimum number of bits needed to encode 𝑛 iid copies of 𝑋
is 𝑛𝐻 (𝑋) + 𝑜(𝑛).

Here are some basic properties. Throughout we only consider discrete random vari-
ables.
The proofs are all routine calculations. It will useful to understand the information
theoretic interpretations of these properties.

Lemma 10.1.4 (Uniform bound)

𝐻 (𝑋) ≤ log2 | support(𝑋)|,


with equality if and only if 𝑋 is uniformly distributed.

Proof. Let function 𝑓 (𝑥) = −𝑥 log2 𝑥 is concave for 𝑥 ∈ [0, 1]. Let 𝑆 = support(𝑋).
Then !  
∑︁ 1 ∑︁ 1
𝐻 (𝑋) = 𝑓 ( 𝑝 𝑠 ) ≤ |𝑆| 𝑓 𝑝 𝑠 = |𝑆| 𝑓 = log2 |𝑆| . □
𝑠∈𝑆
|𝑆| 𝑠∈𝑆 |𝑆|

We write 𝐻 (𝑋, 𝑌 ) for the entropy of the joint random variables (𝑋, 𝑌 ). In other words,
letting 𝑍 = (𝑋, 𝑌 ),
∑︁
𝑯(𝑿, 𝒀) := 𝐻 (𝑍) = −P(𝑋 = 𝑥, 𝑌 = 𝑦) log2 P(𝑋 = 𝑥, 𝑌 = 𝑦).
(𝑥,𝑦)

We can similarly write 𝐻 (𝑋1 , . . . , 𝑋𝑛 ) for joint entropy.

Lemma 10.1.5 (Independence)


If 𝑋 and 𝑌 are independent random variables, then

𝐻 (𝑋, 𝑌 ) = 𝐻 (𝑋) + 𝐻 (𝑌 ).

174
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.1 Basic properties

Proof.
∑︁
𝐻 (𝑋, 𝑌 ) = −P(𝑋 = 𝑥, 𝑌 = 𝑦) log2 P(𝑋 = 𝑥, 𝑌 = 𝑦)
(𝑥,𝑦)
∑︁
= −𝑝 𝑥 𝑝 𝑦 log2 ( 𝑝 𝑥 𝑝 𝑦 )
(𝑥,𝑦)
∑︁
= −𝑝 𝑥 𝑝 𝑦 (log2 𝑝 𝑥 + log2 𝑝 𝑦 )
(𝑥,𝑦)
∑︁ ∑︁
= −𝑝 𝑥 log2 𝑝 𝑥 + −𝑝 𝑦 log2 𝑝 𝑦 = 𝐻 (𝑋) + 𝐻 (𝑌 ). □
𝑥 𝑦

Definition 10.1.6 (Conditional entropy)


Given jointly distributed random variables 𝑋 and 𝑌 , define

𝑯(𝑿 |𝒀) := E𝑦 [𝐻 (𝑋 |𝑌 = 𝑦)]


∑︁
= P(𝑌 = 𝑦)𝐻 (𝑋 |𝑌 = 𝑦)
𝑦
∑︁ ∑︁
= P(𝑌 = 𝑦) −P(𝑋 = 𝑥|𝑌 = 𝑦) log2 P(𝑋 = 𝑥|𝑌 = 𝑦)
𝑦 𝑥

(each line unpacks the previous line. In the summations, 𝑥 and 𝑦 range over the
supports of 𝑋 and 𝑌 respectively).

Intuitively, the conditional entropy 𝐻 (𝑋 |𝑌 ) measures the amount of additional in-


formation in 𝑋 not contained in 𝑌 . This is intuition is also captured by the next
lemma.
Some important special cases:
• If 𝑋 = 𝑌 , or 𝑋 = 𝑓 (𝑌 ), then 𝐻 (𝑋 |𝑌 ) = 0.
• If 𝑋 and 𝑌 are independent, then 𝐻 (𝑋 |𝑌 ) = 𝐻 (𝑋)
• If 𝑋 and 𝑌 are conditionally independent on 𝑍, then 𝐻 (𝑋, 𝑌 |𝑍) = 𝐻 (𝑋 |𝑍) +
𝐻 (𝑌 |𝑍) and 𝐻 (𝑋 |𝑌 , 𝑍) = 𝐻 (𝑋 |𝑍).

Lemma 10.1.7 (Chain rule)

𝐻 (𝑋, 𝑌 ) = 𝐻 (𝑋) + 𝐻 (𝑌 |𝑋)

Proof. Writing 𝑝(𝑥, 𝑦) = P(𝑋 = 𝑥, 𝑌 = 𝑦), etc., we have by Bayes’s rule

𝑝(𝑥|𝑦) 𝑝(𝑦) = 𝑝(𝑥, 𝑦),

175
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

and so
∑︁ ∑︁
𝐻 (𝑋 |𝑌 ) := E𝑦 [𝐻 (𝑋 |𝑌 = 𝑦)] = −𝑝(𝑦) 𝑝(𝑥|𝑦) log2 𝑝(𝑥|𝑦)
𝑦 𝑥
∑︁ 𝑝(𝑥, 𝑦)
= −𝑝(𝑥, 𝑦) log2
𝑥,𝑦
𝑝(𝑦)
∑︁ ∑︁
= −𝑝(𝑥, 𝑦) log2 𝑝(𝑥, 𝑦) + 𝑝(𝑦) log2 𝑝(𝑦)
𝑥,𝑦 𝑦

= 𝐻 (𝑋, 𝑌 ) − 𝐻 (𝑌 ). □

Lemma 10.1.8 (Subadditivity)


𝐻 (𝑋, 𝑌 ) ≤ 𝐻 (𝑋) + 𝐻 (𝑌 ), and more generally,

𝐻 (𝑋1 , . . . , 𝑋𝑛 ) ≤ 𝐻 (𝑋1 ) + · · · + 𝐻 (𝑋𝑛 ).

Proof. Let 𝑓 (𝑡) = log2 (1/𝑡), which is convex. Then


∑︁ 
𝐻 (𝑋) + 𝐻 (𝑌 ) − 𝐻 (𝑋, 𝑌 ) = −𝑝(𝑥, 𝑦) log2 𝑝(𝑥) − 𝑝(𝑥, 𝑦) log2 𝑝(𝑦) + 𝑝(𝑥, 𝑦) log2 𝑝(𝑥, 𝑦)
𝑥,𝑦
∑︁ 𝑝(𝑥, 𝑦)
= 𝑝(𝑥, 𝑦) log2
𝑥,𝑦
𝑝(𝑥) 𝑝(𝑦)
 
∑︁ 𝑝(𝑥) 𝑝(𝑦)
= 𝑝(𝑥, 𝑦) 𝑓
𝑥,𝑦
𝑝(𝑥, 𝑦)
!
∑︁ 𝑝(𝑥) 𝑝(𝑦)
≥ 𝑓 𝑝(𝑥, 𝑦) = 𝑓 (1) = 0
𝑥,𝑦
𝑝(𝑥, 𝑦)

More generally, by iterating the above inequality for two random variables, we have

𝐻 (𝑋1 , . . . , 𝑋𝑛 ) ≤ 𝐻 (𝑋1 , . . . , 𝑋𝑛−1 ) + 𝐻 (𝑋𝑛 )


≤ 𝐻 (𝑋1 , . . . , 𝑋𝑛−2 ) + 𝐻 (𝑋𝑛−1 ) + 𝐻 (𝑋𝑛 )
≤ · · · ≤ 𝐻 (𝑋1 ) + · · · + 𝐻 (𝑋𝑛 ). □

Remark 10.1.9 (Mutual information). The nonnegative quantity

𝐼 (𝑋; 𝑌 ) := 𝐻 (𝑋) + 𝐻 (𝑌 ) − 𝐻 (𝑋, 𝑌 )

is called mutual information. Intuitively, it measures the amount of common infor-


mation between 𝑋 and 𝑌 .

176
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.1 Basic properties

Lemma 10.1.10 (Dropping conditioning)


𝐻 (𝑋 |𝑌 ) ≤ 𝐻 (𝑋) and more generally,

𝐻 (𝑋 |𝑌 , 𝑍) ≤ 𝐻 (𝑋 |𝑍).

Proof. By chain rule and subadditivity, we have

𝐻 (𝑋 |𝑌 ) = 𝐻 (𝑋, 𝑌 ) − 𝐻 (𝑌 ) ≤ 𝐻 (𝑋).

The inequality conditioning on 𝑍 follows since the above implies that

𝐻 (𝑋 |𝑌 , 𝑍 = 𝑧) ≥ 𝐻 (𝑋 |𝑍 = 𝑧)

holds for every 𝑧, and taking expectation of 𝑧 yields 𝐻 (𝑋 |𝑌 , 𝑍) ≤ 𝐻 (𝑋 |𝑍). □

Remark 10.1.11. A related theorem is the data processing inequality: 𝐻 (𝑋 | 𝑓 (𝑌 )) ≥


𝐻 (𝑋 |𝑌 ) for any function 𝑓 . More generally, 𝑓 can be random. In other words, if
𝑋 → 𝑌 → 𝑍 is a Markov chain, then 𝐻 (𝑋 |𝑍) ≥ 𝐻 (𝑋 |𝑌 ) (exercise: prove this).

Here are some simple applications of entropy to tail bounds.


Let us denote the entropy of a Bernoulli random variable by

𝐻 ( 𝑝) := 𝐻 (Bernoulli( 𝑝)) = −𝑝 log2 𝑝 − (1 − 𝑝) log2 (1 − 𝑝).

𝐻 ( 𝑝)

0
0 𝑝 1

(This notation 𝐻 (·) is standard but unfortunately ambiguous: 𝐻 (𝑋) versus 𝐻 ( 𝑝). It
is usually clear from context which is meant.)

Theorem 10.1.12
If 0 < 𝑘 ≤ 𝑛/2, then
∑︁ 𝑛   𝑛  𝑘  𝑛  𝑛−𝑘
≤ 2𝐻 (𝑘/𝑛)𝑛 = .
0≤𝑖≤𝑘
𝑖 𝑘 𝑛 − 𝑘

177
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

This bound can be established using our proof technique for Chernoff bound by
applying Markov’s inequality to the moment generating function:
∑︁ 𝑛  (1 + 𝑥) 𝑛
≤ 𝑘
for all 𝑥 ∈ [0, 1].
0≤𝑖≤𝑘
𝑖 𝑥

The infimum of the RHS over 𝑥 ∈ [0, 1] is precisely 2𝐻 (𝑘/𝑛)𝑛 .


Now let us give a purely information theoretic proof to get some practice with entropy.

Proof. Let (𝑋1 , . . . , 𝑋𝑛 ) ∈ {0, 1} 𝑛 be chosen uniformly conditioned on 𝑋1 + · · · + 𝑋𝑛 ≤


𝑘. Then
∑︁ 𝑛 
log2 = 𝐻 (𝑋1 , . . . , 𝑋𝑛 ) ≤ 𝐻 (𝑋1 ) + · · · + 𝐻 (𝑋𝑛 ).
0≤𝑖≤𝑘
𝑖

Each 𝑋𝑖 is a Bernoulli with probability P(𝑋𝑖 = 1). Note that conditioned on 𝑋1 +


· · · + 𝑋𝑛 = 𝑚, one has P(𝑋𝑖 = 1) = 𝑚/𝑛. Varying over 𝑚 ≤ 𝑘 ≤ 𝑛/2, we find
P(𝑋𝑖 = 1) ≤ 𝑘/𝑛, so 𝐻 (𝑋𝑖 ) ≤ 𝐻 (𝑘/𝑛). Hence
∑︁ 𝑛 
log2 ≤ 𝐻 (𝑘/𝑛)𝑛. □
0≤𝑖≤𝑘
𝑖

Remark 10.1.13. One can extend the above proof to bound the tail of Binomial(𝑛, 𝑝)
for any 𝑝. The result can be expressed in terms of the relative entropy (also known
as the Kullback–Leibler divergence between two Bernoulli random variables). More
concretely, for 𝑋 ∼ Binomial(𝑛, 𝑝), one has

log P(𝑋 ≤ 𝑛𝑞) 𝑞 1−𝑞


≤ −𝑞 log − (1 − 𝑞) log for all 0 ≤ 𝑞 ≤ 𝑝,
𝑛 𝑝 1− 𝑝

and
log P(𝑋 ≥ 𝑛𝑞) 𝑞 1−𝑞
≤ −𝑞 log − (1 − 𝑞) log for all 𝑝 ≤ 𝑞 ≤ 1.
𝑛 𝑝 1− 𝑝

10.2 Permanent, perfect matchings, and Steiner


triple systems
Permanent
We define the permanent of an 𝑛 × 𝑛 matrix 𝐴 by
𝑛
∑︁ Ö
per 𝐴 := 𝑎𝑖,𝜎(𝑖) .
𝜎∈𝑆 𝑛 𝑖=1

178
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.2 Permanent, perfect matchings, and Steiner triple systems

The formula for the permanent is simply that of the determinant without the sign factor:

∑︁ 𝑛
Ö
det 𝐴 := sgn(𝜎) 𝑎𝑖𝜎𝑖 .
𝜎∈𝑆 𝑛 𝑖=1

We’ll consider {0, 1}-valued matrices. If 𝐴 is the bipartite adjacency matrix of a


bipartite graph, then

per 𝐴 = the number of perfect matchings.

The following theorem gives an upper bound on the number of perfect matchings of
a bipartite graph with a given degree distribution. It was conjectured by Minc (1963)
and proved by Brégman (1973).

Theorem 10.2.1 (Brégman–Minc inequality)


Let 𝐴 = (𝑎𝑖 𝑗 ) ∈ {0, 1}𝑛×𝑛 , whose 𝑖-th row has sum 𝑑𝑖 . Then
𝑛
Ö
per 𝐴 ≤ (𝑑𝑖 !) 1/𝑑𝑖
𝑖=1

Note that equality is attained when 𝐴 consists diagonal blocks of 1’s (corresponding
to perfect matchings in a bipartite graph of the form 𝐾 𝑑1 ,𝑑1 ⊔ · · · ⊔ 𝐾 𝑑𝑡 ,𝑑𝑡 ).
Let 𝜎 be a uniform random permutation of [𝑛] conditioned on 𝑎𝑖𝜎𝑖 = 1 for all 𝑖 ∈ [𝑛].
Then

log2 per 𝐴 = 𝐻 (𝜎) = 𝐻 (𝜎1 , . . . , 𝜎𝑛 ) = 𝐻 (𝜎1 ) + 𝐻 (𝜎2 |𝜎1 ) + · · · + 𝐻 (𝜎𝑛 |𝜎1 , . . . , 𝜎𝑛−1 ).

We have
𝐻 (𝜎𝑖 |𝜎1 , . . . , 𝜎𝑖−1 ) ≤ 𝐻 (𝜎𝑖 ) ≤ log2 |support 𝜎𝑖 | = log2 𝑑𝑖 ,
but this step would be too lossy. In fact, what we just did amounts to a naive worst
case counting argument.
The key new idea is to reveal the chosen entries in a uniform random order.

Proof. (Radhakrishnan 1997) Let 𝜎 be as earlier. Consider a permutation of 𝜏 repre-


senting an ordering of the rows of the matrix. Say that 𝑖 appears before 𝑗 if 𝜏𝑖 < 𝜏 𝑗 .
Let 𝑁𝑖 = 𝑁𝑖 (𝜎, 𝜏) be the number of ones on row 𝑖 that does not lie in the same column
as some entry ( 𝑗, 𝜎 𝑗 ) that comes before 𝑖. (Intuitively, 𝑁𝑖 is the number of “greedily
available” choices for 𝜎𝑖 before it is revealed.)

179
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

For any 𝜏, the chain rule gives


𝑛
∑︁ 
𝐻 (𝜎) = 𝐻 𝜎𝑖 𝜎 𝑗 : 𝑗 comes before 𝑖 ,
𝑖=1

and the uniform bound gives



𝐻 𝜎𝑖 𝜎 𝑗 : 𝑗 comes before 𝑖 ≤ E𝜎 log2 𝑁𝑖 .

Let 𝜏 vary uniformly over all permutations. Then,


𝑛
∑︁
𝐻 (𝜎) ≤ E𝜎,𝜏 log2 𝑁𝑖 .
𝑖=1

For any fixed 𝜎, as 𝜏 varies uniformly over all permutations of [𝑛], 𝑁𝑖 varies uniformly
over [𝑑𝑖 ]. (Why?) Thus

log2 1 + · · · + log2 𝑑𝑖 log2 (𝑑𝑖 !)


E𝜏 log2 𝑁𝑖 = = .
𝑑𝑖 𝑑𝑖

Taking expectation over 𝜎 and summing over 𝑖 yields


𝑛 𝑛
∑︁ ∑︁ log2 (𝑑𝑖 !)
log2 per 𝐴 = 𝐻 (𝜎) ≤ E𝜎,𝜏 log2 𝑁𝑖 ≤ . □
𝑖=1 𝑖=1
𝑑𝑖

Corollary 10.2.2 (Kahn and Lovász)


Let 𝐺 be a graph. Let 𝑑𝑣 denote the degree of 𝑣. Then the number pm(𝐺) of perfect
matchings of 𝐺 satisfies
Ö Ö
pm(𝐺) ≤ (𝑑𝑣 !) 1/(2𝑑 𝑣 ) = pm(𝐾 𝑑 𝑣 ,𝑑 𝑣 ) 1/(2𝑑 𝑣 ) .
𝑣∈𝑉 (𝐺) 𝑣∈𝑉 (𝐺)

Proof. (Alon and Friedland 2008) Brégman’s theorem implies the statement for bipar-
tite graphs 𝐺 (by considering a bipartition on 𝐺 ⊔𝐺). For the extension of non-bipartite
𝐺, one can proceed via a combinatorial argument that pm(𝐺 ⊔ 𝐺) ≤ pm(𝐺 × 𝐾2 ),
which is left as an exercise. □

180
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.2 Permanent, perfect matchings, and Steiner triple systems

The maximum number of Hamilton paths in a tournament

Question 10.2.3
What is the maximum possible number of directed Hamilton paths in an 𝑛-vertex
tournament?

Earlier we saw that a uniformly random tournament has 𝑛!/2𝑛−1 Hamilton paths in
expectation, and hence there is some tournament with at least this many Hamilton
paths. This result, due to Szele, is the earliest application of the probabilistic method.
Using Brégman’s theorem, Alon proved a nearly matching upper bound.

Theorem 10.2.4 (Alon 1990)


Every 𝑛-vertex tournament has at most 𝑂 (𝑛3/2 · 𝑛!/2𝑛 ) Hamilton paths.

Remark 10.2.5. The upper bound has been improved to 𝑂 (𝑛3/2−𝛾 𝑛!/2𝑛 ) for some
small constant 𝛾 > 0 (Friedgut and Kahn 2005), while the lower bound 𝑛!/2𝑛−1 has
been improved by a constant factor (Adler, Alon, and Ross 2001, Wormald 2004). It
remains open to close this 𝑛𝑂 (1) factor gap.

We first prove an upper bound on the number of Hamilton cycles.

Theorem 10.2.6 (Alon 1990)



Every 𝑛-vertex tournament has at most 𝑂 ( 𝑛 · 𝑛!/2𝑛 ) Hamilton cycles.

Proof. Let 𝐴 be an 𝑛 × 𝑛 matrix whose (𝑖, 𝑗) entry is 1 if 𝑖 → 𝑗 is an edge of the


tournament and 0 otherwise. Let 𝑑𝑖 be the sum of the 𝑖-th row. Then per 𝐴 counts the
number of 1-factors (spanning disjoint unions of directed cycles) of the tournament.
So by Brégman’s theorem, we have
𝑛
Ö
number of Hamilton cycles ≤ per 𝐴 ≤ (𝑑𝑖 !) 1/𝑑1 .
𝑖=1

One can check (omitted) that the function 𝑔(𝑥) = (𝑥!) 1/𝑥 is log-concave, i.e, 𝑔(𝑛)𝑔(𝑛 +
2) ≥ 𝑔(𝑛 + 1) 2 for all 𝑛 ≥ 0. Thus, by a smoothing argument, among sequences

(𝑑1 , . . . , 𝑑𝑛 ) with sum 𝑛2 , the RHS above is maximized when all the 𝑑𝑖 ’s are within 1

of each other, which, by Stirling’s formula, gives 𝑂 ( 𝑛 · 𝑛!/2𝑛 ). □

Theorem 10.2.4 then follows by applying the above bound with the following lemma.

181
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

Lemma 10.2.7
Given an 𝑛-vertex tournament with 𝑃 Hamilton paths, one can add a new vertex to
obtain a (𝑛 + 1)-vertex tournament with at least 𝑃/4 Hamilton cycles.

Proof. Add a new vertex and orient its incident edges uniformly at random. For every
Hamilton path in the 𝑛-vertex tournament, there is probability 1/4 that it can be closed
up into a Hamilton cycle through the new vertex. The claim then follows by linearity
of expectation. □

Steiner triple systems

Definition 10.2.8 (Steiner triple system)


A Steiner triple system (STS) of order 𝑛 is a 3-uniform hypergraph on 𝑛 vertices where
every pair of vertices is contained in exactly one triple.

Equivalently: an STS is a decomposition of a complete graph 𝐾𝑛 into edge-disjoint


triangles.
Example: the Fano plane is an STS of order 7.
It is a classic result that an STS of order 𝑛 exists if and only if 𝑛 ≡ 1 or 3 mod 6. It is

not hard to see that this is necessary, since if an STS of order 𝑛 exsits, then 𝑛2 should
be divisible by 3, and 𝑛 − 1 should be divisible by 2. Keevash (2014+) obtained a
significant breakthrough proving the existence of more general designs.

Question 10.2.9
How many STS are there on 𝑛 labeled vertices?

We shall prove the following result.

Theorem 10.2.10 (Upper bound on the number of STS — Linial and Luria 2013)
The number of Steiner triple systems on 𝑛 labeled vertices is at most
  𝑛2
𝑛
.
𝑒 2 + 𝑜(1)

Remark 10.2.11. Keevash (2018) proved a matching lower bound when 𝑛 ≡ 1, 3


(mod 6).

Proof. As in the earlier proof, the idea is to reveal the triples in a random order.

182
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.2 Permanent, perfect matchings, and Steiner triple systems

Let 𝑋 denote a uniformly chosen STS on 𝑛 vertices. We wish to upper bound 𝐻 (𝑋).
We encode 𝑋 as a tuple (𝑋𝑖 𝑗 )𝑖< 𝑗 ∈ [𝑛] ( 2) where 𝑋𝑖 𝑗 is the label of the unique vertex
𝑛

that forms a triple with 𝑖 and 𝑗 in the STS. Here whenever we write 𝑖 𝑗 we mean the
unordered pair {𝑖, 𝑗 }, i.e., an edge of 𝐾𝑛 .
Let 𝑦 = (𝑦𝑖 𝑗 )𝑖< 𝑗 ∈ [0, 1] ( 2) , and we order the edges of 𝐾𝑛 in decreasing 𝑦𝑖 𝑗 :
𝑛

𝑘𝑙 ≺ 𝑖 𝑗 if 𝑦 𝑘𝑙 > 𝑦𝑖 𝑗 .

By the chain rule, ∑︁ 


𝐻 (𝑋) = 𝐻 𝑋𝑖 𝑗 𝑋 𝑘𝑙 : 𝑘𝑙 ≺ 𝑖 𝑗 .
𝑖𝑗

Let

𝑁𝑖 𝑗 = 𝑁𝑖 𝑗 (𝑋, 𝑦) = the number of possibilities for 𝑋𝑖 𝑗 after revealing 𝑋 𝑘𝑙 for all 𝑘𝑙 ≺ 𝑖 𝑗 .

By the uniform bound, we have


∑︁
𝐻 (𝑋) ≤ E 𝑋 log2 𝑁𝑖 𝑗 .
𝑖𝑗

Now let 𝑦 = (𝑦𝑖 𝑗 )𝑖< 𝑗 ∈ [0, 1] ( 2) be chosen uniformly at random. We have


𝑛

∑︁
𝐻 (𝑋) ≤ E 𝑋 E𝑦 log2 𝑁𝑖 𝑗 .
𝑖𝑗

Write 𝑦 −𝑖 𝑗 ∈ [0, 1] ( 2) −1 to mean 𝑦 with the 𝑖 𝑗-coordinate removed. Let us bound


𝑛

E𝑦 −𝑖 𝑗 log2 𝑁𝑖 𝑗 as a function of 𝑦𝑖 𝑗 .
We define 𝑖 𝑗 shows up first in its triple to be the event that 𝑖 𝑗 ≺ 𝑖𝑘, 𝑗 𝑘 where 𝑘 = 𝑋𝑖 𝑗 .
We have, for any fixed 𝑋,

P𝑦 −𝑖 𝑗 (𝑖 𝑗 shows up first in its triple) = P𝑦 −𝑖 𝑗 (𝑖 𝑗 ≺ 𝑖𝑘, 𝑗 𝑘) = P𝑦 −𝑖 𝑗 (𝑦𝑖 𝑗 > 𝑦𝑖𝑘 , 𝑦 𝑗 𝑘 ) = 𝑦𝑖2𝑗 .

If 𝑖 𝑗 does not show up first in its triple, then 𝑋𝑖 𝑗 has exactly one possibility (namely 𝑘)
by the time it gets revealed, and so 𝑁𝑖 𝑗 = 1 and log2 𝑁𝑖 𝑗 = 0. Thus

E𝑦 −𝑖 𝑗 log2 𝑁𝑖 𝑗 = 𝑦𝑖2𝑗 E𝑦 −𝑖 𝑗 log2 𝑁𝑖 𝑗 𝑖 𝑗 shows up first in its triple


 

≤ 𝑦𝑖2𝑗 log2 E𝑦 −𝑖 𝑗 𝑁𝑖 𝑗 𝑖 𝑗 shows up first in its triple .


 

Now we use linearity of expectations (over 𝑦 −𝑖 𝑗 with fixed 𝑋). For each 𝑠 ∈ [𝑛] \
{𝑖, 𝑗, 𝑘 }, if 𝑠 is available as a possibility for 𝑋𝑖 𝑗 by the time 𝑋𝑖 𝑗 is revealed, then none
of the six edges of 𝐾𝑛 consisting of the two triangle 𝑖𝑠𝑋𝑖 𝑗 and 𝑗 𝑠𝑋 𝑗 𝑠 may occur before

183
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

𝑋𝑖 𝑗 ; the latter event occurs with probability 𝑦𝑖6𝑗 . So

E𝑦 −𝑖 𝑗 𝑁𝑖 𝑗 𝑖 𝑗 shows up first in its triple ≤ 1 + (𝑛 − 3)𝑦𝑖6𝑗 .


 

Thus
∫ 1 ∫ 1
1
E𝑦 log2 𝑁𝑖 𝑗 ≤ 𝑦𝑖2𝑗 log2 (1 + (𝑛 − 3)𝑦𝑖3𝑗 ) 𝑑𝑦𝑖 𝑗 = log2 (1 + (𝑛 − 3)𝑡 2 ) 𝑑𝑡.
0 3 0

This integral actually has a closed-form antiderivative (e.g., check Mathematica/Wolfram


Alpha), but it suffices for us to obtain the asymptotics. We have
∫ 1   ∫ 1
1 2
log2 + 𝑡 𝑑𝑡 → log2 (𝑡 2 ) 𝑑𝑡 = −2 log2 𝑒
0 𝑛−3 0

as 𝑛 → ∞ by the monotone convergence theorem. Thus

log2 (𝑛/𝑒 2 ) + 𝑜(1)


E𝑦 log2 𝑁𝑖 𝑗 ≤ .
3
It follows therefore that the log-number of STS on 𝑛 vertices is

𝑛 log2 (𝑛/𝑒 2 ) + 𝑜(1) 𝑛2


    
∑︁ 𝑛
𝐻 (𝑋) ≤ E 𝑋 E𝑦 log2 𝑁𝑖 𝑗 ≤ = log2 2 . □
𝑖𝑗
2 3 6 𝑒 + 𝑜(1)

Remark 10.2.12 (Guessing the formula). Here is perhaps how we might have guessed
the formula for the number of STSs. Suppose we select 31 𝑛2 triangles in 𝐾𝑛 indepen-


dently at random. What is the probability that every edge is contained in exactly one
triangle? Each edge is contained one triangle on expectation, and so by the Poisson
approximation, the probability that a single fixed edge is contained in exactly one tri-
angle is 1/𝑒 + 𝑜(1). Now let us pretend as if all the edges behave independently (!) —
the probability that every edge is contained in exactly one triangle is (1/𝑒 + 𝑜(1)) ( 2) .
𝑛

This would then lead us to guessing that the number of STSs being

  13 ( 𝑛2)   ( 𝑛2)  2  −𝑛2 /6  3  𝑛2 /6   𝑛2 /2 ! 1+𝑜(1)   𝑛2 /3


1 𝑛 1 𝑛 𝑛 1 𝑛
  + 𝑜(1) = = 2 .
1 𝑛
3 2 ! 3 𝑒 6𝑒 6 𝑒 𝑒 + 𝑜(1)

Here is another heuristic for getting the formula, and this time this method can actually
be turned into a proof of matching lower bound on the number of STSs, though with
a lot of work (Keevash 2018). Suppose we remove triangles from 𝐾𝑛 one at a time.

After 𝑘 triangles have been removed, the number of edges remaining is 𝑛2 − 3𝑘. Let
us pretend that the remaining edges were randomly distributed. Then the number of

184
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.3 Sidorenko’s inequality

triangles should be about


  !3    3
𝑛 3𝑘 36 1 𝑛
1 − 𝑛 ∼ 3 −𝑘
3 2
𝑛 3 2
 
1 𝑛 1 𝑛
 
If we multiply the above quantity over 0 ≤ 𝑘 < 3 2 , and then divide by 3 2 ! to
account for the ordering of the triangles, we get
  𝑛2 /6  
36 1 𝑛
!3
   𝑛2 /3
𝑛3 3 2 𝑛
  ≈ 2 .
1 𝑛

! 𝑒 + 𝑜(1)
3 2

10.3 Sidorenko’s inequality


Given graphs 𝐹 and 𝐺, a graph homomorphism from 𝐹 to 𝐺 is a map 𝜙 : 𝑉 (𝐹) →
𝑉 (𝐺) of vertices that sends edges to edges, i.e., 𝜙(𝑢)𝜙(𝑣) ∈ 𝐸 (𝐺) for all 𝑢𝑣 ∈ 𝐸 (𝐹).
Let
hom(𝐹, 𝐺) = the number of graph homomorphisms from 𝐹 to 𝐺.
Define the homomorphism density (the 𝑯-density in 𝑮) by

hom(𝐹, 𝐺)
𝑡 (𝐹, 𝐻) =
𝑣(𝐺) 𝑣(𝐹)
= P(a uniform random map 𝑉 (𝐹) → 𝑉 (𝐺) is a graph homomorphism 𝐹 → 𝐺)

In this section, we are interested in the regime of fixed 𝐹 and large 𝐺, in which case
almost all maps 𝑉 (𝐹) → 𝑉 (𝐺) are injective, so that there is not much difference
between homomorphisms and subgraphs. More precisely,

hom(𝐹, 𝐺) = aut(𝐹) (#copies of 𝐹 in 𝐺 as a subgraph) + 𝑂 𝐹 (𝑣(𝐺) 𝑣(𝐹)−1 ).

where aut(𝐹) is the number of automorphisms of 𝐹.


Inequalities between graph homomorphism densities is a central topic in extremal
graph theory. For example, see Chapter 5 of my book Graph Theory and Additive
Combinatorics. Much of the rest of this chapter is adapted from §5.5 of the book.

Question 10.3.1
Given a fixed graph 𝐹 and constant 𝑝 ∈ [0, 1], what is the minimum possible 𝐹-density
in a graph with edge density at least 𝑝?

185
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

The 𝐹-density in the random graph 𝐺 (𝑛, 𝑝) is 𝑝 𝑒(𝐹) +𝑜(1). Here 𝑝 is fixed and 𝑛 → ∞.
Can one do better?
If 𝐹 is non-bipartite, then the complete bipartite graph 𝐾𝑛/2,𝑛/2 has 𝐹-density zero.
(The problem of minimizing 𝐹-density is still interesting and not easy; it has been
solved for cliques.)
Sidorenko’s conjecture (1993) (also proposed by Erdős and Simonovits (1983)) says
for any fixed bipartite 𝐹, the random graph asymptotically minimizes 𝐹-density. This
is an important and well-known conjecture in extremal graph theory.

Conjecture 10.3.2 (Sidorenko)


For every bipartite graph 𝐹, and any graph 𝐺,

𝑡 (𝐹, 𝐺) ≥ 𝑡 (𝐾2 , 𝐺) 𝑒(𝐹) .

The conjecture is known to hold for a large family of graphs 𝐹.


The entropy approach to Sidorenko’s conjecture was first introduced by Li and Szegedy
(2011) and later further developed in subsequent works. Here we illustrate the entropy
approach to Sidorenko’s conjecture with several examples.
We will construct a probability distribution 𝜇 on Hom(𝐹, 𝐺), the set of all graph
homomorphisms 𝐹 → 𝐺. Unlike earlier applications of entropy, here we are trying to
prove a lower bound on hom(𝐹, 𝐺) instead of an upper bound. So instead of taking
𝜇 to be a uniform distribution (which automatically has entropy log2 hom(𝐹, 𝐺)), we
actually take 𝜇 to be carefully constructed distribution, and apply the upper bound

𝐻 (𝜇) ≤ log2 |support 𝜇| = log2 hom(𝐹, 𝐺).

We are trying to show that


  𝑒(𝐹)
hom(𝐹, 𝐺) 2𝑒(𝐺)
≥ .
𝑣(𝐺) 𝑣(𝐹) 𝑣(𝐺) 2

So we would like to find a probability distribution 𝜇 on Hom(𝐹, 𝐺) satisfying

𝐻 (𝜇) ≥ 𝑒(𝐹) log2 (2𝑒(𝐺)) − (2𝑒(𝐹) − 𝑣(𝐹)) log2 𝑣(𝐺). (10.1)

Theorem 10.3.3 (Blakey and Roy 1965)


Sidorenko’s conjecture holds if 𝐹 is a three-edge path.

Proof. We choose randomly a walk 𝑋𝑌 𝑍𝑊 in 𝐺 as follows:

186
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.3 Sidorenko’s inequality

• 𝑋𝑌 is a uniform random edge of 𝐺 (by this we mean first choosing an edge of


𝐺 uniformly at random, and then let 𝑋 be a uniformly chosen endpoint of this
edge, and then 𝑌 the other endpoint);
• 𝑍 is a uniform random neighbor of 𝑌 ;
• 𝑊 is a uniform random neighbor of 𝑍.
Key observation: 𝑌 𝑍 is distributed as a uniform random edge of 𝐺, and likewise with
𝑍𝑊
Indeed, conditioned on the choice of 𝑌 , the vertices 𝑋 and 𝑍 are both independent and
uniform neighbors of 𝑌 , so 𝑋𝑌 and 𝑌 𝑍 are uniformly distributed.
Also, the conditional independence observation implies that

𝐻 (𝑍 |𝑋, 𝑌 ) = 𝐻 (𝑍 |𝑌 ) and 𝐻 (𝑊 |𝑋, 𝑌 , 𝑍) = 𝐻 (𝑊 |𝑍)

and futhermore both quantities are equal to 𝐻 (𝑌 |𝑋) since 𝑋𝑌 , 𝑌 𝑍, 𝑍𝑊 are each dis-
tributed as a uniform random edge.
Thus

𝐻 (𝑋, 𝑌 , 𝑍, 𝑊) = 𝐻 (𝑋) + 𝐻 (𝑌 |𝑋) + 𝐻 (𝑍 |𝑋, 𝑌 ) + 𝐻 (𝑊 |𝑋, 𝑌 , 𝑍) [chain rule]


= 𝐻 (𝑋) + 𝐻 (𝑌 |𝑋) + 𝐻 (𝑍 |𝑌 ) + 𝐻 (𝑊 |𝑍) [cond indep]
= 𝐻 (𝑋) + 3𝐻 (𝑌 |𝑋)
= 3𝐻 (𝑋, 𝑌 ) − 2𝐻 (𝑋) [chain rule]
≥ 3 log2 (2𝑒(𝐺)) − 2 log2 𝑣(𝐺)

In the final step we used 𝐻 (𝑋, 𝑌 ) = log2 (2𝑒(𝐺)) since 𝑋𝑌 is uniformly distributed
among edges, and 𝐻 (𝑋) ≤ log2 |support(𝑋)| = log2 𝑣(𝐺). This proves (10.1) and
hence the theorem for a path of 4 vertices. (As long as the final expression has the
“right form” and none of the steps are lossy, the proof should work out.) □

Remark 10.3.4. See this MathOverflow discussion for the history as well as alternate
proofs.

The above proof easily generalizes to all trees. We omit the details.

Theorem 10.3.5
Sidorenko’s conjecture holds if 𝐹 is a tree.

187
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

Theorem 10.3.6
Sidorenko’s conjecture holds for all complete bipartite graphs.

Proof. Following the same framework as earlier, let us demonstrate the result for
𝐹 = 𝐾2,2 . The same proof extends to all 𝐾 𝑠,𝑡 .

𝑥2 𝑦2

𝑥1 𝑦1

We will pick a random tuple (𝑋1 , 𝑋2 , 𝑌1 , 𝑌2 ) ∈ 𝑉 (𝐺) 4 with 𝑋𝑖𝑌 𝑗 ∈ 𝐸 (𝐺) for all 𝑖, 𝑗 as
follows.
• 𝑋1𝑌1 is a uniform random edge;
• 𝑌2 is a uniform random neighbor of 𝑋1 ;
• 𝑋2 is a conditionally independent copy of 𝑋1 given (𝑌1 , 𝑌2 ).
The last point deserves more attention. Note that we are not simply uniformly randomly
choosing a common neighbor of 𝑌1 and 𝑌2 as one might naively attempt. Instead, one
can think of the first two steps as generating a distribution for (𝑋1 , 𝑌1 , 𝑌2 )—according to
this distribution, we first generate (𝑌1 , 𝑌2 ) according to its marginal, and then produce
two conditionally independent copies of 𝑋1 (the second copy is 𝑋2 ).
As in the previous proof (applied to a 2-edge path), we see that

𝐻 (𝑋1 , 𝑌1 , 𝑌2 ) = 2𝐻 (𝑋1 , 𝑌1 ) − 𝐻 (𝑋1 ) ≥ 2 log2 (2𝑒(𝐺)) − log2 𝑣(𝐺).

So we have

𝐻 (𝑋1 , 𝑋2 , 𝑌1 , 𝑌2 )
= 𝐻 (𝑌1 , 𝑌2 ) + 𝐻 (𝑋1 , 𝑋2 |𝑌1 , 𝑌2 ) [chain rule]
= 𝐻 (𝑌1 , 𝑌2 ) + 2𝐻 (𝑋1 |𝑌1 , 𝑌2 ) [conditional independence]
= 2𝐻 (𝑋1 , 𝑌1 , 𝑌2 ) − 𝐻 (𝑌1 , 𝑌2 ) [chain rule]
≥ 2(2 log2 (2𝑒(𝐺)) − log2 𝑣(𝐺)) − 2 log2 𝑣(𝐺). [prev. ineq. and uniform bound]
= 4 log(2𝑒(𝐺)) − 4 log2 𝑣(𝐺).

So we have verified (10.1) for 𝐾2,2 . □

188
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.3 Sidorenko’s inequality

Theorem 10.3.7 (Conlon, Fox, Sudakov 2010)


Sidorenko’s conjecture holds for a bipartite graph that has a vertex adjacent to all
vertices in the other part.

Proof. Let us illustrate the proof for the following graph. The proof extends to the
general case.
𝑦1

𝑥1
𝑥0
𝑦2 𝑥2
𝑦3

Let us choose a random tuple (𝑋0 , 𝑋1 , 𝑋2 , 𝑌1 , 𝑌2 , 𝑌3 ) ∈ 𝑉 (𝐺) 6 as follows:


• 𝑋0𝑌1 is a uniform random edge;
• 𝑌2 and 𝑌3 are independent uniform random neighbors of 𝑋0 ;
• 𝑋1 is a conditionally independent copy of 𝑋0 given (𝑌1 , 𝑌2 );
• 𝑋2 is a conditionally independent copy of 𝑋0 given (𝑌2 , 𝑌3 ).
(as well as other symmetric versions.) Some important properties of this distribution:
• 𝑋0 , 𝑋1 , 𝑋2 are conditionally independent given (𝑌1 , 𝑌2 , 𝑌3 );
• 𝑋1 and (𝑋0 , 𝑌3 , 𝑋2 ) are conditionally independent given (𝑌1 , 𝑌2 );
• The distribution of (𝑋0 , 𝑌1 , 𝑌2 ) is identical to the distribution of (𝑋1 , 𝑌1 , 𝑌2 ).
We have

𝐻 (𝑋0 , 𝑋1 , 𝑋2 , 𝑌1 , 𝑌2 , 𝑌3 )
= 𝐻 (𝑋0 , 𝑋1 , 𝑋2 |𝑌1 , 𝑌2 , 𝑌3 ) + 𝐻 (𝑌1 , 𝑌2 , 𝑌3 ) [chain rule]
= 𝐻 (𝑋0 |𝑌1 , 𝑌2 , 𝑌3 ) + 𝐻 (𝑋1 |𝑌1 , 𝑌2 , 𝑌3 ) + 𝐻 (𝑋2 |𝑌1 , 𝑌2 , 𝑌3 ) + 𝐻 (𝑌1 , 𝑌2 , 𝑌3 ) [cond indep]
= 𝐻 (𝑋0 |𝑌1 , 𝑌2 , 𝑌3 ) + 𝐻 (𝑋1 |𝑌1 , 𝑌2 ) + 𝐻 (𝑋2 |𝑌2 , 𝑌3 ) + 𝐻 (𝑌1 , 𝑌2 , 𝑌3 ) [cond indep]
= 𝐻 (𝑋0 , 𝑌1 , 𝑌2 , 𝑌3 ) + 𝐻 (𝑋1 , 𝑌1 , 𝑌2 ) + 𝐻 (𝑋2 , 𝑌2 , 𝑌3 ) − 𝐻 (𝑌1 , 𝑌2 ) − 𝐻 (𝑌2 , 𝑌3 ). [chain rule]

The proof of Theorem 10.3.3 actually lower bounds the first three terms:

𝐻 (𝑋0 , 𝑌1 , 𝑌2 , 𝑌3 ) ≥ 3 log2 (2𝑒(𝐺)) − 2 log2 𝑣(𝐺)


𝐻 (𝑋1 , 𝑌1 , 𝑌2 ) ≥ 2 log2 (2𝑒(𝐺)) − log2 𝑣(𝐺)
𝐻 (𝑋2 , 𝑌2 , 𝑌3 ) ≥ 2 log2 (2𝑒(𝐺)) − log2 𝑣(𝐺).

189
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

We can apply the uniform support bound on the remaining terms.

𝐻 (𝑌1 , 𝑌2 ) = 𝐻 (𝑌2 , 𝑌3 ) ≤ 2 log2 𝑣(𝐺).

Putting everything together, we have

𝐻 (𝑋0 , 𝑋1 , 𝑋2 , 𝑌1 , 𝑌2 , 𝑌3 ) ≥ 7 log2 (2𝑒(𝐺)) − 8 log2 𝑣(𝐺),

thereby verifying (10.1). □

To check that you understand the above proof, where did we use the assumption that
𝐹 has a vertex complete to the other part?
Many other graphs can be proved by extending this method.

Remark 10.3.8 (Möbius graph). An important open case (and the smallest in some
sense) of Sidorenko conjecture is when 𝐹 is the following graph, known as the Möbius
graph. It is 𝐾5,5 with a 10-cycle removed. The name comes from it being the face-
vertex incidence graph of the simplicial complex structure of the Möbius strip, built
by gluing a strip of five triangles.

Möbius graph = 𝐾5,5 \ 𝐶10 =

10.4 Shearer’s lemma


Shearer’s entropy lemma extends the subadditivity property of entropy. Before stating
it in full generality, let us first see the simplest instance of Shearer’s lemma.

Theorem 10.4.1 (Shearer’s lemma, special case)

2𝐻 (𝑋, 𝑌 , 𝑍) ≤ 𝐻 (𝑋, 𝑌 ) + 𝐻 (𝑋, 𝑍) + 𝐻 (𝑌 , 𝑍)

Proof. Using the chain rule and conditioning dropping, we have

𝐻 (𝑋, 𝑌 ) = 𝐻 (𝑋) + 𝐻 (𝑌 |𝑋)


𝐻 (𝑋, 𝑍) = 𝐻 (𝑋) + 𝐻 (𝑍 |𝑋) ≥ 𝐻 (𝑋) + 𝐻 (𝑍 |𝑋, 𝑍)
𝐻 (𝑌 , 𝑍) = 𝐻 (𝑌 ) + 𝐻 (𝑍 |𝑌 ) ≥ 𝐻 (𝑌 |𝑋) + 𝐻 (𝑍 |𝑋, 𝑌 )

Applying conditioning dropping, we see that their sum is at at least

2𝐻 (𝑋) + 2𝐻 (𝑌 |𝑋) + 2𝐻 (𝑍 |𝑋, 𝑌 ) = 2𝐻 (𝑋, 𝑌 , 𝑍). □

190
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.4 Shearer’s lemma

Question 10.4.2
What is the maximum volume of a body in R3 that has area at most 1 when projected
to each of the three coordinate planes?

The cube [0, 1] 3 satisfies the above property and has area 1. It turns out that this is the
maximum.
To prove this claim, first let us use Shearer’s inequality to prove a discrete version.

Theorem 10.4.3
Let 𝑆 ⊆ R3 be a finite set, and 𝜋𝑥𝑦 (𝑆) be its projection on the 𝑥𝑦-plane, etc. Then

|𝑆| 2 ≤ 𝜋𝑥𝑦 (𝑆) |𝜋𝑥𝑧 (𝑆)| 𝜋 𝑦𝑧 (𝑆)

Proof. Let (𝑋, 𝑌 , 𝑍) be a uniform random point of 𝑆. Then

2 log2 |𝑆| = 2𝐻 (𝑋, 𝑌 , 𝑍) ≤ 𝐻 (𝑋, 𝑌 ) + 𝐻 (𝑋, 𝑍) + 𝐻 (𝑌 , 𝑍)


≤ log2 𝜋𝑥𝑦 (𝑆) + log2 𝜋𝑥𝑧 (𝑆) + log2 𝜋 𝑦𝑧 (𝑆). □

By approximating a body using cubes, we can deduce the following corollary.

Corollary 10.4.4
Let 𝑆 be a body in R3 . Then

vol(𝑆) 2 ≤ area(𝜋𝑥𝑦 (𝑆)) area(𝜋𝑥𝑧 (𝑆)) area(𝜋 𝑦𝑧 (𝑆)).

Let us now state the general form of Shearer’s lemma. (Chung, Graham, Frankl, and
Shearer 1986)

Theorem 10.4.5 (Shearer’s lemma)


Let 𝐴1 , . . . , 𝐴𝑠 ⊆ [𝑛] where each 𝑖 ∈ [𝑛] appears in at least 𝑘 sets 𝐴 𝑗 ’s. Writing
𝑋 𝐴 := (𝑋𝑖 )𝑖∈𝐴 , ∑︁
𝑘 𝐻 (𝑋1 , . . . , 𝑋𝑛 ) ≤ 𝐻 (𝑋 𝐴 𝑗 ).
𝑗 ∈[𝑠]

The proof of the general form of Shearer’s lemma is a straightforward adaptation of


the proof of the special case earlier.
Like earlier, we can deduce an inequality about sizes of projections. (Loomis and
Whitney 1949)

191
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

Corollary 10.4.6 (Loomis–Whitney inequality)


Writing 𝜋𝑖 for the projection from R𝑛 onto the hyperplane 𝑥𝑖 = 0, we have for every
𝑆 ⊆ R𝑛 ,
𝑛
Ö
𝑛−1
|𝑆| ≤ |𝜋𝑖 (𝑆)|
𝑖=1

Corollary 10.4.7
Let 𝐴1 , . . . , 𝐴𝑠 ⊆ Ω where each 𝑖 ∈ Ω appears in at least 𝑘 sets 𝐴 𝑗 . Then for every
family F of subsets of Ω, Ö
𝑘
|F | ≤ F |𝐴𝑗
𝑗 ∈[𝑠]

where F | 𝐴 := {𝐹 ∩ 𝐴 : 𝐹 ∈ F }.

Proof. Each subset of Ω corresponds to a vector (𝑋1 , . . . , 𝑋𝑛 ) ∈ {0, 1} 𝑛 . Let (𝑋1 , . . . , 𝑋𝑛 )


be a random vector corresponding to a uniform element of F . Then
∑︁
𝑘 log2 |F | = 𝑘 𝐻 (𝑋1 , . . . , 𝑋𝑛 ) ≤ 𝐻 (𝑋 𝐴 𝑗 ) = log2 F | 𝐴 𝑗 . □
𝑗 ∈[𝑠]

Triangle-intersecting families
We say that a set G of labeled graphs on the same vertex set is triangle-intersecting if
𝐺 ∩ 𝐺 ′ contains a triangle for every 𝐺, 𝐺 ′ ∈ G.

Question 10.4.8
What is the largest triangle-intersecting family of graphs on 𝑛 labeled vertices?

The set of all graphs that contain a fixed triangle is triangle-intersecting, and they form
a 1/8 fraction of all graphs.
An easy upper bound: the edges form an intersecting family, so a triangle-intersecting
family must be at most 1/2 fraction of all graphs.
The next theorem improves this upper bound to < 1/4. It is also in this paper that
Shearer’s lemma was introduced.

Theorem 10.4.9 (Chung, Graham, Frankl, and Shearer 1986)


Every triangle-intersecting family of graphs on 𝑛 labeled vertices has size < 2 ( 2) −2 .
𝑛

192
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.4 Shearer’s lemma

Proof. Let G be a triangle-intersecting family of graphs on vertex set [𝑛] (viewed as


a collection of subsets of edges of 𝐾𝑛 )
For 𝑆 ⊆ [𝑛] with |𝑆| = ⌊𝑛/2⌋, let 𝐴𝑆 = 𝑆2 ∪ [𝑛]\𝑆
 
2 (i.e., 𝐴𝑆 is the union of the clique
on 𝑆 and the clique on the complement of 𝑆). Let
     
⌊𝑛/2⌋ ⌈𝑛/2⌉ 1 𝑛
𝑟 = | 𝐴𝑆 | = + ≤ .
2 2 2 2

For every 𝑆, every triangle has an edge in 𝐴𝑆 , and thus G restricted to 𝐴𝑆 must be an
intersecting family. Hence

G| 𝐴𝑆 ≤ 2| 𝐴𝑆 |−1 = 2𝑟−1 .

Each edge of 𝐾𝑛 appears in at least


 
𝑟 𝑛
𝑘 = 𝑛
2
⌊𝑛/2⌋

different 𝐴𝑆 with |𝑆| = ⌊𝑛/2⌋ (by symmetry and averaging). Applying Corol-
lary 10.4.7, we find that
 ( 𝑛 )
𝑘 𝑟−1 ⌊𝑛/2⌋
|G| ≤ 2 .
Therefore 𝑛
( 2)
|G| ≤ 2 ( 2) − 𝑟 < 2 ( 2) −2 .
𝑛 𝑛

Remark 10.4.10. A tight upper bound of 2 ( 2) −3 (matching the construction of taking


𝑛

all graphs containing a fixed triangle) was conjectured by Simonovits and Sós (1976)
and proved by Ellis, Filmus, and Friedgut (2012) using Fourier analytic methods.
Berger and Zhao (2023) gave a tight solution for 𝐾4 -intersecting families. The general
conjecture for 𝐾𝑟 -intersecting families is open.

The number of independent sets in a regular bipartite graph

Question 10.4.11
Fix 𝑑. Which 𝑑-regular graph on a given number of vertices has the most number of
independent sets? Alternatively, which graph 𝐺 maximizes 𝑖(𝐺) 1/𝑣(𝐺) ?

(Note that the number of independent sets is multiplicative: 𝑖(𝐺 1 ⊔𝐺 2 ) = 𝑖(𝐺 1 )𝑖(𝐺 2 ).)
Alon and Kahn conjectured that for graphs on 𝑛 vertices, when 𝑛 is a multiple of 2𝑑,
a disjoint union of 𝐾 𝑑,𝑑 ’s maximizes the number of independent sets.

193
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

Alon (1991) proved an approximate version of this conjecture. Kahn (2001) proved it
assuming the graph is bipartite. Zhao (2010) proved it in general.

Theorem 10.4.12 (Kahn, Zhao)


Let 𝐺 be an 𝑛-vertex 𝑑-regular graph. Then

𝑖(𝐺) ≤ 𝑖(𝐾 𝑑,𝑑 ) 𝑛/(2𝑑) = (2𝑑+1 − 1) 𝑛/(2𝑑)

where 𝑖(𝐺) is the number of independent sets of 𝐺.

Proof assuming 𝐺 is bipartite. (Kahn) Let us first illustrate the proof for

𝑥1 𝑦1
𝐺= 𝑥2 𝑦2
𝑥3 𝑦3

Among all independent sets of 𝐺, choose one uniformly at random, and let (𝑋1 , 𝑋2 , 𝑋3 , 𝑌1 , 𝑌2 , 𝑌3 ) ∈
{0, 1}6 be its indicator vector. Then

2 log2 𝑖(𝐺) = 2𝐻 (𝑋1 , 𝑋2 , 𝑋3 , 𝑌1 , 𝑌2 , 𝑌3 )


= 2𝐻 (𝑋1 , 𝑋2 , 𝑋3 ) + 2𝐻 (𝑌1 , 𝑌2 , 𝑌3 |𝑋1 , 𝑋2 , 𝑋3 ) [chain rule]
≤ 𝐻 (𝑋1 , 𝑋2 ) + 𝐻 (𝑋1 , 𝑋3 ) + 𝐻 (𝑋2 , 𝑋3 )
+ 2𝐻 (𝑌1 |𝑋1 , 𝑋2 , 𝑋3 ) + 2𝐻 (𝑌2 |𝑋1 , 𝑋2 , 𝑋3 ) + 2𝐻 (𝑌3 |𝑋1 , 𝑋2 , 𝑋3 ) [Shearer]
= 𝐻 (𝑋1 , 𝑋2 ) + 𝐻 (𝑋1 , 𝑋3 ) + 𝐻 (𝑋2 , 𝑋3 )
+ 2𝐻 (𝑌1 |𝑋1 , 𝑋2 ) + 2𝐻 (𝑌2 |𝑋1 , 𝑋3 ) + 2𝐻 (𝑌3 |𝑋2 , 𝑋3 ) [cond indep]

Here we are using that (a) 𝑌1 , 𝑌2 , 𝑌3 are conditionally independent given (𝑋1 , 𝑋2 , 𝑋3 )
and (b) 𝑌1 and (𝑋3 , 𝑌2 , 𝑌3 ) are conditionally independent given (𝑋1 , 𝑋2 ). A more
general statement is that if 𝑆 ⊆ 𝑉 (𝐺), then the restrictions to the different connected
components of 𝐺 − 𝑆 are conditionally independent given 𝑋𝑆 .
It remains to prove that

𝐻 (𝑋1 , 𝑋2 ) + 2𝐻 (𝑌1 |𝑋1 , 𝑋2 ) ≤ log2 𝑖(𝐾2,2 )

and two other analogous inequalities. Let 𝑌1′ be conditionally independent copy of 𝑌1
given (𝑋1 , 𝑋2 ). Then (𝑋1 , 𝑋2 , 𝑌1 , 𝑌1′ ) is the indictor vector of an independent set of
𝐾2,2 (though not necessarily chosen uniformly).

𝑥1 𝑦1
𝑥2 𝑦′1

194
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.4 Shearer’s lemma

Thus we have

𝐻 (𝑋1 , 𝑋2 ) + 2𝐻 (𝑌1 |𝑋1 , 𝑋2 ) = 𝐻 (𝑋1 , 𝑋2 ) + 𝐻 (𝑌1 |𝑋1 , 𝑋2 ) + 𝐻 (𝑌1′ |𝑋1 , 𝑋2 )


= 𝐻 (𝑋1 , 𝑋2 , 𝑌1 , 𝑌1′ ) [chain rule]
≤ log2 𝑖(𝐺) [uniform bound]

This concludes the proof for 𝐺 = 𝐾2,2 , which works for all bipartite 𝐺. Here are the
details.
Let 𝑉 = 𝐴 ∪ 𝐵 be the vertex bipartition of 𝐺. Let 𝑋 = (𝑋𝑣 )𝑣∈𝑉 be the indicator function
of an independent set chosen uniformly at random. Write 𝑋𝑆 := (𝑋𝑣 )𝑣∈𝑆 . We have

𝑑 log2 𝑖(𝐺) = 𝑑𝐻 (𝑋) = 𝑑𝐻 (𝑋 𝐴 ) + 𝑑𝐻 (𝑋𝐵 |𝑋 𝐴 ) [chain rule]


∑︁ ∑︁
≤ 𝐻 (𝑋𝑁 (𝑏) ) + 𝑑 𝐻 (𝑋𝑏 |𝑋 𝐴 ) [Shearer]
𝑏∈𝐵 𝑏∈𝐵
∑︁ ∑︁
≤ 𝐻 (𝑋𝑁 (𝑏) ) + 𝑑 𝐻 (𝑋𝑏 |𝑋𝑁 (𝑏) ) [drop conditioning]
𝑏∈𝐵 𝑏∈𝐵

For each 𝑏 ∈ 𝐵, we have

𝐻 (𝑋𝑁 (𝑏) ) + 𝑑𝐻 (𝑋𝑏 |𝑋𝑁 (𝑏) ) = 𝐻 (𝑋𝑁 (𝑏) ) + 𝐻 (𝑋𝑏(1) , . . . , 𝑋𝑏(𝑑) |𝑋𝑁 (𝑏) )
= 𝐻 (𝑋𝑏(1) , . . . , 𝑋𝑏(𝑑) , 𝑋𝑁 (𝑏) )
≤ log2 𝑖(𝐾 𝑑,𝑑 )

where 𝑋𝑏(1) , . . . , 𝑋𝑏(𝑑) are conditionally independent copies of 𝑋𝑏 given 𝑋𝑁 (𝑏) . Sum-
ming over all 𝑏 yields the result. □

Now we give the argument from Zhao (2010) that removes the bipartite hypothesis.
The following combinatorial argument reduces the problem for non-bipartite 𝐺 to that
of bipartite 𝐺.
Starting from a graph 𝐺, we construct its bipartite double cover 𝐺 × 𝐾2 (see Fig-
ure 10.1), which has vertex set 𝑉 (𝐺) × {0, 1}. The vertices of 𝐺 × 𝐾2 are labeled 𝑣 𝑖
for 𝑣 ∈ 𝑉 (𝐺) and 𝑖 ∈ {0, 1}. Its edges are 𝑢 0 𝑣 1 for all 𝑢𝑣 ∈ 𝐸 (𝐺). Note that 𝐺 × 𝐾2
is always a bipartite graph.

Lemma 10.4.13
Let 𝐺 be any graph (not necessarily regular). Then

𝑖(𝐺) 2 ≤ 𝑖(𝐺 × 𝐾2 ).

Once we have the lemma, Theorem 10.4.12 then reduces to the bipartite case, which
we already proved. Indeed, for a 𝑑-regular 𝐺, since 𝐺 × 𝐾2 is bipartite, the bipartite

195
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

2𝐺 𝐺 × 𝐾2 𝐺 × 𝐾2

Figure 10.1: The bipartite swapping trick in the proof of Lemma 10.4.13: swap-
ping the circled pairs of vertices (denoted 𝐴 in the proof) fixes the
bad edges (red and bolded), transforming an independent set of 2𝐺
into an independent set of 𝐺 × 𝐾2 .

case of the theorem gives

𝑖(𝐺) 2 ≤ 𝑖(𝐺 × 𝐾2 ) ≤ 𝑖(𝐾 𝑑,𝑑 ) 𝑛/𝑑 ,

Proof of Lemma 10.4.13. Let 2𝐺 denote a disjoint union of two copies of 𝐺. Label
its vertices by 𝑣 𝑖 with 𝑣 ∈ 𝑉 and 𝑖 ∈ {0, 1} so that its edges are 𝑢𝑖 𝑣 𝑖 with 𝑢𝑣 ∈ 𝐸 (𝐺) and
𝑖 ∈ {0, 1}. We will give an injection 𝜙 : 𝐼 (2𝐺) → 𝐼 (𝐺 × 𝐾2 ). Recall that 𝐼 (𝐺) is the
set of independent sets of 𝐺. The injection would imply 𝑖(𝐺) 2 = 𝑖(2𝐺) ≤ 𝑖(𝐺 × 𝐾2 )
as desired.
Fix an arbitrary order on all subsets of 𝑉 (𝐺). Let 𝑆 be an independent set of 2𝐺. Let

𝐸 bad (𝑆) := {𝑢𝑣 ∈ 𝐸 (𝐺) : 𝑢 0 , 𝑣 1 ∈ 𝑆}.

Note that 𝐸 bad (𝑆) is a bipartite subgraph of 𝐺, since each edge of 𝐸 bad has exactly one
endpoint in {𝑣 ∈ 𝑉 (𝐺) : 𝑣 0 ∈ 𝑆} but not both (or else 𝑆 would not be independent).
Let 𝐴 denote the first subset (in the previously fixed ordering) of 𝑉 (𝐺) such that all
edges in 𝐸 bad (𝑆) have one vertex in 𝐴 and the other outside 𝐴. Define 𝜙(𝑆) to be the
subset of 𝑉 (𝐺) × {0, 1} obtained by “swapping” the pairs in 𝐴, i.e., for all 𝑣 ∈ 𝐴,
𝑣 𝑖 ∈ 𝜙(𝑆) if and only if 𝑣 1−𝑖 ∈ 𝑆 for each 𝑖 ∈ {0, 1}, and for all 𝑣 ∉ 𝐴, 𝑣 𝑖 ∈ 𝜙(𝑆) if and
only if 𝑣 𝑖 ∈ 𝑆 for each 𝑖 ∈ {0, 1}. It is not hard to verify that 𝜙(𝑆) is an independent
set in 𝐺 × 𝐾2 . The swapping procedure fixes the “bad” edges.
It remains to verify that 𝜙 is an injection. For every 𝑆 ∈ 𝐼 (2𝐺), once we know
𝑇 = 𝜙(𝑆), we can recover 𝑆 by first setting


𝐸 bad (𝑇) = {𝑢𝑣 ∈ 𝐸 (𝐺) : 𝑢𝑖 , 𝑣 𝑖 ∈ 𝑇 for some 𝑖 ∈ {0, 1}},

′ (𝑇), and then finding 𝐴 as earlier and swapping the pairs of 𝐴


so that 𝐸 bad (𝑆) = 𝐸 bad
back. (Remark: it follows that 𝑇 ∈ 𝐼 (𝐺 × 𝐾2 ) lies in the image of 𝜙 if and only if
′ (𝑇) is bipartite.)
𝐸 bad □

196
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.4 Shearer’s lemma

The entropy proof of the bipartite case of Theorem 10.4.12 extends to graph homo-
morphisms, yielding the following result.

Theorem 10.4.14 (Galvin and Tetali 2004)


Let 𝐺 be an 𝑛-vertex 𝑑-regular bipartite graph. Let 𝐻 be any graph allowing loops.
Then
hom(𝐺, 𝐻) ≤ hom(𝐾 𝑑,𝑑 , 𝐻) 𝑛/(2𝑑)

Some important special cases:


• hom(𝐺, ) = 𝑖(𝐺), the number of independent sets of 𝐺;
• hom(𝐺, 𝐾𝑞 ) = the number of proper 𝑞-colorings of 𝐺.
The bipartite hypothesis in Theorem 10.4.14 cannot be always be removed. For
example, if 𝐻 = , then log2 hom(𝐺, 𝐻) is the number of connected components
of 𝐺, so that the maximizers of log2 hom(𝐺, 𝐻)/𝑣(𝐺) are disjoint unions of 𝐾 𝑑+1 ’s.
For 𝐻 = 𝐾𝑞 , corresponding to the proper 𝑞-colorings, the bipartite hypothesis was
recently removed.

Theorem 10.4.15 (Sah, Sawhney, Stoner, and Zhao 2020)


Let 𝐺 be an 𝑛-vertex 𝑑-regular graph. Then

𝑐 𝑞 (𝐺) ≤ 𝑐 𝑞 (𝐾 𝑑,𝑑 ) 𝑛/(2𝑑)

where 𝑐 𝑞 (𝐺) is the number of 𝑞-colorings of 𝐺.

Furthermore, it was also shown in the same paper that in Theorem 10.4.14, the bipartite
hypothesis on 𝐺 can be weakened to triangle-free. Furthermore triangle-free is the
weakest possible hypothesis on 𝐺 so that the claim is true for all 𝐻.
For more discussion and open problems on this topic, see the survey by Zhao (2017).

Exercises
The problems in this section should be solved using entropy arguments or results
derived from entropy arguments.
1. Submodularity. Prove that 𝐻 (𝑋, 𝑌 , 𝑍) + 𝐻 (𝑋) ≤ 𝐻 (𝑋, 𝑌 ) + 𝐻 (𝑋, 𝑍).
2. Let F be a collection of subsets of [𝑛]. Let 𝑝𝑖 denote the fraction of F that

197
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10 Entropy

contains 𝑖. Prove that


𝑛
Ö
−𝑝 𝑖
|F | ≤ 𝑝𝑖 (1 − 𝑝𝑖 ) −(1−𝑝𝑖 ) .
𝑖=1

3. ★ Uniquely decodable codes. Let [𝑟] ∗ denote the set of all finite strings of
elements in [𝑟]. Let 𝐴 be a finite subset of [𝑟] ∗ and suppose no two distinct
concatenations of sequences in 𝐴 can produce the same string. Let |𝑎| denote
the length of 𝑎 ∈ 𝐴. Prove that
∑︁
𝑟 −|𝑎| ≤ 1.
𝑎∈𝐴

4. Sudoku. A 𝑛2 × 𝑛2 Sudoku square (the usual Sudoku corresponds to 𝑛 = 3) is


an 𝑛2 × 𝑛2 array with entries from [𝑛2 ] so that each row, each column, and, after
partitioning the square into 𝑛 × 𝑛 blocks, each of these 𝑛2 blocks consist of a
permutation of [𝑛2 ]. Prove that the number of 𝑛2 × 𝑛2 Sudoku squares is at most
 𝑛4
𝑛2

.
𝑒 3 + 𝑜(1)

5. Prove Sidorenko’s conjecture for the following graph.

6. ★ Triangles versus vees in a directed graph. Let 𝑉 be a finite set, 𝐸 ⊆ 𝑉 × 𝑉, and

(𝑥, 𝑦, 𝑧) ∈ 𝑉 3 : (𝑥, 𝑦), (𝑦, 𝑧), (𝑧, 𝑥) ∈ 𝐸



△=

(i.e., cyclic triangles; note the direction of edges) and

(𝑥, 𝑦, 𝑧) ∈ 𝑉 3 : (𝑥, 𝑦), (𝑥, 𝑧) ∈ 𝐸



∧= .

Prove that △ ≤ ∧.
7. ★ Box theorem. Prove that for every compact set 𝐴 ⊆ R𝑑 , there exists an
axis-aligned box 𝐵 ⊆ R𝑑 with

vol 𝐴 = vol 𝐵 and vol 𝜋 𝐼 ( 𝐴) ≥ vol 𝜋 𝐼 (𝐵) for all 𝐼 ⊆ [𝑛].

198
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

10.4 Shearer’s lemma

Here 𝜋 𝐼 denotes the orthogonal projection onto the 𝐼-coordinate subspace.


(For the purpose of the homework, you only need to establish the case when 𝐴 is a union of grid
cubes. It is optional to give the limiting argument for compact 𝐴.)

8. Let G be a family of graphs on vertices labeled by [2𝑛] such that the intersection
2𝑛
of every pair of graphs in G contains a perfect matching. Prove that |G| ≤ 2 ( 2 ) −𝑛 .
9. Loomis–Whitney for sumsets. Let 𝐴, 𝐵, 𝐶 be finite subsets of some abelian
group. Writing 𝐴 + 𝐵 := {𝑎 + 𝑏 : 𝑎 ∈ 𝐴, 𝑏 ∈ 𝐵}, etc., prove that

| 𝐴 + 𝐵 + 𝐶 | 2 ≤ | 𝐴 + 𝐵| | 𝐴 + 𝐶 | |𝐵 + 𝐶 | .

10. ★ Shearer for sums. Let 𝑋, 𝑌 , 𝑍 be independent random integers. Prove that

2𝐻 (𝑋 + 𝑌 + 𝑍) ≤ 𝐻 (𝑋 + 𝑌 ) + 𝐻 (𝑋 + 𝑍) + 𝐻 (𝑌 + 𝑍).

199
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

11 Containers

Many problems in combinatorics can be phrased in terms of independent sets in


hypergraphs.
For example, here is a model question:

Question 11.0.1
How many triangle-free graphs are there on 𝑛 vertices?

2 /4
By taking all subgraphs of 𝐾𝑛/2,𝑛/2 , we obtain 2𝑛 such graphs. It turns out this gives
the correct exponential asymptotic.

Theorem 11.0.2 (Erdős, Kleitman, and Rothschild 1973)


2 /4+𝑜(𝑛2 )
The number of triangle-free graphs on 𝑛 vertices is 2𝑛 .

Remark 11.0.3. It does not matter here whether we consider vertices to be labeled, it
affects the answer up to a factor of at most 𝑛! = 𝑒𝑂 (𝑛 log 𝑛) .

Remark 11.0.4. Actually the original Erdős–Kleitman–Rothschild paper showed an


even stronger result: 1 − 𝑜(1) fraction of all 𝑛-vertex triangle-free graphs are bipartite.
The above asymptotic can be then easily deduced by counting subgraphs of complete
bipartite graphs. The container methods discussed in this section are not strong enough
to prove this finer claim.

We can convert this asymptotic enumeration problem into a problem about independent
sets in a 3-uniform hypergraph 𝐻:
• 𝑉 (𝐻) = [𝑛]

2

• The edges of 𝐻 are triples of the form {𝑥𝑦, 𝑥𝑧, 𝑦𝑥}, i.e., triangles
We then have the correspondence:
• A subset of 𝑉 (𝐻) = a graph on vertex set [𝑛]
• An independent set of 𝑉 (𝐻) = a triangle-free graph

201
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

11 Containers

(Here an independent set in a hypergraph is a subset of vertices containing no edges.)


Naively applying first moment/union bound does not work—there are too many events
to union bound over.
For example, Mantel’s theorem tell us the maximum number of edges in an 𝑛-vertex
triangle-free graph is 𝑛2 /4 , obtained by 𝐾 ⌊𝑛/2⌋,⌈𝑛/2⌉ . With a fixed triangle-free graph
 

𝐺, the number of subgraphs of 𝐺 is 2𝑒(𝐺) , and each of them is triangle-free. Perhaps


we could union bound over all maximal triangle-free graphs? It turns out that there are
2 2
2𝑛 /8+𝑜(𝑛 ) such maximal triangle-free graphs, so a union bound would be too wasteful.
In many applications, independent sets are clustered into relatively few highly corre-
lated sets. In the case of triangle-free graphs, each maximal triangle-free graph is very
“close” to many other maximal triangle-free graphs.
Is there a more efficient union bound that takes account of the clustering of independent
sets?
The container method does exactly that. Given some hypergraph with controlled
degrees, one can find a collection of containers satisfying the following properties:
• Each container is a subset of vertices of the hypergraph.
• Every independent set of the hypergraph is a subset of some container.
• The total number of containers in the collection is relatively small.
• Each container is not too large (in fact, not too much larger than the maximum
size of an independent set)
We can then union bound over all such containers. If the number of containers is not
too small, then the union bound is not too lossy.
Here are some of the most typical and important applications of the container method:
• Asymptotic enumerations:
– Counting 𝐻-free graphs on 𝑛 vertices
– Counting 𝐻-free graphs on 𝑛 vertices and 𝑚 edges
– Counting 𝑘-AP-free subsets of [𝑛] of size 𝑚
• Extremal and Ramsey results in random structures:
– The maximum number of edges in an 𝐻-free subgraph of 𝐺 (𝑛, 𝑝)
– Szemeŕedi’s theorem in a 𝑝-random subset of [𝑛]
• List coloring in graphs/hypergraphs

202
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

11.1 Containers for triangle-free graphs

The method of hypergraph containers is one of the most exciting developments in this
past decade. Some references and further reading:
• The graph container method was developed by Kleitman and Winston (1982) (for
counting 𝐶4 -free graphs) and Sapozhenko (2001) (for bounding the number of
independent sets in a regular graph, giving an earlier version of Theorem 10.4.12)
• The hypergraph container theorem was proved independently by Balogh, Morris,
and Samotij (2015), and Saxton and Thomason (2015).
• See the 2018 ICM survey of Balogh, Morris, and Samotij for an introduction to
the topic along with many applications
• See Samotij’s survey article (2015) for an introduction to the graph container
method
• See Morris’ lecture notes (2016) for a gentle introduction to the proof and
applications of hypergraph containers.

11.1 Containers for triangle-free graphs


The number of triangle-free graphs
We will prove Theorem 11.0.2 that the number of triangle-free graphs on 𝑛 vertices is
2 2
2𝑛 /4+𝑜(𝑛 ) .

Theorem 11.1.1 (Containers for triangle-free graphs)


For every 𝜀 > 0, there exists 𝐶 > 0 such that the following holds.
For every 𝑛, there is a collection C of graphs on 𝑛 vertices, with
3/2
|C| ≤ 𝑛𝐶𝑛

such that
(a) every 𝐺 ∈ C has at most ( 14 + 𝜀)𝑛2 edges, and
(b) every triangle-free graph is contained in some 𝐺 ∈ C.

Proof of upper bound of Theorem 11.0.2. We want to show that the number of 𝑛-
2 2
vertex triangle-free graphs is at most 2𝑛 /4+𝑜(𝑛 ) . Let 𝜀 > 0 be any real number
(arbitrarily small). Let C be produced by Theorem 11.1.1.
Then every 𝐺 ∈ C has at most ( 14 + 𝜀)𝑛2 edges, and every triangle-free graph is

203
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

11 Containers

contained in some 𝐺 ∈ C. Hence the number of triangle-free graphs is


1 2 1 2 +𝑂 3/2 log 𝑛)
|C| 2 ( 4 +𝛿)𝑛 ≤ 2 ( 4 +𝜀)𝑛 𝜀 (𝑛
.

Since 𝜀 > 0 can be made arbitrarily small, the number triangle-free graphs on 𝑛
1 2
vertices is 2 ( 4 +𝑜(1))𝑛 . □

The same proof technique, with an appropriate container theorem, can be used to count
𝐻-free graphs.
We write ex(𝑛, 𝐻) for the maximum number of edges in an 𝑛-vertex graph without 𝐻
as a subgraph.

Theorem 11.1.2 (Erdős–Stone–Simonovits)


Fix a non-bipartite graph 𝐻. Then
  
1 𝑛
ex(𝑛, 𝐻) = 1 − + 𝑜(1) .
𝜒(𝐻) − 1 2

Note that for bipartite graphs 𝐻, the above theorem just says 𝑜(𝑛2 ), though more
precise estimates are available. Although we do not know the asymptotic for ex(𝑛, 𝐻)
for all 𝐻, e.g., it is still open for 𝐻 = 𝐾4,4 and 𝐻 = 𝐶8 .

Theorem 11.1.3
Fix a non-bipartite graph 𝐻. Then the number of 𝐻-free graphs on 𝑛 vertices is
2 (1+𝑜(1)) ex(𝑛,𝐻) .

The analogous statement for bipartite graphs is false. The following conjecture remains
of great interest, and it is known for certain graphs, e.g., 𝐻 = 𝐶4 .

Conjecture 11.1.4
Fix a bipartite graph 𝐻 with a cycle. The number of 𝐻-free graphs on 𝑛 vertices is
2𝑂 (ex(𝑛,𝐻)) .

Mantel’s theorem in random graphs

Theorem 11.1.5

If 𝑝 ≫ 1/ 𝑛, then with probability 1 − 𝑜(1), every triangle-free subgraph of 𝐺 (𝑛, 𝑝)
has at most ( 41 + 𝑜(1)) 𝑝𝑛2 edges.

204
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

11.1 Containers for triangle-free graphs

Remark 11.1.6. In fact, a much stronger result is true: the triangle-free subgraph
of 𝐺 (𝑛, 𝑝) with the maximum number of edges is whp bipartite (DeMarco and Kahn
2015).

Remark 11.1.7. The statement is false for 𝑝 ≪ 1/ 𝑛. Indeed, in this case, then the
expected number of triangles is 𝑂 (𝑛3 𝑝 3 ), whereas there are whp 𝑛2 𝑝/2 edges, and
𝑛3 𝑝 3 ≪ 𝑛2 𝑝, so we can remove 𝑜(𝑛2 𝑝) edges to make the graph triangle-free.

Proof. We prove a slightly weaker result, namely that the result is true if 𝑝 ≫
𝑛−1/2 log 𝑛. The version with 𝑝 ≫ 𝑛−1/2 can be proved via a stronger formulation
of the container lemma (using “fingerprints” as discussed later).
Let 𝜀 > 0 be aribtrarily small. Let C be a set of containers
 for 𝑛-vertex triangle-free
graphs in Theorem 11.1.1. For every 𝐺 ∈ C, 𝑒(𝐺) ≤ 14 + 𝜀 𝑛2 , so by an application
of the Chernoff bound,
   
1 2
P 𝑒(𝐺 ∩ 𝐺 (𝑛, 𝑝)) > + 2𝜀 𝑛 𝑝 ≤ 𝑒 −Ω 𝜀 (𝑛 𝑝)
2
4

Since every triangle-free graph is contained in some 𝐺 ∈ C, by taking a union bound


over C, we see that
   
1 2
P 𝐺 (𝑛, 𝑝) has a triangle-free subgraph with > + 2𝜀 𝑛 𝑝 edges
4
∑︁  
1
 
2
≤ P 𝑒(𝐺 ∩ 𝐺 (𝑛, 𝑝)) > + 2𝜀 𝑛 𝑝
𝐺∈C
4
2 𝑝)
≤ |C| 𝑒 −Ω 𝜀 (𝑛
3/2 log 𝑛)−Ω 2 𝑝)
≤ 𝑒𝑂 𝜀 (𝑛 𝜀 (𝑛

= 𝑜(1)

provided that 𝑝 ≫ 𝑛−1/2 log 𝑛. □

205
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

11 Containers

11.2 Graph containers

Theorem 11.2.1 (Container theorem for independent sets in graphs)


For every 𝑐 > 0, there exists 𝛿 > 0 such that the following holds.
Let 𝐺 = (𝑉, 𝐸) be a graph with average degree 𝑑 and maximum degree at most 𝑐𝑑.
There exists a collection C of subsets of 𝑉, with
 
|𝑉 |
|C| ≤
≤ 2𝛿 |𝑉 | /𝑑

such that
1. Every independent set 𝐼 of 𝐺 is contained in some 𝐶 ∈ C.
2. |𝐶 | ≤ (1 − 𝛿) |𝑉 | for every 𝐶 ∈ C.

Each 𝐶 ∈ C is called a “container.” Every independent set of 𝐶 is contained in some


container.

Remark 11.2.2. The requirement |𝐶 | ≤ (1 − 𝛿) |𝑉 | looks quite a bit weaker than


in Theorem 11.1.1, where each container is only slightly larger than the maximum
independent set. In a typical application of the container method, one iteratively applies
the (hyper)graph container theorem (e.g., Theorem 11.2.1 and later Theorem 11.3.1)
to the subgraphs induced by the slightly smaller containers in the previous iteration.
One iterates until the containers are close to their minimum possible size.
For this iterative application of container theorem to work, one usually needs a super-
saturation result, which, roughly speaking, says that every subset of vertices that is
slightly larger than the independence number necessarily induces a lot of edges. This
property is common to all standard applications of the container method.

The container theorem is proved using


The graph container algorithm (for a fixed given graph 𝐺)
Input: a maximal independent set 𝐼 ⊆ 𝑉.
Output: a “fingerprint” 𝑆 ⊆ 𝐼 of size ≤ 2𝛿 |𝑉 | /𝑑, and a container 𝐶 ⊇ 𝐼 which depends
only on 𝑆.
Throughout the algorithm, we will maintain a partition 𝑉 = 𝐴 ∪ 𝑆 ∪ 𝑋, where
• 𝐴, the “available” vertices, initially 𝐴 = 𝑉
• 𝑆, the current fingerprint, initially 𝑆 = ∅
• 𝑋, the “excluded” vertices, initially 𝑋 = ∅.

206
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

11.2 Graph containers

The max-degree order of 𝐺 [ 𝐴] is an ordering of 𝐴 in by the degree of the vertices


in 𝐺 [ 𝐴], with the largest first, and breaking ties according to some arbitrary predeter-
mined ordering of 𝑉.
While |𝑋 | < 𝛿 |𝑉 |:
1. Let 𝑣 be the first vertex of 𝐼 ∩ 𝐴 in the max-degree order on 𝐺 [ 𝐴].
2. Add 𝑣 to 𝑆.
3. Add the neighbors of 𝑣 to 𝑋.
4. Add vertices preceding 𝑣 in the max-degree order on 𝐺 [ 𝐴] to 𝑋.
5. Remove from 𝐴 all the new vertices added to 𝑆 ∪ 𝑋.
Claim: when the algorithm terminates, we obtain a partition 𝑉 = 𝐴 ∪ 𝑆 ∪ 𝑋 such that
|𝑋 | ≥ 𝛿 |𝑉 | and |𝑆| ≤ 2𝛿 |𝑉 | /𝑑.
Proof idea: due to the degree hypotheses, in every iteration, at least ≥ 𝑑/2 new vertices
are added to 𝑋 (provided that 𝑑 ≤ 2𝛿 |𝑉 |). See Morris’ lecture notes for details.
Key facts:
• Two different maximal independent sets 𝐼, 𝐼 ′ ⊆ 𝑉 that produce the same finger-
print 𝑆 in the algorithm necessarily produces the same partition 𝑉 = 𝐴 ∪ 𝑆 ∪ 𝑋
• The final set 𝑆 ∪ 𝐴 contains 𝐼 (since only vertices not in 𝐼 are ever moved to 𝐼)
Therefore, the total number possibilities for containers 𝑆 ∪ 𝐴 is at most the number of
sets 𝑆 ⊆ 𝑉. Since |𝑆| ≤ 2𝛿 |𝑉 | /𝑑 and | 𝐴 ∪ 𝑆| ≤ (1 − 𝛿) |𝑉 |, this concludes the proof
of the graph container lemma.
The fingerprint obtained by the proof actually gives us a stronger consequence that
will be important for some applications.

207
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

11 Containers

Theorem 11.2.3 (Graph container theorem, with fingerprints)


For every 𝑐 > 0, there exists 𝛿 > 0 such that the following holds.
Let 𝐺 = (𝑉, 𝐸) a graph with average degree 𝑑 and maximum degree at most 𝑐𝑑.
Writing I for the collection of independent sets of 𝐺, there exist functions

𝑆 : I → 2𝑉 and 𝐴 : 2𝑉 → 2𝑉

(one only needs to define 𝐴(·) on sets in the image of 𝑆)


such that, for every 𝐼 ∈ I,
• 𝑆(𝐼) ⊆ 𝐼 ⊆ 𝑆(𝐼) ∪ 𝐴(𝑆(𝐼))
• |𝑆(𝐼)| ≤ 2𝛿 |𝑉 | /𝑑
• |𝑆(𝐼) ∪ 𝐴(𝑆(𝐼))| ≤ (1 − 𝛿) |𝑉 |

11.3 Hypergraph container theorem


An independent set in a hypergraph is a subset of vertices containing no edges.
Given an 𝑟-uniform hypergraph 𝐻 and 1 ≤ ℓ < 𝑟, we write

Δℓ (𝐻) = max the number of edges containing 𝐴


𝐴⊆𝑉 (𝐻):| 𝐴|=ℓ

Theorem 11.3.1 (Container theorem for 3-uniform hypergraph)


For every 𝑐 > 0 there exists 𝛿 > 0 such that the following holds.
Let 𝐻 be a 3-uniform hypergraph with average degree 𝑑 ≥ 𝛿−1 and

Δ1 (𝐻) ≤ 𝑐𝑑 and Δ2 (𝐻) ≤ 𝑐 𝑑.

Then there exists a collection C of subsets of 𝑉 (𝐻) with


 
𝑣(𝐻)
|C| ≤ √
≤ 𝑣(𝐻)/ 𝑑

such that
• Every independent set of 𝐻 is contained in some 𝐶 ∈ C, and
• |𝐶 | ≤ (1 − 𝛿)𝑣(𝐻) for every 𝐶 ∈ C.

208
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao

11.3 Hypergraph container theorem

Like the graph container theorem, the hypergraph container theorem is proved by
designing an algorithm to produce, from an independent set 𝐼 ⊆ 𝑉 (𝐻), a fingerprint
𝑆 ⊆ 𝐼 and a container 𝐶 ⊃ 𝐼.
The hypergraph container algorithm is more involved compared to the graph container
algorithm. In fact, the 3-uniform hypergraph container algorithm calls the graph
container algorithm.
Container algorithm for 3-uniform hypergraphs (a very rough sketch):
Throughout the algorithm, we will maintain
• A fingerprint 𝑆, initially 𝑆 = ∅
• A 3-uniform hypergraph 𝐴, initially 𝐴 = 𝐻
• A graph 𝐺 of “forbidden” pairs on 𝑉 (𝐻), initially 𝐺 = ∅

While |𝑆| ≤ 𝑣(𝐻)/ 𝑑 − 1:
• Let 𝑢 be the first vertex in 𝐼 in the max-degree order on 𝐴
• Add 𝑢 to 𝑆
• Add 𝑥𝑦 to 𝐸 (𝐺) whenever 𝑢𝑥𝑦 ∈ 𝐸 (𝐻)
• Remove from 𝑉 ( 𝐴) the vertex 𝑢 as well as all vertices proceeding 𝑢 in the
max-degree order on 𝐴

• Remove from 𝑉 ( 𝐴) every vertex whose degree in 𝐺 is larger than 𝑐 𝑑.
• Remove from 𝐸 ( 𝐴) every edge that contains an edge of 𝐺.
Finally, it is will be the case that either
• We have removed many vertices from 𝑉 ( 𝐴)
√ √
• Or the final graph 𝐺 has at least Ω( 𝑑𝑛) edges and has maximum degree 𝑂 ( 𝑑),
so that we can apply the graph container lemma to 𝐺.
In either case, the algorithm produces a container with the desired properties. Again
see Morris’ lecture notes for details.

209
MIT OpenCourseWare
https://ocw.mit.edu

18.226 Probabilistic Methods in Combinatorics


Fall 2022

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

You might also like