TASI Lecture On Physics For ML
TASI Lecture On Physics For ML
Abstract
These notes are based on lectures I gave at TASI 2024 on Physics for Machine Learning. The focus
is on neural network theory, organized according to network expressivity, statistics, and dynamics. I
present classic results such as the universal approximation theorem and neural network / Gaussian
process correspondence, and also more recent results such as the neural tangent kernel, feature learning
with the maximal update parameterization, and Kolmogorov-Arnold networks. The exposition on neural
network theory emphasizes a field theoretic perspective familiar to theoretical physicists. I elaborate on
connections between the two, including a neural network approach to field theory.
1
Physics for Machine Learning Lecture Notes 2
Contents
1 Introduction 3
5 NN-FT Correspondence 17
5.1 Quantum Field Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 ϕ4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 Open Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
The Setup. where ℓ is a loss function such as ℓMSE = (ϕθ (xα ) − yα )2 . One may
optimize θ by gradient descent
Understanding ML at the very least means understanding neural
networks. A neural network is a function dθi
= −∇θi L[ϕθ ], (11)
dt
ϕθ : Rd → R (1)
or other algorithms, e.g., classics like stochastic gradient descent
with parameters θ. We’ve chosen outputs in R because, channeling (SGD) [23, 24] or Adam [25], or a more recent technique such as
Coleman, scalars already exhibit the essentials. We’ll use the lingo Energy Conserving Descent [26, 27]. Throughout, t is training time
Input: x ∈ Rd (2) of the learning algorithm unless otherwise noted.
Output: ϕθ (x) ∈ R (3) Second, statistics. When a NN is initialized on your computer, the
parameters θ are initialized as draws
Network: ϕθ ∈ Maps(Rd , R) (4)
Data: D, (5) θ ∼ P (θ) (12)
where the data D depends on the problem, but involves at least a from a distribution P (θ), where means “drawn from” in this context.
subset of Rd , potentially paired with labels y ∈ R. Different draws of θ will give different functions ϕθ , and a priori we
With this minimal background, let’s ask our central question: have no reason to prefer one over another. The prediction ϕθ (x)
therefore can’t be fundamental! Instead, what is fundamental is the
Question: What does a NN predict? average prediction and second moment or variance:
Z
For any fixed value of θ, the answer is clear: ϕθ (x). However, the
E[ϕθ (x)] = dθP (θ) ϕθ (x) (13)
answer is complicated by issues of both dynamics and statistics.
First, dynamics. In ML, parameters are updated to solve problems
Z
and we really have trajectories in E[ϕθ (x)ϕθ (y)] = dθP (θ) ϕθ (x)ϕθ (y), (14)
Parameter Space: θ(t) ∈ R|θ| (6) as well as the higher moments. Expectations are across different
Output Space: ϕθ(t) (x) ∈ R (7) initializations. Since we’re physicists, we henceforth replace E[·] = ⟨·⟩
Function Space: ϕθ(t) ∈ Maps(Rd , R). (8) and we remember this is a statistical expectation value. It’s useful
to put this in our language:
governed by some learning dynamics determined by the optimization
algorithm and the nature of the learning problem. For instance, in G(1) (x) = ⟨ϕθ (x)⟩ (15)
supervised learning we have data (2)
G (x, y) = ⟨ϕθ (x)ϕθ (y)⟩, (16)
d |D|
D = {(xα , yα ) ∈ R × R}α=1 , (9) the mean prediction and second moment are just the one-point and
and a loss function two-point correlation functions of the statistical ensemble of neural
|D|
networks. Apparently ML has something to do with field theory.
L[ϕθ ] =
X
ℓ(ϕθ (xα ), yα ), (10) Putting the dynamics and statistics together, we have an ensemble
α=1
of initial θ-values, each of which is the starting point of a trajectory
Physics for Machine Learning Lecture Notes 5
θ(t), and therefore we have an ensemble of trajectories. We choose • Statistics. What is the NN ensemble?
to think of θ(t) drawn as
• Dynamics. How does it evolve?
θ(t) ∼ P (θ(t)), (17)
We will approach the topics in this order, building up theory that
a density on parameters that depends on the training time and yields takes us through the last few years through a physicist’s lens.
time-dependent correlators A physicist’s lens means a few things. First, it means a physicist’s
(1) tools, including:
Gt (x) = ⟨ϕθ (x)⟩t (18)
(2)
Gt (x, y) = ⟨ϕθ (x)ϕθ (y)⟩t , (19) • Field Theory as alluded to above.
where the subscript t indicates time-dependence and the expecta- • Landscape Dynamics from loss functions on θ-space.
tion is with respect to P (θ(t)). Of course, assuming that learning is
helping, we wish to take t → ∞ and are interested in • Symmetries of individual NNs and their ensembles.
(1)
G∞ (x) = mean prediction of ∞-number of NNs as t → ∞. A physicist’s lens also means we will try to find the appropriate
balance of rigor and intuition. We’d like mechanisms and toy models,
Remarkably, we will see that in a certain supervised setting there is but we won’t try to prove everything at a level that would satisfy
an exact analytic solution for this quantity. a mathematician. This befits the current situation in ML, where
There is one more pillar beyond dynamics and statistics that I there are an enormous number of empirical results that need an O(1)
need to introduce: expressivity. Let’s make the point by seeing a theoretical understanding, not rigorous proof.
failure mode. Consider neural networks of the following functional Henceforth, we drop the subscript θ in ϕθ (x), and the reader should
form, or architecture: recall that the network depends on parameters.
ϕθ (x) = θ · x. (20)
The one-point and two-point functions of these are analytically solv- 2 Expressivity of Neural Networks
able, but going that far defeats the purpose: this is a linear model,
and learning schemes involving it are linear regression, which will fail Neural networks are big functions composed out of many simpler
on general problems. We say that model is not expressive enough to functions according to the choice of architecture. The compositions
account for the data. Conversely, we wish instead to choose expres- give more flexibility, prompting
sive architectures that can model the data, which essentially requires
that the architecture is complex enough to approximate anything. Of
Question: How powerful is a NN?
course, such architectures must be non-linear.
We now have the three pillars on which we’ll build our under-
standing of neural networks, and associated questions: Ultimately this is a question for mathematics, as it’s a question about
functions. Less colloquially, what we mean by the power of a NN is
• Expressivity. How powerful is the NN? its ability to approximate any function.
Physics for Machine Learning Lecture Notes 6
Model Multi-Layer Perceptron (MLP) Kolmogorov-Arnold Network (KAN) However, one aspect that is a little unusual is that ϕq,p depends on
Theorem Universal Approximation Theorem Kolmogorov-Arnold Representation Theorem both p and q, and therefore lives on the connection between xp and
2n+1 (1)
xq := np=1 ϕq,p (xp ). If this is a neural network, it’s one with acti-
N(ϵ) n
P
Formula
f(x) ≈ aiσ(wi ⋅ x + bi) f(x) = Φq ϕq,p(xp)
(Shallow) ∑ ∑ ∑
i=1 q=1 p=1
vations ϕq,p on the edges, rather than on the nodes. Furthermore, for
(a) (b) learnable activation functions
fixed activation functions
on nodes on edges
distinct values of p, q, the functions ϕq,p are independent in general.
Model
(Shallow) sum operation on nodes
In fact, this idea led to the recent proposal of Kolmogorov-Arnold
learnable weights
on edges
networks (KAN) [32], a new architecture in which activation func-
tions are on the edges, as motivated by KART, and are learned.
Formula
MLP(x) = (W3 ∘ σ2 ∘ W2 ∘ σ1 ∘ W1)(x) KAN(x) = (Φ3 ∘ Φ2 ∘ Φ1)(x)
(Deep) Much like how MLPs may be formed out of layers that repeat the
(c) MLP(x) (d) KAN(x) structure in the single-layer UAT, KANs may be formed into lay-
W3 Φ3
Model σ2 nonlinear, ers to repeat the structure of KART. See Figure 2 for a detailed
(Deep) fixed nonlinear,
W2 Φ2 learnable comparison with MLPs, including a depiction of the architectures
σ1
linear,
Φ1
and how they are formed out of layers motivated by their respective
W1 learnable
x x mathematical theorems. The KAN implementation provided in [32]
includes visualizations of the learned activation functions, which can
lead to interpretable can architectures that can be mapped directly
Figure 2: A comparison of MLP and KAN, including their functional onto symbolic formulae.
form and the mathematical theorem motivating the architecture.
(0) (1)
where J(x) is a source. This expectation ⟨·⟩ is intentionally not spec- where the set of network parameters is θ = {wij , wi } independently
ified here to allow for flexibility. For instance, using the expectation and identically distributed (i.i.d.).
in the introduction we have (0) (1)
Z wij ∼ P (w(0) ) wi ∼ P (w(1) ). (33)
R d
Z[J] = dθP (θ)e d xJ(x)ϕ(x) , (30) Under this assumption, we see
reminding the reader that the NN ϕ(x) depends on θ. The partition Observation: The network is a sum of N i.i.d. functions.
function integrates over the density of network parameters. But as
This is a function version of the Central Limit Theorem, generaliz-
physicists we’re much more familiar with function space densities
ing the review in Appendix A, and gives us the Neural Network /
according to
Gaussian Process (NNGP) correspondence,
Z R
Z[J] = Dϕ e−S[ϕ] e J(x)ϕ(x) , (31) NNGP Correspondence: in the N → ∞ limit, ϕ is drawn
from a Gaussian Process (GP),
the Feynman path integral that determines the correlators from an
action S[ϕ] that defines a density on functions. lim ϕ(x) ∼ N (µ(x), K(x, y)) , (34)
N →∞
Since starting a neural network requires specifying the data
(ϕ, P (θ)), the parameter space partition function (30) and associ- with mean and covariance (or kernel) µ(x) and K(x, y).
ated parameter space calculation of correlators is always available to
By the CLT, exp(−S[ϕ]) is Gaussian and therefore S[ϕ] is quadratic
us. Given that mathematical data, one might ask
in networks. Now this really feels like physics, since the infinite neural
network is drawn from a Gaussian density on functions, which defines
Question: What is the action S[ϕ] associated to (ϕ, P (θ))? a generalized free field theory.
We will address generality of the NNGP correspondence momen-
When this question can be answered, it opens a second way of study- tarily, but let’s first get a feel for how to do computations. To facil-
ing or understanding the theory. The parameter-space and function- itate then, we take P (w(1) ) to have zero mean and finite variance,
space descriptions should be thought of as a duality.
⟨w(1) ⟩ = 0 ⟨w(1) w(1) ⟩ = µ2 , (35)
3.1 NNGP Correspondence which causes the one-point function to vanish G(1) (x) = 0. Following
Williams [34], we compute the two-point function in parameter space
Having raised the question of the action S[ϕ] associated to the net- (with Einstein summation)
work data (ϕ, P (θ)), we can turn to a classic result of Neal [33].
1 (1) (0) (1) (0)
For simplicity, we again consider a single-layer fully connected net- G(2) (x, y) = ⟨wi σ(wij xj ) wk σ(wkl yl )⟩ (36)
work of width N , with the so-called biases turned off for simplicity: N
1 (1) (1) (0) (0)
= ⟨wi wk ⟩⟨σ(wij xj )σ(wkl yl )⟩ (37)
N d N
1 X X (1) (0) µ2
ϕ(x) = √ wi σ(wij xj ), (32) (0) (0)
= ⟨σ(wij xj ) σ(wil yl )⟩, (38)
N i=1 j=1 N
Physics for Machine Learning Lecture Notes 9
(1) (1)
where the last equality follows from the ones being i.i.d., ⟨wi wk ⟩ = Neal’s result — that infinite-width single-layer feedforward NNs are
µ2 δik . The sum over i gives us N copies of the same function, leaving drawn from GP — stood on its own for many years, perhaps (I am
us with guessing) due to focus on non-NN ML techniques in the 90’s and early
(0) (0)
G(2) (x, y) = µ2 ⟨σ(wij xj ) σ(wil yl )⟩, (39) 2000’s during a so-called AI Winter. As NNs succeeded on many
tasks in the 2010’s after AlexNet [35], however, many asked whether
where we emphasize there is now no summation on i. This is an
architecture X has a hyperparameter N such that the network is
exact-in-N two-point function that now requires only on the com-
drawn from a Gaussian Process as N → ∞. Before listing such X’s,
putation of the quantity in bra-kets. One may try to evaluate it
let’s rhetorically ask
exactly by doing the integral over w(0) . If it can’t be done, Monte
Carlo estimates may be obtained from M samples of w(0) ∼ P (w(0) )
as Question: Didn’t Neal’s result essentially follow from summing
µ2 X
M N i.i.d. random functions? Maybe NNs do this all the time?
(0) (0)
G(2) (x, y) ≃ σ(wij xj ) σ(wil yl ). (40)
M samples
In fact, that is the case. Architectures admitting an NNGP limit
In typical NN settings, parameter densities are easy to sample for include
convenience, allowing for easy computation of the estimate. If the
density is more complicated, one may always resort to Markov chains, • Deep Fully Connected Networks, N = width,
e.g. as in lattice field theory.
With this computation in hand, we have the defining data of this • Convolutional Neural Networks, N = channels,
NNGP,
• Attention Networks, N = heads,
lim ϕ(x) ∼ N 0, G(2) (x, y) .
(41)
N →∞
and many more. See, e.g., [36] and references therein.
The associated action is
Z
S[ϕ] = dd xdd y ϕ(x) G(2) (x, y)−1 ϕ(y), (42) 3.2 Non-Gaussian Processes
If the GP limit exists due to the CLT, then violating any of the as-
where
sumptions of the CLT should introduce non-Gaussianities, which are
Z
dd y G(2) (x, y)−1 G(2) (y, z) = δ (d) (x − z). (43) interactions in field theory. From Appendix A, we see that the CLT
is violated by finite-N corrections and breaking statistical indepen-
defines the inverse two-point function. In fact, this allows us to
dence. See [37] for a systematic treatment of independence breaking,
determine the action of any NNGP with µ(x) = G(1) (x) = 0, by
and derivation of NN actions from correlators.
computing the G(2) in parameter space and inverting it.
We wish to see the N -dependence of the connected 4-pt function.
William’s technique for computing G(2) extends to any correlator. To
So certain large neural networks are function draws from general-
avoid a proliferation of indices, we will compute it using the notation
ized free field theories. But at this point you might be asking yourself
X
Question: How general is the NNGP correspondence? ϕ(x) = wi φi (x) (44)
i
Physics for Machine Learning Lecture Notes 10
where wi is distributed as w(1) was in the single layer case, and φi (x) in Section 5, but for now we will focus on one type of structure:
are i.i.d. neurons of any architecture. The four-point function is symmetries.
To allow for symmetries at both input and output, in this section
G(4) = ⟨ϕ(x)ϕ(y)ϕ(z)ϕ(w)⟩ (45) we consider networks
ϕ : Rd → RD . (52)
X
= ⟨wi wj wk wl ⟩⟨φi (x)φj (y)φk (z)φl (w)⟩ (46)
i,j,k,l with D-dimensional output. Sometimes the indices will be implicit.
A classic example is equivariant neural networks [39]. We say that
X
= ⟨wi4 ⟩⟨φi (x)φi (y)φi (z)φi (w)⟩ (47)
i ϕ is G-equivariant with respect to a group G if
X
+ ⟨wi2 ⟩⟨wj2 ⟩⟨φi (x)φi (y)φj (z)φj (w) + perms⟩. (48) ρD (g)ϕ(x) = ϕ(ρd (g)x) ∀g ∈ G, (53)
i̸=j
where
One can see that you have to be careful with indices. The connected ρd ∈ Mat(Rd ) ρD ∈ Mat(RD ) (54)
4-pt function is [38]
are matrix representations of G on RD and Rd , respectively. The
G(4) (4) (2) (2)
network is invariant if ρD = 1, the trivial representation. Equiv-
c (x, y, z, w) = G (x, y, z, w) − G (x, y)G (z, w) + perms ,
(49) ariance is a powerful constraint on the network that may be imple-
and watching indices carefully we obtain mented in a variety of ways. For problems that have appropriate
symmetries, building them into the network can improve the speed
(4) 1 and performance of learning [40], e.g., at the level of scaling laws
Gc (x, y, z, w) = µ4 φi (x)φi (y)φi (z)φi (w) (50) [41, 42, 43]. For instance, in Phiala Shanahan’s lectures you’ll learn
N
about SU (N )-equivariant neural networks from her work [44], which
2
−µ2 φi (x)φi (y) ⟨φi (z)φi (w)⟩ + perms , (51) are natural in lattice field theory due to invariance of the action
under gauge transformations.
with no Einstein summation on i. We see that the connected 4-pt But this is the statistics section, so we’re interested in symme-
function is non-zero at finite-N , signalling interactions. We will see tries that arise in ensembles of neural networks, which leave the
(4) statistical ensemble invariant. In field theory, we call them global
that in some examples Gc can be computed exactly.
symmetries. Let the network transform under a group action as
where one can put indices on ϕ and J as required by D. By a network some examples and then discuss similarities and differences. For
redefinition on the LHS, this may be cast as having a symmetry if canonical cases in ML, like networks with ReLU activation or simi-
⟨·⟩ is invariant. In the usual path integral this is the statement lar, see [46] and references therein.
of invariant action S[ϕ] and measure Dϕ. In parameter space, the Gauss-net. The architecture and parameter densities are
redefinition may be instituted [45] by absorbing g into a redefinition
(1) (0) (0)
of parameters as θ 7→ θg , with symmetry arising when wi exp(wij xj + bi )
ϕ(x) = q 2
, (60)
Z Z
2 σw 0 2
dθg P (θg )e
R d
d xJ(x)ϕθ (x)
R d
= dθP (θ)e d xJ(x)ϕθ (x) , (58) exp[2(σb0 + d x )]
for parameters drawn i.i.d. from
i.e. the parameter density and measure must be invariant. We will 2
give a simple example realizing this mechanism in a moment. σW σw2 1
w(0) ∼ N (0, 0
), w(1) ∼ N (0, ), b(0) ∼ N (0, σb20 ).
It is most natural to transform the input or output of the net- d 2N
(61)
work. Our mechanism allows for symmetries of both types, which are
The two-point function is
analogs of spacetime and internal symmetries, respectively. It may
also be interesting to study symmetries of intermediate layers, if one σw2 1 − 1 σw2 (x1 −x2 )2
G(2) (x1 , x2 ) =e 2d 0 (62)
wishes to impose symmetries on learned representations. Equivari- 2
ance fits into this picture because it turns a transformation at input and we see that the theory has a correlation length
into a transformation at output. The ensemble of equivariant NNs s
is invariant under ρd action on the input if the partition function is d
invariant under the induced ρD action on the output. ξ= . (63)
σw2 0
Example. Consider any architecture of the form
The connected four-point function is
X
ϕ(x) = wi φi (x), wi ∼ P (w) even, (59) 1 4
2
2 P
σw
0 2
i
(4)
Gc = σw1 3e4σb0 e− 2d i xi −2(x1 x2 + 6 perms)
4N
for any neuron φ, which could itself be a deep network. The Z2 1 2
− 2d σw0 ((x1 −x4 )2 +(x2 −x3 )2 )
action ϕ 7→ −ϕ may be absorbed into parameters wg = −wi with − (e + 2 perms) . (64)
dwP (w) invariant by evenness, which is a global symmetry provided
that the domain is invariant. This theory has only even correlators. We see that the 2-pt function is translation invariant, but not the
Less trivial examples will be presented below, when we present 4-pt function.
specific networks and compute correlators. Euclidean-Invariant Nets. We are interested in constructing input
layers that are sufficient to ensure invariance under SO(d) and trans-
3.4 Examples lations, i.e., under the Euclidean group. Consider an input layer of
the form
In the statistics of neural networks, we have covered three topics: !
Gaussian limits, non-Gaussian corrections from violating CLT as-
X (0) (0)
ℓi (x) = F (w(0) i ) cos wij xj + bi , i ∈ 1, . . . , N , (65)
sumptions, and symmetries. We will present this essential data in j
Physics for Machine Learning Lecture Notes 12
where the sum has been made explicit since i’s are not summed over, The connected four-point function is
with
(0) (0)
wij ∼ P (wij )
(0)
bi ∼ Unif[−π, π]. (66) 1 4 1 2 2
(4)
G |c = σw1 3 e− 2d σw0 (x1 +x2 −x3 −x4 ) + 2 perms
8N
It is easy to see that xi → Rij xj for R ∈ SO(d) can be absorbed into
1 2
(0) − 2d σb ((x1 −x4 )2 +(x2 −x3 )2 )
a redefinition of the wij ’s, and the partition function is invariant − 2e + 2 perms . (71)
(0)
under this action provided that F (w(0) i ) and P (wij ) are invariant.
(0)
Furthermore, arbitrary translations xj → xj + ϵj give a term wij ϵj The 2-pt function and 4-pt function are fully Euclidean invariant.
(0)
that can be absorbed into a redefinition of the bi that leaves the Equipped with both Cos-net and Gauss-net, we’d like to compare
theory invariant. See [47] for the 2-pt and 4-pt functions. the two:
We emphasize two nice features: • Symmetry. Cos-net is Euclidean-invariant ∀N , by construc-
• Larger Euclidean Nets. Any network that builds on ℓ without tion, while Gauss-net enjoys this symmetry only when N → ∞;
reusing its parameters is Euclidean-invariant. in that limit it only depends on G(2) (x, y).
• Spectrum Shaping. In computing G(2) (p), bi gets evaluated • Large-N Duality. As N → ∞, both theories are drawn from
on p, and F may be chosen to shape the power spectrum (mo- the same Gaussian Process, i.e., with the same G(2) (x, y).
mentum space propagator) arbitrarily.
We refer the reader to [47] for general calculations of ℓ-correlators Scalar-net. Here is one final example you might like, specializing (67)
and to the specializations below for simple single-layer theories. We s
turn the ℓi (x) input layer into a scalar NN by 2vol(BΛd ) 1 (1)
(0) (0)
ϕ(x) = q w i cos w ij x j + b i , (72)
(2π)d σw2 1 (0) 2 2
ϕ(x) =
X (1)
wi ℓi (x) (67) w i +m
σw2 1 − 1 σw2 (x1 −x2 )2 We see that we have a realization of the free scalar field theory in d
(2)
G (x, y) = e 2d 0 . (70) Euclidean dimensions. See Sec. 5.2 for an extension to ϕ4 theory.
2
Physics for Machine Learning Lecture Notes 13
δℓ(ϕ(x), y) It’s also non-local, communicating information about the loss at train
∆(x) = − , (78) points xα to the test point x. In summary, the NTK is unwieldy.
δϕ(x)
The reason it is a classic result, however, is that it simplifies in
where y is to be understood as the label associated to x, which yields the N → ∞ limit. In this limit, neural network training is in the
so-called
|D|
dθi η X ∂ϕ(xα ) Lazy regime: |θ(t) − θ(0)| ≪ 1, (83)
= ∆(xα ) . (79)
dt |D| α=1 ∂θi i.e. the number of parameters is large but the evolution keeps them in
a local neighborhood. In such a regime the network is approximately
as another form of the gradient descent equation, by chain rule. ∆(x) a linear-in-parameters model [48, 49]
is the natural object of gradient descent in function space.
We use Einstein summation throughout this section unless ∂ϕ(x)
lim ϕ(x) ≃ ϕlin (x) := ϕθ0 (x) + (θ − θ0 )i , (84)
stated otherwise (which will happen). N →∞ ∂θi θ0
Physics for Machine Learning Lecture Notes 14
and we have We make the sums explicit here because one is very important. The
lim Θ(x, x′ ) ≃ Θ(x, x′ ) θ0
. (85) NTK is
N →∞
X ∂ϕ(x) ∂ϕ(x′ ) X ∂ϕ(x) ∂ϕ(x′ )
That is, the infinite-width NTK is the NTK at initialization, provided Θ(x, x′ ) = (1) (1)
+ (0) (0)
(89)
that the network evolves as a linear model. Furthermore, in the same i ∂w i ∂w i ij ∂wij ∂wij
limit the law of large numbers often allows a sum to be replaced by 1 X X
N d
(0) (0)
an expectation value, e.g., = σ(wij xj )σ(wil x′l ) (90)
N i=1 j,l=1
lim Θ(x, x′ )|θ0 = ⟨βθ (x, x′ )⟩ =: Θ̄(x, x′ ), (86) d
N →∞ X
′ (1) (1) ′ (0) ′ (0) ′
+ xj xj wi wi σ (wij xj )σ (wij xj ) (91)
′
for computable β(x, x ), yielding network dynamics governed by j=1
1 X
|D| =: βi (x, x′ ). (92)
dϕ(x) η X δl(ϕ(xα ), yα ) N i
=− Θ̄(x, xα ), (87)
dt |D| α=1 δϕ(xα ) If you squint a little, you’ll see that the i-sum is a sum over the
same type of object, βi (x, x′ ), whose i dependence comes from all
where Θ̄ is the so-called frozen NTK, a kernel that may be com- these i.i.d. parameter draws in the i-direction. By the law of large
puted at initialization and fixed once-and-for-all. This is a dramatic numbers, we have that in the N → ∞ limit
simplification of the dynamics.
However, you should also complain. Θ̄(x, x′ ) = ⟨βi (x, x′ )⟩, (93)
with no sum on i. We emphasize
Complaint: The dynamics in (87) simply interpolates between
information at train points xα and test point x, Observation: The NTK in the N → ∞ limit is deterministic
(parameter-independent), depending only on P (θ).
according to a fixed function Θ̄. This isn’t “learning” in the usual
Sometimes, the expectation may be computed exactly, and one knows
NN sense, and there are zero parameters. In particular, since the
the NTK that governs the dynamics once and for all.
NN only affects (87) through Θ̄, which is fixed, nothing happening
dynamically in the NN is affecting the evolution. We say that in this
limit the NN does not learn features in the hidden dimensions 4.2 An Exactly Solvable Model
(intermediate layers), since their non-trivial evolution would cause Let us consider a special case of frozen-NTK dynamics with MSE
the NTK to evolve. loss,
1
Example. Let’s compute the frozen NTK for a single-layer network, ℓ(ϕ(x), y) = (ϕ(x) − y)2 . (94)
2
to get the idea. The architecture is Then the dynamics (87) becomes
N d |D|
1 X X (1) (0) dϕ(x) η X
ϕ(x) = √ wi σ(wij xj ). (88) =− (ϕ(xα ) − yα )Θ̄(x, xα ). (95)
N i=1 j=1 dt |D| α=1
Physics for Machine Learning Lecture Notes 15
The solution to this ODE is general study of learning dynamics, with a focus on choosing wise
1 N -scaling such that features are non-trivially learned during gradient
ϕt (x) = ϕ0 (x)+ Θ̄(x, xα )Θ̄(xα , xβ )−1 1 − e−ηΘ̄t (yγ − ϕ0 (xγ )) , descent. To do this we will engineer three properties: that features
|D| βγ
(96) (pre-activations) are finite at initialization, predictions evolve in fi-
where computational difficulty is that Θ̄(x, xα ) is a |D| × |D| matrix nite time, and features evolve in finite time.
and takes O(|D|3 ) time to invert. The solution defines a trajectory This section is denser than the others, so I encourage the reader
through function space from ϕ0 to ϕ∞ . The converged network is to remember that we are aiming to achieve these three principles if
they become overwhelmed with details. Throughout, I am following
ϕ∞ (x) = ϕ0 (x) + Θ̄(x, xα )Θ̄(xα , xβ )−1 (yβ − ϕ0 (xβ )) . (97) lecture notes of Pehlevan and Bordelon [50] with some changes in
notation to match the rest of the lectures; see those lectures for
This is known as kernel regression, a classic technique in ML. In more details, [51, 52] for original literature from those lectures that
general kernel regression, one chooses the kernel. In our case, gradi- I utilize, and [53, 54] for related work on feature learning.
ent descent training in the N → ∞ limit is kernel regression, with We study a deep feedforward network with L layers and width N ,
respect to a specific kernel determined by the NN, the NTK Θ̄. which in all is a map
On train points we have memorization ϕ : RD → R (100)
ϕ∞ (xα ) = yα ∀α. (98) (note the input dimension D; d is reserved for below) defined recur-
sively as
On test points x, the converged network is performing an interpola- 1 (L)
tion, communicating residuals Rβ on train points β through a fixed ϕ(x) = z (x) (101)
γ0 N d
kernel Θ̄ to test points x. The prediction depends on ϕ0 , but may be 1 (L) (L−1)
averaged over to obtain z (L) (x) = a wi σ(zi (x)) (102)
N L
1
µ∞ (x) := ⟨ϕ∞ (x)⟩ = Θ̄(x, xα )Θ̄(xα , xβ )−1 yβ , (99) (ℓ)
zi (x) = a Wij σ(zj
(ℓ) (ℓ−1)
(x)) (103)
N ℓ
• µ∞ (x) is the mean prediction of an ∞ number of ∞-wide NNs where Einstein summation is implied throughout this section (unless
trained to ∞ time. stated otherwise) and all Latin indices run from {1, . . . , N } except
in the j-index in the first layer, when they are {1, . . . , D}. The
• If ϕ0 is drawn from a GP, then ϕ∞ is as well. The mean is pre- parameters are drawn
cisely µ∞ (x), see [49] for the two-point function and covariance.
(ℓ) 1 (ℓ) 1
wi ∼ N 0, b Wij ∼ N 0, b . (105)
N L N ℓ
4.3 Feature Learning
We scale the learning rate as
The frozen NTK is a tractable toy model, but it has a major draw-
back: it does not learn features. In this section we perform a more η = η0 γ02 N 2d−c (106)
Physics for Machine Learning Lecture Notes 16
with γ0 , η0 O(1) constants. We use a parameterization that will be is a feature kernel. We are constructing a proof-by-induction that the
convenient where d has already been introduced but c is a new pa- pre-activations are ON (1), so at this stage we may assume that the
rameter. For notational brevity we will sometimes use a Greek index pre-activations z (ℓ−1) ∼ ON (1) and therefore Φ(ℓ−1) ∼ ON (1) since it
(ℓ) (ℓ−1)
subscript in place of inputs, e.g. zα := z (ℓ) (xα ). The z’s are known is the average of N ON (1) quantities σ(zmα ). With this in hand,
as the pre-activations, as they are the inputs to the activation func-
tions σ. 2a1 + b1 = 0 2aℓ + bℓ = 1 ∀l > 1, (113)
We have a standard MLP but have parameterized our ignorance ensures that the pre-activations z (ℓ) are ON (1), as is empirically re-
of N -scaling, governed by parameters (aℓ , bℓ , c, d). We will use this quired for well-behaved training.
freedom to set some reasonable goals: As an aside: in the N → ∞ limit feature kernels asymptote to de-
• Finite Initialization Pre-activations. z (ℓ) ∼ ON (1) ∀l. terministic objects, akin to the frozen NTK behavior we have already
(ℓ)
seen, and intermediate layer pre-activations zα are also Gaussian
• Learning in Finite Time. dϕ(x)/dt ∼ ON (1). distributed. Therefore in that limit, the statistics of a randomly ini-
tialized neural network is described by a sequence of generalized free
• Feature Learning in Finite Time. dz (ℓ) /dt ∼ ON (1) ∀l. field theories where correlations are propagated down the network
These constraints have a one-parameter family of solutions, which according to a recursion relation; see, e.g., [55, 49].
becomes completely fixed under an additional learning rate assump-
Learning in Finite Time. Similar to our NTK derivation, we have
tion. We take each constraint in turn.
|D|
dϕ(x) η X ∂ϕ(xα ) ∂ϕ(x)
Finite Pre-activations. Since the weights are zero mean, trivially = ∆(xα ) , (114)
dt |D| α=1 ∂θi ∂θi
⟨zα(ℓ) ⟩ = 0 ∀l. (107)
We have that
We must also compute the covariance. For the first pre-activation, dϕ(x) γ 2 N 2d ∂ϕ(xα ) ∂ϕ(x)
we have ∼ ON (1) ↔ 0 c ∼ ON (1). (115)
dt N ∂θi ∂θi
(1) (1) 1 (1) (1) 1
⟨ziα zjβ ⟩ = 2a
⟨wim wjn ⟩ xmα xnβ = δij xmα xmβ . (108) A short calculation involving a chain rule, keeping track of the dif-
DN 1 DN 1 +b1
2a
ferent types of layers, and tedium yields
A similar calculation for higher layers l gives
η0 γ02 N 2d ∂ϕ(xα ) ∂ϕ(x)
1 1
1 = c Φ(L−1)
(ℓ) (ℓ)
⟨ziα zjβ ⟩ =
(1) (1)
⟨wim wjn ⟩ ⟨σ(zmα(ℓ−1) (ℓ−1)
)σ(zmβ )⟩ (109) N c ∂θi ∂θi N N L −1 µν
2a
N 2aℓ L−1
1 1
X 1 (ℓ) (ℓ−1) 1 (1) (0)
= δij 2a +b −1 ⟨σ(zmα (ℓ−1) (ℓ−1)
)σ(zmβ )⟩ (110) + G Φ
2aℓ −1 αx αx
+ 2a1 Gαx Φαx
N ℓ ℓ N ℓ=2
N N
1 (ℓ−1) (116)
= δij 2a +b −1 ⟨Φαβ ⟩, (111)
N ℓ ℓ
where
where 1 (ℓ) (ℓ) (ℓ)
√ ∂zµ(L)
(ℓ) 1 (ℓ) (ℓ) G(ℓ)
µν = giµ giν , giµ = N (117)
Φαβ := σ(zmα )σ(zmβ ) (112) N (ℓ)
∂ziµ
N
Physics for Machine Learning Lecture Notes 17
arises naturally when doing a chain rule through the layers and x in These constraints are solves by a one-parameter family depending on
the subscript is a stand-in for an arbitrary input x. Some additional a∈R
work [50] shows that g, G ∼ ON (1). Since we’ve shown that the
feature kernels are also ON (1), then if (aℓ , bℓ , cℓ ) = (a, 1 − 2a, 1 − 2a) ∀l > 1 (126)
1
2a1 + c = 0, 2aℓ + c = 1 ∀l > 1, (118) (a1 , b1 , c1 ) = (a − , 1 − 2a, 1 − 2a). (127)
2
we have that dϕ(x) ∼ ON (1), i.e., predictions evolve in finite time,
dt If one makes the addition demands that η ∼ ON (N 2d−c ) is ON (1), so
and therefore learning happens in finite time.
that you don’t have to change the learning rate as you scale up, then
Features Evolve in Finite Time. Finally, we need the features to c = 1 is fixed and there is a unique solution. This scaling is known
evolve. For the first layer we compute as the maximal update parameterization, or µP [51].
(1) |D|
dziµ 1 dWij 1 η0 γ0 X (1)
= √ xjµ = 2a1 +c−d+1/2 ∆ν giν Φ(0)
µν , 5 NN-FT Correspondence
dt N 1 D dt
a N |D| ν=1
(119) Understanding the statistics and dynamics of NNs has led us natu-
where the only N -dependence arises from the exponent. Given (118), rally to objects that we are used to from field theory. The idea has
been to understand ML theory, but one can also ask the converse,
1
d= (120) whether ML theory gives new insights into field theory. With that
2 in mind, we ask
is required to have dz (1) /dt ∼ ON (1), a constraint that in fact persists
for all dz (ℓ) /dt. We note that we have Question: What is a field theory?
which applies in particular to the NTK regime d = 0. • Fields, functions from an appropriate function space, or sections
of an appropriate bundle, more generally.
Summarizing the Constraints. Putting it all together, by enforcing
finite pre-activations and that predictions and features evolve in finite • Correlation Functions of fields, here expressed as scalars
time, we have constraints given by
G(n) (x1 , . . . , xn ) = ⟨ϕ(x1 ) . . . ϕ(xn )⟩. (128)
2a1 + b1 = 0 (121)
2aℓ + bℓ = 1 ∀l > 1 (122) You might already be wanting to add more beyond these minimal
2a1 + c = 0 (123) requirements – we’ll discuss that in a second. For now, we have
2aℓ + c = 1 ∀l > 1 (124)
1 Answer: a FT is an ensemble of functions with a way to com-
d= . (125) pute their correlators.
2
Physics for Machine Learning Lecture Notes 18
In the Euclidean case, when the expectation is a statistical expecta- The problem is that with any such X, there’s usually some commu-
tion, one my say nity of physicists that doesn’t care. For instance, not all statistical
field theories are Wick rotations of quantum theories; not all field
Euclidean Answer: a FT is a statistical ensemble of functions. theories have a known Lagrangian; not all field theories have sym-
metry; not all field theories are local. So I’m going to stick with my
definition, because at a minimum I want fields and correlators.
Our minimal requirements get us a partition function
Instead, if your X isn’t included in the definition of field theory,
dd xJ(x)ϕ(x)
R
Z[J] = ⟨e ⟩ (129) it becomes an engineering problem. Whether you’re defining your
specific theory by S[ϕ], (ϕθ , P (θ)), or something else, you can ask
that we can use to compute correlators, where at this stage we are
agnostic about the definition of ⟨·⟩. In normal field theory, the ⟨·⟩ is Question: Can I engineer my defining data to get FT + X?
defined by the Feynman path integral
Z R d For X = Symmetries you’ve already seen ways to do this at the level
Z[J] = Dϕ e−S[ϕ]+ d xJ(x)ϕ(x) , (130) of actions in QFT1 and at the level of (ϕθ , P (θ)) in these lectures.
For a current account of recent progress in NN-FT, see the rather
which requires specifying an action S[ϕ] that determines a density long Introduction of [37] and associated references, as well as results.
on functions exp(−S[ϕ]). But that’s not the data we specify when
we specify a NN. The NN data (ϕθ , P (θ)) instead defines
Z
5.1 Quantum Field Theory
R d
Z[J] = dθP (θ)e d xJ(x)ϕθ (x) . (131) We’ve been on Euclidean space the whole time1 , so it’s natural to
wonder in what sense these field theories are quantum. In a course
These are two different ways of defining a field theory, and indeed on field theory, we first learn to canonically quantize and then at
given (ϕθ , P (θ)) one can try to work out the associated action, in some later point learn about Wick rotation, and how it can define
which case we have dual description of the same field theory, as in the Euclidean correlators. The theory is manifestly quantum.
NNGP correspondence. The parameter space description is already But given a Euclidean theory, can it be continued to a well-behaved
quite useful, though, as it enables the computation of correlation quantum theory in Lorentzian signature, e.g. with unitary time evo-
functions even if the action isn’t known. In certain cases it enables lution and a Hilbert space without negative norm states? If we have
the computation of exact correlators in interacting theories. a nice-enough local action, it’s possible, but what if we don’t have
Okay, you get it, this is a different way to do field theory. Now I’ll an action? We ask:
let you complain about my definition. You’re asking Question: Given Euclidean correlators, can the theory be con-
tinued to a well-behaved Lorentzian quantum theory?
Question: Shouldn’t my definition of field theory include X?
I’m writing this before I give the lecture, and my guess is you already This is a central question in axiomatic quantum field theory, and the
asked about a set of X’s, e.g. answer is that it depends on the properties of the correlators. The
1
This can be relaxed, see e.g. for a recent paper defining an equivariant
X ∈ {Quantum, Lagrangian, Symmetries, Locality, . . . }. (132) network in Lorentzian signature [56].
Physics for Machine Learning Lecture Notes 19
Osterwalder-Schrader (OS) theorem [57] gives a set of conditions on then one may insert an operator associated to any local potential
the Euclidean correlators that ensure that the theory can be con- V (ϕ), which deforms the action in the expected way and the NN-FT
tinued to a unitary Lorentzian theory that satisfies the Wightman to
axioms. The conditions of the theorem include
Z R d R d
Z[J] = dθP (θ)e d xV (ϕθ (x)) e d xJ(x)ϕθ (x) (135)
• Euclidean Invariance. The correlators are invariant under the Z R d
Euclidean group, which after continuation to Lorentzian signa- =: dθP̃ (θ)e d xJ(x)ϕθ (x) , (136)
ture becomes the Poincaré group.
where the architecture equation ϕθ (x) lets us sub out the abstract ex-
• Permutation Invariance of the correlators G (n)
(x1 , . . . , xn ) pression for a concrete function of parameters, defining a new density
under any permutation of the x1 , . . . , xn . on parameters P̃ (θ) in the process. The interactions in V (ϕ) break
Gaussianity of the NNGP that was ensured by a CLT. This means
• Reflection Positivity. Having time in Lorentzian signature a CLT assumption must be violated: it is the breaking of statistical
requires picking a Euclidean time direction τ . Let R(x) be the independence in P̃ (θ). The theory Z[J] defined by (ϕθ , P̃ (θ)) is an
reflection of x in the τ = 0 plane. Then reflection positivity interacting NN-QFT, since local potentials that deformed Gaussian
requires that QFTs still satisfy reflection positivity and cluster decomposition.
G(2n) (x1 , . . . , xn , R(x1 ), . . . , R(xn )) ≥ 0. (133)
5.2 ϕ4 Theory
Technically, this is necessary but not sufficient. An accessible With this discussion of operator insertions, it’s clear now how to get
elaboration can be found in notes [58] from a previous TASI. ϕ4 theory. We just insert the operator
• Cluster Decomposition occurs when the connected correla- dd xϕθ (x)4
R
e (137)
tors vanish when any points are infinitely far apart.
into the partition function associated to the free scalar of Section
If all of these are satisfied, then the pair (ϕθ , P (θ)) that defines the 3.3; this operator with the architecture (72) technically requires an
NN-FT actually defines a neural network quantum field theory [47]. IR cutoff, though other architectures realizing the free scalar may
In NN-FT, permutation invariance is essentially automatic and Eu- not. The operator insertion deforms the parameter densities in (73)
clidean invariance may be engineered as described in Section 3.3. and breaks their statistical independence, explaining the origin of
Cluster decomposition and reflection positivity hold in some exam- interactions in the NN-QFT. See [37] for a thorough presentation.
ples [47, 37], but systematizing their construction is an important
direction for future work. 5.3 Open Questions
There is at least one straightforward way to obtain an interacting
NN-QFT. Notably, if (ϕθ , P (θ)) is a NNGP that satisfies the OS In this section we’ve discussed a neural network approach to field
axioms (this is much easier [47]) with Gaussian partition function theory in which the partition function is an integral over probability
densities of parameters, and the fields (networks) are functions of
these parameters. We have summarized some essential results, but
Z R d
ZG [J] = dθP (θ)e d xJ(x)ϕθ (x) , (134) there are many outstanding questions:
Physics for Machine Learning Lecture Notes 20
• Reflection Positivity. Is there a systematic way to engineer In Section 3 on statistics, I reminded the reader that a randomly
NN-FTs (ϕθ , P (θ)) that satisfy reflection positivity? initialized NN is no more fundamental than a single role of the dice.
It is a random function with parameters that is drawn from a statisti-
• Cluster Decomposition. Can systematically understand the cal ensemble of NNs, and we asked “What characterizes the statistics
conditions under which cluster decomposition holds in NN-FT? of the NN ensemble?” I reviewed a classic result of Neal from the
90’s that demonstrated that under common assumptions a width-N
• Engineering Actions. How generally can we define an NN-FT feedforward network is drawn from a Gaussian Process — a Gaussian
that is equivalent to a field theory with a fixed action? density on functions — in the N → ∞ limit. I explained that work
• Locality in the action can be realized at infinite-N , but is there from the last decade has demonstrated that this result generalizes to
a fundamental obstruction at finite-N ? many different architectures, as a result of the Central Limit The-
orem. Non-Gaussianities therefore arise by violating an assumption
On the other hand, an NN-FT approach to CFT and to Grassmann of the Central Limit Theorem, such as N → ∞ or statistical inde-
fields is well underway and should appear in 2024. pendence, and by violating them weakly the non-Gaussianities can
be made parametrically small. In field theory language, NNGPs are
generalized free field theories and NN Non-Gaussian processes are
6 Recap and Outlook interacting field theories. I also introduced a mechanism for global
symmetries and exemplified many of the phenomena in the section.
Neural networks are the backbone of recent progress in ML. In these In Section 4 I presented some essential results from ML theory
lectures we took the perspective that to understand ML, we must on NN dynamics, asking “How does a NN evolve under gradient
understand neural networks, and built our study around three pillars: descent?” When trained with full-batch gradient descent and a par-
the expressivity, statistics, and dynamics of neural networks. ticular normalization, the network dynamics are governed by the
Let’s recap the essentials and then provide some outlook. In Sec- Neural Tangent Kernel (NTK) that in general is an intractable ob-
tion 2 on expressivity, we asked “How powerful is a NN?” We pre- ject. However, in N → ∞ limits it becomes a deterministic function
sented the Universal Approximation Theorem (UAT) and gave a pic- and the NTK at initialization governs the dynamics for all time.
ture demonstrating how it works. We also presented the Kolmogorov- An exactly solvable model with mean-squared error loss is presented
Arnold Representation Theorem (KART), which recently motivated that includes the mean prediction of an infinite number of infinitely
a new architecture for deep learning where activations are learned wide neural networks trained to infinite time. The mean network
and live on the edges. We wonder: prediction is equivalent to kernel regression, and the kernel commu-
nicates information from train points to test points in order to make
Outlook: Cybenko’s UAT was around for over 20 years before predictions. However, from this description of the NTK a problem
the empirical breakthroughs of the 2010’s. Existence of good is already clear: nothing is being learned, and in particular late-
approximators wasn’t enough, we needed better compute and time features in the hidden dimensions are in a local neighborhood
optimization schemes to find them dynamically. However, given of their initial values. I showed how a detailed N -scaling analysis
tremendous progress in dynamics, should we systematically re- allows one to demand that network features and predictions update
turn to the approximation theory literature, to help motivate non-trivially, leading to richer learning regimes known as dynamical
new architectures, as e.g., in KART and KAN? mean field theory or the maximal update parameterization.
Physics for Machine Learning Lecture Notes 21
Outlook: These recent theories of NN statistics and dynam- which complements the statistical and dynamical approach that I’ve
ics sometimes realize interesting toy models, but with known taken in these lectures. I also recommend the book Deep Learning
shortcomings from a learning perspective, though the N -scaling [61] for a more comprehensive introduction to the field.
analysis in the feature learning section is promising. Surely there I end with a final analogy between history and the current situation
is still much to learn overall, e.g., we have said very little in these in ML. We are living in an ML era akin to the pre-PC era in classical
lectures about architecture design principles. For instance, some computing. Current large language models (LLMs) or other deep
architectures, such as the Transformers [59] central to LLMs, are networks are by now very powerful, but are the analogs of room-
motivated by structural principles, whereas others are motivated sized computers that can only be fully utilized by the privileged.
by theoretical guarantees. Whereas harnessing the power of those computers in the 60’s required
access to government facilities or large laboratories, only billion dollar
What are the missing principles that combines statistics, dy- companies have sufficient funds and compute to train today’s LLMs
namics, and architecture design to achieve optimal learning? from scratch. Just as the PC revolution brought computing power
to the masses in the 80’s, a central question now is whether we can
Finally, in Section 5 I explained how neural networks provide a
do the analog in ML by learning how to train equally powerful small
new way to define a field theory, so that one might use ML theory
models with limited resources. Doing so likely requires a deeper
for physics and not just physics for ML theory. This NN-FT corre-
understanding of theory, especially with respect to sparse networks.
spondence is a rethink of field theory from an ML perspective, and I
tried to state clearly about what is and isn’t known about obtaining
cherished field theory properties in that context. For example, I ex- Acknowledgements: I would like to thank the students and or-
plained how by engineering a desired NNGP, such as the free scalar, ganizers of TASI for an inspiring scientific environment and many
one may do an operator insertion in the path integral and interpret excellent questions. I am grateful for the papers of and discussions
the associated interactions as breaking the statistical independence with friends whose works have contributed significantly to these lec-
necessary for the CLT. By such a mechanism, one may engineer ϕ4 tures, especially Yasaman Bahri, Cengiz Pehlevan, and Greg Yang.
theory in an ensemble of infinite width neural networks. I’d also like to thank my collaborators on related matters, including
Mehmet Demirtas, Anindita Maiti, Fabian Ruehle, Matt Schwartz,
Outlook: Can NN-FTs motivate new interesting physics theo- and Keegan Stoner. I would like to thank Sam Frank, Yikun Jiang,
ries or provide useful tools for studying known theories? and Sneh Pandya for correcting typos in a draft. Finally, thanks
to GitHub Copilot for its help in writing these notes, including the
Our discussion in this Section is directly correlated with the top- correct auto-completion of equations and creation of tikz figures!
ics of these lectures. It reviews recent theoretical progress, but the Two-column landscape mode was chosen to facilitate boardwork, but
careful reader surely has a sense that theory is significantly behind I think it also makes a pleasant read, HT @ [62]. I am supported
experiment and there is still much to do. It is a great time to enter by the National Science Foundation under CAREER grant PHY-
the field, and to that end I recommend carefully reading the liter- 1848089 and Cooperative Agreement PHY-2019786 (The NSF AI
ature that I’ve cited, as well as the book Geometric Deep Learning Institute for Artificial Intelligence and Fundamental Interactions).
[60] that covers many current topics that will be natural to a physi-
cist, including grids, groups, graphs, geodesics, and gauges. Many Disclaimers: In an effort to post these notes shortly after TASI
aspects of that book are related to principled architecture design, 2024, there are more typos and fewer references than is ideal. Other
Physics for Machine Learning Lecture Notes 22
references and topics may be added in the future to round out the This yields
content that was presented in real-time. I also struck a playful tone lim κϕr>2 = 0, (144)
N →∞
throughout, because lectures are supposed to be fun, but may have
overdone it at times. I welcome suggestions on the above, but up- which is sufficient to show that ϕ is Gaussian in the large-N limit.
dates will still aim for clarity and brevity, targeted at HET students. In physics language, cumulants are connected correlators, and (144)
means that Gaussian (free) theories have no connected correlators.
In neural networks we will be interested in studying certain Gaus-
A Central Limit Theorem sian limits. From this CLT derivation, we see two potential origins
of non-Gaussianity:
Let us recall a simple derivation of the Central Limit Theorem (CLT),
in order to better understand the statistics of neural networks. Con- • 1/N -corrections from appearance in κϕr .
sider a sum of random variables
• Independence breaking since the proof relied on (141).
N
1 X
ϕ= √ Xi , (138)
N i=1
References
with ⟨Xi ⟩ = 0. The moments µr and cumulants κr are determined
by the moment generating function (partition function) Z[J] = ⟨eJϕ ⟩ [1] W. Isaacson, The Innovators: How a Group of Hackers,
and cumulant generating function W [J] = log Z[J], respectively, as Geniuses, and Geeks Created the Digital Revolution. Simon &
r Schuster, New York, first simon & schuster hardcover
d edition ed., 2014. On Shelf.
µr = Z[J] (139)
dJ J=0
r
d [2] D. Silver, J. Schrittwieser, et al., “Mastering the game of go
κr = W [J] . (140) without human knowledge,” Nature 550 no. 7676, (Oct, 2017)
dJ J=0
354–359. https://doi.org/10.1038/nature24270.
If the Xi are independent random variables, then the partition func-
[3] D. Silver, T. Hubert, et al., “Mastering chess and shogi by
Q
tion factorizes Z i Xi [J] = i ZXi [J], and the cumulant generating
P
function of the sum is the sum of the cumulant generating functions, self-play with a general reinforcement learning algorithm,”
yielding arXiv preprint arXiv:1712.01815 (2017) .
X
WPi Xi [J] = WXi [J] (141) [4] B. LeMoine, “How the artificial intelligence program alphazero
i mastered its games,” The New Yorker (Jan, 2023) . https://
P X
κr Xi
= κX www.newyorker.com/science/elements/how-the-artificia
r . (142)
i
i l-intelligence-program-alphazero-mastered-its-games.
Xi
If the Xi are identically distributed, then
√ the cumulants κr are the [5] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and
same for all i and we account for the 1/ N appropriately, we obtain S. Ganguli, “Deep unsupervised learning using nonequilibrium
κX i thermodynamics,” in International conference on machine
r
κϕr = r/2−1 . (143) learning, pp. 2256–2265, PMLR. 2015.
N
Physics for Machine Learning Lecture Notes 23
[6] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, [14] J. Carrasquilla and R. G. Melko, “Machine learning phases of
S. Ermon, and B. Poole, “Score-based generative modeling matter,” Nature Physics 13 no. 5, (2017) 431–434.
through stochastic differential equations,” arXiv preprint
arXiv:2011.13456 (2020) . [15] L. B. Anderson, M. Gerdes, J. Gray, S. Krippendorf,
N. Raghuram, and F. Ruehle, “Moduli-dependent Calabi-Yau
[7] A. Ananthaswamy, “The physics principle that inspired and SU(3)-structure metrics from Machine Learning,” JHEP
modern ai art,” Quanta Magazine (Jan, 2023) . 05 (2021) 013, arXiv:2012.04656 [hep-th].
https://www.quantamagazine.org/the-physics-principle
-that-inspired-modern-ai-art-20230105/. [16] M. R. Douglas, S. Lakshminarasimhan, and Y. Qi, “Numerical
Calabi-Yau metrics from holomorphic networks,”
[8] T. Brown, B. Mann, et al., “Language models are few-shot arXiv:2012.04797 [hep-th].
learners,” in Advances in Neural Information Processing
Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, [17] V. Jejjala, D. K. Mayorga Pena, and C. Mishra, “Neural
and H. Lin, eds., vol. 33, pp. 1877–1901. Curran Associates, network approximations for Calabi-Yau metrics,” JHEP 08
Inc., 2020. (2022) 105, arXiv:2012.15821 [hep-th].
https://proceedings.neurips.cc/paper_files/paper/202
0/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [18] S. Gukov, J. Halverson, and F. Ruehle, “Rigor with machine
learning from field theory to the poincaré conjecture,” Nature
[9] K. Roose, “The brilliance and weirdness of chatgpt,” The New Reviews Physics (2024) 1–10.
York Times (Dec, 2022) . https://www.nytimes.com/2022/1
2/05/technology/chatgpt-ai-twitter.html. [19] “Iaifi summer workshop 2023,” IAIFI - Institute for Artificial
Intelligence and Fundamental Interactions. 2023.
[10] A. Bhatia, “Watch an a.i. learn to write by reading nothing https://iaifi.org/events/summer_workshop_2023.html.
but shakespeare,” The New York Times (Apr, 2023) .
https://www.nytimes.com/interactive/2023/04/26/upsho [20] “Machine learning and the physical sciences,” in Proceedings of
t/gpt-from-scratch.html. the Neural Information Processing Systems (NeurIPS)
Workshop on Machine Learning and the Physical Sciences.
[11] S. Bubeck, V. Chandrasekaran, et al., “Sparks of artificial Neural Information Processing Systems Foundation, Inc., 2023.
general intelligence: Early experiments with gpt-4,” arXiv https://ml4physicalsciences.github.io/2023/.
preprint arXiv:2303.12712 (2023) .
[21] G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld,
[12] J. Jumper, R. Evans, et al., “Highly accurate protein structure N. Tishby, L. Vogt-Maranto, and L. Zdeborová, “Machine
prediction with alphafold,” Nature 596 no. 7873, (Aug, 2021) learning and the physical sciences,” Rev. Mod. Phys. 91 no. 4,
583–589. https://doi.org/10.1038/s41586-021-03819-2. (2019) 045002, arXiv:1903.10563 [physics.comp-ph].
[13] G. Carleo and M. Troyer, “Solving the quantum many-body [22] “Tasi 2024: Frontiers in particle theory.” 2024.
problem with artificial neural networks,” Science 355 no. 6325, https://www.colorado.edu/physics/events/summer-inten
(2017) 602–606. sive-programs/theoretical-advanced-study-institute-e
Physics for Machine Learning Lecture Notes 24
lementary-particle-physics-current. Summer Intensive [31] V. I. Arnold, “On functions of three variables,” Doklady
Programs, University of Colorado Boulder. Akademii Nauk SSSR 114 (1957) 679–681.
[23] H. Robbins and S. Monro, “A Stochastic Approximation [32] Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson,
Method,” The Annals of Mathematical Statistics 22 no. 3, M. Soljačić, T. Y. Hou, and M. Tegmark, “KAN:
(1951) 400 – 407. Kolmogorov-Arnold Networks,” arXiv:2404.19756 [cs.LG].
https://doi.org/10.1214/aoms/1177729586. [33] R. M. Neal, “Bayesian learning for neural networks,” Lecture
Notes in Statistics 118 (1996) .
[24] F. Rosenblatt, “The perceptron: a probabilistic model for
information storage and organization in the brain.” [34] C. K. Williams, “Computing with infinite networks,” Advances
Psychological review 65 6 (1958) 386–408. in neural information processing systems (1997) .
https://api.semanticscholar.org/CorpusID:12781225.
[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic classification with deep convolutional neural networks,” in
optimization.” 2017. https://arxiv.org/abs/1412.6980. Advances in Neural Information Processing Systems,
F. Pereira, C. Burges, L. Bottou, and K. Weinberger, eds.,
[26] G. B. De Luca and E. Silverstein, “Born-infeld (bi) for ai: vol. 25. Curran Associates, Inc., 2012.
Energy-conserving descent (ecd) for optimization,” in https://proceedings.neurips.cc/paper_files/paper/201
International Conference on Machine Learning, pp. 4918–4936, 2/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
PMLR. 2022.
[36] G. Yang, “Wide feedforward or recurrent neural networks of
[27] G. B. De Luca, A. Gatti, and E. Silverstein, “Improving any architecture are gaussian processes,” Advances in Neural
energy conserving descent for machine learning: Theory and Information Processing Systems 32 (2019) .
practice,” arXiv preprint arXiv:2306.00352 (2023) . [37] M. Demirtas, J. Halverson, A. Maiti, M. D. Schwartz, and
K. Stoner, “Neural network field theories: non-Gaussianity,
[28] G. Cybenko, “Approximation by superpositions of a sigmoidal
actions, and locality,” Mach. Learn. Sci. Tech. 5 no. 1, (2024)
function,” Mathematics of Control, Signals and Systems 2
015002, arXiv:2307.03223 [hep-th].
no. 4, (1989) 303–314.
[38] S. Yaida, “Non-gaussian processes and neural networks at
[29] K. Hornik, M. Stinchcombe, and H. White, “Approximation finite widths,” in Mathematical and Scientific Machine
capabilities of multilayer feedforward networks,” Neural Learning, pp. 165–192, PMLR. 2020.
Networks 4 no. 2, (1991) 251–257.
[39] T. Cohen and M. Welling, “Group equivariant convolutional
[30] A. N. Kolmogorov, “On the representation of continuous networks,” in Proceedings of The 33rd International
functions of several variables by superpositions of continuous Conference on Machine Learning, M. F. Balcan and K. Q.
functions of one variable and addition,” Doklady Akademii Weinberger, eds., vol. 48 of Proceedings of Machine Learning
Nauk SSSR 114 (1957) 953–956. Research, pp. 2990–2999. PMLR, New York, New York, USA,
Physics for Machine Learning Lecture Notes 25
20–22 jun, 2016. [48] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel:
https://proceedings.mlr.press/v48/cohenc16.html. Convergence and generalization in neural networks,” in
Advances in Neural Information Processing Systems, S. Bengio,
[40] M. Winkels and T. S. Cohen, “3d g-cnns for pulmonary nodule H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
detection,” arXiv preprint arXiv:1804.04656 (2018) . R. Garnett, eds., vol. 31. Curran Associates, Inc., 2018.
https://proceedings.neurips.cc/paper_files/paper/201
[41] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown,
8/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and
D. Amodei, “Scaling laws for neural language models,” arXiv [49] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak,
preprint arXiv:2001.08361 (2020) . J. Sohl-Dickstein, and J. Pennington, “Wide neural networks
of any depth evolve as linear models under gradient descent,”
[42] S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa,
Advances in neural information processing systems 32 (2019) .
M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky,
“E(3)-equivariant graph neural networks for data-efficient and
[50] C. Pehlevan and B. Bordelon, “Lecture notes on infinite-width
accurate interatomic potentials,” Nature Communications 13
limits of neural networks.” August, 2024.
no. 1, (May, 2022) 2453.
https://mlschool.princeton.edu/events/2023/pehlevan.
https://doi.org/10.1038/s41467-022-29939-5.
Princeton Machine Learning Theory Summer School, August 6
[43] N. Frey, R. Soklaski, S. Axelrod, S. Samsi, - 15, 2024.
R. Gomez-Bombarelli, C. Coley, et al., “Neural scaling of deep
[51] G. Yang and E. J. Hu, “Feature learning in infinite-width
chemical models,” ChemRxiv (2022) . This content is a
neural networks,” arXiv preprint arXiv:2011.14522 (2020) .
preprint and has not been peer-reviewed.
[44] D. Boyda, G. Kanwar, S. Racanière, D. J. Rezende, M. S. [52] B. Bordelon and C. Pehlevan, “Self-consistent dynamical field
Albergo, K. Cranmer, D. C. Hackett, and P. E. Shanahan, theory of kernel evolution in wide neural networks,” Advances
“Sampling using su (n) gauge equivariant flows,” Physical in Neural Information Processing Systems 35 (2022)
Review D 103 no. 7, (2021) 074504. 32240–32256.
[45] A. Maiti, K. Stoner, and J. Halverson, “Symmetry-via-Duality: [53] D. A. Roberts, S. Yaida, and B. Hanin, The principles of deep
Invariant Neural Network Densities from Parameter-Space learning theory, vol. 46. Cambridge University Press
Correlators,” arXiv:2106.00694 [cs.LG]. Cambridge, MA, USA, 2022.
[46] J. Halverson, A. Maiti, and K. Stoner, “Neural Networks and [54] S. Yaida, “Meta-principled family of hyperparameter scaling
Quantum Field Theory,” Mach. Learn. Sci. Tech. 2 no. 3, strategies,” arXiv preprint arXiv:2210.04909 (2022) .
(2021) 035002, arXiv:2008.08601 [cs.LG].
[55] S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein,
[47] J. Halverson, “Building Quantum Field Theories Out of “Deep information propagation,” arXiv preprint
Neurons,” arXiv:2112.04527 [hep-th]. arXiv:1611.01232 (2016) .
Physics for Machine Learning Lecture Notes 26