KEMBAR78
TASI Lecture On Physics For ML | PDF | Quantum Chromodynamics | Machine Learning
0% found this document useful (0 votes)
25 views26 pages

TASI Lecture On Physics For ML

Uploaded by

amoghprayagpatwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views26 pages

TASI Lecture On Physics For ML

Uploaded by

amoghprayagpatwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

arXiv:2408.

00082v1 [hep-th] 31 Jul 2024

TASI Lectures on Physics for Machine Learning


Jim Halverson
Department of Physics, Northeastern University, Boston, MA 02115, USA
The NSF Institute for Artificial Intelligence
and Fundamental Interactions
jhh@neu.edu

Abstract
These notes are based on lectures I gave at TASI 2024 on Physics for Machine Learning. The focus
is on neural network theory, organized according to network expressivity, statistics, and dynamics. I
present classic results such as the universal approximation theorem and neural network / Gaussian
process correspondence, and also more recent results such as the neural tangent kernel, feature learning
with the maximal update parameterization, and Kolmogorov-Arnold networks. The exposition on neural
network theory emphasizes a field theoretic perspective familiar to theoretical physicists. I elaborate on
connections between the two, including a neural network approach to field theory.

1
Physics for Machine Learning Lecture Notes 2

Contents
1 Introduction 3

2 Expressivity of Neural Networks 5


2.1 Universal Approximation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Kolmogorov-Arnold Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Statistics of Neural Networks 7


3.1 NNGP Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Non-Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Symmetries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Dynamics of Neural Networks 13


4.1 Neural Tangent Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 An Exactly Solvable Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 NN-FT Correspondence 17
5.1 Quantum Field Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 ϕ4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 Open Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Recap and Outlook 20

A Central Limit Theorem 22


Physics for Machine Learning Lecture Notes 3

1 Introduction a smattering of results across physics and the physical sciences.


With this in mind, we have some explaining to do. There are a
Computer science (CS) is still in its infancy. “What!?” you incredu- large number of impressive ML results in both pop culture and the
lously exclaim, thinking back to the earliest computer scientists you sciences, but experiments have significantly outpaced theory over
can think of, perhaps Lovelace or Babbage or Turing. Since its early the last decade. An analogy from physics might be useful, given the
days, CS has steadily progressed, from cracking the Enigma code in original audience. The current situation in ML might be reminis-
the 40’s, to personal computers in the 70’s and 80’s, the internet in cent of the 1960’s in particle physics: some essentials of the theory
the 90’s, powerful computers-in-your-pocket in the 2000’s, and finally were known, e.g. quantum electrodynamics (QED), but a deluge of
to some version of artificial intelligence in the last decade; see [1] for newly discovered hadrons left theorists scratching their heads and de-
a thorough account. It shows no signs of stopping, and one might manded an organizing principle to explain the new phenomena. This
wager that future historians will not speak of decades of progress in was one of the motivating factors for the development of string the-
CS, but centuries. Computer science is still in its infancy. ory, but the right theory in the end is the quark model, as described
It is 2024, what have we seen recently? A plethora of “experimen- by quantum chromodynamics (QCD). It’s tempting to speculate that
tal” breakthroughs in machine learning (ML) that constitute some of recent results in ML are waiting for someone to discover a founda-
the most exciting developments in CS to date. Reinforcement learn- tional theory, akin to QCD for the strong interactions. As theoret-
ing algorithms have become world-class at Go [2] and Chess [3, 4], ical physicists, we hope for this sort of mechanistic understanding.
merely by playing against themselves. Diffusion models [5, 6, 7] can However, it might be that ML and intelligence are so complex that
generate images of people that don’t exist, and yet you might never sociology is a better analogy, and it would be a fool’s errand to ex-
know. Large language models (e.g. GPT-3 [8] literature, or popular pect such a precise theory. My personal guess is that detailed theory
accounts [9, 10]) can code better than you — or me, at least — and will pay off in certain areas of ML — perhaps related to sparsity in
they can even write compelling bedtime stories for your children. NNs, architecture design, or the topics of these lectures — but not
These forms of artificial intelligence have already started a trillion in all. Nevertheless, given the experimental progress, one must try.
dollar industry, and some have speculated that we’ve already seen
sparks [11] of AGI, or human-level intelligence. These lectures were given at TASI 2024 [22] to Ph.D. students in
theoretical high energy physics. I was given the title “Physics4ML”,
We’ve also started to see significant advances in the sciences. For
which means that I’m trying to keep ML-for-Physics to a minimum. I
instance, AlphaFold [12] has achieved state-of-the-art performance
take a physics perspective, but acknowledge that this is an enormous
for predicting the structure of folded proteins, and neural networks
field to which many communities have contributed.
(NNs) give state-of-the-art models for complex objects, from quan-
The central idea of the lectures is that understanding NNs is es-
tum many-body ground states [13, 14] to metrics [15, 16, 17] on the
sential to understanding ML, a story I built on three pillars: expres-
Calabi-Yau manifolds of string theory. Lack-of-rigor is a natural con-
sivity, statistics, and dynamics of NNs, one for each of the lectures
cern, but in certain cases ML techniques can be made rigorous or in-
I gave. The content builds on many ideas explained to me over the
terpretable at a level that would satisfy a formal theoretical physicist
last five years by friends, acknowledged below, who pioneered many
or pure mathematician; see, e.g., [18] and references therein for exam-
of the developments, as well as a few of my own results. Throughout,
ples in string theory, algebraic geometry, differential geometry, and
I take a decidedly field-theoretic lens to suit the audience, and also
low-dimensional topology, including uses of ML theory for science.
because I think it’s a useful way to understand NNs.
Of course, this is just scratching the surface, see, e.g., [19, 20, 21] for
Physics for Machine Learning Lecture Notes 4

The Setup. where ℓ is a loss function such as ℓMSE = (ϕθ (xα ) − yα )2 . One may
optimize θ by gradient descent
Understanding ML at the very least means understanding neural
networks. A neural network is a function dθi
= −∇θi L[ϕθ ], (11)
dt
ϕθ : Rd → R (1)
or other algorithms, e.g., classics like stochastic gradient descent
with parameters θ. We’ve chosen outputs in R because, channeling (SGD) [23, 24] or Adam [25], or a more recent technique such as
Coleman, scalars already exhibit the essentials. We’ll use the lingo Energy Conserving Descent [26, 27]. Throughout, t is training time
Input: x ∈ Rd (2) of the learning algorithm unless otherwise noted.
Output: ϕθ (x) ∈ R (3) Second, statistics. When a NN is initialized on your computer, the
parameters θ are initialized as draws
Network: ϕθ ∈ Maps(Rd , R) (4)
Data: D, (5) θ ∼ P (θ) (12)

where the data D depends on the problem, but involves at least a from a distribution P (θ), where means “drawn from” in this context.
subset of Rd , potentially paired with labels y ∈ R. Different draws of θ will give different functions ϕθ , and a priori we
With this minimal background, let’s ask our central question: have no reason to prefer one over another. The prediction ϕθ (x)
therefore can’t be fundamental! Instead, what is fundamental is the
Question: What does a NN predict? average prediction and second moment or variance:
Z
For any fixed value of θ, the answer is clear: ϕθ (x). However, the
E[ϕθ (x)] = dθP (θ) ϕθ (x) (13)
answer is complicated by issues of both dynamics and statistics.
First, dynamics. In ML, parameters are updated to solve problems
Z
and we really have trajectories in E[ϕθ (x)ϕθ (y)] = dθP (θ) ϕθ (x)ϕθ (y), (14)

Parameter Space: θ(t) ∈ R|θ| (6) as well as the higher moments. Expectations are across different
Output Space: ϕθ(t) (x) ∈ R (7) initializations. Since we’re physicists, we henceforth replace E[·] = ⟨·⟩
Function Space: ϕθ(t) ∈ Maps(Rd , R). (8) and we remember this is a statistical expectation value. It’s useful
to put this in our language:
governed by some learning dynamics determined by the optimization
algorithm and the nature of the learning problem. For instance, in G(1) (x) = ⟨ϕθ (x)⟩ (15)
supervised learning we have data (2)
G (x, y) = ⟨ϕθ (x)ϕθ (y)⟩, (16)
d |D|
D = {(xα , yα ) ∈ R × R}α=1 , (9) the mean prediction and second moment are just the one-point and
and a loss function two-point correlation functions of the statistical ensemble of neural
|D|
networks. Apparently ML has something to do with field theory.
L[ϕθ ] =
X
ℓ(ϕθ (xα ), yα ), (10) Putting the dynamics and statistics together, we have an ensemble
α=1
of initial θ-values, each of which is the starting point of a trajectory
Physics for Machine Learning Lecture Notes 5

θ(t), and therefore we have an ensemble of trajectories. We choose • Statistics. What is the NN ensemble?
to think of θ(t) drawn as
• Dynamics. How does it evolve?
θ(t) ∼ P (θ(t)), (17)
We will approach the topics in this order, building up theory that
a density on parameters that depends on the training time and yields takes us through the last few years through a physicist’s lens.
time-dependent correlators A physicist’s lens means a few things. First, it means a physicist’s
(1) tools, including:
Gt (x) = ⟨ϕθ (x)⟩t (18)
(2)
Gt (x, y) = ⟨ϕθ (x)ϕθ (y)⟩t , (19) • Field Theory as alluded to above.

where the subscript t indicates time-dependence and the expecta- • Landscape Dynamics from loss functions on θ-space.
tion is with respect to P (θ(t)). Of course, assuming that learning is
helping, we wish to take t → ∞ and are interested in • Symmetries of individual NNs and their ensembles.

(1)
G∞ (x) = mean prediction of ∞-number of NNs as t → ∞. A physicist’s lens also means we will try to find the appropriate
balance of rigor and intuition. We’d like mechanisms and toy models,
Remarkably, we will see that in a certain supervised setting there is but we won’t try to prove everything at a level that would satisfy
an exact analytic solution for this quantity. a mathematician. This befits the current situation in ML, where
There is one more pillar beyond dynamics and statistics that I there are an enormous number of empirical results that need an O(1)
need to introduce: expressivity. Let’s make the point by seeing a theoretical understanding, not rigorous proof.
failure mode. Consider neural networks of the following functional Henceforth, we drop the subscript θ in ϕθ (x), and the reader should
form, or architecture: recall that the network depends on parameters.
ϕθ (x) = θ · x. (20)
The one-point and two-point functions of these are analytically solv- 2 Expressivity of Neural Networks
able, but going that far defeats the purpose: this is a linear model,
and learning schemes involving it are linear regression, which will fail Neural networks are big functions composed out of many simpler
on general problems. We say that model is not expressive enough to functions according to the choice of architecture. The compositions
account for the data. Conversely, we wish instead to choose expres- give more flexibility, prompting
sive architectures that can model the data, which essentially requires
that the architecture is complex enough to approximate anything. Of
Question: How powerful is a NN?
course, such architectures must be non-linear.
We now have the three pillars on which we’ll build our under-
standing of neural networks, and associated questions: Ultimately this is a question for mathematics, as it’s a question about
functions. Less colloquially, what we mean by the power of a NN is
• Expressivity. How powerful is the NN? its ability to approximate any function.
Physics for Machine Learning Lecture Notes 6

2.1 Universal Approximation Theorem


The Universal Approximation Theorem (UAT) is the first result in x
this direction. It states that a neural network with a single hidden
layer can approximate any continuous function on a compact do-
main to arbitrary accuracy. More precisely, the origin version of the Figure 1: The Universal Approximation Theorem can be understood
theorem due to Cybenko is by approximating a function like sin(x) with a series of bumps.
Theorem 2.1 (Cybenko). Let f : Rd → R be a continuous function
on a compact set K ⊂ Rd . Then for any ϵ > 0 there exists a neural In Cybenko’s original work, he focused on the case that σ was the
network with a single hidden layer of the form sigmoid function
1
σ(x) = , (23)
N X
X d 1 + e−x
(1) (0) (0)
ϕ(x) = wi σ(wij xj + bi ) + b(1) , (21) which allows us to get a picture of what’s happening. The sigmoids
i=1 j=1 appear in the network function of the form
 
(0) (0)
(0) (1) (0)
θ = {wij , wi , bi , b(1) }, where σ : R → R is a non-polynomial non- σ wi · x + b i (24)
linear activation function, such that
which approximates a shifted step function as w(0) → ∞. A linear
sup |f (x) − ϕ(x)| < ϵ. (22) combination can turn it into a bump with approximately compact
x∈K support that gets scaled by w(1) . These bumps can then be put
together to approximate any function; see, e.g., Fig. 2.1.
The architecture in Eq. (21) is known by many names, including the Cybenko’s version of the UAT was just the first, and there are
perceptron, a fully-connected network, or a feedforward network. many generalizations, including to deeper networks, to other activa-
The parameter N is known as the width. It may be generalized to tion functions, and to other domains. See [28] for the original paper
include a depth dimension L encoding the number of compositions and [29] for a generalization to the case of multiple hidden layers.
of affine transformations and non-linearities. Such an architecture
is known as a multi-layer perceptron (MLP) or a deep feedforward
network. 2.2 Kolmogorov-Arnold Theorem
The UAT is a powerful result, but it has some limitations. First, The UAT essentially says that any function can be approximated
though the error ϵ does get better with N , it doesn’t say how many by a neural network, i.e. a complicated function comprised out of
neurons are needed. Second, it doesn’t say how to train the network: many building blocks. It is natural to wonder whether there are
though there is a point θ∗ in parameter space that is a good approx- other theorems to this effect, and whether they inspire other neural
imation to any f , existence doesn’t imply that we can find it, which network constructions.
is a question of learning dynamics. One such result is due to Kolmogorov [30] and Arnold [31], who
To ask the obvious, showed that any multivariate continuous function can be represented
(exactly, not approximated) as a sum of continuous one-dimensional
Question: Why does this UAT work? functions:
Physics for Machine Learning Lecture Notes 7

Model Multi-Layer Perceptron (MLP) Kolmogorov-Arnold Network (KAN) However, one aspect that is a little unusual is that ϕq,p depends on
Theorem Universal Approximation Theorem Kolmogorov-Arnold Representation Theorem both p and q, and therefore lives on the connection between xp and
2n+1 (1)
xq := np=1 ϕq,p (xp ). If this is a neural network, it’s one with acti-
N(ϵ) n
P
Formula
f(x) ≈ aiσ(wi ⋅ x + bi) f(x) = Φq ϕq,p(xp)
(Shallow) ∑ ∑ ∑
i=1 q=1 p=1
vations ϕq,p on the edges, rather than on the nodes. Furthermore, for
(a) (b) learnable activation functions
fixed activation functions
on nodes on edges
distinct values of p, q, the functions ϕq,p are independent in general.
Model
(Shallow) sum operation on nodes
In fact, this idea led to the recent proposal of Kolmogorov-Arnold
learnable weights
on edges
networks (KAN) [32], a new architecture in which activation func-
tions are on the edges, as motivated by KART, and are learned.
Formula
MLP(x) = (W3 ∘ σ2 ∘ W2 ∘ σ1 ∘ W1)(x) KAN(x) = (Φ3 ∘ Φ2 ∘ Φ1)(x)
(Deep) Much like how MLPs may be formed out of layers that repeat the
(c) MLP(x) (d) KAN(x) structure in the single-layer UAT, KANs may be formed into lay-
W3 Φ3
Model σ2 nonlinear, ers to repeat the structure of KART. See Figure 2 for a detailed
(Deep) fixed nonlinear,
W2 Φ2 learnable comparison with MLPs, including a depiction of the architectures
σ1
linear,
Φ1
and how they are formed out of layers motivated by their respective
W1 learnable
x x mathematical theorems. The KAN implementation provided in [32]
includes visualizations of the learned activation functions, which can
lead to interpretable can architectures that can be mapped directly
Figure 2: A comparison of MLP and KAN, including their functional onto symbolic formulae.
form and the mathematical theorem motivating the architecture.

Theorem 2.2 (Kolmogorov-Arnold Representation Theorem). Let


3 Statistics of Neural Networks
f : [0, 1]n → R be an arbitrary multivariate continuous function. Let’s try to understand neural networks at initialization. For this,
Then it has the representation firing up your computer once is not enough, since one initialization
2n
X n
X
! give you one draw θ ∼ P (θ) and therefore one random neural net-
f (x1 , . . . , xn ) = Φq ϕq,p (xp ) (25) work. Instead, we want to understand the statistics of the networks:
q=0 p=1
Question: What characterizes the stats of the NN ensemble?
with continuous one-dimensional functions Φq and ϕq,p .
One aspect of this is encoded in the moments, or n-pt functions,
This result is perhaps less intuitive than the UAT. If every mul-
tivariate function is to be represented by univariate functions plus G(n) (x1 , . . . , xn ) = ⟨ϕ(x1 ) . . . ϕ(xn )⟩, (27)
addition, what then of f (x, y) = xy? This is just a pedagogical
example since which may be obtained from a partition function as
f (x, y) = xy = elog x+log y , (26) R d
Z[J] = ⟨e d xJ(x)ϕ(x) ⟩ (28)
but of course the general theorem is very non-trivial.  
A notable feature of (25) is that it is a sum of functions of func- (n) δ δ
G (x1 , . . . , xn ) = ... Z[J] , (29)
tions, which one might take to inspire a neural network definition. δJ(x1 ) δJ(xn ) J=0
Physics for Machine Learning Lecture Notes 8

(0) (1)
where J(x) is a source. This expectation ⟨·⟩ is intentionally not spec- where the set of network parameters is θ = {wij , wi } independently
ified here to allow for flexibility. For instance, using the expectation and identically distributed (i.i.d.).
in the introduction we have (0) (1)
Z wij ∼ P (w(0) ) wi ∼ P (w(1) ). (33)
R d
Z[J] = dθP (θ)e d xJ(x)ϕ(x) , (30) Under this assumption, we see

reminding the reader that the NN ϕ(x) depends on θ. The partition Observation: The network is a sum of N i.i.d. functions.
function integrates over the density of network parameters. But as
This is a function version of the Central Limit Theorem, generaliz-
physicists we’re much more familiar with function space densities
ing the review in Appendix A, and gives us the Neural Network /
according to
Gaussian Process (NNGP) correspondence,
Z R
Z[J] = Dϕ e−S[ϕ] e J(x)ϕ(x) , (31) NNGP Correspondence: in the N → ∞ limit, ϕ is drawn
from a Gaussian Process (GP),
the Feynman path integral that determines the correlators from an
action S[ϕ] that defines a density on functions. lim ϕ(x) ∼ N (µ(x), K(x, y)) , (34)
N →∞
Since starting a neural network requires specifying the data
(ϕ, P (θ)), the parameter space partition function (30) and associ- with mean and covariance (or kernel) µ(x) and K(x, y).
ated parameter space calculation of correlators is always available to
By the CLT, exp(−S[ϕ]) is Gaussian and therefore S[ϕ] is quadratic
us. Given that mathematical data, one might ask
in networks. Now this really feels like physics, since the infinite neural
network is drawn from a Gaussian density on functions, which defines
Question: What is the action S[ϕ] associated to (ϕ, P (θ))? a generalized free field theory.
We will address generality of the NNGP correspondence momen-
When this question can be answered, it opens a second way of study- tarily, but let’s first get a feel for how to do computations. To facil-
ing or understanding the theory. The parameter-space and function- itate then, we take P (w(1) ) to have zero mean and finite variance,
space descriptions should be thought of as a duality.
⟨w(1) ⟩ = 0 ⟨w(1) w(1) ⟩ = µ2 , (35)

3.1 NNGP Correspondence which causes the one-point function to vanish G(1) (x) = 0. Following
Williams [34], we compute the two-point function in parameter space
Having raised the question of the action S[ϕ] associated to the net- (with Einstein summation)
work data (ϕ, P (θ)), we can turn to a classic result of Neal [33].
1 (1) (0) (1) (0)
For simplicity, we again consider a single-layer fully connected net- G(2) (x, y) = ⟨wi σ(wij xj ) wk σ(wkl yl )⟩ (36)
work of width N , with the so-called biases turned off for simplicity: N
1 (1) (1) (0) (0)
= ⟨wi wk ⟩⟨σ(wij xj )σ(wkl yl )⟩ (37)
N d N
1 X X (1) (0) µ2
ϕ(x) = √ wi σ(wij xj ), (32) (0) (0)
= ⟨σ(wij xj ) σ(wil yl )⟩, (38)
N i=1 j=1 N
Physics for Machine Learning Lecture Notes 9

(1) (1)
where the last equality follows from the ones being i.i.d., ⟨wi wk ⟩ = Neal’s result — that infinite-width single-layer feedforward NNs are
µ2 δik . The sum over i gives us N copies of the same function, leaving drawn from GP — stood on its own for many years, perhaps (I am
us with guessing) due to focus on non-NN ML techniques in the 90’s and early
(0) (0)
G(2) (x, y) = µ2 ⟨σ(wij xj ) σ(wil yl )⟩, (39) 2000’s during a so-called AI Winter. As NNs succeeded on many
tasks in the 2010’s after AlexNet [35], however, many asked whether
where we emphasize there is now no summation on i. This is an
architecture X has a hyperparameter N such that the network is
exact-in-N two-point function that now requires only on the com-
drawn from a Gaussian Process as N → ∞. Before listing such X’s,
putation of the quantity in bra-kets. One may try to evaluate it
let’s rhetorically ask
exactly by doing the integral over w(0) . If it can’t be done, Monte
Carlo estimates may be obtained from M samples of w(0) ∼ P (w(0) )
as Question: Didn’t Neal’s result essentially follow from summing
µ2 X
M N i.i.d. random functions? Maybe NNs do this all the time?
(0) (0)
G(2) (x, y) ≃ σ(wij xj ) σ(wil yl ). (40)
M samples
In fact, that is the case. Architectures admitting an NNGP limit
In typical NN settings, parameter densities are easy to sample for include
convenience, allowing for easy computation of the estimate. If the
density is more complicated, one may always resort to Markov chains, • Deep Fully Connected Networks, N = width,
e.g. as in lattice field theory.
With this computation in hand, we have the defining data of this • Convolutional Neural Networks, N = channels,
NNGP,
• Attention Networks, N = heads,
lim ϕ(x) ∼ N 0, G(2) (x, y) .

(41)
N →∞
and many more. See, e.g., [36] and references therein.
The associated action is
Z
S[ϕ] = dd xdd y ϕ(x) G(2) (x, y)−1 ϕ(y), (42) 3.2 Non-Gaussian Processes
If the GP limit exists due to the CLT, then violating any of the as-
where
sumptions of the CLT should introduce non-Gaussianities, which are
Z
dd y G(2) (x, y)−1 G(2) (y, z) = δ (d) (x − z). (43) interactions in field theory. From Appendix A, we see that the CLT
is violated by finite-N corrections and breaking statistical indepen-
defines the inverse two-point function. In fact, this allows us to
dence. See [37] for a systematic treatment of independence breaking,
determine the action of any NNGP with µ(x) = G(1) (x) = 0, by
and derivation of NN actions from correlators.
computing the G(2) in parameter space and inverting it.
We wish to see the N -dependence of the connected 4-pt function.
William’s technique for computing G(2) extends to any correlator. To
So certain large neural networks are function draws from general-
avoid a proliferation of indices, we will compute it using the notation
ized free field theories. But at this point you might be asking yourself
X
Question: How general is the NNGP correspondence? ϕ(x) = wi φi (x) (44)
i
Physics for Machine Learning Lecture Notes 10

where wi is distributed as w(1) was in the single layer case, and φi (x) in Section 5, but for now we will focus on one type of structure:
are i.i.d. neurons of any architecture. The four-point function is symmetries.
To allow for symmetries at both input and output, in this section
G(4) = ⟨ϕ(x)ϕ(y)ϕ(z)ϕ(w)⟩ (45) we consider networks
ϕ : Rd → RD . (52)
X
= ⟨wi wj wk wl ⟩⟨φi (x)φj (y)φk (z)φl (w)⟩ (46)
i,j,k,l with D-dimensional output. Sometimes the indices will be implicit.
A classic example is equivariant neural networks [39]. We say that
X
= ⟨wi4 ⟩⟨φi (x)φi (y)φi (z)φi (w)⟩ (47)
i ϕ is G-equivariant with respect to a group G if
X
+ ⟨wi2 ⟩⟨wj2 ⟩⟨φi (x)φi (y)φj (z)φj (w) + perms⟩. (48) ρD (g)ϕ(x) = ϕ(ρd (g)x) ∀g ∈ G, (53)
i̸=j
where
One can see that you have to be careful with indices. The connected ρd ∈ Mat(Rd ) ρD ∈ Mat(RD ) (54)
4-pt function is [38]
are matrix representations of G on RD and Rd , respectively. The
G(4) (4) (2) (2)
 network is invariant if ρD = 1, the trivial representation. Equiv-
c (x, y, z, w) = G (x, y, z, w) − G (x, y)G (z, w) + perms ,
(49) ariance is a powerful constraint on the network that may be imple-
and watching indices carefully we obtain mented in a variety of ways. For problems that have appropriate
symmetries, building them into the network can improve the speed

(4) 1 and performance of learning [40], e.g., at the level of scaling laws
Gc (x, y, z, w) = µ4 φi (x)φi (y)φi (z)φi (w) (50) [41, 42, 43]. For instance, in Phiala Shanahan’s lectures you’ll learn
N
 about SU (N )-equivariant neural networks from her work [44], which
2

−µ2 φi (x)φi (y) ⟨φi (z)φi (w)⟩ + perms , (51) are natural in lattice field theory due to invariance of the action
under gauge transformations.
with no Einstein summation on i. We see that the connected 4-pt But this is the statistics section, so we’re interested in symme-
function is non-zero at finite-N , signalling interactions. We will see tries that arise in ensembles of neural networks, which leave the
(4) statistical ensemble invariant. In field theory, we call them global
that in some examples Gc can be computed exactly.
symmetries. Let the network transform under a group action as

3.3 Symmetries ϕ 7→ ϕg , g ∈ G. (55)


We have gotten some control over the statistics of the ensemble of We say that ensemble of networks has a global symmetry group G if
neural networks. It natural at this point to ask the partition function is invariant,

Question: Is there structure in the ensemble? Zg [J] = Z[J], ∀g ∈ G. (56)


At the level of expectations, this is
By this I mean properties that the ensemble realizes that an individ-
dd xJ(x)ϕg (x) dd xJ(x)ϕ(x)
R R
ual network might not see. We will return to this question broadly ⟨e ⟩ = ⟨e ⟩ ∀g ∈ G, (57)
Physics for Machine Learning Lecture Notes 11

where one can put indices on ϕ and J as required by D. By a network some examples and then discuss similarities and differences. For
redefinition on the LHS, this may be cast as having a symmetry if canonical cases in ML, like networks with ReLU activation or simi-
⟨·⟩ is invariant. In the usual path integral this is the statement lar, see [46] and references therein.
of invariant action S[ϕ] and measure Dϕ. In parameter space, the Gauss-net. The architecture and parameter densities are
redefinition may be instituted [45] by absorbing g into a redefinition
(1) (0) (0)
of parameters as θ 7→ θg , with symmetry arising when wi exp(wij xj + bi )
ϕ(x) = q 2
, (60)
Z Z
2 σw 0 2
dθg P (θg )e
R d
d xJ(x)ϕθ (x)
R d
= dθP (θ)e d xJ(x)ϕθ (x) , (58) exp[2(σb0 + d x )]
for parameters drawn i.i.d. from
i.e. the parameter density and measure must be invariant. We will 2
give a simple example realizing this mechanism in a moment. σW σw2 1
w(0) ∼ N (0, 0
), w(1) ∼ N (0, ), b(0) ∼ N (0, σb20 ).
It is most natural to transform the input or output of the net- d 2N
(61)
work. Our mechanism allows for symmetries of both types, which are
The two-point function is
analogs of spacetime and internal symmetries, respectively. It may
also be interesting to study symmetries of intermediate layers, if one σw2 1 − 1 σw2 (x1 −x2 )2
G(2) (x1 , x2 ) =e 2d 0 (62)
wishes to impose symmetries on learned representations. Equivari- 2
ance fits into this picture because it turns a transformation at input and we see that the theory has a correlation length
into a transformation at output. The ensemble of equivariant NNs s
is invariant under ρd action on the input if the partition function is d
invariant under the induced ρD action on the output. ξ= . (63)
σw2 0
Example. Consider any architecture of the form
The connected four-point function is
X
ϕ(x) = wi φi (x), wi ∼ P (w) even, (59) 1 4

2
2 P
σw
0 2
i
(4)
Gc = σw1 3e4σb0 e− 2d i xi −2(x1 x2 + 6 perms)
4N
for any neuron φ, which could itself be a deep network. The Z2 1 2
− 2d σw0 ((x1 −x4 )2 +(x2 −x3 )2 )

action ϕ 7→ −ϕ may be absorbed into parameters wg = −wi with − (e + 2 perms) . (64)
dwP (w) invariant by evenness, which is a global symmetry provided
that the domain is invariant. This theory has only even correlators. We see that the 2-pt function is translation invariant, but not the
Less trivial examples will be presented below, when we present 4-pt function.
specific networks and compute correlators. Euclidean-Invariant Nets. We are interested in constructing input
layers that are sufficient to ensure invariance under SO(d) and trans-
3.4 Examples lations, i.e., under the Euclidean group. Consider an input layer of
the form
In the statistics of neural networks, we have covered three topics: !
Gaussian limits, non-Gaussian corrections from violating CLT as-
X (0) (0)
ℓi (x) = F (w(0) i ) cos wij xj + bi , i ∈ 1, . . . , N , (65)
sumptions, and symmetries. We will present this essential data in j
Physics for Machine Learning Lecture Notes 12

where the sum has been made explicit since i’s are not summed over, The connected four-point function is
with  
(0) (0)
wij ∼ P (wij )
(0)
bi ∼ Unif[−π, π]. (66) 1 4 1 2 2

(4)
G |c = σw1 3 e− 2d σw0 (x1 +x2 −x3 −x4 ) + 2 perms
8N
It is easy to see that xi → Rij xj for R ∈ SO(d) can be absorbed into  
1 2
(0) − 2d σb ((x1 −x4 )2 +(x2 −x3 )2 )
a redefinition of the wij ’s, and the partition function is invariant − 2e + 2 perms . (71)
(0)
under this action provided that F (w(0) i ) and P (wij ) are invariant.
(0)
Furthermore, arbitrary translations xj → xj + ϵj give a term wij ϵj The 2-pt function and 4-pt function are fully Euclidean invariant.
(0)
that can be absorbed into a redefinition of the bi that leaves the Equipped with both Cos-net and Gauss-net, we’d like to compare
theory invariant. See [47] for the 2-pt and 4-pt functions. the two:
We emphasize two nice features: • Symmetry. Cos-net is Euclidean-invariant ∀N , by construc-
• Larger Euclidean Nets. Any network that builds on ℓ without tion, while Gauss-net enjoys this symmetry only when N → ∞;
reusing its parameters is Euclidean-invariant. in that limit it only depends on G(2) (x, y).

• Spectrum Shaping. In computing G(2) (p), bi gets evaluated • Large-N Duality. As N → ∞, both theories are drawn from
on p, and F may be chosen to shape the power spectrum (mo- the same Gaussian Process, i.e., with the same G(2) (x, y).
mentum space propagator) arbitrarily.
We refer the reader to [47] for general calculations of ℓ-correlators Scalar-net. Here is one final example you might like, specializing (67)
and to the specializations below for simple single-layer theories. We s
turn the ℓi (x) input layer into a scalar NN by 2vol(BΛd ) 1 (1)

(0) (0)

ϕ(x) = q w i cos w ij x j + b i , (72)
(2π)d σw2 1 (0) 2 2
ϕ(x) =
X (1)
wi ℓi (x) (67) w i +m

(1) and specific parameter densities


with drawn wi ∼ P (w(1) ).
σw2 1
 
Cos-net. The architecture is defined by specifying (67) to F = 1 (1)
w ∼ N 0, w(0) ∼ Unif(BΛd ) b(0) ∼ Unif[−π, π],
(1)

(0) (0)
 N
ϕ(x) = wi cos wij xj + bi , (68) (73)
where BΛd is a d-ball of radius Λ. The theory is translation invariant
and specific parameter densities by construction, and so we compute the power spectrum of the two-
σw2 1 σw2 0 point function G(2) (x − y) to be
   
(1) (0)
w ∼ N 0, w ∼ N 0, b(0) ∼ Unif[−π, π].
N d 1
(69) G(2) (p) = . (74)
The two-point function is equivalent to that of Gauss-net p2 + m2

σw2 1 − 1 σw2 (x1 −x2 )2 We see that we have a realization of the free scalar field theory in d
(2)
G (x, y) = e 2d 0 . (70) Euclidean dimensions. See Sec. 5.2 for an extension to ϕ4 theory.
2
Physics for Machine Learning Lecture Notes 13

4 Dynamics of Neural Networks 4.1 Neural Tangent Kernel


Having covered expressivity and statistics, we turn to dynamics. Fo- We now arrive at a classic dynamical result in ML theory, the neural
cusing on the most elementary NN dynamics, we ask tangent kernel [48]. We train the network by gradient descent
dϕ(x) ∂ϕ(x) dθi
Question: How does a NN evolve under gradient descent? = (80)
dt ∂θi dt
|D|
First we will study a simplification known as the neural tangent ker- η X
nel (NTK), and then will use it in the case of MSE loss to solve a = ∆(xα ) Θ(x, xα ), (81)
|D| α=1
model exactly. We’ll discuss drawbacks of the NTK, and then im-
prove upon them with a scaling analysis that ensures feature learning. where
We will study the dynamics of supervised learning with gradient ∂ϕ(x) ∂ϕ(xα )
Θ(x, xα ) = (82)
descent, with data ∂θi ∂θi
|D|
D = {(xα , yα )}α=1 , (75) is the neural tangent kernel (NTK).
and loss function Given the short derivation, this is clearly a fundamental object,
but it seems terrible to work with, because it is:
|D|
1 X
L[ϕ] = ℓ(ϕ(xα ), yα ), (76) • Parameter-dependent. For modern NNs, there are billions of
|D| α=1 parameters to sum over.
We optimize the network parameters θ by gradient descent, • Time-dependent. Of course, the learning trajectory is θ(t),
and therefore the NTK time evolves.
dθi
= −η∇θi L[ϕ]. (77)
dt • Stochastic. Since θ(t) begins at θ(0) sampled at initialization,
It is also convenient to define the NTK inherits the randomness.

δℓ(ϕ(x), y) It’s also non-local, communicating information about the loss at train
∆(x) = − , (78) points xα to the test point x. In summary, the NTK is unwieldy.
δϕ(x)
The reason it is a classic result, however, is that it simplifies in
where y is to be understood as the label associated to x, which yields the N → ∞ limit. In this limit, neural network training is in the
so-called
|D|
dθi η X ∂ϕ(xα ) Lazy regime: |θ(t) − θ(0)| ≪ 1, (83)
= ∆(xα ) . (79)
dt |D| α=1 ∂θi i.e. the number of parameters is large but the evolution keeps them in
a local neighborhood. In such a regime the network is approximately
as another form of the gradient descent equation, by chain rule. ∆(x) a linear-in-parameters model [48, 49]
is the natural object of gradient descent in function space.
We use Einstein summation throughout this section unless ∂ϕ(x)
lim ϕ(x) ≃ ϕlin (x) := ϕθ0 (x) + (θ − θ0 )i , (84)
stated otherwise (which will happen). N →∞ ∂θi θ0
Physics for Machine Learning Lecture Notes 14

and we have We make the sums explicit here because one is very important. The
lim Θ(x, x′ ) ≃ Θ(x, x′ ) θ0
. (85) NTK is
N →∞
X ∂ϕ(x) ∂ϕ(x′ ) X ∂ϕ(x) ∂ϕ(x′ )
That is, the infinite-width NTK is the NTK at initialization, provided Θ(x, x′ ) = (1) (1)
+ (0) (0)
(89)
that the network evolves as a linear model. Furthermore, in the same i ∂w i ∂w i ij ∂wij ∂wij
limit the law of large numbers often allows a sum to be replaced by 1 X X
N  d
(0) (0)
an expectation value, e.g., = σ(wij xj )σ(wil x′l ) (90)
N i=1 j,l=1
lim Θ(x, x′ )|θ0 = ⟨βθ (x, x′ )⟩ =: Θ̄(x, x′ ), (86) d 
N →∞ X
′ (1) (1) ′ (0) ′ (0) ′
+ xj xj wi wi σ (wij xj )σ (wij xj ) (91)

for computable β(x, x ), yielding network dynamics governed by j=1
1 X
|D| =: βi (x, x′ ). (92)
dϕ(x) η X δl(ϕ(xα ), yα ) N i
=− Θ̄(x, xα ), (87)
dt |D| α=1 δϕ(xα ) If you squint a little, you’ll see that the i-sum is a sum over the
same type of object, βi (x, x′ ), whose i dependence comes from all
where Θ̄ is the so-called frozen NTK, a kernel that may be com- these i.i.d. parameter draws in the i-direction. By the law of large
puted at initialization and fixed once-and-for-all. This is a dramatic numbers, we have that in the N → ∞ limit
simplification of the dynamics.
However, you should also complain. Θ̄(x, x′ ) = ⟨βi (x, x′ )⟩, (93)
with no sum on i. We emphasize
Complaint: The dynamics in (87) simply interpolates between
information at train points xα and test point x, Observation: The NTK in the N → ∞ limit is deterministic
(parameter-independent), depending only on P (θ).
according to a fixed function Θ̄. This isn’t “learning” in the usual
Sometimes, the expectation may be computed exactly, and one knows
NN sense, and there are zero parameters. In particular, since the
the NTK that governs the dynamics once and for all.
NN only affects (87) through Θ̄, which is fixed, nothing happening
dynamically in the NN is affecting the evolution. We say that in this
limit the NN does not learn features in the hidden dimensions 4.2 An Exactly Solvable Model
(intermediate layers), since their non-trivial evolution would cause Let us consider a special case of frozen-NTK dynamics with MSE
the NTK to evolve. loss,
1
Example. Let’s compute the frozen NTK for a single-layer network, ℓ(ϕ(x), y) = (ϕ(x) − y)2 . (94)
2
to get the idea. The architecture is Then the dynamics (87) becomes
N d |D|
1 X X (1) (0) dϕ(x) η X
ϕ(x) = √ wi σ(wij xj ). (88) =− (ϕ(xα ) − yα )Θ̄(x, xα ). (95)
N i=1 j=1 dt |D| α=1
Physics for Machine Learning Lecture Notes 15

The solution to this ODE is general study of learning dynamics, with a focus on choosing wise
1   N -scaling such that features are non-trivially learned during gradient
ϕt (x) = ϕ0 (x)+ Θ̄(x, xα )Θ̄(xα , xβ )−1 1 − e−ηΘ̄t (yγ − ϕ0 (xγ )) , descent. To do this we will engineer three properties: that features
|D| βγ
(96) (pre-activations) are finite at initialization, predictions evolve in fi-
where computational difficulty is that Θ̄(x, xα ) is a |D| × |D| matrix nite time, and features evolve in finite time.
and takes O(|D|3 ) time to invert. The solution defines a trajectory This section is denser than the others, so I encourage the reader
through function space from ϕ0 to ϕ∞ . The converged network is to remember that we are aiming to achieve these three principles if
they become overwhelmed with details. Throughout, I am following
ϕ∞ (x) = ϕ0 (x) + Θ̄(x, xα )Θ̄(xα , xβ )−1 (yβ − ϕ0 (xβ )) . (97) lecture notes of Pehlevan and Bordelon [50] with some changes in
notation to match the rest of the lectures; see those lectures for
This is known as kernel regression, a classic technique in ML. In more details, [51, 52] for original literature from those lectures that
general kernel regression, one chooses the kernel. In our case, gradi- I utilize, and [53, 54] for related work on feature learning.
ent descent training in the N → ∞ limit is kernel regression, with We study a deep feedforward network with L layers and width N ,
respect to a specific kernel determined by the NN, the NTK Θ̄. which in all is a map
On train points we have memorization ϕ : RD → R (100)

ϕ∞ (xα ) = yα ∀α. (98) (note the input dimension D; d is reserved for below) defined recur-
sively as
On test points x, the converged network is performing an interpola- 1 (L)
tion, communicating residuals Rβ on train points β through a fixed ϕ(x) = z (x) (101)
γ0 N d
kernel Θ̄ to test points x. The prediction depends on ϕ0 , but may be 1 (L) (L−1)
averaged over to obtain z (L) (x) = a wi σ(zi (x)) (102)
N L

1
µ∞ (x) := ⟨ϕ∞ (x)⟩ = Θ̄(x, xα )Θ̄(xα , xβ )−1 yβ , (99) (ℓ)
zi (x) = a Wij σ(zj
(ℓ) (ℓ−1)
(x)) (103)
N ℓ

provided that ⟨ϕ0 ⟩ = 0, as in many initializations for the parameters. (1) 1


zi (x) = √ Wij(1) xj (104)
N a D
Let’s put some English on the remarkable facts, 1

• µ∞ (x) is the mean prediction of an ∞ number of ∞-wide NNs where Einstein summation is implied throughout this section (unless
trained to ∞ time. stated otherwise) and all Latin indices run from {1, . . . , N } except
in the j-index in the first layer, when they are {1, . . . , D}. The
• If ϕ0 is drawn from a GP, then ϕ∞ is as well. The mean is pre- parameters are drawn
cisely µ∞ (x), see [49] for the two-point function and covariance.    
(ℓ) 1 (ℓ) 1
wi ∼ N 0, b Wij ∼ N 0, b . (105)
N L N ℓ
4.3 Feature Learning
We scale the learning rate as
The frozen NTK is a tractable toy model, but it has a major draw-
back: it does not learn features. In this section we perform a more η = η0 γ02 N 2d−c (106)
Physics for Machine Learning Lecture Notes 16

with γ0 , η0 O(1) constants. We use a parameterization that will be is a feature kernel. We are constructing a proof-by-induction that the
convenient where d has already been introduced but c is a new pa- pre-activations are ON (1), so at this stage we may assume that the
rameter. For notational brevity we will sometimes use a Greek index pre-activations z (ℓ−1) ∼ ON (1) and therefore Φ(ℓ−1) ∼ ON (1) since it
(ℓ) (ℓ−1)
subscript in place of inputs, e.g. zα := z (ℓ) (xα ). The z’s are known is the average of N ON (1) quantities σ(zmα ). With this in hand,
as the pre-activations, as they are the inputs to the activation func-
tions σ. 2a1 + b1 = 0 2aℓ + bℓ = 1 ∀l > 1, (113)
We have a standard MLP but have parameterized our ignorance ensures that the pre-activations z (ℓ) are ON (1), as is empirically re-
of N -scaling, governed by parameters (aℓ , bℓ , c, d). We will use this quired for well-behaved training.
freedom to set some reasonable goals: As an aside: in the N → ∞ limit feature kernels asymptote to de-
• Finite Initialization Pre-activations. z (ℓ) ∼ ON (1) ∀l. terministic objects, akin to the frozen NTK behavior we have already
(ℓ)
seen, and intermediate layer pre-activations zα are also Gaussian
• Learning in Finite Time. dϕ(x)/dt ∼ ON (1). distributed. Therefore in that limit, the statistics of a randomly ini-
tialized neural network is described by a sequence of generalized free
• Feature Learning in Finite Time. dz (ℓ) /dt ∼ ON (1) ∀l. field theories where correlations are propagated down the network
These constraints have a one-parameter family of solutions, which according to a recursion relation; see, e.g., [55, 49].
becomes completely fixed under an additional learning rate assump-
Learning in Finite Time. Similar to our NTK derivation, we have
tion. We take each constraint in turn.
|D|
dϕ(x) η X ∂ϕ(xα ) ∂ϕ(x)
Finite Pre-activations. Since the weights are zero mean, trivially = ∆(xα ) , (114)
dt |D| α=1 ∂θi ∂θi
⟨zα(ℓ) ⟩ = 0 ∀l. (107)
We have that
We must also compute the covariance. For the first pre-activation, dϕ(x) γ 2 N 2d ∂ϕ(xα ) ∂ϕ(x)
we have ∼ ON (1) ↔ 0 c ∼ ON (1). (115)
dt N ∂θi ∂θi
(1) (1) 1 (1) (1) 1
⟨ziα zjβ ⟩ = 2a
⟨wim wjn ⟩ xmα xnβ = δij xmα xmβ . (108) A short calculation involving a chain rule, keeping track of the dif-
DN 1 DN 1 +b1
2a
ferent types of layers, and tedium yields
A similar calculation for higher layers l gives
η0 γ02 N 2d ∂ϕ(xα ) ∂ϕ(x)

1 1
1 = c Φ(L−1)
(ℓ) (ℓ)
⟨ziα zjβ ⟩ =
(1) (1)
⟨wim wjn ⟩ ⟨σ(zmα(ℓ−1) (ℓ−1)
)σ(zmβ )⟩ (109) N c ∂θi ∂θi N N L −1 µν
2a

N 2aℓ L−1 
1 1
X 1 (ℓ) (ℓ−1) 1 (1) (0)
= δij 2a +b −1 ⟨σ(zmα (ℓ−1) (ℓ−1)
)σ(zmβ )⟩ (110) + G Φ
2aℓ −1 αx αx
+ 2a1 Gαx Φαx
N ℓ ℓ N ℓ=2
N N
1 (ℓ−1) (116)
= δij 2a +b −1 ⟨Φαβ ⟩, (111)
N ℓ ℓ
where
where 1 (ℓ) (ℓ) (ℓ)
√ ∂zµ(L)
(ℓ) 1 (ℓ) (ℓ) G(ℓ)
µν = giµ giν , giµ = N (117)
Φαβ := σ(zmα )σ(zmβ ) (112) N (ℓ)
∂ziµ
N
Physics for Machine Learning Lecture Notes 17

arises naturally when doing a chain rule through the layers and x in These constraints are solves by a one-parameter family depending on
the subscript is a stand-in for an arbitrary input x. Some additional a∈R
work [50] shows that g, G ∼ ON (1). Since we’ve shown that the
feature kernels are also ON (1), then if (aℓ , bℓ , cℓ ) = (a, 1 − 2a, 1 − 2a) ∀l > 1 (126)
1
2a1 + c = 0, 2aℓ + c = 1 ∀l > 1, (118) (a1 , b1 , c1 ) = (a − , 1 − 2a, 1 − 2a). (127)
2
we have that dϕ(x) ∼ ON (1), i.e., predictions evolve in finite time,
dt If one makes the addition demands that η ∼ ON (N 2d−c ) is ON (1), so
and therefore learning happens in finite time.
that you don’t have to change the learning rate as you scale up, then
Features Evolve in Finite Time. Finally, we need the features to c = 1 is fixed and there is a unique solution. This scaling is known
evolve. For the first layer we compute as the maximal update parameterization, or µP [51].
(1) |D|
dziµ 1 dWij 1 η0 γ0 X (1)
= √ xjµ = 2a1 +c−d+1/2 ∆ν giν Φ(0)
µν , 5 NN-FT Correspondence
dt N 1 D dt
a N |D| ν=1
(119) Understanding the statistics and dynamics of NNs has led us natu-
where the only N -dependence arises from the exponent. Given (118), rally to objects that we are used to from field theory. The idea has
been to understand ML theory, but one can also ask the converse,
1
d= (120) whether ML theory gives new insights into field theory. With that
2 in mind, we ask
is required to have dz (1) /dt ∼ ON (1), a constraint that in fact persists
for all dz (ℓ) /dt. We note that we have Question: What is a field theory?

No feature learning: if d < 21 , At the very least, a field theory needs

which applies in particular to the NTK regime d = 0. • Fields, functions from an appropriate function space, or sections
of an appropriate bundle, more generally.
Summarizing the Constraints. Putting it all together, by enforcing
finite pre-activations and that predictions and features evolve in finite • Correlation Functions of fields, here expressed as scalars
time, we have constraints given by
G(n) (x1 , . . . , xn ) = ⟨ϕ(x1 ) . . . ϕ(xn )⟩. (128)
2a1 + b1 = 0 (121)
2aℓ + bℓ = 1 ∀l > 1 (122) You might already be wanting to add more beyond these minimal
2a1 + c = 0 (123) requirements – we’ll discuss that in a second. For now, we have
2aℓ + c = 1 ∀l > 1 (124)
1 Answer: a FT is an ensemble of functions with a way to com-
d= . (125) pute their correlators.
2
Physics for Machine Learning Lecture Notes 18

In the Euclidean case, when the expectation is a statistical expecta- The problem is that with any such X, there’s usually some commu-
tion, one my say nity of physicists that doesn’t care. For instance, not all statistical
field theories are Wick rotations of quantum theories; not all field
Euclidean Answer: a FT is a statistical ensemble of functions. theories have a known Lagrangian; not all field theories have sym-
metry; not all field theories are local. So I’m going to stick with my
definition, because at a minimum I want fields and correlators.
Our minimal requirements get us a partition function
Instead, if your X isn’t included in the definition of field theory,
dd xJ(x)ϕ(x)
R
Z[J] = ⟨e ⟩ (129) it becomes an engineering problem. Whether you’re defining your
specific theory by S[ϕ], (ϕθ , P (θ)), or something else, you can ask
that we can use to compute correlators, where at this stage we are
agnostic about the definition of ⟨·⟩. In normal field theory, the ⟨·⟩ is Question: Can I engineer my defining data to get FT + X?
defined by the Feynman path integral
Z R d For X = Symmetries you’ve already seen ways to do this at the level
Z[J] = Dϕ e−S[ϕ]+ d xJ(x)ϕ(x) , (130) of actions in QFT1 and at the level of (ϕθ , P (θ)) in these lectures.
For a current account of recent progress in NN-FT, see the rather
which requires specifying an action S[ϕ] that determines a density long Introduction of [37] and associated references, as well as results.
on functions exp(−S[ϕ]). But that’s not the data we specify when
we specify a NN. The NN data (ϕθ , P (θ)) instead defines
Z
5.1 Quantum Field Theory
R d
Z[J] = dθP (θ)e d xJ(x)ϕθ (x) . (131) We’ve been on Euclidean space the whole time1 , so it’s natural to
wonder in what sense these field theories are quantum. In a course
These are two different ways of defining a field theory, and indeed on field theory, we first learn to canonically quantize and then at
given (ϕθ , P (θ)) one can try to work out the associated action, in some later point learn about Wick rotation, and how it can define
which case we have dual description of the same field theory, as in the Euclidean correlators. The theory is manifestly quantum.
NNGP correspondence. The parameter space description is already But given a Euclidean theory, can it be continued to a well-behaved
quite useful, though, as it enables the computation of correlation quantum theory in Lorentzian signature, e.g. with unitary time evo-
functions even if the action isn’t known. In certain cases it enables lution and a Hilbert space without negative norm states? If we have
the computation of exact correlators in interacting theories. a nice-enough local action, it’s possible, but what if we don’t have
Okay, you get it, this is a different way to do field theory. Now I’ll an action? We ask:
let you complain about my definition. You’re asking Question: Given Euclidean correlators, can the theory be con-
tinued to a well-behaved Lorentzian quantum theory?
Question: Shouldn’t my definition of field theory include X?

I’m writing this before I give the lecture, and my guess is you already This is a central question in axiomatic quantum field theory, and the
asked about a set of X’s, e.g. answer is that it depends on the properties of the correlators. The
1
This can be relaxed, see e.g. for a recent paper defining an equivariant
X ∈ {Quantum, Lagrangian, Symmetries, Locality, . . . }. (132) network in Lorentzian signature [56].
Physics for Machine Learning Lecture Notes 19

Osterwalder-Schrader (OS) theorem [57] gives a set of conditions on then one may insert an operator associated to any local potential
the Euclidean correlators that ensure that the theory can be con- V (ϕ), which deforms the action in the expected way and the NN-FT
tinued to a unitary Lorentzian theory that satisfies the Wightman to
axioms. The conditions of the theorem include
Z R d R d
Z[J] = dθP (θ)e d xV (ϕθ (x)) e d xJ(x)ϕθ (x) (135)
• Euclidean Invariance. The correlators are invariant under the Z R d
Euclidean group, which after continuation to Lorentzian signa- =: dθP̃ (θ)e d xJ(x)ϕθ (x) , (136)
ture becomes the Poincaré group.
where the architecture equation ϕθ (x) lets us sub out the abstract ex-
• Permutation Invariance of the correlators G (n)
(x1 , . . . , xn ) pression for a concrete function of parameters, defining a new density
under any permutation of the x1 , . . . , xn . on parameters P̃ (θ) in the process. The interactions in V (ϕ) break
Gaussianity of the NNGP that was ensured by a CLT. This means
• Reflection Positivity. Having time in Lorentzian signature a CLT assumption must be violated: it is the breaking of statistical
requires picking a Euclidean time direction τ . Let R(x) be the independence in P̃ (θ). The theory Z[J] defined by (ϕθ , P̃ (θ)) is an
reflection of x in the τ = 0 plane. Then reflection positivity interacting NN-QFT, since local potentials that deformed Gaussian
requires that QFTs still satisfy reflection positivity and cluster decomposition.
G(2n) (x1 , . . . , xn , R(x1 ), . . . , R(xn )) ≥ 0. (133)
5.2 ϕ4 Theory
Technically, this is necessary but not sufficient. An accessible With this discussion of operator insertions, it’s clear now how to get
elaboration can be found in notes [58] from a previous TASI. ϕ4 theory. We just insert the operator
• Cluster Decomposition occurs when the connected correla- dd xϕθ (x)4
R
e (137)
tors vanish when any points are infinitely far apart.
into the partition function associated to the free scalar of Section
If all of these are satisfied, then the pair (ϕθ , P (θ)) that defines the 3.3; this operator with the architecture (72) technically requires an
NN-FT actually defines a neural network quantum field theory [47]. IR cutoff, though other architectures realizing the free scalar may
In NN-FT, permutation invariance is essentially automatic and Eu- not. The operator insertion deforms the parameter densities in (73)
clidean invariance may be engineered as described in Section 3.3. and breaks their statistical independence, explaining the origin of
Cluster decomposition and reflection positivity hold in some exam- interactions in the NN-QFT. See [37] for a thorough presentation.
ples [47, 37], but systematizing their construction is an important
direction for future work. 5.3 Open Questions
There is at least one straightforward way to obtain an interacting
NN-QFT. Notably, if (ϕθ , P (θ)) is a NNGP that satisfies the OS In this section we’ve discussed a neural network approach to field
axioms (this is much easier [47]) with Gaussian partition function theory in which the partition function is an integral over probability
densities of parameters, and the fields (networks) are functions of
these parameters. We have summarized some essential results, but
Z R d
ZG [J] = dθP (θ)e d xJ(x)ϕθ (x) , (134) there are many outstanding questions:
Physics for Machine Learning Lecture Notes 20

• Reflection Positivity. Is there a systematic way to engineer In Section 3 on statistics, I reminded the reader that a randomly
NN-FTs (ϕθ , P (θ)) that satisfy reflection positivity? initialized NN is no more fundamental than a single role of the dice.
It is a random function with parameters that is drawn from a statisti-
• Cluster Decomposition. Can systematically understand the cal ensemble of NNs, and we asked “What characterizes the statistics
conditions under which cluster decomposition holds in NN-FT? of the NN ensemble?” I reviewed a classic result of Neal from the
90’s that demonstrated that under common assumptions a width-N
• Engineering Actions. How generally can we define an NN-FT feedforward network is drawn from a Gaussian Process — a Gaussian
that is equivalent to a field theory with a fixed action? density on functions — in the N → ∞ limit. I explained that work
• Locality in the action can be realized at infinite-N , but is there from the last decade has demonstrated that this result generalizes to
a fundamental obstruction at finite-N ? many different architectures, as a result of the Central Limit The-
orem. Non-Gaussianities therefore arise by violating an assumption
On the other hand, an NN-FT approach to CFT and to Grassmann of the Central Limit Theorem, such as N → ∞ or statistical inde-
fields is well underway and should appear in 2024. pendence, and by violating them weakly the non-Gaussianities can
be made parametrically small. In field theory language, NNGPs are
generalized free field theories and NN Non-Gaussian processes are
6 Recap and Outlook interacting field theories. I also introduced a mechanism for global
symmetries and exemplified many of the phenomena in the section.
Neural networks are the backbone of recent progress in ML. In these In Section 4 I presented some essential results from ML theory
lectures we took the perspective that to understand ML, we must on NN dynamics, asking “How does a NN evolve under gradient
understand neural networks, and built our study around three pillars: descent?” When trained with full-batch gradient descent and a par-
the expressivity, statistics, and dynamics of neural networks. ticular normalization, the network dynamics are governed by the
Let’s recap the essentials and then provide some outlook. In Sec- Neural Tangent Kernel (NTK) that in general is an intractable ob-
tion 2 on expressivity, we asked “How powerful is a NN?” We pre- ject. However, in N → ∞ limits it becomes a deterministic function
sented the Universal Approximation Theorem (UAT) and gave a pic- and the NTK at initialization governs the dynamics for all time.
ture demonstrating how it works. We also presented the Kolmogorov- An exactly solvable model with mean-squared error loss is presented
Arnold Representation Theorem (KART), which recently motivated that includes the mean prediction of an infinite number of infinitely
a new architecture for deep learning where activations are learned wide neural networks trained to infinite time. The mean network
and live on the edges. We wonder: prediction is equivalent to kernel regression, and the kernel commu-
nicates information from train points to test points in order to make
Outlook: Cybenko’s UAT was around for over 20 years before predictions. However, from this description of the NTK a problem
the empirical breakthroughs of the 2010’s. Existence of good is already clear: nothing is being learned, and in particular late-
approximators wasn’t enough, we needed better compute and time features in the hidden dimensions are in a local neighborhood
optimization schemes to find them dynamically. However, given of their initial values. I showed how a detailed N -scaling analysis
tremendous progress in dynamics, should we systematically re- allows one to demand that network features and predictions update
turn to the approximation theory literature, to help motivate non-trivially, leading to richer learning regimes known as dynamical
new architectures, as e.g., in KART and KAN? mean field theory or the maximal update parameterization.
Physics for Machine Learning Lecture Notes 21

Outlook: These recent theories of NN statistics and dynam- which complements the statistical and dynamical approach that I’ve
ics sometimes realize interesting toy models, but with known taken in these lectures. I also recommend the book Deep Learning
shortcomings from a learning perspective, though the N -scaling [61] for a more comprehensive introduction to the field.
analysis in the feature learning section is promising. Surely there I end with a final analogy between history and the current situation
is still much to learn overall, e.g., we have said very little in these in ML. We are living in an ML era akin to the pre-PC era in classical
lectures about architecture design principles. For instance, some computing. Current large language models (LLMs) or other deep
architectures, such as the Transformers [59] central to LLMs, are networks are by now very powerful, but are the analogs of room-
motivated by structural principles, whereas others are motivated sized computers that can only be fully utilized by the privileged.
by theoretical guarantees. Whereas harnessing the power of those computers in the 60’s required
access to government facilities or large laboratories, only billion dollar
What are the missing principles that combines statistics, dy- companies have sufficient funds and compute to train today’s LLMs
namics, and architecture design to achieve optimal learning? from scratch. Just as the PC revolution brought computing power
to the masses in the 80’s, a central question now is whether we can
Finally, in Section 5 I explained how neural networks provide a
do the analog in ML by learning how to train equally powerful small
new way to define a field theory, so that one might use ML theory
models with limited resources. Doing so likely requires a deeper
for physics and not just physics for ML theory. This NN-FT corre-
understanding of theory, especially with respect to sparse networks.
spondence is a rethink of field theory from an ML perspective, and I
tried to state clearly about what is and isn’t known about obtaining
cherished field theory properties in that context. For example, I ex- Acknowledgements: I would like to thank the students and or-
plained how by engineering a desired NNGP, such as the free scalar, ganizers of TASI for an inspiring scientific environment and many
one may do an operator insertion in the path integral and interpret excellent questions. I am grateful for the papers of and discussions
the associated interactions as breaking the statistical independence with friends whose works have contributed significantly to these lec-
necessary for the CLT. By such a mechanism, one may engineer ϕ4 tures, especially Yasaman Bahri, Cengiz Pehlevan, and Greg Yang.
theory in an ensemble of infinite width neural networks. I’d also like to thank my collaborators on related matters, including
Mehmet Demirtas, Anindita Maiti, Fabian Ruehle, Matt Schwartz,
Outlook: Can NN-FTs motivate new interesting physics theo- and Keegan Stoner. I would like to thank Sam Frank, Yikun Jiang,
ries or provide useful tools for studying known theories? and Sneh Pandya for correcting typos in a draft. Finally, thanks
to GitHub Copilot for its help in writing these notes, including the
Our discussion in this Section is directly correlated with the top- correct auto-completion of equations and creation of tikz figures!
ics of these lectures. It reviews recent theoretical progress, but the Two-column landscape mode was chosen to facilitate boardwork, but
careful reader surely has a sense that theory is significantly behind I think it also makes a pleasant read, HT @ [62]. I am supported
experiment and there is still much to do. It is a great time to enter by the National Science Foundation under CAREER grant PHY-
the field, and to that end I recommend carefully reading the liter- 1848089 and Cooperative Agreement PHY-2019786 (The NSF AI
ature that I’ve cited, as well as the book Geometric Deep Learning Institute for Artificial Intelligence and Fundamental Interactions).
[60] that covers many current topics that will be natural to a physi-
cist, including grids, groups, graphs, geodesics, and gauges. Many Disclaimers: In an effort to post these notes shortly after TASI
aspects of that book are related to principled architecture design, 2024, there are more typos and fewer references than is ideal. Other
Physics for Machine Learning Lecture Notes 22

references and topics may be added in the future to round out the This yields
content that was presented in real-time. I also struck a playful tone lim κϕr>2 = 0, (144)
N →∞
throughout, because lectures are supposed to be fun, but may have
overdone it at times. I welcome suggestions on the above, but up- which is sufficient to show that ϕ is Gaussian in the large-N limit.
dates will still aim for clarity and brevity, targeted at HET students. In physics language, cumulants are connected correlators, and (144)
means that Gaussian (free) theories have no connected correlators.
In neural networks we will be interested in studying certain Gaus-
A Central Limit Theorem sian limits. From this CLT derivation, we see two potential origins
of non-Gaussianity:
Let us recall a simple derivation of the Central Limit Theorem (CLT),
in order to better understand the statistics of neural networks. Con- • 1/N -corrections from appearance in κϕr .
sider a sum of random variables
• Independence breaking since the proof relied on (141).
N
1 X
ϕ= √ Xi , (138)
N i=1
References
with ⟨Xi ⟩ = 0. The moments µr and cumulants κr are determined
by the moment generating function (partition function) Z[J] = ⟨eJϕ ⟩ [1] W. Isaacson, The Innovators: How a Group of Hackers,
and cumulant generating function W [J] = log Z[J], respectively, as Geniuses, and Geeks Created the Digital Revolution. Simon &
 r Schuster, New York, first simon & schuster hardcover
d edition ed., 2014. On Shelf.
µr = Z[J] (139)
dJ J=0
 r
d [2] D. Silver, J. Schrittwieser, et al., “Mastering the game of go
κr = W [J] . (140) without human knowledge,” Nature 550 no. 7676, (Oct, 2017)
dJ J=0
354–359. https://doi.org/10.1038/nature24270.
If the Xi are independent random variables, then the partition func-
[3] D. Silver, T. Hubert, et al., “Mastering chess and shogi by
Q
tion factorizes Z i Xi [J] = i ZXi [J], and the cumulant generating
P

function of the sum is the sum of the cumulant generating functions, self-play with a general reinforcement learning algorithm,”
yielding arXiv preprint arXiv:1712.01815 (2017) .
X
WPi Xi [J] = WXi [J] (141) [4] B. LeMoine, “How the artificial intelligence program alphazero
i mastered its games,” The New Yorker (Jan, 2023) . https://
P X
κr Xi
= κX www.newyorker.com/science/elements/how-the-artificia
r . (142)
i

i l-intelligence-program-alphazero-mastered-its-games.
Xi
If the Xi are identically distributed, then
√ the cumulants κr are the [5] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and
same for all i and we account for the 1/ N appropriately, we obtain S. Ganguli, “Deep unsupervised learning using nonequilibrium
κX i thermodynamics,” in International conference on machine
r
κϕr = r/2−1 . (143) learning, pp. 2256–2265, PMLR. 2015.
N
Physics for Machine Learning Lecture Notes 23

[6] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, [14] J. Carrasquilla and R. G. Melko, “Machine learning phases of
S. Ermon, and B. Poole, “Score-based generative modeling matter,” Nature Physics 13 no. 5, (2017) 431–434.
through stochastic differential equations,” arXiv preprint
arXiv:2011.13456 (2020) . [15] L. B. Anderson, M. Gerdes, J. Gray, S. Krippendorf,
N. Raghuram, and F. Ruehle, “Moduli-dependent Calabi-Yau
[7] A. Ananthaswamy, “The physics principle that inspired and SU(3)-structure metrics from Machine Learning,” JHEP
modern ai art,” Quanta Magazine (Jan, 2023) . 05 (2021) 013, arXiv:2012.04656 [hep-th].
https://www.quantamagazine.org/the-physics-principle
-that-inspired-modern-ai-art-20230105/. [16] M. R. Douglas, S. Lakshminarasimhan, and Y. Qi, “Numerical
Calabi-Yau metrics from holomorphic networks,”
[8] T. Brown, B. Mann, et al., “Language models are few-shot arXiv:2012.04797 [hep-th].
learners,” in Advances in Neural Information Processing
Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, [17] V. Jejjala, D. K. Mayorga Pena, and C. Mishra, “Neural
and H. Lin, eds., vol. 33, pp. 1877–1901. Curran Associates, network approximations for Calabi-Yau metrics,” JHEP 08
Inc., 2020. (2022) 105, arXiv:2012.15821 [hep-th].
https://proceedings.neurips.cc/paper_files/paper/202
0/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [18] S. Gukov, J. Halverson, and F. Ruehle, “Rigor with machine
learning from field theory to the poincaré conjecture,” Nature
[9] K. Roose, “The brilliance and weirdness of chatgpt,” The New Reviews Physics (2024) 1–10.
York Times (Dec, 2022) . https://www.nytimes.com/2022/1
2/05/technology/chatgpt-ai-twitter.html. [19] “Iaifi summer workshop 2023,” IAIFI - Institute for Artificial
Intelligence and Fundamental Interactions. 2023.
[10] A. Bhatia, “Watch an a.i. learn to write by reading nothing https://iaifi.org/events/summer_workshop_2023.html.
but shakespeare,” The New York Times (Apr, 2023) .
https://www.nytimes.com/interactive/2023/04/26/upsho [20] “Machine learning and the physical sciences,” in Proceedings of
t/gpt-from-scratch.html. the Neural Information Processing Systems (NeurIPS)
Workshop on Machine Learning and the Physical Sciences.
[11] S. Bubeck, V. Chandrasekaran, et al., “Sparks of artificial Neural Information Processing Systems Foundation, Inc., 2023.
general intelligence: Early experiments with gpt-4,” arXiv https://ml4physicalsciences.github.io/2023/.
preprint arXiv:2303.12712 (2023) .
[21] G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld,
[12] J. Jumper, R. Evans, et al., “Highly accurate protein structure N. Tishby, L. Vogt-Maranto, and L. Zdeborová, “Machine
prediction with alphafold,” Nature 596 no. 7873, (Aug, 2021) learning and the physical sciences,” Rev. Mod. Phys. 91 no. 4,
583–589. https://doi.org/10.1038/s41586-021-03819-2. (2019) 045002, arXiv:1903.10563 [physics.comp-ph].

[13] G. Carleo and M. Troyer, “Solving the quantum many-body [22] “Tasi 2024: Frontiers in particle theory.” 2024.
problem with artificial neural networks,” Science 355 no. 6325, https://www.colorado.edu/physics/events/summer-inten
(2017) 602–606. sive-programs/theoretical-advanced-study-institute-e
Physics for Machine Learning Lecture Notes 24

lementary-particle-physics-current. Summer Intensive [31] V. I. Arnold, “On functions of three variables,” Doklady
Programs, University of Colorado Boulder. Akademii Nauk SSSR 114 (1957) 679–681.

[23] H. Robbins and S. Monro, “A Stochastic Approximation [32] Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson,
Method,” The Annals of Mathematical Statistics 22 no. 3, M. Soljačić, T. Y. Hou, and M. Tegmark, “KAN:
(1951) 400 – 407. Kolmogorov-Arnold Networks,” arXiv:2404.19756 [cs.LG].
https://doi.org/10.1214/aoms/1177729586. [33] R. M. Neal, “Bayesian learning for neural networks,” Lecture
Notes in Statistics 118 (1996) .
[24] F. Rosenblatt, “The perceptron: a probabilistic model for
information storage and organization in the brain.” [34] C. K. Williams, “Computing with infinite networks,” Advances
Psychological review 65 6 (1958) 386–408. in neural information processing systems (1997) .
https://api.semanticscholar.org/CorpusID:12781225.
[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic classification with deep convolutional neural networks,” in
optimization.” 2017. https://arxiv.org/abs/1412.6980. Advances in Neural Information Processing Systems,
F. Pereira, C. Burges, L. Bottou, and K. Weinberger, eds.,
[26] G. B. De Luca and E. Silverstein, “Born-infeld (bi) for ai: vol. 25. Curran Associates, Inc., 2012.
Energy-conserving descent (ecd) for optimization,” in https://proceedings.neurips.cc/paper_files/paper/201
International Conference on Machine Learning, pp. 4918–4936, 2/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
PMLR. 2022.
[36] G. Yang, “Wide feedforward or recurrent neural networks of
[27] G. B. De Luca, A. Gatti, and E. Silverstein, “Improving any architecture are gaussian processes,” Advances in Neural
energy conserving descent for machine learning: Theory and Information Processing Systems 32 (2019) .
practice,” arXiv preprint arXiv:2306.00352 (2023) . [37] M. Demirtas, J. Halverson, A. Maiti, M. D. Schwartz, and
K. Stoner, “Neural network field theories: non-Gaussianity,
[28] G. Cybenko, “Approximation by superpositions of a sigmoidal
actions, and locality,” Mach. Learn. Sci. Tech. 5 no. 1, (2024)
function,” Mathematics of Control, Signals and Systems 2
015002, arXiv:2307.03223 [hep-th].
no. 4, (1989) 303–314.
[38] S. Yaida, “Non-gaussian processes and neural networks at
[29] K. Hornik, M. Stinchcombe, and H. White, “Approximation finite widths,” in Mathematical and Scientific Machine
capabilities of multilayer feedforward networks,” Neural Learning, pp. 165–192, PMLR. 2020.
Networks 4 no. 2, (1991) 251–257.
[39] T. Cohen and M. Welling, “Group equivariant convolutional
[30] A. N. Kolmogorov, “On the representation of continuous networks,” in Proceedings of The 33rd International
functions of several variables by superpositions of continuous Conference on Machine Learning, M. F. Balcan and K. Q.
functions of one variable and addition,” Doklady Akademii Weinberger, eds., vol. 48 of Proceedings of Machine Learning
Nauk SSSR 114 (1957) 953–956. Research, pp. 2990–2999. PMLR, New York, New York, USA,
Physics for Machine Learning Lecture Notes 25

20–22 jun, 2016. [48] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel:
https://proceedings.mlr.press/v48/cohenc16.html. Convergence and generalization in neural networks,” in
Advances in Neural Information Processing Systems, S. Bengio,
[40] M. Winkels and T. S. Cohen, “3d g-cnns for pulmonary nodule H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
detection,” arXiv preprint arXiv:1804.04656 (2018) . R. Garnett, eds., vol. 31. Curran Associates, Inc., 2018.
https://proceedings.neurips.cc/paper_files/paper/201
[41] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown,
8/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and
D. Amodei, “Scaling laws for neural language models,” arXiv [49] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak,
preprint arXiv:2001.08361 (2020) . J. Sohl-Dickstein, and J. Pennington, “Wide neural networks
of any depth evolve as linear models under gradient descent,”
[42] S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa,
Advances in neural information processing systems 32 (2019) .
M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky,
“E(3)-equivariant graph neural networks for data-efficient and
[50] C. Pehlevan and B. Bordelon, “Lecture notes on infinite-width
accurate interatomic potentials,” Nature Communications 13
limits of neural networks.” August, 2024.
no. 1, (May, 2022) 2453.
https://mlschool.princeton.edu/events/2023/pehlevan.
https://doi.org/10.1038/s41467-022-29939-5.
Princeton Machine Learning Theory Summer School, August 6
[43] N. Frey, R. Soklaski, S. Axelrod, S. Samsi, - 15, 2024.
R. Gomez-Bombarelli, C. Coley, et al., “Neural scaling of deep
[51] G. Yang and E. J. Hu, “Feature learning in infinite-width
chemical models,” ChemRxiv (2022) . This content is a
neural networks,” arXiv preprint arXiv:2011.14522 (2020) .
preprint and has not been peer-reviewed.

[44] D. Boyda, G. Kanwar, S. Racanière, D. J. Rezende, M. S. [52] B. Bordelon and C. Pehlevan, “Self-consistent dynamical field
Albergo, K. Cranmer, D. C. Hackett, and P. E. Shanahan, theory of kernel evolution in wide neural networks,” Advances
“Sampling using su (n) gauge equivariant flows,” Physical in Neural Information Processing Systems 35 (2022)
Review D 103 no. 7, (2021) 074504. 32240–32256.

[45] A. Maiti, K. Stoner, and J. Halverson, “Symmetry-via-Duality: [53] D. A. Roberts, S. Yaida, and B. Hanin, The principles of deep
Invariant Neural Network Densities from Parameter-Space learning theory, vol. 46. Cambridge University Press
Correlators,” arXiv:2106.00694 [cs.LG]. Cambridge, MA, USA, 2022.

[46] J. Halverson, A. Maiti, and K. Stoner, “Neural Networks and [54] S. Yaida, “Meta-principled family of hyperparameter scaling
Quantum Field Theory,” Mach. Learn. Sci. Tech. 2 no. 3, strategies,” arXiv preprint arXiv:2210.04909 (2022) .
(2021) 035002, arXiv:2008.08601 [cs.LG].
[55] S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein,
[47] J. Halverson, “Building Quantum Field Theories Out of “Deep information propagation,” arXiv preprint
Neurons,” arXiv:2112.04527 [hep-th]. arXiv:1611.01232 (2016) .
Physics for Machine Learning Lecture Notes 26

[56] M. Zhdanov, D. Ruhe, M. Weiler, A. Lucic, J. Brandstetter,


and P. Forré, “Clifford-steerable convolutional neural
networks,” arXiv preprint arXiv:2402.14730 (2024) .

[57] K. Osterwalder and R. Schrader, “AXIOMS FOR


EUCLIDEAN GREEN’S FUNCTIONS,” Commun. Math.
Phys. 31 (1973) 83–112.

[58] D. Simmons-Duffin, “The Conformal Bootstrap,” in


Theoretical Advanced Study Institute in Elementary Particle
Physics: New Frontiers in Fields and Strings, pp. 1–74. 2017.
arXiv:1602.07982 [hep-th].

[59] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,


A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
you need,” Advances in neural information processing systems
30 (2017) .

[60] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković,


“Geometric deep learning: Grids, groups, graphs, geodesics,
and gauges,” arXiv preprint arXiv:2104.13478 (2021) .

[61] I. J. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.


MIT Press, Cambridge, MA, USA, 2016.
http://www.deeplearningbook.org.

[62] P. Ginsparg, “Applied conformal field theory,” arXiv preprint


hep-th/9108028 (1988) .

You might also like