CSC 2541: Neural Net Training Dynamics
Lecture 1 - A Toy Model: Linear Regression
                        Roger Grosse
                University of Toronto, Winter 2021
NNTD (UofT)               CSC2541-Lec1               1 / 62
Introduction
Neural nets are everywhere:
     NNTD (UofT)              CSC2541-Lec1   2 / 62
Introduction
   But how do they work?
   Some things (most) machine learning researchers believed 10 years
   ago:
        It’s hard to optimize nonconvex functions.
        It’s hard to train neural nets with more than 2 layers.
        If you have way more parameters than data points, you’ll overfit.
        Regularization and optimization can be studied separately.
        Your learning algorithm and feature representation need to be
        carefully designed for a particular application.
   Our experience from the last 10 years has turned each of these
   claims on its head — and we are just beginning to understand why!
    NNTD (UofT)                CSC2541-Lec1                                 3 / 62
Introduction
    NNTD (UofT)   CSC2541-Lec1   4 / 62
Introduction
    NNTD (UofT)   CSC2541-Lec1   5 / 62
Introduction
                                 Hinton and Salakhutdinov, 2006
    NNTD (UofT)   CSC2541-Lec1                              6 / 62
Introduction
   But here are the training curves for an image classifier with
   millions of parameters:
                                                               He et al., 2016
   I could have picked basically any modern example!
   Generic optimization routines (only a few lines of code!)
    NNTD (UofT)              CSC2541-Lec1                                 7 / 62
Introduction
    NNTD (UofT)   CSC2541-Lec1   8 / 62
Introduction
   But this image classification network is able to generalize well even
   though it can fit random labels:
                                                           Zhang et al., 2017
   So capacity constraints are not necessary for generalization.
    NNTD (UofT)               CSC2541-Lec1                               9 / 62
Introduction
   E.g., weight decay was understood as implementing Tikhonov
   regularization, a well-understood concept from statistics.                                                  
                                  ∂        λ     2
                     w ←w−α           J + λkwk
                                 ∂w        2
                                        ∂J
                       = (1 − αλ)w − α
                                        ∂w
    NNTD (UofT)            CSC2541-Lec1                         10 / 62
Introduction
   This analysis predicts that if you change the optimization
   algorithm, you should keep the same objective function (with the
   kwk2 penalty).
   But for various optimization algorithms, it works much better to
   literally apply weight decay, even though this appears to be
   optimizing a different function!
                                                   Loschilov and Hutter, 2019
    NNTD (UofT)             CSC2541-Lec1                                11 / 62
Introduction
   Also, for overparameterized models, different optimization
   algorithms can converge to different optimal solutions.
   This is a type of implicit regularization.
                                                         Amari et al., 2020
    NNTD (UofT)               CSC2541-Lec1                            12 / 62
Introduction
   Common theme: need to understand the training dynamics
        If you’re minimizing a (strongly) convex function, you only have to
        understand the properties of the unique global optimum.
        Neural net training is nonconvex, so we could wind up at various
        local optima depending on the path the optimizer takes
        Also, most modern nets are overparameterized, so there are
        (infinitely) many different global optimal we could wind up at.
    NNTD (UofT)                CSC2541-Lec1                             13 / 62
              This Course
NNTD (UofT)    CSC2541-Lec1   14 / 62
This Course
Course staff
    Instructor: Roger Grosse
    TAs: Cem Anil, Guodong Zhang
Course web page:
 https://www.cs.toronto.edu/~rgrosse/courses/csc2541_2021/
     NNTD (UofT)               CSC2541-Lec1             15 / 62
This Course
Topics
    Weeks 1–4: Foundational concepts
        Today: Linear regression as a toy model
        Week 2: Taylor approximations (Jacobian-vector products,
        Hessians, etc.)
        Week 3: Metrics (function space distance, natural gradient)
        Week 4: Second-order optimization
    Weeks 5–9: Understanding neural net training
    Weeks 10–12: Game dynamics and bilevel optimization
    NNTD (UofT)                CSC2541-Lec1                           16 / 62
This Course
Topics
    Weeks 1–4: Foundational concepts
    Weeks 5–9: Understanding neural net training
         Week     5:   Adaptive gradient methods, normalization, weight decay
         Week     6:   Infinite limits (neural tangent kernel, infinite depth)
         Week     7:   Stochasticity
         Week     8:   Implicit regularization
         Week     9:   Momentum
    Weeks 10–12: Game dynamics and bilevel optimization
    NNTD (UofT)                     CSC2541-Lec1                            17 / 62
This Course
Topics
    Weeks 1–4: Foundational concepts
    Weeks 5–9: Understanding neural net training
    Weeks 10–12: Game dynamics and bilevel optimization
         Week 10: Differential games and minmax optimization
         (e.g. GANs)
         Weeks 11–12: Bilevel optimization (e.g. MAML, hyperparameter
         adaptation)
    NNTD (UofT)               CSC2541-Lec1                         18 / 62
This Course
Prerequisites
    Linear algebra
    Probability theory
    Multivariable calculus
    A course in machine learning (enforced!)
         You will need to fill out a Google Form saying what course you’re
         using to meet this prerequisite. Due Wednesday 1/20.
    A course in neural nets (e.g. CSC2516) is useful but not required.
         You only need to know the very basics of neural nets to follow the
         lectures.
         You may want to read the CSC2516 materials for
         tasks/architectures you’re using for your final project.
     NNTD (UofT)                CSC2541-Lec1                             19 / 62
This Course: Coursework
   Marking scheme:
        15%   Problem set (due 2/10)
        25%   Colab notebook and paper presentation
        10%   Final project proposal (due 2/17)
        50%   Final project report (due 4/7)
   The problem set is already posted.
        Covers Lectures 1–3.
        The longest/hardest problem only depends on this lecture, so you
        can get started now.
    NNTD (UofT)                CSC2541-Lec1                           20 / 62
This Course: Coursework
Colab notebook and paper presentation
    Form groups of 2–3
    Pick a paper from the spreadsheet (posted to Quercus)
    Create a Colab notebook that illustrates at least one of the key
    ideas from the paper.
    Give a 20-minute presentation to the class that summarizes the
    paper and presents your notebook.
    This assignment is due on the class meeting that the paper is
    discussed.
The sign-up sheet will be posted once the enrollment list is finalized.
     NNTD (UofT)               CSC2541-Lec1                            21 / 62
This Course: Coursework
Final project
    Form groups of 2–3
    Carry out a small research project related to the course content,
    e.g.
         invent a new algorithm/architecture
         explain a phenomenon
         apply the techniques to make a DL system work better
         test the course ideas in new settings (e.g. transformers, graph neural
         nets, generative models, etc.)
    Project proposal (due 2/17): main purpose is for us to give you
    feedback on your project
    Final report (due 4/7): conference paper style, about 8 pages
    See website for more details
     NNTD (UofT)                 CSC2541-Lec1                              22 / 62
This Course: Software
   For software, we’ll use JAX, a deep learning framework for Python
   Newer than TensorFlow/PyTorch, so maybe still some rough edges
   Advantages
        Clean, NumPy-like API
        Excellent support for forward derivatives and higher-order
        derivatives
        Functional style, user maintains state explicitly. Avoids lots of
        potential bugs (especially random number generation).
        Easily target TPU architectures.
        Parallelize/vectorize your code with pmap/vmap
        Fun to program in
   JAX tutorial today, right after this lecture, same Zoom meeting
   You’re welcome to use whatever framework you like for the final
   project.
    NNTD (UofT)                 CSC2541-Lec1                                23 / 62
          Gradient Descent for Linear Regression
NNTD (UofT)             CSC2541-Lec1               24 / 62
Recap: Linear Regression
   Linear regression assumes a linear model with parameters w, b.
                             y = f (x, w) = w> φ(x) + b
   Loss function penalizes the squared error from the true label:
                                 L(y, t) = 21 ky − tk2
   Given a finite training set {(x(i) , t(i) )}N
                                               i=1
   The cost function is the mean of the losses over all training examples:
                                       N
                                    1 X
                          J (w) =         L(f (x(i) , w), t(i) )
                                    N i=1
   Simplifying the notation with a homogeneous coordinate:
                                                                                Φ(x)            w
                       Φ̆(x) =             w̆ =
                                   1              b
   In vectorized form, this is a quadratic cost function:
                               J (w) =     1
                                          2N kΦ̆w̆   − tk2
    NNTD (UofT)                     CSC2541-Lec1                         25 / 62
Recap: Gradient Descent
   Gradient descent update rule:
                      w(k+1) ← w(k) − α∇w J (w(k) )                   (1)
   It’s a local search algorithm, so in general it can get stuck in local
   optima. But linear regression is a convex optimization problem, so
   for a small enough learning rate, it will converge to a global
   optimum.
   We can exactly analyze the dynamics of gradient descent for
   convex quadratics, gaining a lot of insight:
                      J (w) = 21 w> Aw + b> w + c,
   where A is symmetric
   Gradient descent update:
                      w(k+1) ← w(k) − α(Aw + b)
    NNTD (UofT)               CSC2541-Lec1                           26 / 62
Gradient Descent: Some Observations
   Perhaps the first question to ask about an iterative algorithm:
   what are its fixed points? I.e., for what values of w(k) does
   w(k+1) = w(k) ?
   For gradient descent on a differentiable function, the fixed points
   are the stationary points of J :
              w = w − α∇J (w)            ⇐⇒        ∇J (w) = 0
   In general, fixed points may be stable (e.g. local optima) or
   unstable (e.g. saddle points). Linear regression is convex, so all
   fixed points are stable.
   For a convex quadratic:
             ∇J (w) = Aw + b = 0              ⇐⇒      Aw = −b
        When does this have a solution? A unique solution?
        If the solution isn’t unique, where we end up depends on the
        initialization!
    NNTD (UofT)                CSC2541-Lec1                             27 / 62
Gradient Descent: Some Observations
   Another important question: what are the algorithm’s invariances?
   Claim: gradient descent is invariant to rigid transformations
   (rotation, reflection, translation)
   We often informally call this rotation invariance
   Can you give me an example of an optimization algorithm which
   isn’t rotation invariant?
    NNTD (UofT)              CSC2541-Lec1                          28 / 62
Gradient Descent: Invariance
   Rigid transformations can be written as:
                       w̃ = T (w) = Q> (w(0) − t)
   for orthogonal Q
   Inverse transformation:
                        w = T −1 (w̃) = Qw̃ + t
   This is a reparameterization, or change-of-basis. The cost function
   can be re-expressed in the new coordinate system:
                  J˜(w̃) = J (T −1 (w̃)) = J (Qw̃ + t).
   Want to show that gradient descent in both coordinate systems
   results in equivalent trajectories.
   Mathematically: initialize w̃(0) = T (w(0) ). Want to show that
                      w̃(k) = T (w(k) )     for all k.
    NNTD (UofT)              CSC2541-Lec1                            29 / 62
Gradient Descent: Invariance
Proof by Induction:
    Base case: covered by initialization,
                                w̃(0) = T (w(0) )
    Inductive step: assuming w̃(k) = T (w(k) ),
                   w̃(k+1) = w̃(k) − α∇J˜(w̃(k) )
                          = w̃(k) − αQ> ∇J (w(k) )
                          = Q> (w(k) − t) − αQ> ∇J (w(k) )
                          = T (w(k+1) )
     NNTD (UofT)                 CSC2541-Lec1                30 / 62
Gradient Descent: Invariance
   Because of rotation invariance, we are free to rotate to another
   coordinate system where the dynamics are easier to analyze.
   Recall the Spectral Decomposition of symmetric matrices:
                              A = QDQ> ,
   where Q is an orthogonal matrix whose columns are the
   eigenvectors of A, and D is a diagonal matrix whose diagonal
   entries are the eigenvalues.
   w̃ = T (w) = Q> w is a very convenient coordinate system to
   rotate to because à = Q> AQ is diagonal!
    NNTD (UofT)              CSC2541-Lec1                             31 / 62
Gradient Descent: Invariance
   A shorthand: we may assume without loss of generality
   (WLOG) that [property P].
   Translation: The problem can be transformed to an equivalent
   one where P holds.
   E.g.,
        When analyzing gradient descent, we may assume WLOG that A is
        diagonal, because the algorithm is rotation invariant.
        When analyzing Adam or coordinate descent, we can’t assume this,
        since the algorithims aren’t rotation invariant.
        We can’t assume WLOG that matrices A and B are both diagonal,
        since diagonalizing A fixes a rotation.
    NNTD (UofT)               CSC2541-Lec1                          32 / 62
Gradient Descent: Coordinatewise Dynamics
   If à is diagonal, then each coordinate evolves independently as:
                         (k+1)            (k)             (k)
                       w̃j          ← w̃j       − α(ãj w̃j     + b̃j ).
   We can analyze this into different cases.
   Case 1: ãj > 0
        Unique fixed point given by
                                          w̃?j = −b̃j /ãj
        Solving the recurrence:
                             (k)                                (0)
                        w̃j        = w̃?j + (1 − αãj )k (w̃j − w̃?j ).
        Case 1(a): 0 < αãj < 2. Iterates converge exponentially to w̃?j .
             Converges monotonically if 0 < αãj < 1, oscillates if 1 < αãj < 2.
        Case 1(b): αãj = 2. Iterates oscillate and never converge.
        Case 1(c): αãj > 2. Iterates diverge exponentially.
    NNTD (UofT)                       CSC2541-Lec1                             33 / 62
Gradient Descent: Coordinatewise Dynamics
   Case 2: ãj = 0 and b̃j 6= 0. Recurrence is solved by
                                  (k)           (0)
                                w̃j     = w̃j − αk b̃j ,
   so iterates diverge linearly.
   Case 3: ãj = 0 and b̃j = 0. Then w̃j is never updated, so
                            (k)           (0)
                          w̃j     = w̃j               for all k.
    NNTD (UofT)                       CSC2541-Lec1                 34 / 62
Gradient Descent: Coordinatewise Dynamics
    NNTD (UofT)       CSC2541-Lec1          35 / 62
Gradient Descent
   Summarizing all the above analysis into an equation:
                  w(k) = w(∞) + (I − αA)k (w(0) − w(∞) )
   The stationary solution w(∞) is, among all min-cost solutions, the
   one closest to the initialization.
   I.e., it is the projection of w(0) onto the min-cost subspace:
         w(∞) = arg min kw − w(0) k2         s.t. w ∈ arg min J (w0 ),
                     w                                   w0
    NNTD (UofT)               CSC2541-Lec1                               36 / 62
Gradient Descent: Stationary Solution
   A† denotes the pseudo-inverse of A:
                               A† = QD† Q> ,
   where QD† Q> is the spectral decomposition, and D† is a diagonal
   matrix that inverts the nonzero diagonal entries of D.
        A† = A−1 if A is invertible.
    NNTD (UofT)                CSC2541-Lec1                     37 / 62
Gradient Descent: Stationary Solution
   Closed form for the stationary solution starting from w(0) = 0:
                                 w(∞) = −A† b
   For a matrix B, the pseudoinverse is defined as:
                                 B† = VS† U> ,
   where USV> is the SVD of B, and S† is defined like D† .
   This can be written as
                                B† = (B> B)−1 B>
   if (B> B)−1 is invertible.
   If B is square and invertible, then B† = B−1 .
   If A is symmetric, then:
                                 A† = QD† Q> ,
   where QD† Q> is the spectral decomposition, and D† is a diagonal
   matrix that inverts the nonzero diagonal entries of D.
    NNTD (UofT)                  CSC2541-Lec1                         38 / 62
Gradient Descent: Convergence
What happens to the cost function?
    Keeping the transformed coordinate system, the loss decomposes
    into an independent term for each coordinate:
                                     X ãj
                        J˜(w̃) =           (w̃j − w̃?j )2 + c
                                        2
                                    j:ãj >0
    for a constant c.
    Recall:
                        (k)                           (0)
                    w̃j       − w̃?k = (1 − αãj )k (w̃j − w̃?k )
    So its contribution to the loss decreases exponentially, by a factor
    of (1 − αãj )2 in each iteration.
    Speed of convergence
                   − ln(1 − αãj )2 ≈ 2αãj          if αãj  1
    NNTD (UofT)                      CSC2541-Lec1                    39 / 62
Gradient Descent: Convergence
   Speed of convergence:
                   − ln(1 − αãj )2 ≈ 2αãj     if αãj  1
   Set α ≤ ã−1
             max for stability
   Lowest (nonzero) curvature direction converges the slowest, at rate
   αãmin < ã−1
              max ãmin
   These dimensions will eventually dominate the loss, so the
   convergence rate is bounded by κ−1 , where κ is the condition
   number:
                            κ = ãmax /ãmin
        κ ≈ 1: well-conditioned, fast convergence
        κ  1: ill-conditioned, slow convergence
    NNTD (UofT)                  CSC2541-Lec1                      40 / 62
Gradient Descent: Convergence
E.g.,
Blue = total cost
Red = cost from min curvature direction
        NNTD (UofT)          CSC2541-Lec1   41 / 62
              Back to linear regression...
NNTD (UofT)           CSC2541-Lec1           42 / 62
Gradient Descent for Linear Regression
   Cost function:
                         J (w) =    1
                                   2N kΦ̆w̆   − tk2
   Convex quadratic cost:
                      J (w) = 21 w> Aw + b> w + c,
   So
                                   1 >
                              A=   N Φ̆ Φ̆
                                          >
                              b=   − N1 Φ̆ t
   Stationary solution (starting from w̆(0) = 0):
                                                †
                          w̆(∞) = −A† b = Φ̆ t
    NNTD (UofT)              CSC2541-Lec1             43 / 62
Gradient Descent for Linear Regression
   Compare with ridge regression, or L2 -regularized linear regression:
                                                       λ
                      Jλ (w̆) =    1
                                  2N kΦ̆w̆   − tk2 +     kw̆k2 ,
                                                       2
   For λ > 0, there’s a unique optimal solution:
                                                 >                 >
                  w̆λ = arg min Jλ (w̆) = (Φ̆ Φ̆ + λI)−1 Φ̆ t.
                           w̆
   Can show that
                                                 †
                                  lim w̆λ = Φ̆ t,
                                  λ→0
   which agrees with w̆(∞) for gradient descent on an unregularized
   model. This is an example of implicit regularization.
    NNTD (UofT)                   CSC2541-Lec1                         44 / 62
              Why do we normalize the features?
NNTD (UofT)              CSC2541-Lec1             45 / 62
Why Normalize?
   A common trick when training machine learning models is to
   normalize, or standardize, the inputs to zero mean and unit
   variance:
                                φj (x) − µj
                    φ̃j (x) =
                                     σj
                                  N
                                1 X
                       µj =         φj (x(i) )
                                N
                                    i=1
                              N
                        2   1 X
                       σj =     (φj (x(i) ) − µj )2
                            N
                                    i=1
   Why is this a good idea?
   NNTD (UofT)                  CSC2541-Lec1                     46 / 62
Why Normalize?
   Recall: the convergence rate of gradient descent depends on the
   condition number κ, the ratio of the largest and smallest
   eigenvalues.
   Can show that
                             Σ + µµ> µ                                      
                         A=              ,
                               µ>    1
   where µ = N1 N         (i)
               P
                 i=1 φ(x ) is the empirical mean and
             N
   Σ = N1 i=1 (φ(x(i) ) − µ)(φ(x(i) ) − µ)> is the empirical
          P
   covariance.
   Example 1: Suppose µ = 0 and Σ = I. (The data are said to be
   white, as in “white noise”.) Then A = I, so κ = 1, and the
   problem is perfectly well-conditioned. Gradient descent converges
   in one step.
   NNTD (UofT)               CSC2541-Lec1                        47 / 62
Why Normalize?
                                 Σ + µµ> µ
                                          
                        A=
                                   µ>    1
   Example 2: Suppose µ = 0 and Σ            is diagonal. Then
                                                                             Σ             0
                         A=                      ,
                               0             1
   and κ depends on the eigenvalues of Σ. Convergence is slower.
   Example 3: Suppose the data are uncentered, i.e. µ 6= 0, and Σ is
   diagonal with bounded entries. It turns out that A has an outlier
   eigenvalue of roughly kµk2 + 1, in roughly the direction (µ> 1)> .
       Note that kµk2 grows linearly in the dimension D, while the
       remaining eigenvalues are bounded.
       This can be really badly conditioned in high-dimensional spaces!
   NNTD (UofT)                CSC2541-Lec1                                48 / 62
Why Normalize?
Intuition
                      x1           x2         t
                  114.8      0.00323      5.1
                  338.1      0.00183      3.2
                   98.8      0.00279      4.1
                    ..          ..         ..
                     .           .          .
    NNTD (UofT)            CSC2541-Lec1           49 / 62
Why Normalize?
Intuition
                       x1        x2        t
                  1003.2    1005.1     3.3
                  1001.1    1008.2     4.8
                   998.3    1003.4     2.9
                     ..       ..        ..
                      .        .         .
    NNTD (UofT)         CSC2541-Lec1           50 / 62
Why Normalize?
   So convergence is faster when µ is closer to 0 and Σ is closer to I.
       Centering sets µ = 0, which eliminates the outlier eigenvalue.
       Normalization corrects for the variances of different features, but
       not for correlations between features.
   It’s possible to go a step further and whiten the data.
       Map x → S−1 x, where S is a matrix such that SS> = Σ. (For
       instance, the matrix square root Σ1/2 .)
       Then Σ̃ = I, so the problem is perfectly well conditioned.
   Is whitening a good idea?
   NNTD (UofT)                 CSC2541-Lec1                              51 / 62
Pitfalls of Full Whitening
    Whitening always improves convergence of gradient descent, for
    linear regression, on the training set.
    But we also care about generalization!
    The highest variance directions are the principal components,
    which we believe contain a lot of the signal.
        Converging faster in these directions can be a useful inductive bias,
        which is removed by whitening.
        By contrast, the means and variances contain less useful
        information, since they depend on on arbitrary choice of units. So
        normalization is generally OK.
    The lesson: when making a change to speed up convergence, ask
    if it’s doing it in a way that’s useful for generalization!
    NNTD (UofT)                 CSC2541-Lec1                              52 / 62
Normalization: Neural Nets
Does this apply to neural nets?
    Centering and normalizing the inputs are a standard preprocessing
    step, even for neural nets
    Uncentered hidden activations create ill-conditioning too, and this
    is not straightforward to prevent.
         Classic trick: use tanh rather than logistic activation function
         Activations can become uncentered due to internal covariate shift,
         and batch normalization was designed partly to counteract this
         We can show that uncentered hidden activations create large
         eigenvalues in the Hessian (but this requires more machinery)
     NNTD (UofT)                CSC2541-Lec1                                53 / 62
              Double Descent
NNTD (UofT)     CSC2541-Lec1   54 / 62
Double Descent
How does generalization depend on dimensionality?
                                            What often happens
       The cartoon picture
                                                       Belkin et al. (2019)
This phenomenon is known as double descent
     NNTD (UofT)             CSC2541-Lec1                               55 / 62
Double Descent
   Can we build a linear regression model of this phenomenon? If so,
   then maybe we can analyze it.
   No straightforward way to vary the complexity of the model.
   Instead, we’ll fix the dimension D = 50 and vary N , the number of
   training examples.
   Double descent still happens!
    NNTD (UofT)              CSC2541-Lec1                        56 / 62
Double Descent
Intuition:
    Case 1: N  D. There’s way more than enough data to pin
    down the optimal parameters, so it generalizes well.
    Case 2: N ≈ D. It can memorize the training set, but just barely.
    It might need a large kwk to do so.
    Case 3: N  D. It can fit the training set easily. The implicit
    regularization of gradient descent makes it do so with a small kwk.
     NNTD (UofT)              CSC2541-Lec1                         57 / 62
Double Descent
   Recall the stationary solution
                                            †
                              w̆(∞) = Φ̆ t
                                                †
   Roughly speaking, w̆(∞) is large when Φ̆ is large. This happens
   when Φ̆ has small singular values.
   The minimum singular value is small exactly at the double descent
   point! (Basic result from random matrix theory, but beyond the
   scope of this course.)
    NNTD (UofT)              CSC2541-Lec1                       58 / 62
Double Descent
   Adding explicit regularization removes any trace of the double
   descent phenomenon:
   Double descent is best regarded as a pathology, but it’s one that
   still applies to a lot of state-of-the-art neural nets.
    NNTD (UofT)              CSC2541-Lec1                           59 / 62
              Discussion
NNTD (UofT)   CSC2541-Lec1   60 / 62
Discussion
   When we prove something about linear regression, what does that
   tell us about neural nets?
        In principle, nothing. Neural nets are much more complicated and
        can behave completely differently.
        We have no guarantees.
        But it’s hard to prove any nontrivial guarantees about practical
        neural nets anyway. Proofs about neural nets usually require very
        idealized conditions.
   Instead, the goal of mathematical analysis is to provide insight
   into what happens during training.
        Simpler model systems can provide more insight precisely because
        they yield more detailed predictions.
        Like an empirical science, we need to validate our model by seeing if
        it makes surprising predictions that we can confirm experimentally.
    NNTD (UofT)                 CSC2541-Lec1                             61 / 62
Discussion
Why spend so much time on linear regression?
   It’s an important model system that we can analyze in detail, and
   often yields good predictions about neural nets.
         Part of a toolbox that also includes noisy quadratic objectives,
         linear neural networks, Gaussian processes, matrix completion,
         bilinear games, ...
    We can approximate local convergence behavior by taking the
    second-order Taylor approximation around the optimum, reducing
    it to the convex quadratic case. (Topic of next lecture.)
    Amazingly, neural nets in certain settings are approximated well
    by linear regression with random features! (Topic of Lecture 6.)
     NNTD (UofT)                 CSC2541-Lec1                               62 / 62