KEMBAR78
Ps 1 | PDF | Mean Squared Error | Linear Regression
0% found this document useful (0 votes)
132 views16 pages

Ps 1

CS229 Problem Set #1 for Summer 2025 includes various questions on gradient descent and locally weighted linear regression. The problem set emphasizes the importance of learning rates, convergence, and the relationship between weights and variance in regression models. Students are required to submit solutions in PDF and zip formats, adhering to collaboration and honor code policies.

Uploaded by

mithilgupta1991
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views16 pages

Ps 1

CS229 Problem Set #1 for Summer 2025 includes various questions on gradient descent and locally weighted linear regression. The problem set emphasizes the importance of learning rates, convergence, and the relationship between weights and variance in regression models. Students are required to submit solutions in PDF and zip formats, adhering to collaboration and honor code policies.

Uploaded by

mithilgupta1991
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

CS229 Problem Set #1 1

CS 229, Summer 2025


Problem Set #1
YOUR NAME HERE (YOUR SUNET HERE) COLLABORATORS (COLLABORATORS)

Due Friday, July 11, 2025 at 11:59 pm on Gradescope.


Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible.
(2) If you have a question about this homework, we encourage you to post your question on our
Ed forum, at https://edstem.org/us/courses/79825/discussion.
(3) If you missed the first lecture or are unfamiliar with the collaboration or honor code policy,
please read the policy on the course website before starting work.
(4) For the coding problems, you may not use any libraries except those defined in the provided
environment.yml file. In particular, ML-specific libraries such as scikit-learn are not permitted.
(5) The due date is Friday, July 11, 2025 at 11:59 pm. If you submit after Friday, July 11, 2025
at 11:59 pm, you will begin consuming your late days. The late day policy can be found in the
course website: Course Logistics and FAQ.
All students must submit an electronic PDF version of the written question including plots
generated from the codes. We require you to typeset your solutions via LATEX. All
students must also submit a zip file of their source code to Gradescope, which should be created
using the make zip.py script. You should make sure to (1) restrict yourself to only using libraries
included in the environment.yml file, and (2) make sure your code runs without errors. Your
submission may be evaluated by the auto-grader using a private test set, or used for verifying the
outputs reported in the writeup. Please make sure that your PDF file and zip file are submitted
to the corresponding Gradescope assignments respectively. We reserve the right to not give any
points to the written solutions if the associated code is not submitted.

Honor code: We strongly encourage students to form study groups. Students may discuss and
work on homework problems in groups. However, each student must write down the solutions
independently, and without referring to written notes from the joint session. In other words,
each student must understand the solution well enough in order to reconstruct it by themself.
In addition, each student should write on the problem set the set of people with whom s/he
collaborated. Further, because we occasionally reuse problem set questions from previous years,
we expect students not to copy, refer to, or look at the solutions in preparing their answers. It
is an honor code violation to intentionally refer to a previous year’s solutions.
Regarding Notation: The notation used in this problem set matches the notation used in the
lecture notes. Some notable differences from lecture notation:
• The superscript “(i)” represents an index into the training set – for example, x(i) is the
i-th feature vector and y (i) is the i-th output variable. In lecture notation, these would
instead be expressed as xi and yi .
(i)
• The subscript j represents an index in a vector – in particular, xj represents the j-th
(i) (i)
feature in the feature vector x . In lecture notation, xj would be hj (xi ) or xi [j].
• The vector that contains the weights parameterizing a linear regression is expressed by the
variable θ, whereas lectures use the variable w. As such, θ0 = w0 , θ1 = w1 , and so on.
CS229 Problem Set #1 2

An overview of this notation is also given at the beginning of the lecture notes.
CS229 Problem Set #1 3

1. [30 points] Convergence and Learning Rate for Gradient Descent


When running an optimization algorithm, finding a good learning rate may require some tuning.
If the learning rate is too high, the gradient descent (GD) algorithm might not converge. If the
learning rate is too low, GD may take too long to converge. In this question, we will investigate
some theoretical guarantees for the convergence of GD on convex objective functions, and their
implications on choosing the learning rate.
Recap and notation. Suppose we have a convex objective function J(θ). At each iteration,
GD updates the iterate θ[t] as follows:

θ[t] = θ[t−1] − α∇J(θ[t−1] )

where α is the learning rate and ∇J(θ[t−1] ) is the gradient of J(θ) evaluated at θ[t−1] .

(a) [8 points] Quadratic Objective, Scalar Variable


In this part, consider the simple objective function with one-dimensional variable θ:
1 2
J(θ) = βθ
2
where θ ∈ R is our parameter and β > 0 is a positive, constant scalar. We initialize GD at
some initialization θ[0] ̸= 0 and run GD with learning rate α.
i. Derive the range of α such that the iterate of GD converges in the sense that there
exists a scalar θ† such that limt→∞ |θ[t] − θ† | = 0. Provide the value of θ† when GD
converges and show that it’s equal to the global minimum θ∗ = arg minθ J(θ). The
range of α can potentially depend on β.
ii. Given a desired accuracy ϵ > 0, for all the α in the range that you derived, derive the
minimum number of iterations T required to reach a point θ[T ] such that |θ[T ] − θ∗ | ≤ ϵ.
T can potentially depend on ϵ, θ[0] , α, and β. Investigate the behavior of T and whether
T is increasing or decreasing for different values of α. Briefly discuss why choosing an
inappropriate α can cause the algorithm to take longer to converge.
Hint: Express θ[t] in terms of θ[0] , α, t, and β.
Hint: limt→∞ ct = 0, ∀c ∈ R s.t. |c| < 1
Answer:
(b) [4 points] Quadratic Objective, d-dimensional Variable
In this part of the question, consider the objective function with d-dimensional variable θ:
d
1X
J(θ) = βi θi2
2 i=1

where θi ’s are our parameters and βi ∈ R s.t. βi > 0 are positive, constant scalars. Assume
[0]
that we start from some θ[0] where θi ̸= 0 for all i and run GD with learning rate α. Derive
the range of learning rate α such that GD converges in the sense that there exists a vector
θ† ∈ Rd such that limt→∞ ∥θ[t] − θ† ∥2 = 0. Provide the value of θ† when GD converges.
The range of α can potentially depend on βi where i ∈ {1, . . . , d}.
Answer:
CS229 Problem Set #1 4

(c) [4 points] Coding Question: Quadratic Multivariate Objective


Let the objective function be
J(θ) = θ⊤ Aθ
where A ∈ R2×2 is a 2 × 2 real, positive semi-definite matrix. Do the following exercises:
i. Implement the update theta and gradient descend function in src/gd_convergence/
experiment.py. You can stop the GD algorithm when either of the following condition
is satisfied:
A. |J(θ[t] ) − J(θ[t−1] )| < ϵ where ϵ = 10−50 is given as the function parameter. This
is when we assume the algorithm converged.
B. J(θ[t] ) > 1020 . This is to prevent an infinite loop when the algorithm does not
converge.
To test your implementation, run python src/gd convergence/experiment.py, which
checks that your θ (approximately) converges to the optimal value.
 
1 0
Note that we have provided you a matrix A = and θ[0] = [−1, 0.5] at the
0 2
beginning of the file for you to experiment with. Therefore, the objective function is
a special case of the objective function in part (b) with dimension d = 2. Check if
your theoretical derivation of the feasible range of the learning rate indeed matches
empirical observations.
ii. Now, suppose we rotate the matrix A, what do we observe? Plot the trajectories of
the GD algorithm using the following learning rates: 0.05, 0.1, 0.2, 0.3, 0.4, 0.45, 0.5,
and 1. We have provided the rotation and plotting function for you. You need to
simply run python src/gd convergence/plotting.py lr1 lr2 lr3 ... with lr*
replaced with desired learning rates. Include the output file trajectories.png and
trajectories_rotated.png, as well as a brief discussion on the printed output of
plotting.py in the write-up.
Remark: Setting the learning rate too high will cause the objective function to not converge.
The convergence properties of these learning rates are also rotational invariant. We will
show this formally in the next part of the problem.
Answer:
(d) [12 points] Convergence of Gradient Descent for a General Convex Objective
Now let’s consider any convex, twice continuously differentiable objective function J(θ)
that is bounded from below. Suppose that the largest eigenvalue of the Hessian matrix
H = ∇2 J(θ) is less than or equal to βmax for all points θ.
1
Show that when running gradient descent, choosing a step size 0 < α < βmax guarantees that
[t] 1
J(θ ) will converge to some finite value as t approaches infinity. Specifically, you should
derive an inequality for J(θ[t] ) in terms of J(θ[t−1] ), ∇J(θ[t−1] ), α, and βmax , and show that
the objective function is strictly decreasing during each iteration, i.e. J(θ[t] ) < J(θ[t−1] ).
Hint: You can assume that, by Taylor’s theorem, the following statement is true:
1
J(y) = J(x) + ∇J(x)⊤ (y − x) + (y − x)⊤ ∇2 J(x + c(y − x))(y − x) for some 0 ≤ c ≤ 1
2
1 This definition of convergence is different from the definition in previous questions where we explicitly prove
that the parameter θ converges to an optimal parameter θ∗ . In this case, we do not know the optimal value θ∗ ,
and it would be difficult to prove convergence using the previous definition.
CS229 Problem Set #1 5

Hint: The Hessian of a convex function is symmetric and positive semi-definite at any point
θ, which means all of its eigenvalues are real and non-negative.
Hint: The Hessian matrix is symmetric. Consider using the spectral theorem introduced
in problem 3 in homework 0.
Optional (no credit): Using the gradient descent inequality that you derived, show that
the GD algorithm converges in the sense that limt→∞ ||∇J(θ[t] )||22 = 0.2
Remark: This question suggests that a smaller learning rate should be applied if we believe
the curvature of the objective function is big.
Answer:
(e) [2 points] Learning Rate for Linear Regression
Consider using GD on the LMS objective introduced in lecture:
1
J(θ) = ||Xθ − y||22
2
where X ∈ Rn×d is the design matrix of our data, θ ∈ Rd is our parameter, and ⃗y ∈ Rn is
the response variable.
Let βmax be the largest eigenvalue of X ⊤ X. Prove that for all α ∈ (0, βmax 1
), GD with
[t]
learning rate α satisfies that J(θ ) converges as t → ∞.
Hint: You can invoke the statements in the part (d) (including the optional question in
part (d)) to solve this part (even if you were not able to prove part (d).)
Remark: The conclusion here suggests that for linear regression, roughly speaking, you
should use smaller learning rates if the scale of your data X is big or if the data points are
correlated with each other, both of which will enable a large top eigenvalue of X ⊤ X.
Remark: Even though in many cases the LMS objective can be solved exactly by solving
the normal equation as shown in lecture, it is still useful to be able to guarantee convergence
when using the Gradient Descent algorithm in situations where it is difficult to solve for
the optimal θ∗ directly (such as having large amount of data making inverting X very
expensive).
Answer:

2 Note that limt→∞ ||∇J(θ[t] )||22 = 0 does not necessarily mean that there exists a vector θ̂ such that θ[t] → θ̂.
(Even an 1-dimensional counterexample exists.) However, for convex functions, when θ[t] stays in a bounded
set, limt→∞ ||∇J(θ[t] )||22 = 0 does imply that J(θ[t] ) → inf θ J(θ) as t → ∞. Therefore, limt→∞ ||∇J(θ[t] )||22 = 0
is typically considered a reasonable definition for an optimization algorithm’s convergence.
CS229 Problem Set #1 6

2. [25 points] Locally weighted linear regression

(a) [10 points] Consider a linear regression problem in which we want to “weight” different
training examples differently. Specifically, suppose we want to minimize
n 2
1 X (i)  T (i)
J(θ) = w θ x − y (i) .
2 i=1

In class, we worked out what happens for the case where all the weights (the w(i) ’s) are the
same. In this problem, we will generalize some of those ideas to the weighted setting. We
will assume w(i) > 0 for all i.
i. [2 points] Show that J(θ) can also be written

J(θ) = (Xθ − y)T W (Xθ − y)

for an appropriate matrix W , and where X and y are as defined in class. Clearly specify
the value of each element of the matrix W .
ii. [4 points] If all the w(i) ’s equal 1, then we saw in class that the normal equation is

X T Xθ = X T y,

and that the value of θ that minimizes J(θ) is given by (X T X)−1 X T y. By finding
the derivative ∇θ J(θ) and setting that to zero, generalize the normal equation to this
weighted setting, and give the new value of θ that minimizes J(θ) in closed form as a
function of X, W and y.
iii. [4 points] Suppose we have a dataset {(x(i) , y (i) ); i = 1 . . . , n} of n independent ex-
amples, but we model the y (i) ’s as drawn from conditional distributions with different
levels of variance (σ (i) )2 . Specifically, assume the model

(y (i) − θT x(i) )2
 
1
p(y (i) |x(i) ; θ) = √ exp −
2πσ (i) 2(σ (i) )2

That is, each y (i) is drawn from a Gaussian distribution with mean θT x(i) and variance
(σ (i) )2 (where the σ (i) ’s are fixed, known, constants). Show that finding the maximum
likelihood estimate of θ reduces to solving a weighted linear regression problem. State
clearly what the w(i) ’s are in terms of the σ (i) ’s.
In other words, this suggests that if we have prior knowledge on the noise levels (the
variance of the label y (i) conditioned on x(i) ) of all the examples, then we should use
weighted least squares with weights depending on the variances.
Answer:
(b) [10 points] Coding problem.
We will now consider the following dataset (the formatting matches that of Datasets 1-4,
except x(i) is 1-dimensional):
src/lwr/{train,valid,test}.csv
In src/lwr/lwr.py, implement locally weighted linear regression using the normal equa-
tions you derived in Part (a) and using

∥x(i) − x∥22
 
(i)
w = exp − .
2τ 2
CS229 Problem Set #1 7

This is a fairly standard choice for weights, where the weight w(i) depends on the particular
point x at which we’re trying to evaluate y: if |x(i) − x| is small, then w(i) is close to 1;
if |x(i) − x| is large, then w(i) is close to 0. Here, τ is the bandwidth parameter, and it
controls how quickly the weight of a training example falls off with distance of its x(i) from
the query point x.
Train your model on the train split using τ = 0.5, then run your model on the valid
split and report the mean squared error (MSE). Finally plot your model’s predictions on
the validation set (plot the training set with blue ‘x’ markers and the predictions on the
validation set with a red ‘o’ markers). Does the model seem to be under- or overfitting?
Answer:
(c) [5 points] Coding problem.
We will now tune the hyperparameter τ . In src/lwr/tau.py, find the MSE value of your
model on the validation set for each of the values of τ specified in the code. For each τ , plot
your model’s predictions on the validation set in the format described in part (b). Report
the value of τ which achieves the lowest MSE on the valid split, and finally report the
MSE on the test split using this τ -value.
Answer:
CS229 Problem Set #1 8

3. [25 points] Linear regression: linear in what?


In this question, we will see how linear regression can be used to fit non-linear functions of the
data using feature maps, as well as exploring some of its limitations.

(a) [5 points] Learning degree-3 polynomials of the input


Suppose we have a dataset {(x(i) , y (i) )}ni=1 where x(i) , y (i) ∈ R. We would like to fit a third
degree polynomial hθ (x) = θ3 x3 + θ2 x2 + θ1 x1 + θ0 to the dataset. The key observation
here is that the function hθ (x) is still linear in the unknown parameter θ, even though it’s
not linear in the input x. This allows us to convert the problem into a linear regression
problem as follows.
Let ϕ : R → R4 be a function that transforms the original input x to a 4-dimensional vector
defined as
 
1
 x  4
ϕ(x) =   x2  ∈ R (1)

3
x

Let x̂ ∈ R4 be a shorthand for ϕ(x), and let x̂(i) ≜ ϕ(x(i) ) be the transformed input in
the training dataset. We construct a new dataset {(ϕ(x(i) ), y (i) )}ni=1 = {(x̂(i) , y (i) )}ni=1 by
replacing the original inputs x(i) ’s by x̂(i) ’s. We see that fitting hθ (x) = θ3 x3 +θ2 x2 +θ1 x1 +
θ0 to the old dataset is equivalent to fitting a linear function hθ (x̂) = θ3 x̂3 + θ2 x̂2 + θ1 x̂1 + θ0
to the new dataset because

hθ (x) = θ3 x3 + θ2 x2 + θ1 x1 + θ0 = θ3 ϕ(x)3 + θ2 ϕ(x)2 + θ1 ϕ(x)1 + θ0 = θT x̂ (2)

In other words, we can use linear regression on the new dataset to find parameters θ0 , . . . , θ3 .
Please write down 1) the objective function J(θ) of the linear regression problem on the
new dataset {(x̂(i) , y (i) )}ni=1 and 2) the update rule of the batch gradient descent algorithm
for linear regression on the dataset {(x̂(i) , y (i) )}ni=1 .
Terminology: In machine learning, ϕ is often called the feature map which maps the original
input x to a new set of variables. To distinguish between these two sets of variables, we will
call x the input attributes, and call ϕ(x) the features. (Unfortunately, different authors
use different terms to describe these two things. In this course, we will do our best to follow
the above convention consistently.)
Answer:
(b) [5 points] Coding question: degree-3 polynomial regression
For this sub-question question, we will use the dataset provided in the following files:
src/featuremaps/{train,valid,test}.csv
Each file contains two columns: x and y. In the terminology described in the introduction,
x is the attribute (in this case one dimensional) and y is the output label.
Using the formulation of the previous sub-question, implement linear regression with nor-
mal equations using the feature map of degree-3 polynomials. Use the starter code pro-
vided in src/featuremaps/featuremap.py to implement the algorithm.
Create a scatter plot of the training data, and plot the learnt hypothesis as a smooth curve
over it. Submit the plot in the writeup as the solution for this problem.
CS229 Problem Set #1 9

Remark: Suppose X b is the design matrix of the transformed dataset. You may sometimes
encounter a non-invertible matrix X b T X.
b For a numerically stable code implementation,
always use np.linalg.solve to obtain the parameters directly, rather than explicitly cal-
culating the inverse and then multiplying it with X b T y.
Answer:
(c) [5 points] Coding question: degree-k polynomial regression
Now we extend the idea above to degree-k polynomials by considering ϕ : R → Rk+1 to be
 
1
 x 
 2 
ϕ(x) =  x  ∈ Rk+1 (3)
 
 .. 
 . 
xk

Follow the same procedure as the previous sub-question, and implement the algorithm with
k = 3, 5, 10, 20. Create a similar plot as in the previous sub-question, and include the
hypothesis curves for each value of k with a different color. Include a legend in the plot to
indicate which color is for which value of k.
Submit the plot in the writeup as the solution for this sub-problem. Observe how the fitting
of the training dataset changes as k increases. Briefly comment on your observations in the
plot. Answer:
(d) [5 points] Coding question: other feature maps
You may have observed that it requires a relatively high degree k to fit the given training
data, and this is because the dataset cannot be explained (i.e., approximated) very well
by low-degree polynomials. By visualizing the data, you may have realized that y can
be approximated well by a sine wave. In fact, we generated the data by sampling from
y = sin(x) + ξ, where ξ is noise with Gaussian distribution. Please update the feature map
ϕ to include a sine transformation as follows:
 
1
 x 
x2
 
 
ϕ(x) =   ∈ Rk+2 (4)
 
..

 . 

 xk 
sin(x)

With the updated feature map, train different models for values of k = 0, 1, 2, 3, 5, 10, 20,
and plot the resulting hypothesis curves over the data as before.
Submit the plot as a solution to this sub-problem. Compare the fitted models with the
previous sub-question, and briefly comment about noticeable differences in the fit with this
feature map. Answer:
(e) [5 points] Overfitting with expressive models and small data
For the rest of the problem, we will consider a small dataset (a random subset of the dataset
you have been using so far) with much fewer examples, provided in the following file:
src/featuremaps/small.csv
CS229 Problem Set #1 10

We will be exploring what happens when the number of features start becoming bigger
than the number of examples in the training set. Run your algorithm on this small dataset
using the following feature map
 
1
 x 
 2 
ϕ(x) =  x  ∈ Rk+1 (5)
 
 .. 
 . 
xk

with k = 1, 2, 5, 10, 20.


Create a plot of the various hypothesis curves (just like previous sub-questions). Observe
how the fitting of the training dataset changes as k increases. Submit the plot in the writeup
and comment on what you observe.
Remark: The phenomenon you observe where the models start to fit the training dataset
very well, but suddenly “goes wild” is due to what is called overfitting. The intuition to
have for now is that, when the amount of data you have is small relative to the expressive
capacity of the family of possible models (that is, the hypothesis class, which, in this case,
is the family of all degree k polynomials), it results in overfitting.
Loosely speaking, the set of hypothesis function is “very flexible” and can be easily forced
to pass through all your data points especially in unnatural ways. In other words, the model
explains the noises in the training dataset, which shouldn’t be explained in the first place.
This hurts the predictive power of the model on test examples.
Answer:
CS229 Problem Set #1 11

4. [35 points] Implicit Regularization


Recall that in the overparameterized regime (where the number of parameters is larger than the
number of samples), typically there are infinitely many solutions that can fit the training dataset
perfectly, and many of them cannot generalize well (that is, they have large validation errors).
However, in many cases, the particular optimizer we use (e.g., GD, SGD with particular learning
rates, batch sizes, noise, etc.) tends to find solutions that generalize well. This phenomenon is
called implicit regularization effect (also known as algorithmic regularization or implicit bias).
In this problem, we will look at the implicit regularization effect on two toy examples in the
overparameterized regime: linear regression and a quadratically parameterized model. For lin-
ear regression, we will show that gradient descent with zero initialization will always find the
minimum norm solution (instead of an arbitrary solution that fits the training data), and in prac-
tice, the minimum norm solution tends to generalize well. For a quadratically parameterized
model, we will show that initialization and batch size also affect generalization.

(a) [3 points] Suppose we have a dataset {(x(i) , y (i) ); i = 1, · · · , n} where x(i) ∈ Rd and y (i) ∈ R
for all 1 ≤ i ≤ n. We assume the dataset is generated by a linear model without noise. That
is, there is a vector β ⋆ ∈ Rd such that y (i) = (β ⋆ )⊤ x(i) for all 1 ≤ i ≤ n. Let X ∈ Rn×d be
the matrix representing the inputs (i.e., the i-th row of X corresponds to x(i) )) and ⃗y ∈ Rn
the vector representing the labels (i.e., the i-th row of ⃗y corresponds to y (i) )):
− x(1) −
   (1) 
y
− x(2) −  y (2) 
X =. ..  , ⃗y =  .  .
   
..
 .. . .   .
. 
− x(n) − y (n)
Then in matrix form, we can write ⃗y = Xβ ⋆ . We assume that the number of examples is
less than the number of parameters (that is, n < d).
We use the least-squares cost function to train a linear model:
1
J(β) = ∥Xβ − ⃗y ∥22 . (6)
2n
In this sub-question, we characterize the family of global minimizers to Eq. (6). We assume
that XX ⊤ ∈ Rn×n is an invertible matrix. Prove that β achieves zero cost in Eq. (6) if
and only if
β = X ⊤ (XX ⊤ )−1 ⃗y + ζ (7)
for some ζ in the subspace orthogonal to all the data (that is, for some ζ such that ζ ⊤ x(i) =
0, ∀1 ≤ i ≤ n.)
Note that this implies that there is an infinite number of β’s such that Eq. (6) is minimized.
We also note that X ⊤ (XX ⊤ )−1 is the pseudo-inverse of X, but you don’t necessarily need
this fact for the proof.
Answer:
(b) [3 points] We still work with the setup of part (a). Among the infinitely many optimal
solutions of Eq. (6), we consider the minimum norm solution. Let ρ = X ⊤ (XX ⊤ )−1 ⃗y . In
the setting of (a), prove that for any β such that J(β) = 0, ∥ρ∥2 ≤ ∥β∥2 . In other words,
ρ is the minimum norm solution.
Hint: As a intermediate step, you can prove that for any β in the form of Eq. (7),
∥β∥22 = ∥ρ∥22 + ∥ζ∥22 .
CS229 Problem Set #1 12

Answer:
(c) [5 points] Coding question: minimum norm solution generalizes well
For this sub-question, we still work with the setup of parts (a) and (b). We use the following
datasets:
src/implicitreg/ds1_train.csv,ds1_valid.csv
Each file contains d + 1 columns. The first d columns in the i-th row represents x(i) , and
the last column represents y (i) . In this sub-question, we use d = 200 and n = 40.
Using the formula in sub-question (b), compute the minimum norm solution using the
training dataset. Then, generate three other different solutions with zero costs and differ-
ent norms using the formula in sub-question (a). The starter code is in src/implicitreg/
linear.py. Plot the validation error of these solutions (including the minimum norm
solution) in a scatter plot. Use the norm of the solutions as x-axis, and the validation
error as y-axis. For your convenience, the plotting function is provided as the method
generate_plot in the starter code. Your plot is expected to demonstrate that the mini-
mum norm solution generalizes well.
Answer:
(d) [5 points] For this sub-question, we work with the setup of part (a) and (b). In this sub-
question, you will prove that the gradient descent algorithm with zero initialization always
converges to the minimum norm solution. Let β (t) be the parameters found by the GD
algorithm at time step t. Recall that at step t, the gradient descent algorithm update the
parameters in the following way
η ⊤
β (t) = β (t−1) − η∇J(β (t−1) ) = β (t−1) − X (Xβ (t−1) − ⃗y ). (8)
n
As in sub-question (a), we also assume XX ⊤ is an invertible matrix. Prove that if the
GD algorithm with zero initialization converges to a solution β̂ satisfying J(β̂) = 0, then
β̂ = X ⊤ (XX ⊤ )−1 ⃗y = ρ,, that is, β̂ is the minimum norm solution.
Hint: As a first step, you can prove by induction that if we start with zero initialization,
β (t) will always be a linear combination of {x(1) , x(2) , · · · , x(n) } for any t ≥ 0. Then, for
any t ≥ 0, you can write β (t) = X ⊤ v (t) for some v (t) ∈ Rn . As a second step, you can prove
that if β̂ = X ⊤ v (t) for some v (t) and J(β̂) = 0, then we have β̂ = ρ.
You don’t necessarily have to follow the steps in this hint. But if you use the hint, you need
to prove the statements in the hint.
Answer:
(e) [3 points] In the following sub-questions, we consider a slightly more complicated model
called quadratically parameterized model. A quadratically parameterized model has two
sets of parameters θ, ϕ ∈ Rd . Given a d-dimensional input x ∈ Rd , the output of the model
is
d
X d
X
fθ,ϕ (x) = θk2 xk − ϕ2k xk . (9)
k=1 k=1

Note that fθ,ϕ (x) is linear in its input x, but non-linear in its parameters θ, ϕ. Thus, if the
goal was to learn the function, one should simply just re-parameterize it with a linear model
and use linear regression. However, here we insist on using the parameterization above in
CS229 Problem Set #1 13

Eq. (9) in order to study the implicit regularization effect in models that are nonlinear in
the parameters.
Notations: To simplify the equations, we define the following notations. For a vector v ∈ Rd ,
let v ⊙2 be its element-wise square (that is, v ⊙2 is the vector [v12 , v22 , · · · , vd2 ] ∈ Rd .) For two
vectors v, w ∈ Rd , let v ⊙ w be their element-wise product (that is, v ⊙ w is the vector
[v1 w1 , v2 w2 , · · · , vd wd ] ∈ Rd .) Then our model can be written as

fθ,ϕ (x) = x⊤ (θ⊙2 − ϕ⊙2 ). (10)

Suppose we have a dataset {(x(i) , y (i) ); i = 1, · · · , n} where x(i) ∈ Rd and y (i) ∈ R for all
1 ≤ i ≤ n, and
y (i) = (x(i) )⊤ ((θ⋆ )⊙2 − (ϕ⋆ )⊙2 )
for some θ⋆ , ϕ⋆ ∈ Rd . Similarly, we use X ∈ Rn×d and ⃗y ∈ Rn to denote the matrix/vector
representing the inputs/labels respectively:

− x(1) −
   (1) 
y
− x(2) −  y (2) 
X =. ..  , ⃗y =  .  .
   
..
 .. . .   .
. 
(n) (n)
− x − y
1
Pn (i) (i) 2
Let J(θ, ϕ) = 4n i=1 (fθ,ϕ (x ) − y ) be the cost function.

First, when n < d and XX is invertible, prove that there exists infinitely many optimal
solutions with zero cost.
Hint: Find a mapping between the parameter β in linear model and the parameter θ, ϕ in
quadratically parameterized model. Then use the conclusion in sub-question (a).
Answer:
(f) [10 points] Coding question: implicit regularization of initialization
We still work with the setup in part (e). For this sub-question, we use the following datasets:
src/implicitreg/ds2_train.csv,ds2_valid.csv
Each file contains d + 1 columns. The first d columns in the i-th row represents x(i) , and
the last column represents y (i) . In this sub-question, we use d = 200 and n = 40.
First of all, the gradient of the loss has the following form:
n
1 X (i) ⊤ ⊙2
∇θ J(θ, ϕ) = ((x ) (θ − ϕ⊙2 ) − y (i) )(θ ⊙ x(i) ), (11)
n i=1
n
1 X (i) ⊤ ⊙2
∇ϕ J(θ, ϕ) = − ((x ) (θ − ϕ⊙2 ) − y (i) )(ϕ ⊙ x(i) ). (12)
n i=1

You don’t need to prove these two equations. They can be verified directly using the chain
rule.
Using the formula above, run gradient descent with initialization θ = α1, ϕ = α1 with
α ∈ {0.1, 0.03, 0.01} (where 1 = [1, 1, · · · , 1] ∈ Rd is the all-1’s vector) and learning rate
0.08. We provide the starter code in src/implicitreg/qp.py. Plot the curve of training
error and validation error with different α. Use the number of gradient steps as x-axis,
and training/validation error as y-axis. Include your plot in the writeup and answer the
CS229 Problem Set #1 14

following two questions based on your plot: which models can fit the training set? Which
initialization achieves the best validation error?
Remark: Your plot is expected to demonstrate that the initialization plays an important
role in the generalization performance—different initialization can lead to different global
minimizers with different generalization performance. In other words, the initialization has
an implicit regularization effect.
Answer:
(g) [6 points] Coding question: implicit regularization of batch size
We still work with the setup in part (e). For this sub-question, we use the same dataset and
starter code as in sub-question (f). We will show that the noise in the training process also
induces implicit regularization. In particular, the noise introduced by stochastic gradient
descent in this case helps generalization. Implement the SGD algorithm, and plot the
training and validation errors with batch size {1, 5, 40}, learning rate 0.08, and initialization
α = 0.1. Similarly, use the number of gradient steps as x-axis, and training/validation error
as y-axis. For simplicity, the code for selecting a batch of examples is already provided in
the starter code. Compare the results with those in sub-question (g) with the same
initialization. Does SGD find a better solution?
Your plot is expected to show that the stochasticity in the training process is also an
important factor in the generalization performance — in our setting, SGD finds a solution
that generalizes better. In fact, a conjecture is that stochasticity in the optimization process
(such as the noise introduced by a small batch size) helps the optimizer to find a solution
that generalizes better. This conjecture can be proved in some simplified cases, such as the
quadratically parameterized model in this sub-question (adapted from the paper HaoChen
et al., 2020), and can be observed empirically in many other cases.
Answer:
CS229 Problem Set #1 15

5. [12 points] Double Descent on Linear Models


Background: In this question, you will empirically observe the sample-wise double descent
phenomenon. That is, the validation losses of some learning algorithms or estimators do not
monotonically decrease as we have more training examples, but instead have a curve with two
U-shaped parts. The double descent phenomenon can be observed even for simple linear models.
In this question, we consider the following setup. Let {(x(i) , y (i) )}ni=1 be the training dataset.
Let X ∈ Rn×d be the matrix representing the inputs (i.e., the i-th row of X corresponds to
x(i) )), and ⃗y ∈ Rn the vector representing the labels (i.e., the i-th row of ⃗y corresponds to y (i) )):

− x(1) −
   (1) 
y
− x(2) −  y (2) 
X =. ..  , ⃗y =  .  .
   
..
 .. . .  .. 
− x(n) − y (n)

Similarly, we use Xv ∈ Rm×d , ⃗yv ∈ Rm to represent the validation dataset, where m is the size
of the validation dataset. We assume that the data are generated with d = 500.
In this question, we consider regularized linear regression. For a regularization level λ ≥ 0, define
the regularized cost function
1 λ
Jλ (β) = ∥Xβ − ⃗y ∥22 + ∥β∥22 ,
2 2

and its minimizer β̂λ = arg minβ∈Rd Jλ (β).

(a) [2 points] In this sub-question, we derive the closed-form solution of β̂λ . Prove that when
λ > 0,
β̂λ = (X ⊤ X + λId×d )−1 X ⊤ ⃗y (13)
(recall that Id×d ∈ Rd×d is the identity matrix.)
Note: λ = 0 is a special case here. When λ = 0, (X ⊤ X + λId×d ) could be singular.
Therefore, there might be more than one solutions that minimize J0 (β). In this case, we
define β̂0 in the following way:

β̂0 = (X ⊤ X)+ X ⊤ ⃗y . (14)

where (X ⊤ X)+ denotes the Moore-Penrose pseudo-inverse of X ⊤ X. You don’t need to


prove the case when λ = 0, but this definition is useful in the following sub-questions.
Answer:
(b) [5 points] Coding question: the double descent phenomenon for unregularized
models
In this sub-question, you will empirically observe the double descent phenomenon.You are
given 13 training datasets of sample sizes n = 200, 250, . . . , 750, and 800, and a validation
dataset, located at
• src/doubledescent/train200.csv, train250.csv, etc.
• src/doubledescent/validation.csv
CS229 Problem Set #1 16

For each training dataset (X, ⃗y ), compute the corresponding β̂0 , and evaluate the mean
squared error (MSE) of β̂0 on the validation dataset. The MSE for your estimators β̂ on a
validation dataset (Xv , ⃗yv ) of size m is defined as:
1
MSE(β̂) = ∥Xv β̂ − ⃗yv ∥22 .
2m
Complete the regression method of src/doubledescent/doubledescent.py which takes
in a training file and a validation file, and computes β̂0 . You can use numpy.linalg.pinv
to compute the pseudo-inverse.
In your writeup, include a line plot of the validation losses. The x-axis is the size of the
training dataset (from 200 to 800); the y-axis is the MSE on the validation dataset. You
should observe that the validation error increases and then decreases as we increase the
sample size.
Note: When n ≈ d, the test MSE could be very large. For better visualization, it is okay
if the test MSE goes out of scope in the plot for some points.
Answer:
(c) [5 points] Coding question: double descent phenomenon and the effect of regu-
larization.
In this sub-question, we will show that regularization mitigates the double descent phe-
nomenon for linear regression. We will use the same datasets as specified in sub-question (b).
Now consider using various regularization strengths. For λ ∈ {0, 1, 5, 10, 50, 250, 500, 1000},
you will compute the minimizer of Jλ (β).
Complete the ridge regression method of src/doubledescent/doubledescent.py which
takes in a training file and a validation file, computes the β̂λ that minimizes the training
objective under different regularization strengths, and returns a list of validation errors (one
for each choice of λ).
In your writeup, include a plot of the validation losses of these models. The x-axis is the size
of the training dataset (from 200 to 800); the y-axis is the MSE on the validation dataset.
Draw one line for each choice of λ connecting the validation errors across different training
dataset sizes. Therefore, the plot should contain 8×13 points and 8 lines connecting them.
You should observe that for some small λ’s, the validation error may increase and then
decrease as we increase the sample size. However, double descent does not occur for a
relatively large λ.
Remark: If you want to learn more about the double descent phenomenon and the effect
of regularization, you can start with this paper Nakkiran, et al. 2020.
Answer:

You might also like