Ps 1
Ps 1
Honor code: We strongly encourage students to form study groups. Students may discuss and
work on homework problems in groups. However, each student must write down the solutions
independently, and without referring to written notes from the joint session. In other words,
each student must understand the solution well enough in order to reconstruct it by themself.
In addition, each student should write on the problem set the set of people with whom s/he
collaborated. Further, because we occasionally reuse problem set questions from previous years,
we expect students not to copy, refer to, or look at the solutions in preparing their answers. It
is an honor code violation to intentionally refer to a previous year’s solutions.
Regarding Notation: The notation used in this problem set matches the notation used in the
lecture notes. Some notable differences from lecture notation:
• The superscript “(i)” represents an index into the training set – for example, x(i) is the
i-th feature vector and y (i) is the i-th output variable. In lecture notation, these would
instead be expressed as xi and yi .
(i)
• The subscript j represents an index in a vector – in particular, xj represents the j-th
(i) (i)
feature in the feature vector x . In lecture notation, xj would be hj (xi ) or xi [j].
• The vector that contains the weights parameterizing a linear regression is expressed by the
variable θ, whereas lectures use the variable w. As such, θ0 = w0 , θ1 = w1 , and so on.
CS229 Problem Set #1 2
An overview of this notation is also given at the beginning of the lecture notes.
CS229 Problem Set #1 3
where α is the learning rate and ∇J(θ[t−1] ) is the gradient of J(θ) evaluated at θ[t−1] .
where θi ’s are our parameters and βi ∈ R s.t. βi > 0 are positive, constant scalars. Assume
[0]
that we start from some θ[0] where θi ̸= 0 for all i and run GD with learning rate α. Derive
the range of learning rate α such that GD converges in the sense that there exists a vector
θ† ∈ Rd such that limt→∞ ∥θ[t] − θ† ∥2 = 0. Provide the value of θ† when GD converges.
The range of α can potentially depend on βi where i ∈ {1, . . . , d}.
Answer:
CS229 Problem Set #1 4
Hint: The Hessian of a convex function is symmetric and positive semi-definite at any point
θ, which means all of its eigenvalues are real and non-negative.
Hint: The Hessian matrix is symmetric. Consider using the spectral theorem introduced
in problem 3 in homework 0.
Optional (no credit): Using the gradient descent inequality that you derived, show that
the GD algorithm converges in the sense that limt→∞ ||∇J(θ[t] )||22 = 0.2
Remark: This question suggests that a smaller learning rate should be applied if we believe
the curvature of the objective function is big.
Answer:
(e) [2 points] Learning Rate for Linear Regression
Consider using GD on the LMS objective introduced in lecture:
1
J(θ) = ||Xθ − y||22
2
where X ∈ Rn×d is the design matrix of our data, θ ∈ Rd is our parameter, and ⃗y ∈ Rn is
the response variable.
Let βmax be the largest eigenvalue of X ⊤ X. Prove that for all α ∈ (0, βmax 1
), GD with
[t]
learning rate α satisfies that J(θ ) converges as t → ∞.
Hint: You can invoke the statements in the part (d) (including the optional question in
part (d)) to solve this part (even if you were not able to prove part (d).)
Remark: The conclusion here suggests that for linear regression, roughly speaking, you
should use smaller learning rates if the scale of your data X is big or if the data points are
correlated with each other, both of which will enable a large top eigenvalue of X ⊤ X.
Remark: Even though in many cases the LMS objective can be solved exactly by solving
the normal equation as shown in lecture, it is still useful to be able to guarantee convergence
when using the Gradient Descent algorithm in situations where it is difficult to solve for
the optimal θ∗ directly (such as having large amount of data making inverting X very
expensive).
Answer:
2 Note that limt→∞ ||∇J(θ[t] )||22 = 0 does not necessarily mean that there exists a vector θ̂ such that θ[t] → θ̂.
(Even an 1-dimensional counterexample exists.) However, for convex functions, when θ[t] stays in a bounded
set, limt→∞ ||∇J(θ[t] )||22 = 0 does imply that J(θ[t] ) → inf θ J(θ) as t → ∞. Therefore, limt→∞ ||∇J(θ[t] )||22 = 0
is typically considered a reasonable definition for an optimization algorithm’s convergence.
CS229 Problem Set #1 6
(a) [10 points] Consider a linear regression problem in which we want to “weight” different
training examples differently. Specifically, suppose we want to minimize
n 2
1 X (i) T (i)
J(θ) = w θ x − y (i) .
2 i=1
In class, we worked out what happens for the case where all the weights (the w(i) ’s) are the
same. In this problem, we will generalize some of those ideas to the weighted setting. We
will assume w(i) > 0 for all i.
i. [2 points] Show that J(θ) can also be written
for an appropriate matrix W , and where X and y are as defined in class. Clearly specify
the value of each element of the matrix W .
ii. [4 points] If all the w(i) ’s equal 1, then we saw in class that the normal equation is
X T Xθ = X T y,
and that the value of θ that minimizes J(θ) is given by (X T X)−1 X T y. By finding
the derivative ∇θ J(θ) and setting that to zero, generalize the normal equation to this
weighted setting, and give the new value of θ that minimizes J(θ) in closed form as a
function of X, W and y.
iii. [4 points] Suppose we have a dataset {(x(i) , y (i) ); i = 1 . . . , n} of n independent ex-
amples, but we model the y (i) ’s as drawn from conditional distributions with different
levels of variance (σ (i) )2 . Specifically, assume the model
(y (i) − θT x(i) )2
1
p(y (i) |x(i) ; θ) = √ exp −
2πσ (i) 2(σ (i) )2
That is, each y (i) is drawn from a Gaussian distribution with mean θT x(i) and variance
(σ (i) )2 (where the σ (i) ’s are fixed, known, constants). Show that finding the maximum
likelihood estimate of θ reduces to solving a weighted linear regression problem. State
clearly what the w(i) ’s are in terms of the σ (i) ’s.
In other words, this suggests that if we have prior knowledge on the noise levels (the
variance of the label y (i) conditioned on x(i) ) of all the examples, then we should use
weighted least squares with weights depending on the variances.
Answer:
(b) [10 points] Coding problem.
We will now consider the following dataset (the formatting matches that of Datasets 1-4,
except x(i) is 1-dimensional):
src/lwr/{train,valid,test}.csv
In src/lwr/lwr.py, implement locally weighted linear regression using the normal equa-
tions you derived in Part (a) and using
∥x(i) − x∥22
(i)
w = exp − .
2τ 2
CS229 Problem Set #1 7
This is a fairly standard choice for weights, where the weight w(i) depends on the particular
point x at which we’re trying to evaluate y: if |x(i) − x| is small, then w(i) is close to 1;
if |x(i) − x| is large, then w(i) is close to 0. Here, τ is the bandwidth parameter, and it
controls how quickly the weight of a training example falls off with distance of its x(i) from
the query point x.
Train your model on the train split using τ = 0.5, then run your model on the valid
split and report the mean squared error (MSE). Finally plot your model’s predictions on
the validation set (plot the training set with blue ‘x’ markers and the predictions on the
validation set with a red ‘o’ markers). Does the model seem to be under- or overfitting?
Answer:
(c) [5 points] Coding problem.
We will now tune the hyperparameter τ . In src/lwr/tau.py, find the MSE value of your
model on the validation set for each of the values of τ specified in the code. For each τ , plot
your model’s predictions on the validation set in the format described in part (b). Report
the value of τ which achieves the lowest MSE on the valid split, and finally report the
MSE on the test split using this τ -value.
Answer:
CS229 Problem Set #1 8
Let x̂ ∈ R4 be a shorthand for ϕ(x), and let x̂(i) ≜ ϕ(x(i) ) be the transformed input in
the training dataset. We construct a new dataset {(ϕ(x(i) ), y (i) )}ni=1 = {(x̂(i) , y (i) )}ni=1 by
replacing the original inputs x(i) ’s by x̂(i) ’s. We see that fitting hθ (x) = θ3 x3 +θ2 x2 +θ1 x1 +
θ0 to the old dataset is equivalent to fitting a linear function hθ (x̂) = θ3 x̂3 + θ2 x̂2 + θ1 x̂1 + θ0
to the new dataset because
In other words, we can use linear regression on the new dataset to find parameters θ0 , . . . , θ3 .
Please write down 1) the objective function J(θ) of the linear regression problem on the
new dataset {(x̂(i) , y (i) )}ni=1 and 2) the update rule of the batch gradient descent algorithm
for linear regression on the dataset {(x̂(i) , y (i) )}ni=1 .
Terminology: In machine learning, ϕ is often called the feature map which maps the original
input x to a new set of variables. To distinguish between these two sets of variables, we will
call x the input attributes, and call ϕ(x) the features. (Unfortunately, different authors
use different terms to describe these two things. In this course, we will do our best to follow
the above convention consistently.)
Answer:
(b) [5 points] Coding question: degree-3 polynomial regression
For this sub-question question, we will use the dataset provided in the following files:
src/featuremaps/{train,valid,test}.csv
Each file contains two columns: x and y. In the terminology described in the introduction,
x is the attribute (in this case one dimensional) and y is the output label.
Using the formulation of the previous sub-question, implement linear regression with nor-
mal equations using the feature map of degree-3 polynomials. Use the starter code pro-
vided in src/featuremaps/featuremap.py to implement the algorithm.
Create a scatter plot of the training data, and plot the learnt hypothesis as a smooth curve
over it. Submit the plot in the writeup as the solution for this problem.
CS229 Problem Set #1 9
Remark: Suppose X b is the design matrix of the transformed dataset. You may sometimes
encounter a non-invertible matrix X b T X.
b For a numerically stable code implementation,
always use np.linalg.solve to obtain the parameters directly, rather than explicitly cal-
culating the inverse and then multiplying it with X b T y.
Answer:
(c) [5 points] Coding question: degree-k polynomial regression
Now we extend the idea above to degree-k polynomials by considering ϕ : R → Rk+1 to be
1
x
2
ϕ(x) = x ∈ Rk+1 (3)
..
.
xk
Follow the same procedure as the previous sub-question, and implement the algorithm with
k = 3, 5, 10, 20. Create a similar plot as in the previous sub-question, and include the
hypothesis curves for each value of k with a different color. Include a legend in the plot to
indicate which color is for which value of k.
Submit the plot in the writeup as the solution for this sub-problem. Observe how the fitting
of the training dataset changes as k increases. Briefly comment on your observations in the
plot. Answer:
(d) [5 points] Coding question: other feature maps
You may have observed that it requires a relatively high degree k to fit the given training
data, and this is because the dataset cannot be explained (i.e., approximated) very well
by low-degree polynomials. By visualizing the data, you may have realized that y can
be approximated well by a sine wave. In fact, we generated the data by sampling from
y = sin(x) + ξ, where ξ is noise with Gaussian distribution. Please update the feature map
ϕ to include a sine transformation as follows:
1
x
x2
ϕ(x) = ∈ Rk+2 (4)
..
.
xk
sin(x)
With the updated feature map, train different models for values of k = 0, 1, 2, 3, 5, 10, 20,
and plot the resulting hypothesis curves over the data as before.
Submit the plot as a solution to this sub-problem. Compare the fitted models with the
previous sub-question, and briefly comment about noticeable differences in the fit with this
feature map. Answer:
(e) [5 points] Overfitting with expressive models and small data
For the rest of the problem, we will consider a small dataset (a random subset of the dataset
you have been using so far) with much fewer examples, provided in the following file:
src/featuremaps/small.csv
CS229 Problem Set #1 10
We will be exploring what happens when the number of features start becoming bigger
than the number of examples in the training set. Run your algorithm on this small dataset
using the following feature map
1
x
2
ϕ(x) = x ∈ Rk+1 (5)
..
.
xk
(a) [3 points] Suppose we have a dataset {(x(i) , y (i) ); i = 1, · · · , n} where x(i) ∈ Rd and y (i) ∈ R
for all 1 ≤ i ≤ n. We assume the dataset is generated by a linear model without noise. That
is, there is a vector β ⋆ ∈ Rd such that y (i) = (β ⋆ )⊤ x(i) for all 1 ≤ i ≤ n. Let X ∈ Rn×d be
the matrix representing the inputs (i.e., the i-th row of X corresponds to x(i) )) and ⃗y ∈ Rn
the vector representing the labels (i.e., the i-th row of ⃗y corresponds to y (i) )):
− x(1) −
(1)
y
− x(2) − y (2)
X =. .. , ⃗y = . .
..
.. . . .
.
− x(n) − y (n)
Then in matrix form, we can write ⃗y = Xβ ⋆ . We assume that the number of examples is
less than the number of parameters (that is, n < d).
We use the least-squares cost function to train a linear model:
1
J(β) = ∥Xβ − ⃗y ∥22 . (6)
2n
In this sub-question, we characterize the family of global minimizers to Eq. (6). We assume
that XX ⊤ ∈ Rn×n is an invertible matrix. Prove that β achieves zero cost in Eq. (6) if
and only if
β = X ⊤ (XX ⊤ )−1 ⃗y + ζ (7)
for some ζ in the subspace orthogonal to all the data (that is, for some ζ such that ζ ⊤ x(i) =
0, ∀1 ≤ i ≤ n.)
Note that this implies that there is an infinite number of β’s such that Eq. (6) is minimized.
We also note that X ⊤ (XX ⊤ )−1 is the pseudo-inverse of X, but you don’t necessarily need
this fact for the proof.
Answer:
(b) [3 points] We still work with the setup of part (a). Among the infinitely many optimal
solutions of Eq. (6), we consider the minimum norm solution. Let ρ = X ⊤ (XX ⊤ )−1 ⃗y . In
the setting of (a), prove that for any β such that J(β) = 0, ∥ρ∥2 ≤ ∥β∥2 . In other words,
ρ is the minimum norm solution.
Hint: As a intermediate step, you can prove that for any β in the form of Eq. (7),
∥β∥22 = ∥ρ∥22 + ∥ζ∥22 .
CS229 Problem Set #1 12
Answer:
(c) [5 points] Coding question: minimum norm solution generalizes well
For this sub-question, we still work with the setup of parts (a) and (b). We use the following
datasets:
src/implicitreg/ds1_train.csv,ds1_valid.csv
Each file contains d + 1 columns. The first d columns in the i-th row represents x(i) , and
the last column represents y (i) . In this sub-question, we use d = 200 and n = 40.
Using the formula in sub-question (b), compute the minimum norm solution using the
training dataset. Then, generate three other different solutions with zero costs and differ-
ent norms using the formula in sub-question (a). The starter code is in src/implicitreg/
linear.py. Plot the validation error of these solutions (including the minimum norm
solution) in a scatter plot. Use the norm of the solutions as x-axis, and the validation
error as y-axis. For your convenience, the plotting function is provided as the method
generate_plot in the starter code. Your plot is expected to demonstrate that the mini-
mum norm solution generalizes well.
Answer:
(d) [5 points] For this sub-question, we work with the setup of part (a) and (b). In this sub-
question, you will prove that the gradient descent algorithm with zero initialization always
converges to the minimum norm solution. Let β (t) be the parameters found by the GD
algorithm at time step t. Recall that at step t, the gradient descent algorithm update the
parameters in the following way
η ⊤
β (t) = β (t−1) − η∇J(β (t−1) ) = β (t−1) − X (Xβ (t−1) − ⃗y ). (8)
n
As in sub-question (a), we also assume XX ⊤ is an invertible matrix. Prove that if the
GD algorithm with zero initialization converges to a solution β̂ satisfying J(β̂) = 0, then
β̂ = X ⊤ (XX ⊤ )−1 ⃗y = ρ,, that is, β̂ is the minimum norm solution.
Hint: As a first step, you can prove by induction that if we start with zero initialization,
β (t) will always be a linear combination of {x(1) , x(2) , · · · , x(n) } for any t ≥ 0. Then, for
any t ≥ 0, you can write β (t) = X ⊤ v (t) for some v (t) ∈ Rn . As a second step, you can prove
that if β̂ = X ⊤ v (t) for some v (t) and J(β̂) = 0, then we have β̂ = ρ.
You don’t necessarily have to follow the steps in this hint. But if you use the hint, you need
to prove the statements in the hint.
Answer:
(e) [3 points] In the following sub-questions, we consider a slightly more complicated model
called quadratically parameterized model. A quadratically parameterized model has two
sets of parameters θ, ϕ ∈ Rd . Given a d-dimensional input x ∈ Rd , the output of the model
is
d
X d
X
fθ,ϕ (x) = θk2 xk − ϕ2k xk . (9)
k=1 k=1
Note that fθ,ϕ (x) is linear in its input x, but non-linear in its parameters θ, ϕ. Thus, if the
goal was to learn the function, one should simply just re-parameterize it with a linear model
and use linear regression. However, here we insist on using the parameterization above in
CS229 Problem Set #1 13
Eq. (9) in order to study the implicit regularization effect in models that are nonlinear in
the parameters.
Notations: To simplify the equations, we define the following notations. For a vector v ∈ Rd ,
let v ⊙2 be its element-wise square (that is, v ⊙2 is the vector [v12 , v22 , · · · , vd2 ] ∈ Rd .) For two
vectors v, w ∈ Rd , let v ⊙ w be their element-wise product (that is, v ⊙ w is the vector
[v1 w1 , v2 w2 , · · · , vd wd ] ∈ Rd .) Then our model can be written as
Suppose we have a dataset {(x(i) , y (i) ); i = 1, · · · , n} where x(i) ∈ Rd and y (i) ∈ R for all
1 ≤ i ≤ n, and
y (i) = (x(i) )⊤ ((θ⋆ )⊙2 − (ϕ⋆ )⊙2 )
for some θ⋆ , ϕ⋆ ∈ Rd . Similarly, we use X ∈ Rn×d and ⃗y ∈ Rn to denote the matrix/vector
representing the inputs/labels respectively:
− x(1) −
(1)
y
− x(2) − y (2)
X =. .. , ⃗y = . .
..
.. . . .
.
(n) (n)
− x − y
1
Pn (i) (i) 2
Let J(θ, ϕ) = 4n i=1 (fθ,ϕ (x ) − y ) be the cost function.
⊤
First, when n < d and XX is invertible, prove that there exists infinitely many optimal
solutions with zero cost.
Hint: Find a mapping between the parameter β in linear model and the parameter θ, ϕ in
quadratically parameterized model. Then use the conclusion in sub-question (a).
Answer:
(f) [10 points] Coding question: implicit regularization of initialization
We still work with the setup in part (e). For this sub-question, we use the following datasets:
src/implicitreg/ds2_train.csv,ds2_valid.csv
Each file contains d + 1 columns. The first d columns in the i-th row represents x(i) , and
the last column represents y (i) . In this sub-question, we use d = 200 and n = 40.
First of all, the gradient of the loss has the following form:
n
1 X (i) ⊤ ⊙2
∇θ J(θ, ϕ) = ((x ) (θ − ϕ⊙2 ) − y (i) )(θ ⊙ x(i) ), (11)
n i=1
n
1 X (i) ⊤ ⊙2
∇ϕ J(θ, ϕ) = − ((x ) (θ − ϕ⊙2 ) − y (i) )(ϕ ⊙ x(i) ). (12)
n i=1
You don’t need to prove these two equations. They can be verified directly using the chain
rule.
Using the formula above, run gradient descent with initialization θ = α1, ϕ = α1 with
α ∈ {0.1, 0.03, 0.01} (where 1 = [1, 1, · · · , 1] ∈ Rd is the all-1’s vector) and learning rate
0.08. We provide the starter code in src/implicitreg/qp.py. Plot the curve of training
error and validation error with different α. Use the number of gradient steps as x-axis,
and training/validation error as y-axis. Include your plot in the writeup and answer the
CS229 Problem Set #1 14
following two questions based on your plot: which models can fit the training set? Which
initialization achieves the best validation error?
Remark: Your plot is expected to demonstrate that the initialization plays an important
role in the generalization performance—different initialization can lead to different global
minimizers with different generalization performance. In other words, the initialization has
an implicit regularization effect.
Answer:
(g) [6 points] Coding question: implicit regularization of batch size
We still work with the setup in part (e). For this sub-question, we use the same dataset and
starter code as in sub-question (f). We will show that the noise in the training process also
induces implicit regularization. In particular, the noise introduced by stochastic gradient
descent in this case helps generalization. Implement the SGD algorithm, and plot the
training and validation errors with batch size {1, 5, 40}, learning rate 0.08, and initialization
α = 0.1. Similarly, use the number of gradient steps as x-axis, and training/validation error
as y-axis. For simplicity, the code for selecting a batch of examples is already provided in
the starter code. Compare the results with those in sub-question (g) with the same
initialization. Does SGD find a better solution?
Your plot is expected to show that the stochasticity in the training process is also an
important factor in the generalization performance — in our setting, SGD finds a solution
that generalizes better. In fact, a conjecture is that stochasticity in the optimization process
(such as the noise introduced by a small batch size) helps the optimizer to find a solution
that generalizes better. This conjecture can be proved in some simplified cases, such as the
quadratically parameterized model in this sub-question (adapted from the paper HaoChen
et al., 2020), and can be observed empirically in many other cases.
Answer:
CS229 Problem Set #1 15
− x(1) −
(1)
y
− x(2) − y (2)
X =. .. , ⃗y = . .
..
.. . . ..
− x(n) − y (n)
Similarly, we use Xv ∈ Rm×d , ⃗yv ∈ Rm to represent the validation dataset, where m is the size
of the validation dataset. We assume that the data are generated with d = 500.
In this question, we consider regularized linear regression. For a regularization level λ ≥ 0, define
the regularized cost function
1 λ
Jλ (β) = ∥Xβ − ⃗y ∥22 + ∥β∥22 ,
2 2
(a) [2 points] In this sub-question, we derive the closed-form solution of β̂λ . Prove that when
λ > 0,
β̂λ = (X ⊤ X + λId×d )−1 X ⊤ ⃗y (13)
(recall that Id×d ∈ Rd×d is the identity matrix.)
Note: λ = 0 is a special case here. When λ = 0, (X ⊤ X + λId×d ) could be singular.
Therefore, there might be more than one solutions that minimize J0 (β). In this case, we
define β̂0 in the following way:
For each training dataset (X, ⃗y ), compute the corresponding β̂0 , and evaluate the mean
squared error (MSE) of β̂0 on the validation dataset. The MSE for your estimators β̂ on a
validation dataset (Xv , ⃗yv ) of size m is defined as:
1
MSE(β̂) = ∥Xv β̂ − ⃗yv ∥22 .
2m
Complete the regression method of src/doubledescent/doubledescent.py which takes
in a training file and a validation file, and computes β̂0 . You can use numpy.linalg.pinv
to compute the pseudo-inverse.
In your writeup, include a line plot of the validation losses. The x-axis is the size of the
training dataset (from 200 to 800); the y-axis is the MSE on the validation dataset. You
should observe that the validation error increases and then decreases as we increase the
sample size.
Note: When n ≈ d, the test MSE could be very large. For better visualization, it is okay
if the test MSE goes out of scope in the plot for some points.
Answer:
(c) [5 points] Coding question: double descent phenomenon and the effect of regu-
larization.
In this sub-question, we will show that regularization mitigates the double descent phe-
nomenon for linear regression. We will use the same datasets as specified in sub-question (b).
Now consider using various regularization strengths. For λ ∈ {0, 1, 5, 10, 50, 250, 500, 1000},
you will compute the minimizer of Jλ (β).
Complete the ridge regression method of src/doubledescent/doubledescent.py which
takes in a training file and a validation file, computes the β̂λ that minimizes the training
objective under different regularization strengths, and returns a list of validation errors (one
for each choice of λ).
In your writeup, include a plot of the validation losses of these models. The x-axis is the size
of the training dataset (from 200 to 800); the y-axis is the MSE on the validation dataset.
Draw one line for each choice of λ connecting the validation errors across different training
dataset sizes. Therefore, the plot should contain 8×13 points and 8 lines connecting them.
You should observe that for some small λ’s, the validation error may increase and then
decrease as we increase the sample size. However, double descent does not occur for a
relatively large λ.
Remark: If you want to learn more about the double descent phenomenon and the effect
of regularization, you can start with this paper Nakkiran, et al. 2020.
Answer: