Data Analysis: Statistics & Linear Algebra
Data Analysis: Statistics & Linear Algebra
1 (x − µ)2
f (x) = √ exp( )
2πσ2 2σ2
0.1.3 Regression.1
Definition 11. Regression
Regression is a statistical model to determine the relationship between a dependent variable and one or more independent
variables.
Definition 12. Linear Regression
Linear Regression is a statistical model that estimate the linear relationship between dependent variable and one or more
independent variables.
Definition 13. Data Set
Data set is a collection of data defined as a set D := {(xi , yi )}ni=1 where inputs xi ∈ Rd and target variable yi ∈ R.
Linear Regression aims to design a cost function (L) to find the best parameter w∗ such that w∗ = argw minL(w).
∗
We will use calculus as the first approach to find optimal parameter WOLS .
d
Let cost function be L : R → R which is continuously differentiable,
then any local optimum w∗ satisfies ∇L(w∗ ) = 0, also note that (Xw)T = yT and Xw = y such that
L(w) = ||Xw − y||22 = (Xw − y)T (Xw − y)
= (Xw)T (Xw) − (Xw)T (y) − yT (Xw) + yT y
= wT X T Xw − 2wT X T y + yT y
Using the following results from matrix calculus:
∇ x (aT x) = a
∇ x (xT Ax) = (A + AT )x
Now we will do the partial differentiation as follows
∇L(w) = ∇w (wT X T Xw − 2wT X T y + yT y)
= ∇w (wT X T Xw) − 2∇w (wT X T y)
= 2X T Xw − 2X T y
=0
∗
Since X is full rank then (X T X) is invertible such that WOLS = (X T X)−1 X T y.
∗
Now we will use orthogonal projection as the second approach to find optimal parameter WOLS
Let V be an inner product space and S be a subspace of V, then any v ∈ V can be decomposed uniquely in the form v = vS + v⊥
where vS ∈ S and v⊥ ∈ S ⊥ . Here S ⊥ is the orthogonal complement of S , so S ⊥ 1 S ,
the vector of S ⊥ are all perpendicular to every vector in S .
The orthogonal projection onto S denoted PS , is the linear operator that maps v to vS in the decomposition above.
Note that ||v − P s v|| ≥ ||v − s|| for all s ∈ S ,with eqality if and only if s = P s v. That is P s v = arg s∈S min||v − s||
Now let us considering the case of O.L.S, w∗OLS = argw min||Xw − y||22
Note that the set of vector Xw for some w ∈ Rd is precisely the range of X, which is the subspace of Rn as Rd ⊂ Rn ,so
minz∈ran(X) ||z − y||22 = minw ∈ Rd ||Xw − y||22 . Note that Pran(X) y = Xw∗O LS where w∗ is any optimum for the right-hand side.
The projected point Xw∗OLS is always unique and w∗ is unique too if X is full rank.We now want to solve for w∗OLS ,we need the
following fact: null(X T ) = range(X)⊥
Since we are projecting onto range X, the orthogonality condition for optimality is that
y − Py⊥range(X) and y − Xw ∈ null(X T ).Therefore X T (y − Xw∗OLS ) = 0 such that WOLS ∗
= (X T X)−1 X T y.. ■
6
Remark 1. The calculus approach is suck as we might not know the matrix calculus.
Remark 2. ∇L(w) = 0 only means us find a critical point, it maybe local maximum or local minimum, or a saddle point, so
why we can confirm it is global minimum point???
Due to the convex cost function L.If we want to show its convex then we need to compute the Hessian of L which is the
second partial order matrix. Note that ∇2 L(w) = 2X T X
It shows that this is positive semi-definite as symmetric matrix wT (X T X)w eigenvalues are all nonnegative :
wT (2X T X)w = 2(Xw)T Xw = 2||Xw||22 ≥ 0
Remark 3. Prove P s v = arg s∈S min||v − s||. Then by the Pythagorean Theorem,
||v − s||2 = ||v − P s v + P s v − s||2 = ||v − P s v||2 + ||P s v − s||2 ≥ ||v − P s v||2
T
1 0 0 1 x1 1 0 0
Example 2. For any x1 = 0 x2 = 2 x3 = 0 y = 1 we have X = x2T = 0 2 0
T
0 0 3 1 x3 0 0 3
1
By O.L.S we find out WOLS ∗
= (X T X)−1 X T y = 31
1
3
Remark 4. I’m too lazy to calculate it and I ask AI to do it 3 time still wrong as Xw , y.
Code 1.
Ridge Regression penalize the entries of w from becoming too large. We can do this by adding a penalty term constraining
the norm of w. For a fixed, small scalar λ >0, we now have: min||Xw − y||22 + λ||w||22 then WRIDGE
∗
= (X T X + λI)−1 X T y.
We will chose a large enough hyperparameter λ to prevent the singular value of (X X) being close to zero.
T −1
Proof. ■
Example 3.
1 2 0 3
0 1 1 and y = 2 .In the First Iteration, w0 = 0 and r0 = y
2 0 1 2
Note that < r0 , x1 >= 7 and < r0 , x2 >= 8 as well as < r0 , x3 >= 4. Since = 8 is the maximum projection so we select i = 2 and
device ||x2 ||2 = 22 + 12 + 02 = 5 so that <r||x0 ,x2 ||2 > = 58 . The weight will be update as W21 = W00 + 85 = 85
−1
5
Then we will update the residual r1 = r0 − W2 x2 = r0 − <r||x0 2,x||22> x2 =⇒ r1 = 25
2
Code 2.
We can devise some new function ϕ : Rl → Rd , called a feature map, that maps each raw data point x ∈ Rl into a vector
of feature ϕ(x).The hypothesis function is hw (x) = dj=1 w j ϕ j (x) = wT ϕ(x).Then the model is still linear with respect to the
P
feature.The component function ϕ j are called basis function, we can still use least-square t estimate the weight w. Therefore
we replace the original data matrix X ∈ Rn×l by ϕ ∈ Rn×d , which has ϕ(xi )T as its ith row:
minw ||ϕw − y||22 .
Example 4.
Ideas 6. Hyperprametal Finetuning
We should do a lot of trial-and-error to find the best hyperparameter λ. When we do validation training , we will try a lot of
hyerparameter λ and then training, return the parameter w∗ . If the validation is bad and get a lot of errors, then do it again. If
not ,then return the best hyperparameter and parameter to do the test. Therefore we need to do a lot of validation.
Definition 16. K-Fold Cross Validation
We need to separate the validation data and training data.Cross-Validation is an alternative to having a dedicated validation
set.
(a) Shuffle the data and partition it into k equally-sized blocks.
(b) For i = 1, ..., k, train the model on all data except block i and evulate the model using block i
(c) Average the k validation errors, this is our final estimate of the true error.
8
0.1.4 Regression.2
Ideas 7. Is l2 norm a good choice to estimate the model
We use l2 norm to measure the error of our predictions and to penalize the model parameter, as it is a great design chice in
statistical interpretations of regression, such as Gaussians, M.L.E and M.A.P.
Definition 17. Maximum Likelihood Estimation (M.L.E)
M.L.E aims to find the hypothesis model that maximizes the probability of the data.
Theorem 5. O.L.S satisfy M.L.E
The Maximum Likelihood Estimation is exactly same as Ordinary Least Square.
Proof. If we parameterize the set of hypothesis models with θ, we can express the problem as θ MLE = argθ maxL(θ; D) =
p(data=D|true model = hθ ).The quantity L(θ) that we are maximizing is also known as the likelihood. Considering log we still
working the same problem , because logarithms are monotonic functions. We have P(A) < P(B) ⇐⇒ logP(A) < logP(B). Let
decompose the log likelihood :
l(θ; X, y) = log p(y1 , ..., yn |x1 , ..., xn , θ) = log( ni=1 p(yi |xi , θ)) = ni=1 log[p(yi |xi , θ)]
Q P
Let p(yi |xi , θ) from a Gaussian Yi |θ ≈ N(hθ (xi ), σ2 ).
2 √
Then we will have θ MLE = argθ maxl(θ; X, y) = argmaxθ − ( ni=1 (yi −h2σθ (x2 i )) ) − n log( 2πσ) = argθ min ni=1 (yi − xiT θ)2
P P
Therefore the M.L.S is just O.L.S problem! ■
Example 5.
Code 3.
Proof. Using the Bayes Rules, we have θ MAP = argθ maxP(true model= hθ |data=D)
= argθ max P(data=D|truemodel=h θ )P(truemodel=hθ )
P(data=D) = argθ max log P(data = D|truemodel = htheta ) + log P(truemodel = hθ )
= argθ min − logP(data = P|truemodel = hθ ) − logP(truemodel = hθ )
then θ ≈ N(θ, σ2 ) such that θ MAX = argθ min( ni=1 (yi − hθ (xi )))2 + σσ2 2j=1 θ2
P 2 P
h
This is just the Ridge Regression! ■
Code 4.
Example 6.
Considering M.L.E, it will choose the parameter θ containing maximum likelihood but it might have high variance.A slightly
different data set will alter the predicted model
M.A.P aim to balance the variance and likelihood. If the model weight should be small then the prior will centered at zero
with small variance, M.A.P will choose the parameter maximizes the posterior probability
0.1. DATA ANALYSIS 9
Working Principal:
(a)Initialization: Start with w0 = 0 and compute the initial residual r0 = y
(b)At each iteration, select a feature that best reduces the residual error.
Update the weight vector for that feature.
Update the residual to reflect this change.
(c)Stopping Criterion: Continue until the number of non-zero elements in w reaches the specified limit k.
Example 7. Let us consider a dataset with three features (columns) and a target variable y. Your goal is to approximate y
using a linear combination of these features while keeping the model sparse (using at most one feature).
1 2 0 3
Let X = 0 1 1 and y = 2 .In the First Iteration, w0 = 0 and r0 = y
2 0 1 2
Note that < r0 , x1 >= 7 and < r0 , x2 >= 8 as well as < r0 , x3 >= 4. Since = 8 is the maximum projection so we select i = 2 and
device ||x2 ||2 = 22 + 12 + 02 = 5 so that <r||x0 ,x2 ||2 > = 58 . The weight will be update as W21 = W00 + 85 = 58
−1
5
Then we will update the residual r1 = r0 − W2 X = r0 − <r||x0 2,x||22> X =⇒ r1 = 25 Now we have w = (0, 85 , 0) and r1 = ( −1
5 , 5 , 2)
2
2
only.
You would repeat the process:
Compute the new inner products with the updated residual r1
Select the feature that maximizes the inner product.
Update the corresponding weight and the residual again.
Continue this process until the sparsity condition is no longer met (i.e., you have selected k features).
10
Let X ∈ Rn×d be a high dimensional data set and v as a projection vector in low dimension, such that the distance between
X and v is:
Note that Pv (xi ) = (xiT v)v
min ni=1 ||xi − Pv (xi )||2
P
=min ni=1 (||xi ||2 − ||xi v||2 )
P
=max ni=1 ||xi v||2
P
=max||Xv||2
=max(vT X T Xv)
=λmax (X T X)
Do it inductively then X T Xvk+1 = λi vk+1 .
Example 8. Principal Component Analysis (P.C.A)
3 2 2
Given 2 3 −2 and x1 , x2 , x3 as three transpose column vectors of X.
1 1 0
(a) Generating the data with mean zero:
2 2 0
x = x1 +x32 +x3 = (2, 2, 0) = 2 2 0 and then X = X − x
2 2 0
2 1 2
(b)X T X = 1 2 −2 and det(λI − X T X) = λ(λ − 3)(λ − 9) =⇒ λ1 = 0, λ2 = 3, λ3 = 9
2 −2 8
Now let find out vi through (X T X − λI)vi = 0
Since v1 (X T X − 0I) = 0 =⇒ v1 =0 √
− 2
√2
Also v2 (X X − 3I) = 0 =⇒ v2 = 2
T
2
0
√2
−√6
and v3 (X T X − 9I) = 0 =⇒ v3 = 62
√
6
3 !
3 √0
(c) Now we have the S.V.D such that = and V = (v3 , v2 ) as we pick the maximum single value first
P
0 3
2 √ √
− 2 − 66
√2 √
After calculate out U = 2 −√66
6
0 3
0.1. DATA ANALYSIS 11
Proof. ■
Code 6.
Example 9.
12
σ(∗) : Rd → Rk , σ(x) = wT x + b
(where w ∈ Rd is the parameters we want to find out,x ∈ Rd is the data input, b ∈ R... b is bias )
Definition 21. Activation Function
An activation function is a non-linear mathematical function applied to the output of a neuron, it calculates the output of the
node based on its individual inputs and their weight.There are different kind of activation function such as sigmoid,RELU......
The following is the general look of activation function:
f : Rk → R : f (x) = σ(wT x + b)
g:neuron1} Figure 1:
0.1. DATA ANALYSIS 13
Prof Fan ask you a question:If we connect more neuron to become a network then it will be more powerful. How to ar-
range those neuron into networking?G2,G3,G4 which one is the best graph to connect network?
It is actually a OPEN QUESTION. Selecting Network Topology is open question to make best network in topology, maybe you
will want to study Graph Theory and Topology.
Figure 2: {fig:neuro
Geoffrey Hinton believe that DEEP is better so he will choose G4. However we need to consider the wide too. If you
need to train your model , how to make a equation calculate effectivity in training? You should do some research in ”Network
Topology”
(Actually Prof Fan invites one of the researcher of Network Topology to have a talk but I didn t attend...)
14
Observer that the objective function is a sum of squared residuals as we ’ve seen before the cost function L(w), that f is
nonlinear. Therefore this method is called nonlinear least square. Hence we need to solve the optimization problem below
(yi − f (xi ,w))2
minw L(w) = argw min
Pn
i=1 2
We can solve it by finding all critical point s and choose the minimum point. From first order optimality condition, the gradient
of the objective function at any minimum must be zero such that:
J is the Joacobian of F , f is linear in w, the gradient ∇w L(w) only depends on w in F(w) because the term ∇w f (xi ; w)
only depend on xi :
X T (y − Xw) = 0
X T y − X T Xw = 0
w = (X T X)−1 X T y
In general , f is nonlinear in w. However we may not derive a closed form solution for w.
f might not be convex so we cannot have additional assumption on f .
w might not be the global minima, might be saddle point , local minimum, or local maximum.
We need to approach the w with no closed form-solution!!!Therefore we have Optimization method–Gradient Descent.
Theorem 10. Critical Point and Gradient
If w∗ is a one of the critical points of f and f is continuously differentiable in a neighborhood of w∗ then ∇ f (w∗ ) = 0
0.1. DATA ANALYSIS 15
Figure 3: {fig:gradi
In optimization, we want to find global minimum of a function. We might find local minimum, local maximum, saddle point in
the process. We also need to consider the boundary point and non-differentiable point to find out global minimum. Therefore
we have to consider all critical points in our analysis. Therefore we need to find the set of all critical points which is simply the
set of points at which the gradient is zero.
∂f
∂w1 0
∂ f 0
∇ f = ∂w2 =
... ...
∂f 0
∂wd
Rather then finding the closed form solution , we want to find a method to creep toward to a local minimum. Therefore we have
the algorithm Gradient Descent, which is the key algorithm of Data Analysis and Machine Learning.
Theorem 11. Gradient Descent
Gradient Descent is an algorithm that iteratively takes small steps in the direction of steepest descent of the objective .
Imagine there is a computer or robot call CTP, he want to walk to the bottom of the mountain, what should he do? he need to
find out the steepest descent of the mountain so he can arrive the bottom faster.Let scaling αt depend on f so we can determine
the adaptive stepsize.
1.Backpropagation algorithm,
2.Feedforward neural network.
3. Construct a 2nd layer Neural Network,
4.Gaussian Process,
5.2018 research paper,
6.Neural Network Compression,
7.Signoid,
8.RELU Network express any C n function,
9.Hopfield Network,Energy Function.
Figure 4: {fig:neuro
σ(∗) : Rd → Rk , σ(x) = wT x + b
(where w ∈ Rd is the parameters we want to find out,x ∈ Rd is the data input, b ∈ R... b is bias )
Definition 23. Activation Function
An activation function is a non-linear mathematical function applied to the output of a neuron, it calculates the output of the
node based on its individual inputs and their weight.There are different kind of activation function such as sigmoid,RELU......
The following is the general look of activation function:
f : Rk → R : f (x) = σ(wT x + b)
18
Computation flows left to right , circles are nodes a.k.a neurons or units based on actual neurons in our brain.
Note that neurons are generated as layer, first one is called input layer, second one is called hidden layer, third one is called
output layer. Obviously, there can be tremendous layers in hidden part as hidden layer are hyperparameters to be chosen by
network designer. The dimensionality of input and output layer is determined by the function we want the network compute.
Every node in that layer is connected to every node in the previous layer, layer are fully connected!
Each edge in the graph has an associated weight, which is the strength of the connection from the input node in one layer
to the node in the next layer. Each node computes a weighted sum of its input, with these connection strengths being the
weights and then applies a nonlinear function which is variously referred to as the activation function or the nonlinearity.
Let wi denotes the weights and σi denotes the activation function of node i,
it computes the function x → σi (wTi , x)
Since the output of each layer is passed as input to the next layer, the function represented by the entire network can be written as
Note that in most layer, the activation function are same for each node with that layer, so it is ”the” scalar function σℓ : R → R
as the nonlinearity for layer ℓ, and apply it element wise
σℓ (x1 )
σℓ (x) = ...
σℓ (xn ℓ)
The principle exception here is the softmax function σ : Rk → R defined by
e xi
σ(x)i = Pk
ex j
which is often used to produce a discrete probability distribution over k classes.
j=1
Note that every entry of the softmax output depends on every entry of the input. More positive xi leads to larger σ(x)i . The
softmax function is used most commonly.
Information 8. Expressive Power
Expressive power is the repeated combination of nonlinearities that gives deep neural network their remarkable expressive
power. If we remove the activation function then x → WL WL−1 ...W2 W1 x = Wx
which is linear in its input! Moreover, the size of the smallest layer restricts the rank of W as
rank(W) ≤ minℓ∈{1,...,L} rank(Wℓ ) ≤ minℓ∈{0,...,L} nℓ
We would like to produce a function approximators. This means that given any continuous function , we can choose a net-
work in this class such that the output of the circuit can be made arbitrary close to the output of the given function for all given
inputs.
The piecewise-constant function are universal function approximateors, the activation is the step function:
σ(x) = 1 if x ≤ 0
σ(x) = 0 if x < 0
It turns out that that only one hidden layer h(x) is need for universal approximation
h(x) = nj=1 c j σ(a j + b j x)
P
where k is the number of hidden units.
0.1. DATA ANALYSIS 19
Now we design the node or neurons in hidden layer, in other work we simulate the nonlinearity (activation function) and
denote it as σ(x), sometime we use h(x) instead of σ(x) to point out it is in the hidden layer.
We connect 1 st layer’s node and 2 nd layer’s node with line. The connection contains the weight (strengths) of every node in
previous layer.The connection is a preactivation function ε(x) which represent the function of neuron– take the sum of input
multiplicate with weight and add bias
1 + 1
ε(x) =
Pk
i=1 (Wk xi bn )
Note that the upper index represents the number of layer, the lower index represents the number of element.
Also all layer are fully connected.
Then we can calculate the activation function or nonlinearity σ(x). We would like to change the input of activation func-
tion as σ as it avoids ambitious and be more precise, denote activation function as α(x). That is :
I know I didn t formulate the activation function σ(x), as there are many choice of activation function such as RELU func-
tion RELU(x) = max(0, x) or sigmoid function sigmoid(x) = 1+e1 −x ......
d-forward} Figure 5:
0.1. DATA ANALYSIS 21
Figure 6: {fig:deepu
22
∂L T
2.Compute δ[r] = ∂z[r]
= (z[r] − o) where o = W [r] a[r−1] + b[r]
3. For k = r − 1 to 1 do
{
4. Compute δ[k] = ∂z∂L[k] = (W [k+1] δ[k+1] )ReLU ′ (z j )
T
∂L
5.Compute ∂b[k+1]
= δ[k+1]
}
z = wT x + b
o = hw = ReLU(z)
L = 12 (y − o)2
∂L ∂L ∂o ∂L ∂o ∂z
By chain rule we have ∂wi = ∂o ∂wi = ∂o ∂z ∂wi = (o − y)ReLU ′ (z)xi
∂L ∂L ∂o ∂z
The key observation is that we reduce ∂w into ∂o , ∂z , ∂w
∀ j ∈ [1, ..., m]
T
[1]T
z j = w[1]
j x + b j where w j , b j ∈ R
[1] [1] d
a j = ReLU(z j )
a = [a1 ,T...am ]T
o = w[2]
j a + bj
[2]
L = 12 (y − o)2
∂L ∂L ∂z j ∂L ∂L ∂a j ∂L ∂L ∂o
∂w j = ∂z j ∂(w[1] )ℓ = ∂z j xℓ = ∂a j ∂z j xℓ = ′
∂a j ReLU (z j )xℓ = ′
∂o ∂a j ReLU (z j )xℓ = (o − y)(w[2] ) j ReLU ′ (z j )xi for j=1
j
∂L
∂w2 = (o − y)ReLU ′ (z j )xℓ
∂L T
∂W [k+1]
= δ[k+1] a[k]
∂L
∂b[k+1]
= δ[k+1]
24
a(1) = σ(z(1) )
- Output Layer:
z(2) = W(2) a(1) + b(2)
ŷ = σ(z(2) )
Example values for forward propagation are:
" # " #
0.1 0.2 0.1
W(1) = , b(1) =
0.3 0.4 0.1
h i
W(2) = 0.5 0.6 , b(2) = 0.2
" #
0.6
x=
0.9
Calculating forward propagation: 1. Hidden Layer:
σ(0.74)
" # " # " #
0.74 0.676
a(1) = σ( )= ≈
0.98 σ(0.98) 0.727
2. Output Layer:
h i "0.676#
z(2) = 0.5 0.6 + 0.2 = 0.5 · 0.676 + 0.6 · 0.727 + 0.2 ≈ 0.706
0.727
ŷ = σ(0.706) ≈ 0.669
The loss function is defined as:
1
L= (ŷ − y)2
2
Assuming the true label y = 1:
1 1
L= (0.669 − 1)2 = (−0.331)2 ≈ 0.0548
2 2
Calculating the gradients for backpropagation: For the output layer:
∂L
= ŷ − y = 0.669 − 1 = −0.331
∂z(2)
0.1. DATA ANALYSIS 25
∂L ∂L
" # " #
0.676 −0.224
= · a(1)
= −0.331 · ≈
∂W(2) ∂z(2) 0.727 −0.241
∂L ∂L
= (2) = −0.331
∂b(2) ∂z
For the hidden layer:
∂L ∂L
" # " #
0.5 −0.1655
= · W = −0.331 ·
(2)
=
∂a(1) ∂z(2) 0.6 −0.1986
The sigmoid derivative is:
σ′ (z) = σ(z)(1 − σ(z))
Calculating the derivative for the hidden layer:
σ(0.74)(1 − σ(0.74))
" # " # " #
0.676(1 − 0.676) 0.219
σ′ (z(1) ) ≈ ≈ ≈
σ(0.98)(1 − σ(0.98)) 0.727(1 − 0.727) 0.198
∂L ∂L
" # " # " #
−0.1655 0.219 −0.0363
= · σ′ (1)
(z ) = ⊙ ≈
∂z(1) ∂a(1) −0.1986 0.198 −0.0393
Calculating gradients for weights and biases in the hidden layer:
∂L ∂L
" # " # " #
−0.0363 0.6 −0.02178 −0.03207
= (1) · x = · ≈
∂W (1) ∂z −0.0393 0.9 −0.02358 −0.03537
∂L ∂L
" #
−0.0363
= =
∂b(1) ∂z(1) −0.0393
The gradients summary is: Output Layer:
∂L ∂L
" #
−0.224
≈ , = −0.331
∂W(2) −0.241 ∂b(2)
Hidden Layer:
∂L ∂L
" # " #
−0.02178 −0.03207 −0.0363
≈ , ≈
∂W(1) −0.02358 −0.03537 ∂b(1) −0.0393
Now, we perform weight updates using a learning rate η: Assuming η = 0.1: For the output layer:
∂L
" # h
h i −0.224 i
W updated = W − η ·
(2) (2)
≈ 0.5 0.6 − 0.1 · ≈ 0.5224 0.6241
∂W (2) −0.241
∂L
b(2) updated = b(2) − η · = 0.2 − 0.1 · (−0.331) ≈ 0.2331
∂b(2)
For the hidden layer:
∂L
" # " # " #
0.1 0.2 −0.02178 −0.03207 0.1022 0.2032
W (1)
updated = W (1)
−η· ≈ − 0.1 · ≈
∂W (1) 0.3 0.4 −0.02358 −0.03537 0.3024 0.4035
∂L
" # " # " #
0.1 −0.0363 0.1036
b (1)
updated = b (1)
− η · (1) ≈ − 0.1 · ≈
∂b 0.1 −0.0393 0.1039
The updated weights and biases are: Output Layer:
h i
W(2) ≈ 0.5224 0.6241 , b(2) ≈ 0.2331
∂L ∂L
" # " #
0.676 −0.224
= · a = −0.331 ·
(1)
≈
∂W(2) ∂z(2) 0.727 −0.241
∂L ∂L
= (2) = −0.331
∂b(2) ∂z
For the hidden layer, we first need the gradients of the loss with respect to the activation of the hidden layer:
∂L ∂L
" # " #
0.5 −0.1655
= · W (2)
= −0.331 · =
∂a(1) ∂z(2) 0.6 −0.1986
Next, we calculate the derivatives of the activation function. The sigmoid derivative is given by:
∂L ∂L
" # " # " #
−0.1655 0.219 −0.0363
= · σ (z ) =
′ (1)
⊙ ≈
∂z(1) ∂a(1) −0.1986 0.198 −0.0393
Finally, we compute the gradients for the weights and biases in the hidden layer:
∂L ∂L
" # " # " #
−0.0363 0.6 −0.02178 −0.03207
= · x = · ≈
∂W(1) ∂z(1) −0.0393 0.9 −0.02358 −0.03537
∂L ∂L
" #
−0.0363
= =
∂b(1) ∂z(1) −0.0393
The gradients summary is as follows: Output Layer:
∂L ∂L
" #
−0.224
≈ , = −0.331
∂W(2) −0.241 ∂b(2)
Hidden Layer:
∂L ∂L
" # " #
−0.02178 −0.03207 −0.0363
≈ , ≈
∂W(1) −0.02358 −0.03537 ∂b(1) −0.0393
Next, we perform weight updates using a learning rate η: Assuming η = 0.1: For the output layer:
∂L
" # h
h i −0.224 i
W(2) updated = W(2) − η · ≈ 0.5 0.6 − 0.1 · ≈ 0.5224 0.6241
∂W (2) −0.241
0.1. DATA ANALYSIS 27
∂L
b(2) updated = b(2) − η · = 0.2 − 0.1 · (−0.331) ≈ 0.2331
∂b(2)
For the hidden layer:
∂L
" # " # " #
0.1 0.2 −0.02178 −0.03207 0.1022 0.2032
W(1) updated = W(1) − η · ≈ − 0.1 · ≈
∂W(1) 0.3 0.4 −0.02358 −0.03537 0.3024 0.4035
∂L
" # " # " #
0.1 −0.0363 0.1036
b (1)
updated = b (1)
− η · (1) ≈ − 0.1 · ≈
∂b 0.1 −0.0393 0.1039
The updated weights and biases are: Output Layer:
h i
W(2) ≈ 0.5224 0.6241 , b(2) ≈ 0.2331
1 (x − µ)2
f (x) = √ exp( )
2πσ2 2σ2
If the layer of neural network is in the limit of infinite width,then the Central Limit Theorem1 implies that the function computed
by the neural network (NN) is a function drawn from a Gaussian process (GP).
Theorem 23. Universal Approximation Theorem
Two layer network with linear output can be uniformly approximate any continuous function given sufficient hidden units.
m
1X
lim y(x) = lim σ(< wi , x >)
m→∞ m→∞ m
i=1
Let M be a manifold embedded in a Euclidean space Rn . For a point x ∈ M, the tangent space T x M at x consists of all
possible directions in which one can tangentially pass through x on the manifold M.
The tangent kernel HT (x, y) is defined as a kernel function that measures the similarity between points x and y based on the
geometry of the manifold. Formally, it can be expressed as:
The neural tangent kernel is a mathematical construct that describes how the outputs of a neural network change with re-
spect to small changes in its parameters (weights). It is derived from the Taylor expansion of the network output around the
initialization of the weights.
When the weights of a neural network are initialized randomly, the network can be approximated as a linear function of its
parameters. The NTK captures this linear behavior in the vicinity of the initialization.
During training, especially in the regime where the network is wide (i.e., has a large number of neurons), the dynamics of
the training process can be understood in terms of the NTK. This means that the optimization of the neural network can be
analyzed as a linear regression problem in the space defined by the NTK.
The NTK framework provides insights into the effectiveness of gradient descent in training deep neural networks, helping
to explain why certain architectures perform well and how their training dynamics behave.
Researchers use NTKs to study generalization, convergence rates, and the behavior of various neural network architectures
during training.
∂ f (w(t)i , xi ) ∂ f (w(t) j , x j )
H(t)i j =< , >
∂wi ∂w j
Information 13. Paper 1,2
We will talk about 2 paper:
It points out tangent kernal in infinitely wide kernal is the inner product of weights.
maybe my note is wrong, hope genius can tell me what I need to know
30
Let a(t) = [ f (w(t)1 , x1 ), f (w(t)2 , x2 ), ..., f (w(t)n , xn )] be the network output,you can view it as vector contains arbitrary continu-
ous functions and y = [y1 , .., yn ] be the actual output of x1 , ...xn .
Theorem 25. Equivalence between wide neural net and kernal regression with NTK
We want to show that if dA → ∞, network be infinitely width then
∂ f (w(t)i , xi ) ∂ f (w(t) j , x j )
H(t)i j =< , >→ w[k] (x, x′ )
∂wi ∂w j
which means tangent kernal fixed into inner product of all parameter:
∂ f (w, x)
= b[h] (g[h−1] (x))T
∂w[h]
If h = l + 1 then b[h] (x) = 1. Otherwise b[h] (x) = D[n] (x)(W [h+1] )b[h+1] ∈ Rd×n where D[h] (x) = diag(σ( f [h] )(x)) Then:
∂ f (w(t)i , xi ) ∂ f (w(t) j , x j )
H(t)i j =< , >=< b[h] (g[h−1] (x))T , b[h] (g[h−1] (x′ ))T >
∂wi ∂w j
H(t)i j =< D[n] (x)(W [h+1] )b[h+1](x) , D[n] (x′ )(W [h+1] )b[h+1] >
H(t)i j =< D[n] (x)(W [h+1] )T , D[n] (x′ )(W [h+1] )T >< b[h+1] (x), b[h+1] (x′ ) >
x ≤ 1, x′ ≤ 1,
∂ f (w(t)i , xi ) ∂ f (w(t) j , x j )
< , > −w[k] (wi , w j ) < ε
∂wi ∂w j
Finally we have
∂ f (w(t)i , xi ) ∂ f (w(t) j , x j )
H(t)i j =< , >→ w[k] (x, x′ )
∂wi ∂w j
At each layer, we have weight matrix W ∈ Rn×n then it requires O(n2 ) time, we apply the decomposition technique with low
rank and eigenvalue decomposition (P.C.A) selecting the most important eigenvalue.
k
X
W= λi vi vTi
i=1
for any k < n and it is possible that W ∈ Rd1 ×d2 ......×dn . Therefore the O(n2 ) parameter become O(kn) parameter, which is
decomposition.
Theorem 27. Decomposition Technique
1.Pruning:
Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees
by removing sections of the tree that are non-critical and redundant to classify instances. Pruning reduces the complexity of the
final classifier, and hence improves predictive accuracy by the reduction of overfitting.
Zeroing out very small weights(not recommend)
L1 regulation
2.Distillation:
Knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. While
large models (such as very deep neural networks or ensembles of many models) have more knowledge capacity than small mod-
els, this capacity might not be fully utilized. It can be just as computationally expensive to evaluate a model even if it utilizes
little of its knowledge capacity. Knowledge distillation transfers knowledge from a large model to a smaller one without loss of
validity. As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile
device).[1]
Given any output one hat label vector x = [0, 1, ..., 0]T ∈ Rn from Teacher , it send to student as v = [0, 0, 1, ..., 0]T ∈ R10 , it
is the training process and then student response teacher in less size vector. Inductively we will have the best student send the
smallest size vector
3.Decomposition
PCA (Principal Component Analysis) to reduce the number of features to fewer components
4.Quantization
Reduce the size of number type not Float 32 but Int 8. Storing integer much better the n floating number
W = Wmin + f loor(△ W−W
△
min
)
0.1. DATA ANALYSIS 33
subject to
rank S(b
p) ≤ r.
34
For any n-order continuous differentiable function f ∈ C n,d , there exists ReLU network R such that R can express any f
from C n,d such that sup x∈[0,1] R(x) − f (x) < ε
Note that
X X X X d
sup f − T = sup ( f ϕm ) − ( Pm ϕm ) = sup ( f − Pm )ϕm ≤ sup x f − Pm ≤ 2d maxm f − Pm ≤ 2d (N)−n ≤ ε
m m m m
n!
Consider
mi mi mi
xi − = σ(xi − ) − σ(−(xi − ))
N N N
Since product of x,y is :
x+y 2 x−y 2
Prod(x, y) = xy = ( ) −( )
2 2
PROOF BY OTHER PAPER x2 is :
m
X g s (x)
hm (x) = x −
i=1
22s
Since g is a sawtooth function such that g2 = g1 · g1 and gn = g1 ......g1 we use 3 ReLU neuron to approximate g.
Therefore we can use ReLU network to approximate ϕ and P, R → P
Hence
R→ f
36
Hopfield neural network is a recurrent neural network invented by John Hopfield in 1982. Hopfield network is a neural network
that combines storage system and binary system. It guarantees convergence to the local minimum, but convergence to the
wrong local minimum (local minimum) instead of the global minimum (global minimum) may also occur. Hopfield networks
also provide a model for simulating human memory.
Theorem 32. Hopeified Network
E(x) = min − X T WX
Let W = xxT as outer product where x = argmin − xT W x for any x ∈ R[n×1] , ||x|| = 1 We need to minimization point of the
Energy Function.(X T WX like gradient descent)
It associate initial point x′ with x(memory). It just like storing image into the energy function, it is imagine recovering method
or called associative memory.
xi represent the i-th image or signal. ∀images xi , x j are far from each other:
hi (t) = j wi j x j (t)
P
1, h(t) > 0
+ = =
x (t 1) sgn(h(t))
i
−1, h(t) < 0
E = i j wi j xi y j
P P
where hi (t) is the preactivation of the i-th neuron, xi is the output, sgn() is the activation function.
Note that E does not increase as long as wi j = w ji where w is symmetric. Also xi (t + 1) = sgn(h(t)) is the generalized gradient
descent.
Suppose at time t,xk is updated then complete the energy function change:
X X
E(t + 1) − E(t) = wik xi (t)[xk (t + 1) − xk (t)] − wk j x j (t)(xk (t + 1) − xk (t))
j i
X X X
= −2 wik xi (t)xk (t + 1) − 2 wk j x j (t)xk (t + 1) = −4xk (t + 1) [wik xi (t)] = −4sgn(hk (t))hk (t) < 0
j i j
m
X m
X
W= (Pm )T (P)m wi j = pm m
i pj
m=1
Hopfield Network really care about the ”states”. Let neuron be 2 states S i = {−1, 1} where the minimzation of S will be P
stands for pattern or perfect image or memory such that S ∗ = P when S ∗ is stable or mini
Pm Pmi Pmj
Let W = PT P = m m=1 Pi P j . Then wi j =
m m
P
m=1 n . Actually W is kind of memory capacity of n-neurons Hopfield Net-
work????? We want to ask that how many pattern can be store at m-neuron Hopfield Network
X
S (t + 1) = sgn( wi j s j (t))
j
1 X ′m ′m X m
i = sgn (
Pm P P P )
N j i j j j
1 X X m′ m′ m
i = sgn(Pi +
Pm m
P P P )
N m′ ,m j i i j
1 X X m′ m′ m m
i = sgn(1 +
Pm P P P P )
N m′ ,m j i i i j
" #
1, 0.5
Pm
i =1+ aim (Pm
i ∼ )
−1, 0.5
We want 1 + aim > 0 where aim > 0 and so aim ∼ N(0, m−1
N )
q
By Central Limit Theorem,define σ = m−1
N we have:
−∞
−x2
Z
1
P(1 + aim > 0) = √ exp( )
−1 2πσ2 2σ2
If m = 0.105N where m be small and N be large then P(1 + aim > 0) = 0.999
Since it follows the normal distribution then 100 neuron stores 10 pattern in the m, n relationship. How to improve the memory
size and follow Normal Distribution and we can find out memory surly.
38
Theorem 34. **Solution to the Capacity or Memory Complicity Limit of Hopfield Network
The standard model of associative memory [?] uses a system of N binary neurons, with values ±1. A configuration of all
the neurons is denoted by a vector i. The model stores K memories, denoted by ξiµ , which for the moment are also assumed to
be binary. The model is defined by an energy function, which is given by
N K
1X T
ξiµ ξµj ,
X
E= i T i j j, Ti j = (1)
2 i, j=1 µ=1
and a dynamical update rule that decreases the energy at every update. The basic problem is the following: when presented
with a new pattern, the network should respond with a stored memory which most closely resembles the input.
There has been a large amount of work in the community of statistical physicists investigating the capacity of this model,
which is the maximal number of memories that the network can store and reliably retrieve. It has been demonstrated [?, ?, ?]
that in case of random memories this maximal value is of the order of Kmax ≈ 0.14N. If one tries to store more patterns, several
neighboring memories in the configuration space will merge together, producing a ground state of the Hamiltonian (1), which
has nothing to do with any of the stored memories. By modifying the Hamiltonian (1) in a way that removes second-order
correlations between the stored memories, it is possible [?] to improve the capacity to Kmax = N.
The mathematical reason why the model (1) gets confused when many memories are stored is that several memories pro-
duce contributions to the energy which are of the same order. In other words, the energy decreases too slowly as the pattern
approaches a memory in the configuration space. In order to take care of this problem, consider a modification of the standard
energy
K
F ξiµ i
X
E= (2)
µ=1
In this formula, F(x) is some smooth function (summation over index i is assumed). The computational capabilities of the
model will be illustrated for two cases. First, when F(x) = xn (where n is an integer), which is referred to as a polynomial
energy function. Second, when F(x) is a rectified polynomial energy function
xn ,
x≥0
F(x) =
(3)
0,
x<0
In the case of the polynomial function with n = 2, the network reduces to the standard model of associative memory [?].
If n > 2, each term in (2) becomes sharper compared to the n = 2 case, thus more memories can be packed into the same
configuration space before cross-talk intervenes.
Having defined the energy function, one can derive an iterative update rule that leads to a decrease in energy. We use
asynchronous updates, flipping one unit at a time. The update rule is:
The argument of the sign function is the difference of two energies. One, for the configuration with all but the i-th units
clumped to their current states and the i-th unit in the “off” state. The other one for a similar configuration, but with the i-th
unit in the “on” state. This rule means that the system updates a unit, given the states of the rest of the network, in such a way
that the energy of the entire configuration decreases. For the case of the polynomial energy function, a very similar family of
models was considered in [?, ?, ?, ?, ?, ?]. The update rule in those models was based on the induced magnetic fields, however,
and not on the difference of energies. The two are slightly different due to the presence of self-coupling terms. Throughout this
paper, we use energy-based update rules.
How many memories can model (4) store and reliably retrieve? Consider the case of random patterns, so that each element
of the memories is equal to ±1 with equal probability. Imagine that the system is initialized in a state equal to one of the
memories (pattern number µ). One can derive a stability criterion, i.e. the upper bound on the number of memories such that
the network stays in that initial state. Define the energy difference between the initial state and the state with spin i flipped:
K
K
X ν µ X ν µ X
ν µ
X
ν µ
E= ξi ξi + ξ j ξ j − ξi ξi +
ξ j ξ j ,
ν=1 j,i ν=1 j,i
where the polynomial energy function is used. This quantity has a mean ⟨E⟩ = N n (N − 2)n ≈ 2n N n−1 , which comes from
the term with ν = µ, and a variance (in the limit of large N)
0.1. DATA ANALYSIS 39
The smoothness measure C f determine how well neural network can approximate the function
Fourier Analysis provides a precise way to measure the smoothness of function via its frequency spectrum.
Theorem 39. Universal Approximation Bound for Superpositions of Sigmoid
0.2 Reference
0.2.1 Prof Fenglei Fan MATH3320 2024 Lecture in CUHK(9-12)
Best professor on MATH3320!!!