Lecture 4
1. Linear Regression:
a. How to load data
b. Pre-process / clean data
c. How to choose features
d. Create model
e. Model evaluation metrics
i. MSE
ii. RMSE
iii. R2
iv. R2 adj (adjuster)
v. Value of the coefficients
vi. Se of the coefficients
vii. T-statistic
viii. P – value
ix. Null hypothesis
2. Supervised:
X feature vector (x1, x2, …)
Y label
Use X to predict Y
f(X) to predict
f(X) = w0 + w1 * x2
True regression functions are never linear
Model:
Y = B0 + B1 * X + E
In Python, use .fit() function to predict the model (substitute X to find
function f(x)
Y = B0 + B1 * X1
- Analytical: Close form
Error = ∑ ❑ Lowest value
o Residual sum of squares (RSS) = e1^2 + e2^2 + … + en^2
o Least square approach
- Numerical:
o Gradient Descent (check below)
Linear time
Local Minima
Best weights / Coefficients
3. Point Estimates:
- Sales = 10000 + (1.6) * (TV) + (2.9) * (Radio)
4. The linear regression is computed as (X'X)^-1 X'Y
5. ChatGPT: too much data, billions
6. Gradient Descent Approach:
- w(0) = initial value (guess)
- w(1) = w(0) – (Learning Rate) * d(error)/dw
- Point Estimate – Best Coefficients
- Standardize the dataset
o N(0 , 1)
7. Exact Solution (Closed Form):
- Point Estimate
o Standard Error
8. Root mean square error (RMSE)
9. MAE
----------------------
Ex: House price prediction:
X1: Size
X2: Bathroom
X3: Bedroom
Mean and variance of the price of the house (predicted from x1, x2, x3)
Ex: Temperature
With temperatures, we may predict the mean, but it’s hard to predict the
variance
10. Predicted mean Actual mean
11. Predicted var Actual var
% Variance explained
If I show the entire dataset to the model, and test using the same data
Memorization
To prevent, you cross-data
Web for datasets: Auto MPG - UCI Machine Learning Repository
12. SGD:
- Evaluation Metrics:
o 1. Score (R2) value: R2 = 1 - RSS/TSS
o RSS = (y1 – y1’)2 + (y2 – y2’)2 + … + (yn – yn’)2
o TSS = (y1 – y’)2 + (y2 – y’)2 + … + (yn – y’)2
0<x<1
- Approximate approach
- Faster
- No guarantee of best solution
- Doesn’t give many evaluation metrics
- A good idea to standardize the data
13. OLS:
- Closed form
- Exact solution
- Cubic time complexity
- Statsmodels
- Not require to standardize the data
- Diagnostics:
o Point Estimate:
MedHouseValue: 0.0163 + 0.4416 * MedInc + …
o T-statistic = point estimate / std.error as large as possible