Handbook
1. Linear Regression
Overview
Linear regression is a foundational technique in statistics and machine learning, used to model the
relationship between a dependent variable (target) and one or more independent variables
(predictors). The primary objective is to find the best-fitting line (or hyperplane, in the case of multiple
predictors) that represents this relationship.
Key Concepts:
Simple Linear Regression: This models the relationship between a dependent variable y and
one independent variable x using the equation:
y = β0 + β1 x + ϵ
Where:
β0 is the intercept, i.e., the predicted value of y when x = 0.
β1 is the coefficient (slope), which shows how much y changes for each unit change in x.
ϵ represents the error term, capturing the variability in y not explained by the linear
relationship with x.
Multiple Linear Regression: When there are multiple independent variables x1 , x2 , … , xn , the
model is expressed as:
y = β 0 + β 1 x1 + β 2 x2 + ⋯ + β n xn + ϵ
Here, the model estimates the relationship between y and each of the independent variables,
assuming a linear interaction between the variables.
Assumptions: Linear regression relies on certain assumptions that ensure the model produces
reliable predictions:
Linearity: The relationship between the independent variables and the dependent variable is
linear.
Independence: Observations are independent of each other, meaning the value of one
observation doesn't depend on another.
Homoscedasticity: The variance of the error terms is constant across all levels of the
independent variable(s).
Normality: The error terms are normally distributed, which is essential for hypothesis testing
and confidence intervals.
Cost Function (Mean Squared Error):
The cost function measures how well the model fits the data, and the goal is to minimize it. It is
calculated as:
m
1
J(β) = ∑(hβ (x(i) ) − y (i) )2
m i=1
Where:
m is the number of training examples.
hβ (x) is the hypothesis function, i.e., the model's prediction for x.
y (i) is the actual value for the i-th training example.
Minimizing this function helps find the optimal parameters β that make the predictions as
accurate as possible.
2. Types of Regression
Overview
Regression models are used for different kinds of relationships between variables. Various techniques
allow for more flexibility in dealing with different data patterns, especially when linearity doesn’t hold.
Linear Regression: As discussed earlier, this assumes a linear relationship between the
dependent and independent variables.
Polynomial Regression: Used when the relationship between the variables is nonlinear. It fits a
polynomial equation to the data. For example:
y = β 0 + β 1 x + β 2 x2 + ⋯ + ϵ
This can capture curvature in the data that a linear model might miss.
Ridge Regression: A variant of linear regression that applies L2 regularization. The cost
function includes a penalty term proportional to the square of the coefficients, helping prevent
overfitting by reducing the size of the coefficients:
m n
1
J(β) = ∑(hβ (x(i) ) − y (i) )2 + λ ∑ βj2
m
i=1 j=1
Here, λ is the regularization parameter that controls the magnitude of the penalty. As λ increases,
the model becomes simpler, but may underfit.
Lasso Regression: Similar to Ridge but applies L1 regularization, which encourages sparsity by
forcing some coefficients to zero. This can be useful for feature selection, as it effectively ignores
irrelevant features:
m n
1
J(β) = ∑(hβ (x(i) ) − y (i) )2 + λ ∑ ∣βj ∣
m i=1 j=1
The L1 regularization helps to reduce the number of features the model uses, making it simpler
and easier to interpret.
Elastic Net Regression: A combination of both L1 and L2 regularization. It balances between
Ridge and Lasso, making it particularly effective when there are many correlated features.
Logistic Regression: Despite its name, it is used for classification tasks rather than regression. It
estimates the probability of a binary outcome (0 or 1) using the logistic function (sigmoid
function). The cost function in logistic regression is log-likelihood rather than squared error.
3. Gradient Descent
Overview
Gradient Descent is an optimization technique used to find the minimum of a function. In machine
learning, it is used to minimize the cost function, adjusting model parameters to reduce the difference
between predicted and actual values.
Key Concepts:
Objective: The goal of gradient descent is to find the parameter values θ that minimize the cost
function J(θ). It does this by iteratively adjusting the parameters based on the gradient (the
partial derivatives) of the cost function.
Update Rule: The parameters are updated in the opposite direction of the gradient, using the
formula:
θ = θ − η∇θ J (θ)
Where:
θ is the parameter vector (e.g., weights in a linear regression model).
η is the learning rate, a hyperparameter that determines the size of the steps taken toward
the minimum.
∇θ J (θ) is the gradient of the cost function with respect to θ.
Types of Gradient Descent:
Batch Gradient Descent: Uses the entire dataset to compute the gradient at each iteration.
This method is stable but can be computationally expensive with large datasets.
Stochastic Gradient Descent (SGD): Updates parameters using a single data point at a
time, making the process much faster but introducing more variance in the updates. This can
sometimes help escape local minima but may result in a noisy path to convergence.
Mini-batch Gradient Descent: A compromise between the above two, using a small batch of
data to compute the gradient at each iteration. It combines the benefits of both speed and
stability.
Convergence: The learning rate η plays a crucial role in how quickly and reliably gradient
descent converges. If the learning rate is too high, the algorithm may oscillate around the
minimum, while if it is too low, convergence may be very slow.
4. PageRank Algorithm
Overview
PageRank is an algorithm developed by Google to rank web pages based on their importance. It uses
the link structure of the web to evaluate the quality and quantity of links to a page, which determines
its rank in search results.
Key Concepts:
PageRank: A page's rank is determined by the number and quality of links pointing to it. Pages
with links from other high-ranking pages receive higher PageRank values.
Formula: The PageRank of a page P is calculated using the formula:
PageRank(Q)
PageRank(P ) = (1 − d) + d × ∑ ( )
out-degree of Q
Where:
P is the page being ranked.
d is the damping factor (usually 0.85), which accounts for the probability that a user will
randomly jump to any page, rather than following a link.
Q represents pages linking to P .
The out-degree of Q refers to the number of outgoing links on page Q, which helps
distribute PageRank scores to other pages.
Damping Factor: The damping factor represents the likelihood that a web surfer will jump to a
random page rather than following a link. It is a critical element in preventing rank from being
distributed infinitely across the web.
Convergence: PageRank is typically computed iteratively, with each page's rank being updated
based on the ranks of the pages linking to it. This process continues until the ranks converge to
stable values.
Summary:
Linear regression is ideal for modeling linear relationships but can be extended to nonlinear
cases with polynomial regression.
Gradient Descent is the backbone of many machine learning algorithms and is used to minimize
cost functions iteratively.
PageRank is a fundamental algorithm for ranking pages, based on the link structure, and it uses
the link graph of the web.
Types of regression (such as Ridge, Lasso, and Elastic Net) help manage overfitting and
improve model performance by adding regularization terms.