KEMBAR78
Module08 PolynomialRegressionSplineGAMs | PDF | Regression Analysis | Spline (Mathematics)
0% found this document useful (0 votes)
46 views56 pages

Module08 PolynomialRegressionSplineGAMs

The document discusses different non-linear regression methods such as polynomial regression, step functions, splines and basis functions. Polynomial regression extends linear regression by adding higher-order terms. Step functions fit a piecewise constant function by dividing the predictor range into regions. Splines are similar but join smoothly at region boundaries. Basis functions provide a general framework that includes these methods as special cases.

Uploaded by

riya pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views56 pages

Module08 PolynomialRegressionSplineGAMs

The document discusses different non-linear regression methods such as polynomial regression, step functions, splines and basis functions. Polynomial regression extends linear regression by adding higher-order terms. Step functions fit a piecewise constant function by dividing the predictor range into regions. Splines are similar but join smoothly at region boundaries. Basis functions provide a general framework that includes these methods as special cases.

Uploaded by

riya pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Polynomial Regression, Step Functions, Basis

Functions, Splines, GAMS


Prof. Sayak Roychowdhury
Department of Industrial and Systems Engineering
Indian Institute of Technology Kharagpur
Reference
• James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An
introduction to statistical learning (Vol. 112, p. 18). New York:
springer.
• Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H.
(2009). The elements of statistical learning: data mining,
inference, and prediction (Vol. 2, pp. 1-758). New York: springer.
Need of non-linear models
• Linear models are difficult to fit when the linearity assumption is poor.
• Ridge, lasso, and principal components regression improve least squares
regression by reducing the variance of the coefficient estimates. But still this
models hold the linearity and perform poorly in nonlinear problems.
• So to overcome this we use polynomial regression, step functions, splines, local
regression, and generalized additive models (GAM).
Different Methods
• Polynomial regression extends the linear model by adding extra predictors,
obtained by raising each of the original
2
predictors
3
to a power. E.g., a cubic
regression uses three variables, 𝑋, 𝑋 , 𝑎𝑛𝑑 𝑋 , as predictors.
• Step functions cut the range of a variable into 𝐾 distinct regions in order to
produce a qualitative variable. This has the effect of fitting a piecewise
constant function.
• Regression splines are more flexible are an extension of the above two.
They involve dividing the range of 𝑋 into 𝐾 distinct regions.
• Within each region, a polynomial function is fit to the data.
• However, these polynomials are constrained so that they join smoothly at the region
boundaries, or knots.
• Provided that the interval is divided into enough regions, this can produce an
extremely flexible fit.
Different Methods
• Smoothing splines are similar to regression splines, but arise in a
slightly different situation resulting from minimizing a residual sum of
squares criterion subject to a smoothness penalty.
• Local regression is similar to splines, but differs in an important way.
The regions are allowed to overlap, and indeed they do so in a very
smooth way.
• Generalized additive models allow us to extend the methods above
to deal with multiple predictors.
Polynomial Regression
• Polynomial Regression is a regression algorithm that models the relationship
between a dependent(𝑦) and independent variable(𝑥) by adding extra predictors
with 𝑛th degree polynomial. A polynomial regression model may look like the
following:
𝑌 = 𝛽​0 + 𝛽​1 (𝑋) + 𝛽​2 (𝑋 2 ) + 𝛽​3 (𝑋 3 )+. . . +𝛽​𝑛 (𝑋 𝑛 )
• For above equation degree greater than 3 or 4 results in too flexible and weird
shaped curve.
• In polynomial regression instead of individual fit the overall fit is considered to
assess the relationship between the predictor and response.
• We either fix the degree 𝑑 at some reasonably low value, else use cross-validation
to choose 𝑑.
• Polynomial regression imposes a global structure of the relationship (means same
degree for all predictors).
Figure: Difference between simple and polynomial curve Figure: Polynomial curve fitting for higher order
Polynomial Model for Regression
Structural Multicollinearity

High VIF
Structural Multicollinearity (after centering)

VIF improved after


centering
Nested Sequence of Models
Probability Bands for Polynomial Model

The confidence bands are first


calculated in logit scale and then
transformed to probability
scale.
All the upper and lower bound values
lie between 0 and 1.
Step Function
• Another way of creating transformations of a variable is cut the variable into
distinct regions.
• Step function fit a piecewise constant function into 𝐾 distinct regions in order to
produce a qualitative variable.
• To avoid imposing of global structure on a non-linear function we use step
function. In step function the​𝑋​values are divided into range and fit a different
constant for each range.
• In greater detail, we create cutpoints 𝑐1 , 𝑐2 , . . . , 𝑐𝐾 in the range of​𝑋, and then
construct 𝐾​ + ​1 new variables.
Step Function Expression
• The step function is as given below:
𝐶0 (𝑋) ​ = ​𝐼(𝑋​ < 𝑐1 ),
𝐶1 (𝑋) ​ = ​𝐼(𝑐1 ​ ≤ 𝑋​ < 𝑐2 ),
𝐶2 (𝑋) ​ = ​𝐼(𝑐2 ​ ≤ 𝑋​ < 𝑐3 ),
...
𝐶𝐾−1 (𝑋) ​ = ​𝐼(𝑐𝐾−1 ≤ 𝑋​ < 𝑐𝐾 ),
𝐶𝐾 (𝑋) ​ = ​𝐼(𝑐𝐾 ​ ≤ ​𝑋),
where 𝐼(·) is an indicator function that returns a 1 if the condition is true otherwise 0.
• For any value of 𝑋, 𝐶0 (𝑋) + 𝐶1 (𝑋) +·​·​· +𝐶𝐾 (𝑋) ​ = ​1, since 𝑋 must be in exactly one of
the 𝐾​ + ​1 intervals.
• The least square fit becomes:
𝑦𝑖​ = ​ 𝛽0 ​ + ​ 𝛽1 𝐶1 (𝑥𝑖 ) ​ + ​ 𝛽2 𝐶2 (𝑥𝑖 ) ​ +​·​·​· ​ +𝛽𝐾 𝐶𝐾 (𝑥𝑖 ) ​ + 𝜀𝑖
when X​ < ​ 𝑐1 , all of the predictors in above equation are zero, so 𝛽0 can be interpreted as
the mean value of 𝑌 for 𝑋​ < ​ 𝑐1 .
Step Function
Basis Function
• Polynomial and piecewise-constant regression models are in fact special cases of a
basis function approach. Instead of fitting a linear model in X, we fit the model
𝑌 = 𝛽​0 + 𝛽​1 𝑏1 𝑥𝑖 + 𝛽​2 𝑏2 𝑥𝑖 + 𝛽​3 𝑏3 𝑥𝑖 +. . . +𝛽​𝐾 𝑏𝐾 𝑥𝑖 + 𝜖𝑖
𝑗
• For polynomial regression, the basis functions are 𝑏𝑗 (𝑥𝑖 ) ​ =​ 𝑥𝑖 , and for
piecewise constant functions they are 𝑏𝑗 𝑥𝑖 = 𝐼(𝑐𝑗 ≤ 𝑥𝑖 ≤ 𝑐𝑗+1) .
Regression splines
• In this X data points are divided into K distinct regions and for each region a
separate polynomial function is fitted. These polynomial are constrained so that
they join smoothly at region boundaries.
• This is a class of basis functions that extends upon the polynomial regression and
piecewise constant regression.
• Piecewise polynomial regression involves fitting separate low-degree polynomials
over different regions of X.
𝑦𝑖​ = ​ 𝛽0 ​ + ​ 𝛽1 𝑥𝑖 ​ + ​ 𝛽2 𝑥𝑖2 + ​ 𝛽3 𝑥𝑖3 + ​ 𝜖𝑖
• where the coefficients 𝛽0 , 𝛽1 , 𝛽2 , 𝑎𝑛𝑑​𝛽3 differ in different parts of the range of X.
• The points where the coefficients change are called knots. Piecewise cubic with
no knots is just a standard cubic polynomial.
Piecewise cubic polynomial
• Piecewise cubic polynomial with a single knot at a point c takes the form
𝛽01 ​ + ​ 𝛽11 𝑥𝑖 ​ + ​ 𝛽21 𝑥𝑖2 + ​ 𝛽31 𝑥𝑖3 + ​ 𝜖𝑖 ​ 𝑖𝑓​𝑥𝑖 < 𝑐
𝑦𝑖 = ൝
𝛽02 ​ + ​ 𝛽12 𝑥𝑖 ​ + ​ 𝛽22 𝑥𝑖2 + ​ 𝛽32 𝑥𝑖3 + ​ 𝜖𝑖 ​ 𝑖𝑓​𝑥𝑖 ≥ 𝑐
• Using more knots leads to a more flexible piecewise polynomial.
• The general definition of a degree-𝑑 spline is that it is a piecewise degree-
𝑑​polynomial, with continuity in derivatives up to degree 𝑑​ − ​1 at each knot.
Linear spline is obtained by fitting a line in each region of the predictor space
defined by the knots, requiring continuity at each knot.
Various piecewise polynomials fit to a subset of wage dataset
Fitting Cubic Spline

Spline with continuous 2nd derivatives


at knots
Constraints
• Discontinuity at the knot is undesirable
• Adding constraint that both first and second derivatives at the knots
should be continuous
• This will result in a smooth piecewise polynomial function
The Spline Basis Representation
• To fit a regression splines with 𝑑-degree polynomial is complex. So we can use
the basis model to represent a regression spline. A cubic spline with K knots can
be modelled as and fit with least square
𝑌 = 𝛽​0 + 𝛽​1 𝑏1 𝑥𝑖 + 𝛽​2 𝑏2 𝑥𝑖 +. . . +𝛽​𝐾+3 𝑏𝐾+3 𝑥𝑖 + 𝜖𝑖
• Basis for a cubic polynomial—namely, 𝑥, 𝑥 2 , 𝑎𝑛𝑑​𝑥 3 and then add one truncated
power basis function per knot. A truncated power basis function is defined as
3 (𝑥 − 𝜉) 3 ​𝑖𝑓​𝑥 > 𝜉
ℎ 𝑥, 𝜉 = 𝑥 − 𝜉 + = ቊ
0​ 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where ξ is the knot.
• The above function will be discontinuous after 3rd derivative.
• Fitting least square regression with an intercept and 3+K predictors, of the form
𝑋, 𝑋 2 , 𝑋 3 , ℎ 𝑥, 𝜉1 , … ℎ 𝑥, 𝜉𝑘 , where 𝜉1 , .. 𝜉𝑘 are the knots
The Spline Basis Representation
❖Choosing the Number and Locations of the Knots
• Regression splines are more flexible where more knots are placed, because in
these regions the coefficients vary rapidly.
• One option could be to place more knots where the function varies rapidly
• Usually knots are placed uniformly, or at certain quantiles
• Software can place the knots automatically when degree of freedom is specified
• Cross-validation can be used to choose the appropriate number of knots
Regression Spline and Polynomial Regression

• Splines introduce flexibility by increasing the number of knots but


keeping the degree fixed.
• Polynomial regression uses higher degree for increasing flexibility
• For splines more knots can be placed where the function varies rapidly
• The extra flexibility in the polynomial produces undesirable results at
the boundaries, while the natural cubic spline still provides a
reasonable fit to the data.
Natural Splines
• A natural spline is a regression spline with additional boundary
constraints: the natural function is required to be linear at the
boundary.
fit <- lm(wage ~ ns(age,
knots=c(25, 40, 60)), data = Wage)
Smoothing splines
• Similar to regression splines but the splines result from minimizing a residual sum of
squares criterion subject to smoothness penalty.
• In fitting a smooth curve we always focus on some function 𝑔(𝑥) such that 𝑅𝑆𝑆 =
σ𝑛𝑖=1(𝑦𝑖 − 𝑔 𝑥𝑖 )2 ​should be minimum.
• Without​any​constraint on​𝑔(𝑥), RSS can be made zero simply by choosing 𝑔 such that it
interpolates all of the​𝑦𝑖 .
• Require a function 𝑔 that makes RSS small, but is also smooth. A natural approach is to
find the function g that minimizes 𝑛
𝐿 𝑔 𝑥 = ​ ෍(𝑦𝑖 − 𝑔 𝑥𝑖 )2 +𝜆 න 𝑔"(𝑡)2 𝑑𝑡
𝑖=1
• where 𝜆 is a nonnegative tuning parameter. The function 𝑔 that minimizes above
equation is known as a smoothing spline (Loss + Penalty).
• The second derivative of a function indicates its roughness.
• Larger the λ, smoother is 𝑔(𝑥)
Smoothing splines
• The function 𝑔(𝑥)​that minimizes 𝐿 𝑔 𝑥 ​ can be shown to have some special
properties
• It is a piecewise cubic polynomial with knots at the unique values of 𝑥1 , . . . , 𝑥𝑛 ,
and continuous first and second derivatives at each knot.
• It is linear in the region outside of the extreme knots.
• In other words, the function g(x) that minimizes 𝐿 𝑔 𝑥 ​ is a natural cubic spline
with knots at each unique observation.
• However, it is a shrunken version of such a natural cubic spline, where the value
of the tuning parameter λ controls the level of shrinkage.
Smoothing Parameter 𝝀
• It is possible to show that as 𝜆 increases from 0 to ∞, the effective degrees of
freedom degrees, which we write 𝑑𝑓𝜆 , decrease from 𝑛 to 2.
• Degrees of freedom refer to the number of free parameters, such as the number of
coefficients fit in a polynomial or cubic spline.
• Effective degrees of freedom is defined to be the sum of the diagonal elements of
the matrix​𝑆𝜆 .
ෝ 𝝀 = 𝑺𝝀 𝒚 (Vector of fitted values with smoothing splines using 𝜆)
𝒈
𝑑𝑓𝜆 = σ𝑛𝑖=1 𝑺𝝀 𝑖𝑖 (Effective dof)
• The value of λ that makes the cross-validated RSS as small as possible is the best
value.
Smoothing splines
• For smoothing splines, there will be a knot at each observations.
• The leave-one- out cross-validation error (LOOCV) can be computed very
efficiently for smoothing splines, with essentially the same cost as computing a
single fit, using the following
𝑛
formula: 𝑛 2
−𝑖 2
𝑦𝑖 − 𝑔ො𝜆 (𝑥𝑖 )
𝑅𝑆𝑆𝑐𝑣 𝜆 = ෍(𝑦𝑖 − 𝑔ො𝜆 𝑥𝑖 ) = ෍
1 − 𝑆𝜆 𝑖𝑖
𝑖=1 𝑖=1
−𝑖
• 𝑔ො𝜆 𝑥𝑖 ​ indicates the fitted value for this smoothing spline evaluated at 𝑥𝑖 ,
where the fit uses all of the training observations except for the 𝑖th observation
(𝑥𝑖 , 𝑦𝑖 ). In contrast, 𝑔ො𝜆 𝑥𝑖 ​indicates the smoothing spline function fit to all of the
training observations and evaluated at​𝑥𝑖 .
Fitting Smoothing Splines
Cubic spline

Smoothing spline
With df = 16
Fitting Smoothing Splines
Cubic spline

Smoothing spline
With df = 16
Smoothing spline with
LOOCV, df = 6.79
Local regression
• Local regression is a different approach for fitting flexible non-linear functions,
which involves computing the fit at a target point 𝑥0 using only the regression
nearby training observations.
• Local regression is sometimes referred to as a memory-based procedure, because
like nearest-neighbours, we need all the training data each time we wish to
compute a prediction.
• Choices to be made:
• How to define weighting function K
• Form of regression (linear, cubic, quadratic)
• Span 𝑠 (plays the role of a tuning parameter, smaller 𝑠: more local – higher variance)
Local regression

Blue- generating function 𝑓(𝑥), orange – estimates from local regression


Local Regression Model Fitting
1. Gather the fraction 𝑠​ = ​𝑘/𝑛​of training points whose 𝑥𝑖 are closest to 𝑥0 .
2. Assign a weight 𝐾𝑖0 ​ = ​𝐾(𝑥𝑖 , 𝑥0 )​to each point in this neighborhood, so that the
point furthest from 𝑥0 has weight zero, and the closest has the highest weight.
All but these 𝑘 nearest neighbors get weight zero.
3. Fit a weighted least squares regression of the 𝑦𝑖 ​on the 𝑥𝑖 using the above
weights, by finding 𝛽መ0 ​and 𝛽መ1 ​that minimize
σ𝑛𝑖=1 𝐾𝑖0 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 2
4. The fitted value at 𝑥0 is given by 𝑓መ x0 = 𝛽መ0 + 𝛽መ1 𝑥0 .
Local Regression Generalization
• In a setting with multiple features 𝑋1 , 𝑋2 , . . . , 𝑋𝑝 , local regression can be
generalized by fitting a multiple linear regression model that is global in some
variables, but local in another, such as time.
• Local regression also generalizes nicely for a pair of variables 𝑋1 ​𝑎𝑛𝑑​𝑋2 , rather
than one.
• Two-dimensional neighborhoods can be used to fit bivariate linear regression
models using the observations that are near each target point in two-dimensional
space.
• However, local regression can perform poorly if number of predictors is much
larger than about 3 or 4 because there will generally be very few training
observations close to 𝑥0
• This is similar to curse of dimensionality problem discussed in KNN.
Generalized additive models (GAMs)
• GAM allow us to extend the local regression methods to deal with multiple predictors.
• Generalized additive models (GAMs) provide a general framework for extending a
standard linear model by allowing non-linear functions of each of the variables, while
maintaining additivity.
• Just like linear models, GAMs can be applied with both quantitative and qualitative
responses. 𝑝

𝑦𝑖 = 𝛽0 + ෍ 𝑓𝑗 (𝑥𝑖𝑗 ) + 𝜖𝑖
𝑗=1
• This is an example of GAM, where linear component 𝛽𝑗 𝑥𝑖𝑗 in multiple linear regression
model is replaced by smooth non-linear function​𝑓𝑗 (𝑥𝑖𝑗 ).
• It is called an additive model because we calculate a separate 𝑓𝑗 ​for each​𝑋𝑗 , and then add
together all of their contributions.
Generalized additive models (GAMs)
• In the regression setting GAM has the form
𝐸 𝑌 𝑋1 , . . , 𝑋𝑝 = 𝛼 + 𝑓1 𝑋1 + 𝑓2 𝑋2 . . +𝑓𝑝 (𝑋𝑝 )
• The 𝑓𝑗 ()s are unspecified, smooth, non-parametric functions
• Each function is fitted using a scatter-plot smoother (e.g. cubic spline, kernel
smoother) and then estimate all p functions simultaneously using an algorithm
• An additive logistic regression model is represented by
𝑃 𝑌=1𝑋
log = 𝛼 + 𝑓1 𝑋1 + 𝑓2 𝑋2 . . +𝑓𝑝 (𝑋𝑝 )
1−𝑃 𝑌 =1 𝑋
• The above model can also be extended to the generalized linear models which
include linear model, logit, probit, gamma, negative-binomial, log-linear models.
• Linear and other parametric forms can be mixed with the nonlinear terms, a
necessity when some of the inputs are qualitative variables (factors).
Generalized additive models (GAMs)
• Additive models can replace linear models, e.g. additive decomposition of time-
series
𝑌𝑡 = 𝑆𝑡 + 𝑇𝑡 + 𝜖𝑡
Where 𝑆𝑡 is seasonal component, 𝑇𝑡 is the trend and 𝜖𝑡 is the error term.
Model fitting with GAMs
• The additive model has the form
𝑝
𝑌 = 𝛼 + σ𝑗=1 𝑓𝑗 (𝑋𝑗 ) + 𝜖
where 𝐸 𝜖 = 0
The penalized sum of square error similar to smoothing spline is applicable for this model
2
𝑛 𝑝 𝑝

𝑃𝑅𝑆𝑆 𝛼, 𝑓1 , . . 𝑓𝑝 = ෍ (𝑦𝑖 − 𝛼 − ෍ 𝑓𝑗 (𝑥𝑖𝑗 )) + ​ ෍ 𝜆𝑗 න 𝑓𝑗 "(𝑡𝑗 )2 𝑑𝑡𝑗


𝑖=1 𝑗=1 𝑗=1
Where 𝜆𝑗 ≥ 0 are the tuning parameters.

• The minimizer of the PRSS is an additive cubic spline model in 𝑋𝑗 , with knots at each of
the unique values of 𝑥𝑖𝑗 , where 𝑖 = 1, . . 𝑁
Backfitting Algorithm
1 𝑁
1. Initialize 𝛼ො = σ𝑖=1 𝑦𝑖 , 𝑓መ𝑗
= 0, ∀𝑖, 𝑗
𝑁
2. For 𝑗 = 1​𝑡𝑜​𝑝, loop
𝑝 𝑁
𝑓መ𝑗 ← 𝒮𝑗 (𝑦𝑖 − 𝛼 − σ𝑘=1 𝑓መ𝑘 (𝑥𝑖𝑘 )
𝑘≠𝑗 1
1 𝑁
𝑓መ𝑗 ← 𝑓መ𝑗 − σ𝑖=1 𝑓መ𝑗 (𝑥𝑖𝑗 )
𝑁
Until the change in 𝑓መ𝑗 is smaller than some threshold
𝑁
• 𝒮𝑗 is cubic smoothing spline applied to targets (𝑦𝑖 − 𝛼 − σ𝑘≠𝑗 𝑓መ𝑘 (𝑥𝑖𝑘 ) ​
1
to obtain new estimates of 𝑓መ𝑗
Backfitting Algorithm
• Operation of smoother𝒮𝑗 only at the training points can be represented by an N ×
N operator matrix 𝑺𝑗
• Then the degrees of freedom for the jth term are (approximately) computed
• as 𝑑𝑓𝑗 ​ = ​𝑡𝑟𝑎𝑐𝑒[𝑺𝑗 ] ​ − ​1,
Advantages and Disadvantages
• Advantages:
1) GAMs allow us to fit a non-linear 𝑓𝑗 ​for each​𝑋𝑗 , so that we can automatically
model non-linear relationships that standard linear regression will miss.
2) The non-linear fits can potentially make more accurate predictions for the
response​𝑌.
3) Because the model is additive, we can examine the effect of each 𝑋𝑗 ​on 𝑌
individually while holding all of the other variables fixed.
4) The smoothness of the function 𝑓𝑗 for the variable 𝑋𝑗 can be summarized via
degrees of freedom.

Disadvantages:
1) The model is restricted to be additive. With many variables, important
interactions can be missed. For removing that manually interactions are added using
linear regression, local regression techniques.
Regression with GAM
require(gam)
gam1 <- gam(wage ~ s(age, df = 4)+ s(year, df = 4) + education, data = Wage)
par(mfrow = c(1,3))
plot(gam1, se = T)
Logit with GAM
gam2 <- gam(I(wage>250) ~ s(age, df = 4)+ s(year, df = 4) + education,
data = Wage, family = binomial)
par(mfrow = c(1,3))
plot(gam2)
Kernel Density Estimation
• The Kernel Density Estimation is a mathematic process of finding an estimate probability
density function of a random variable.
• It is a non-parametric method to estimate the probability density function of a random
variable based on kernels as weights.
• Let (𝑥1, 𝑥2, … , 𝑥𝑛)​be independent and identically distributed samples drawn from some
univariate distribution with an unknown density​𝑓​at
𝑛
any given point x.
1

𝑓𝜆 𝑥 = ෍ 𝐾𝜆 𝑥 − 𝑥𝑖
𝑛
𝑖=1
𝐾 is the kernel — a non-negative function — and 𝜆​ > ​0 is a smoothing parameter called
the bandwidth.
• A range of kernel functions are commonly used: uniform, triangular, biweight, triweight,
Epanechnikov, normal, and others.
• The Kernel Density Estimation works by plotting out the data and beginning to create a
curve of the distribution.
Bandwidth Selection
• The most common optimality criterion used to
select bandwidth is termed as mean integrated
squared error.
• In each of the kernels 𝐾𝜆 , 𝜆 is a parameter that
controls its width:
• For the Epanechnikov or tri-cube kernel with
metric width, ℎ is the radius of the support region.
• For the Gaussian kernel,𝜆 is the standard deviation.
• ℎ is the number k of nearest neighbors in k-nearest
neighborhoods, often expressed as a fraction or
span k/N of the total training sample. Figure: Kernel density estimation with different
bandwidth. Red: KDE with 𝜆​ =0.05, Black:
KDE with 𝜆​=0.337, Green: KDE with 𝜆​=2,
Grey curve is normal density with mean o and
variance 1 source: wikipedia
Kernel Smoothing
KNN and Kernel Smoothing
• KNN average is computed as
𝑓መ 𝑥 = 𝐴𝑣𝑒(𝑦𝑖 |𝑥𝑖 ∈ 𝑁𝑘 (𝑥))
• Here 𝑁𝑘 (𝑥​) is the set of k points nearest to x in squared distance
• Moving 𝑥0 from left to right, the KNN remains constant, until a point
𝑥𝑖 ​to the right of 𝑥0 becomes closer than the furthest point 𝑥𝑖 ′ ​in the
neighborhood to the left of 𝑥0 , at which time 𝑥𝑖 replaces 𝑥𝑖 ′ .
• This leads to discontinuous 𝑓መ 𝑥
• Alternatively, assign weights that die off smoothly with distance from
the target point
Kernel Smoothing
• Nadaraya Watson Kernel weighted average:
σ𝑁 𝑖=1 𝐾𝜆 𝑥0 , 𝑥𝑖 𝑦𝑖

𝑓 𝑥0 = 𝑁
σ𝑖=1 𝐾𝜆 𝑥0 , 𝑥𝑖
• Epanechnikov quadratic kernel:

𝑥−𝑥0
𝐾𝜆 𝑥0 , 𝑥 = 𝐷
𝜆
3
Where 𝐷 𝑡 = (1 − 𝑡 2 ) 𝑡 ≤1
4
=0 otherwise
Kernel Smoothing
𝑥 − 𝑥0
𝐾𝜆 𝑥0 , 𝑥 = 𝐷
𝜆
𝜆 represents width, larger the value of 𝜆 the smoother is the kernel

• Adaptive width function can also be used


𝑥−𝑥0
• 𝐾𝜆 𝑥0 , 𝑥 = 𝐷
ℎ𝜆 (𝑥0 )
Other Popular Kernels
Tri-cubic kernel 𝐷 𝑡 = 1 − 𝑡 3 3 𝑡 ≤1
=0 otherwise

• Gaussian kernel 𝐷 𝑡 = 𝜙(𝑡)


Standard deviation plays the role of the window size
Local Linear Regression (revisit)
Local Linear Regression
• Locally weighted averages can be badly biased on the boundaries of
the domain because of the asymmetry of the kernel in that region.
• By fitting straight lines rather than constants locally, we can remove
this bias exactly to first order
• Locally weighted regression solves a separate weighted least squares
problem at each target point 𝑥0 :
min σ𝑁 𝐾 𝑥
𝑖=1 𝜆 0 𝑖 , 𝑥 𝑦𝑖 − 𝛼 𝑥0 − 𝛽 𝑥0 𝑖𝑥 2
𝛼 𝑥0 ,𝛽 𝑥0
Estimate at 𝑥0 : 𝑓መ 𝑥0 = 𝛼ො 𝑥0 + 𝛽መ 𝑥0 𝑥𝑖
THANK YOU

You might also like