Support Vector Regression
Support Vector Regression
net/publication/300719035
CITATIONS READS
53 587
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Mariette Awad on 30 July 2018.
Rooted in statistical learning or Vapnik-Chervonenkis (VC) theory, support vector machines (SVMs) are
well positioned to generalize on yet-to-be-seen data. The SVM concepts presented in Chapter 3 can be
generalized to become applicable to regression problems. As in classification, support vector regression
(SVR) is characterized by the use of kernels, sparse solution, and VC control of the margin and the number
of support vectors. Although less popular than SVM, SVR has been proven to be an effective tool in real-value
function estimation. As a supervised-learning approach, SVR trains using a symmetrical loss function,
which equally penalizes high and low misestimates. Using Vapnik’s e -insensitive approach, a flexible tube
of minimal radius is formed symmetrically around the estimated function, such that the absolute values
of errors less than a certain threshold e are ignored both above and below the estimate. In this manner,
points outside the tube are penalized, but those within the tube, either above or below the function, receive
no penalty. One of the main advantages of SVR is that its computational complexity does not depend on
the dimensionality of the input space. Additionally, it has excellent generalization capability, with high
prediction accuracy.
This chapter is designed to provide an overview of SVR and Bayesian regression. It also presents a case
study of a modified SVR applicable to circumstances in which it is critically necessary to eliminate or strictly
limit underestimating a function.
SVR Overview
The regression problem is a generalization of the classification problem, in which the model returns a
continuous-valued output, as opposed to an output from a finite set. In other words, a regression model
estimates a continuous-valued multivariate function.
SVMs solve binary classification problems by formulating them as convex optimization problems
(Vapnik 1998). The optimization problem entails finding the maximum margin separating the hyperplane,
while correctly classifying as many training points as possible. SVMs represent this optimal hyperplane with
support vectors. The sparse solution and good generalization of the SVM lend themselves to adaptation to
regression problems. SVM generalization to SVR is accomplished by introducing an e-insensitive region
around the function, called the e-tube. This tube reformulates the optimization problem to find the tube that
best approximates the continuous-valued function, while balancing model complexity and prediction error.
More specifically, SVR is formulated as an optimization problem by first defining a convex e-insensitive loss
function to be minimized and finding the flattest tube that contains most of the training instances. Hence, a
multiobjective function is constructed from the loss function and the geometrical properties of the tube.
67
Chapter 4 ■ Support Vector Regression
Then, the convex optimization, which has a unique solution, is solved, using appropriate numerical
optimization algorithms. The hyperplane is represented in terms of support vectors, which are training
samples that lie outside the boundary of the tube. As in SVM, the support vectors in SVR are the most
influential instances that affect the shape of the tube, and the training and test data are assumed to
be independent and identically distributed (iid), drawn from the same fixed but unknown probability
distribution function in a supervised-learning context.
∑ w j x j + b, y , b ∈ , x , w ∈ M (4-1)
M
y = f (x) = < w , x > + b = j =1
T
w x
f ( x ) = = w T x + b x , w ∈ M +1 (4-2)
b 1
SVR formulates this function approximation problem as an optimization problem that attempts to find
the narrowest tube centered around the surface, while minimizing the prediction error, that is, the distance
between the predicted and the desired outputs. The former condition produces the objective function in
Equation 4-3, where w is the magnitude of the normal vector to the surface that is being approximated:
1
w . (4-3)
2
minw
2
68
Chapter 4 ■ Support Vector Regression
To visualize how the magnitude of the weights can be interpreted as a measure of flatness, consider the
following example:
f ( x , w ) = å i = 1 wi x i , x Î , w Î M .
M
Here, M is the order of the polynomial used to approximate a function. As the magnitude of the vector w
increases, a greater number of wi are nonzero, resulting in higher-order solutions, as shown in Figure 4-2.
The horizontal line is a 0th-order polynomial solution and has a very large deviation from the desired
outputs, and thus, a large error. The linear function, a 1st-order polynomial, produces better approximations
for a portion of the data but still underfits the training data. The 6th-order solution produces the best
tradeoff between function flatness and prediction error. The highest-order solution has zero error but a
high complexity and will most likely overfit the solution on yet to be seen data. The magnitude of w acts as a
regularizing term and provides optimization problem control over the flatness of the solution.
The constraint is to minimize the error between the predicted value of the function for a given input
and the actual output. SVR adopts an e-insensitive loss function, penalizing predictions that are farther
than e from the desired output. The value of e determines the width of the tube; a smaller value indicates
a lower tolerance for error and also affects the number of support vectors and, consequently, the solution
sparsity. Intuitively, the latter can be visualized for Figure 4-1. If e is decreased, the boundary of the tube is
shifted inward. Therefore, more datapoints are around the boundary, which indicates more support vectors.
Similarly, increasing e will result in fewer points around the boundary.
Because it is less sensitive to noisy inputs, the e-insensitive region makes the model more robust. Several
loss functions can be adopted, including the linear, quadratic, and Huber e, as shown in Equations 4-4, 4-5,
and 4-6, respectively. As demonstrated in Figure 4-3, the Huber loss function is smoother than the linear
and quadratic functions, but it penalizes all deviations from the desired output, with greater penalty as the
error increases. The choice of loss function is influenced by a priori information about the noise distribution
affecting the data samples (Huber 1964), the model sparsity sought, and the training computational
complexity. The loss functions presented here are symmetrical and convex. Although asymmetrical loss
functions can be adopted to limit either underestimation or overestimation, the loss functions should be
convex to ensure that the optimization problem has a unique solution that can be found in a finite number of
steps. Throughout this chapter, the derivations will be based on the linear loss function of Equation 4-4.
69
Chapter 4 ■ Support Vector Regression
ïì 0 y - f ( x ,w ) £ e ;
Le ( y , f ( x ,w ) ) = í (4-4)
îï y - f ( x ,w ) - e otherwise ,
ì0
ï y - f ( x ,w ) £ e ; (4-5)
Le ( y , f ( x ,w ) ) = í
( )
2
y - f ( x ,w ) - e otherwise ,
îï
ì c 2
ïïc y - f ( x ,w ) - y - f ( x ,w ) > c
L ( y , f ( x ,w ) ) = í 2 (4-6)
1
ï y - f ( x ,w ) y - f ( x ,w ) £ c
2
ïî 2
Figure 4-3. Loss function types: (a) linear, (b) quadratic, and (c) Huber
Some researchers have proposed modification to loss functions to make them asymmetrical. Shim,
Yong, and Hwang (2011) used an asymmetrical e-insensitive loss function in support vector quantile
regression (SVQR) in an attempt to decrease the number of support vectors. The authors altered the
insensitivity according to the quantile and achieved a sparser model. Schabe (1991) proposed a
two-sided quadratic loss function and a quasi-quadratic s-loss function for Bayes parameter estimation,
and Norstrom (1996) replaced the quadratic loss function with an asymmetrical loss function to
derive a general class of functions that approach infinity near the origin for Bayesian risk analysis.
Nath and Bhattacharyya (2007) presented a maximum margin classifier that bounds misclassification
for each class differently, thus allowing for different tolerances levels. Lee, Hsieh, and Wang (2005)
reformulated the typical SVR approach into a nonconstrained problem, thereby only solving a system
of linear equations rather than a convex quadratic one. Pan and Pan (2006) compared three* different
loss functions for economic tolerance design: Taguchi’s quadratic loss function, inverted normal loss
function, and revised inverted normal loss function.
Adopting a soft-margin approach similar to that employed in SVM, slack variables x, x* can be added
to guard against outliers. These variables determine how many points can be tolerated outside the tube
illustrated in Figure 4-1.
Based on Equations 4-3 and 4-4, the optimization problem in Equation 4-7 is obtained; C is a
regularization—thus, a tuneable parameter that gives more weight to minimizing the flatness, or the error, for
this multiobjective optimization problem. For example, a larger C gives more weight to minimizing the error.
This constrained quadratic optimization problem can be solved by finding the Lagrangian (see Equation 4-8).
The Lagrange multipliers, or dual variables, are l, l*, a, a* and are nonnegative real numbers.
70
Chapter 4 ■ Support Vector Regression
1
min w 2 + C å i =1 xi + xi* ,
N
(4-7)
2
subject to
yi - w T xi £ e + xi* i = 1...N
w x i - yi £ e + xi
T
i = 1...N
xi ,x ³ 0 i = 1... N
i
*
1
(w ,x * ,x , l , l * ,a ,a * ) = w 2 +C å i =1 xi + xi* + å i =1a i* ( yi - w T xi - e - xi* )
N N
2 (4-8)
+ å i =1a i ( - yi + w T xi - e - xi ) - å i =1 lixi + li*xi*
N N
The minimum of Equation 4-8 is found by taking its partial derivatives with respect to the variables
and setting them equal to zero, based on the Karush-Kuhn-Tucker (KKT) conditions. The partial derivatives
with respect to the Lagrange multipliers return the constraints, which have to be less than or equal to zero,
as illustrated in Equation 4-9. The final KKT condition states that the product of the Lagrange multipliers
and the constraints is equal to zero (see Equation 4-10). The Lagrange multipliers that are equal to zero
correspond to data inside the tube, whereas the support vectors have nonzero-valued Lagrange multipliers.
The solution is written in terms of the support vector only—hence, the solution sparsity. The function
approximation is represented in Equation 4-12. By replacing Equation 4-9 in Equation 4-8, the dual form of
the optimization problem can be written as shown in Equation 4-13.
d
= w - å i =1 (a i* - a i ) xi = 0
N
dw
d
= C - li* - a i* = 0
dxi*
d
= C - li - a i = 0
dxi
d (4-9)
= å i =1 xi* £ 0
N
dli*
d
= å i =1 xi £ 0
N
dli
d
= yi - w T xi - e - xi* £ 0
da i*
d
= - yi + w T x i - e - xi £ 0
da i
a i ( - yi + w T x i - e - xi ) = 0
a i* ( yi - w T xi - e - xi* ) = 0 (4-10)
"i
li xi = 0 ,
li*xi* = 0
w = N SV (a * - a ) x
å i=1 i i i (4-11)
71
Chapter 4 ■ Support Vector Regression
1 N SV N SV *
max a ,a * - e å SV (a i + a i* ) + å SV (a i* - a i ) yi - å å (ai - ai ) (a *j - a j ) xiT x j ,
N N
(4-13)
i =1 i =1
2 j =1 i =1
subject to
å (a - a i ) = 0 , a i , a i* Î[0 ,C ]
N SV *
i =1 i
At the beginning of this section, the weights vector w was augmented with the scalar b, and the
derivation of the SVR’s mathematical formulation was carried out, disregarding the explicit computation of b
(see Equation 4-2). However, b could have been calculated from the KKT conditions, as shown next.
Training data that belong to the outside of the boundary of the tube will have nonzero ai or a i* ; they
cannot both be zero, because that would mean that the instance (xi, yi) belongs to the lower and upper
boundary, which is not possible. Therefore, the corresponding constraints will be satisfied with equality, as
demonstrated in Equation 4-14. Furthermore, because the point is not outside the tube, xi = 0 , leading to
the result in Equation 4-15 when a Î(0 ,C ) . Equation 4-16 computes b. Performing the same analysis for a i* ,
one gets Equations 4-17 and 4-18.
(4-14)
yi - w T xi - b - e - xi = 0
(4-15)
yi - w xi - b - e = 0
T
(4-16)
b = yi - w xi - e
T
(4-17)
- yi + w T xi - b - e = 0
(4-18)
b = - yi + w xi - e
T
Instead of using the KKT conditions, one could have also computed b, while solving the optimization
problem, using the interior-point method, which can converge to an optimal solution in logarithmic time by
navigating along the central path of the feasible region. The central path is determined by solving the primal
and dual optimization problems simultaneously.
72
Chapter 4 ■ Support Vector Regression
1 (4-19)
min w 2 +C å i =1 xi + xi* ,
N
subject to
yi - w T j ( xi ) £ e + xi* i = 1,..., N
w T j ( x i ) - y i £ e + xi i = 1,..., N
xi ,xi* ³ 0 i = 1, ..., N
N SV
w = å (a i - a i )j ( xi )
* (4-20)
i =1
N SV N SV
1 N SV N SV *
max a ,a * - e å (a i + a i ) + å (a i - a i ) yi -
*
i =1
*
i =1
å å (ai - ai ) (a *j - a j ) k ( xi , x j )
2 j =1 i =1
(4-21)
N SV
a i ,a i* Î [0 ,C ] ,i = 1,..., N SV , å (a i* - a i ) = 0
i =1
N SV
f ( x ) = å (a i* - a i ) k ( xi , x ) (4-22)
i =1
(4-23)
k ( xi , x ) = j ( xi ) .j ( x )
2.5
1.5
0.5
-0.5
-1
-1.5
-2
-2.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
73
Chapter 4 ■ Support Vector Regression
P (Y | w ,s 2 , X ) ~ ( Xw ,s 2 I ) (4-24)
Once the model has been specified, the model parameters’ posterior distributions can be estimated.
This is done by first assuming a prior distribution of the model parameters (see Equation 4-25). Given the
model variance and observations, the posterior distribution of the model parameters (which is Gaussian)
is as shown in Equation 4-26, with the mean computed in Equation 4-27, and the standard deviation scale
factor, in Equation 4-28. The mean is simply the Moore-Penrose pseudoinverse of the predictive variables
multiplied by the observations. Given some observations, the posterior probability of the model variance is
computed, and an inverse chi-squared distribution (see Equation 4-29), with n - k degrees of freedom and
a scale factor s2 (see Equation 4-30), is obtained. The scale factor is the error between the model’s predicted
output and an observation.
1
P (w,s ) µ
2 (4-25)
s2
P (Y | w ,s 2 , X ) P (w|s 2 )
P (w|s 2 ,Y ) = ~ (w E ,vws 2 ) (4-26)
P (Y |s 2 )
w E = ( X T X ) X T Y
-1 (4-27)
vw = ( X X )
T -1 (4-28)
P (Y |s 2 ) P (s 2 )
P (s |Y ) = ~ inv - 2 ( n - k , s 2 )
2 (4-29)
P (Y )
(Y - XwE ) (Y - XwE )
T
(4-30)
s2 =
n-k
The marginal posterior distribution of the model parameters, given the observations, is a multivariate
Student’s t-distribution, shown in Equation 4-31 and computed in Equation 4-32, with n - k degrees of
freedom, wE mean, and s2 scale factor, as P (w|s 2 ,Y ) has a normal distribution, and P (s 2 |Y ) has an inverse
chi-squared distribution.
(4-31)
P ( w |Y ) ~ t ( n - k , w E , s 2 )
74
Chapter 4 ■ Support Vector Regression
Given the model parameter probability distributions and a set of predictive variables Xp, the marginal
posterior predictive distribution Yp, which is a multivariate Student’s t-distribution (see Equation 4-33) can
be determined. The mean is computed in Equation 4-34, and the variance, in Equation 4-35. The predictive
distribution variance depends on the uncertainty in the observed data and the model parameters.
(
P (YP |Y ) ~ t n - k , E (Yp |Y ) ,var (Yp |s ,Y )
2
) (4-33)
E (Yp |Y ) = X pw E (4-34)
The concept of Bayesian regression is displayed in Figure 4-5, in which the sample input data available
during training would have been generated by a Gaussian distribution. If these instances represent their
population well, the regression model is expected to generalize well.
Figure 4-5. One-dimensional regression example illustrating the Gaussian conditional probability
distributions of the output on the input and model parameters
A generative approach models the joint probability distribution of the data and class labels p(x, Ck ),
based on the prior probability distributions of the class labels p(Ck ) and the likelihood probability
distribution p ( x|C k ). The joint distribution computes the posterior probability distributions p (C k |k ) ,
which will be used to map datapoints to class labels.
A discriminant approach directly computes the posterior probability distributions p (C k |x ) without
computing the joint probability distribution p(x, Ck ). A discriminant approach produces a mapping from
the datapoints to the class labels without computing probability distributions. Therefore, this approach
performs the inference and decision stages in one step.
75
Chapter 4 ■ Support Vector Regression
Advantages Disadvantages
Generative • Robust to outliers • Computationally demanding
• Can easily update decision model • Requires a lot of training data
• Allows combination of classifiers trained • Suffers from the curse of
on different types of data by applying dimensionality
probability rules
• Can improve prediction accuracy by
measuring confidence in classification based
on posterior distributions and not making
predictions when confidence is low
Discriminant • Computationally less demanding • Sensitive to noisy data and outliers
• Simple to implement • Requires retraining for any changes
in the decision model
Figure 4-6. (a) SVR and (b) ALB-SVR (Source: Intel, 2012)
76
Chapter 4 ■ Support Vector Regression
ALB-SVR uses the Huber insensitive loss function (Popov and Sautin 2008). This function is similar to
the e-insensitive loss function; however, it increases quadratically for small errors outside the e-bound but
below a certain threshold ¶ > e and then linearly beyond ¶ . This makes it robust with respect to outliers.
The Huber insensitive loss function is represented by:
ì0 if t - y £ e
ïï
L e¶HuberSVR ( t , y ) = í( t - y - e )
2
if e < t - y < ¶
ï
ïî( ¶ - e ) ( 2 t - y - ¶ - e ) if t - y ³ ¶.
ì0 if 0 ³ ( t - y ) £ e
ï
ïï( t - y ) if ( t - y ) < 0
2
L e¶HuberALB -SVR ( t , y ) = í
ï( (t - y ) - e )
2
if e < (t - y ) < ¶
ï
ïî( ¶ - e ) ( 2 t - y - ¶ - e ) if t - y ³ ¶.
é L + 1 L -2 ù
ê -å(a i - a i )ti + 2C å(ea i - a i )ú
- +2
max a + ,a - ê i =1 i =1
ú
ê 1 ú
ê + å(a i - a i )(a i - a i )xi × x j
+ - + -
ú
ë 2 i,j û
-C £ (a i+ - a i- ) £ C i = 1..L
L
å(a1
i
+
- a i- ) = 0.
1 L
Remp ( y ) = åLe -ALB-SVR (ti , yi ).
L i =1
å (y -t ) + å
iÎ( y -t )£e
e.
iÎ( y -t )>e
Validation: ALB-SVR was tested on a dataset used by David et al. (2010) and Stockman et al. (2010)
that consists of 17,765 samples of five attributes of memory activity counters, with the actual corresponding
power consumed in watts, as measured directly by a memory power riser. The memory power model
attributes are activity, read, write, CKE = high, and CKE = low. ALB-SVR was implemented with a modified
77
Chapter 4 ■ Support Vector Regression
version of LIBSVM (Chang and Lin 2011) for ALB-SVR. Simulation results (see Figures 4-7 – 4-9) took the
average of ten runs of threefold cross-validation of a radial basis function (RBF) kernel, with a combination
of grid search and heuristic experimentation to find the best metaparameters e, g, C+, and C–.
% % Out of
Type
Error Bound
Huber
insensitive 512 – 128 0.1 1.0e-06 1.03 67.07
SVR
Huber
insensitive 10,000,000 1,000 128 0.1 1.0e- 06 1.50 0.24
ALB-SVR
Figure 4-7. Comparative results of SVR versus ALB-SVR (Source: Intel, 2012)
Figure 4-8. Power estimates for running average power limit (RAPL) data with Huber insensitive SVR
(Source: Intel, 2012)
78
Chapter 4 ■ Support Vector Regression
Figure 4-9. Power estimates for RAPL data with Huber insensitive ALB-SVR (Source: Intel, 2012)
In SVR, support vectors are those points that lie outside the e-tube. The smaller the value of e, the more
points that lie outside the tube and hence the greater the number of support vectors. With ALB-SVR the
e -tube is cut in half, and the lower e -bound is dropped. Therefore, for the same g and e parameters, more
points lie outside the tube, and there are a larger number of support vectors. This means that the number of
support vectors is greater for ALB-SVR than for SVR. This increase in the number of support vectors indicates
that using ALB-SVR has some negative effects on the complexity of the estimating function. Although the
percentage relative error data set was higher (5.06 percent), this is acceptable, because the main purpose
was to reduce the number of underestimates and this was achieved.
References
Chang, Chih-Chung, and Chih-Jen Lin. “LIBSVM: A Library for Support Vector Machines,” in “Large-Scale
Machine Learning,” edited by C. Ling, special issue, ACM Transactions on Intelligent Systems and Technology 2,
no. 3 (2011). www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf.
David, Howard, Eugene Gorbatov, Ulf R. Hanebutte, Rahul Khanna, and Christian Le. “RAPL: Memory Power
Estimation and Capping.” In Proceedings of the 2010 ACM/IEEE International Symposium on Low-Power
Electronics and Design (ISLPED), August 18–20, 2010, Austin, TX, 189–194. Piscataway, NJ: Institute for Electrical
and Electronics Engineers, 2010.
Huber, Peter J. “Robust Estimation of a Location Parameter.” Annals of Mathematical Statistics 35, no. 1
(1964): 73–101.
Lee, Yuh-Jye, Wen-Feng Hsieh, and Chien-Ming Huang. “e-SSVR: A Smooth Support Vector Machine for
e-Insensitive Regression.” IEEE Transactions on Knowledge and Data Engineering 17, no. 5 (2005): 678–685.
79
Chapter 4 ■ Support Vector Regression
Nath, J. Saketha, and Chiranjib Bhattacharyya. “Maximum Margin Classifiers with Specified False Positive
and False Negative Error Rates.” In Proceedings of the Seventh SIAM International Conference on Data
Mining, April 26–28, 2007, Minneapolis, MN, 35–46. 2007. http://dblp.uni-trier.de/rec/bibtex/conf/
sdm/NathB07.
Norstrom, Jan Gerhard. “The Use of Precautionary Loss Functions in Risk Analysis.” IEEE Transactions on
Reliability 45, no. 3 (1996): 400–403.
Pan, Jeh-Nan, and Jianbiao Pan. “A Comparative Study of Various Loss Functions in the Economic
Tolerance Design.” In Proceedings of the 2006 IEEE International Conference on Management of Innovation
and Technology, June 21–23, 2006, Singapore, China, 783–787. Piscataway, NJ: Institute of Electrical and
Electronics Engineers, 2006.
Popov, A. A, and A. S. Sautin. “Loss Functions Analysis in Support Vector Regression,” 9th International
Conference on Actual Problems of Electronic Instrument Engineering, September 23–25, 2008, Novosibirsk,
Russia, 198. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2008.
Schabe, H. “Bayes Estimates Under Asymmetric Loss.” IEEE Transactions on Reliability 40, no. 1 (1991): 63–67.
Shim, Joo Yong, and Chang Ha Hwang. “Support Vector Quantile Regression Using Asymmetric e-Insensitive
Loss Function.” Communications for Statistical Applications and Methods 18, no. 2 (2011): 165–170.
Stockman, Melissa, Mariette Awad, and Rahul Khanna. “Asymmetrical and Lower Bounded Support Vector
Regression for Power Prediction.” Intel Technology Journal 16, no. 2 (2012a).
Stockman, Melissa, Mariette Awad, Rahul Khanna, Christian Le, Howard David, Eugene Gorbatov, and Ulf
R. Hanebutte. “A Novel Approach to Memory Power Estimation Using Machine Learning.” In Proceedings of
the 2010 International Conference on Energy Aware Computing (ICEAC), December 16–18, 2010, Cairo, Egypt,
1–3. Piscataway, NJ: Institute for Electrical and Electronics Engineers, 2010.
Stockman, Melissa, Randa S. El Ramli, Mariette Awad, and Rabih Jabr. “An Asymmetrical and Quadratic
Support Vector Regression Loss Function for Beirut Short Term Load Forecast.” In2012 IEEE International
Conference on Systems, Man, and Cybernetics (SMC), October 14–17, 2012, Seoul, Korea, 651–656. Piscataway,
NJ: Institute of Electrical and Electronics Engineers, 2012b.
Vapnik, Vladimir N. Statistical Learning Theory. New York: Wiley, 1998.
80