KEMBAR78
Hyperparameter Search) | PDF | Support Vector Machine | Mathematical Optimization
0% found this document useful (0 votes)
5 views33 pages

Hyperparameter Search)

This paper evaluates 18 search algorithms for tuning the hyperparameters C and γ of Support Vector Machines (SVM) with an RBF kernel across 115 binary datasets. The study finds that certain algorithms, like trees of Parzen estimators and particle swarm optimization, perform better than grid search with minimal additional computation. It also concludes that excessive computational effort in hyperparameter searching does not necessarily lead to improved performance on future data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views33 pages

Hyperparameter Search)

This paper evaluates 18 search algorithms for tuning the hyperparameters C and γ of Support Vector Machines (SVM) with an RBF kernel across 115 binary datasets. The study finds that certain algorithms, like trees of Parzen estimators and particle swarm optimization, perform better than grid search with minimal additional computation. It also concludes that excessive computational effort in hyperparameter searching does not necessarily lead to improved performance on future data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

How to tune the RBF SVM hyperparameters?

:
An empirical evaluation of 18 search algorithms
arXiv:2008.11655v1 [cs.LG] 26 Aug 2020

Jacques Wainer1∗ Pablo Fonseca2


1
Computing Institute
University of Campinas
Campinas, SP, 13083-852, Brazil
2
Facultad de Ciencias y Filosofa
Universidad Peruana Cayetano Heredia
Lima, Peru

Abstract
SVM with an RBF kernel is usually one of the best classification al-
gorithms for most data sets, but it is important to tune the two hyper-
parameters C and γ to the data itself. In general, the selection of the
hyperparameters is a non-convex optimization problem and thus many
algorithms have been proposed to solve it, among them: grid search, ran-
dom search, Bayesian optimization, simulated annealing, particle swarm
optimization, Nelder Mead, and others. There have also been proposals
to decouple the selection of γ and C. We empirically compare 18 of these
proposed search algorithms (with different parameterizations for a total of
47 combinations) on 115 real-life binary data sets. We find (among other
things) that trees of Parzen estimators and particle swarm optimization
select better hyperparameters with only a slight increase in computation
time with respect to a grid search with the same number of evaluations.
We also find that spending too much computational effort searching the
hyperparameters will not likely result in better performance for future
data and that there are no significant differences among the different pro-
cedures to select the best set of hyperparameters when more than one is
found by the search algorithms.
Keywords: Hyperparameters; SVM; grid search; random search; non-
convex optimization algorithms

1 Introduction
Support vector machines (SVM) and in particular SVM with the RBF kernel is a
powerful classification technique, ranking amont the best two or three classifiers
∗ Corresponding author: wainer@ic.unicamp.br

1
for a large set of real life datasets (Fernández-Delgado et al, 2014; Wainer, 2016).
SVM with RBF kernel requires one to select the value of two hyperparameters
C and γ before computing the SVM itself. There have been many proposals
in the literature on how to search for the best pair of hyperparameters for a
particular data set, among them general optimization procedures such as grid
search, random search, simulated annealing, Bayesian optimization, and many
others. The main goal of this paper is to compare many of these proposals on a
large number of real-life binary data sets. The 18 families of search algorithms
that are compared in this research are discussed in Section 2, including their
different parameterizations. In total, we tested 47 different combinations of
algorithms and parameterizations.
SVM, in general, is formulated as a minimization problem:
n
1 X
min ||w||2 + C ξi
2 i=1
(1)
subject to yi (w · φ(xi ) + b) ≥ 1 − ξi
with ξi ≥ 0

The parameter C is a regularization factor and it is one of the SVM hyperpa-


rameters: this constant must be set before solving the minimization problem.
φ(x) is a non-linear transformation that takes the data into a high dimen-
sional space, sometimes called a Reproducing Kernel Hilbert Space and some-
times called the feature space. Fortunately, φ(x) does not need to be computed
explicitly. The dual formulation of the SVM minimization does not need φ(x)
but only the inner product of the transformation of two data points φ(xi )T φ(xj ).
The dual form of the SVM minimization is:
1X X
min yi αi yj αj K(xi , xj ) − αi
2 ij i
(2)
X
subject to yi αi = 0
i
with 0 ≤ αi ≤ C
where K(xi , xj ) = φ(xi )T φ(xj ) is a kernel function that represents the φ map-
ping.
The Gaussian Radial Basis Function (RBF) kernel is the chosen kernel func-
tion discussed in this paper, and it is expressed as:
2
K(xi , xj ) = exp(−γ||xi − xj || ).

The constant γ is also an RBF SVM hyperparameter: a value that must be


set before the optimization in Equation 2 is computed. To solve a particular
classification problem, one must choose a particular pair of C and γ that is the
most appropriate to that problem.
Let us define ac(G0 |G, C, γ) as the accuracy of an RBF SVM on the data set
0
G when it has been trained on the data set G with hyperparameters C and γ.

2
Let us assume a given data set G as an i.i.d. sample from an unknown
distribution D. One would like to select the best pairs of hyperparameters as:

Cbest , γbest = argmax Eg∼D [ac(g|G, C, γ)] (3)


C,γ

where Eg∼D is the expectation taken over a random variable g that represents
future data with the same distribution as the data G.
One can consider the right side of Equation 3 as the definition of a function
of two parameters, C and γ, which we will call the true response surface of
the SVM for the data set G. We will represent the true response surface as
trsG (C, γ).
Of course, one does not have access to the new data g or the underlying
distribution D, and thus one cannot compute Eg [ac(g|G, C, γ)]. One solution
is to compute an estimate of that value using some form of resampling of the
original data set G. Resampling will divide the data set G into one or more pairs
of data sets T Ri and T Ei , such that T Ri ∩ T Ei = ∅ and T Ri ∪ T Ei ⊆ G. K-fold
cross-validation, bootstrapping, and leave-one-out are examples of resampling
procedures.
The estimate of Eg [ac(g|G, C1 , γ1 )] for some fixed C1 and γ1 is the mean
of the accuracy of training the SVM (with C1 and γ1 as hyperparameters) on
each of the T Ri (the training set) and testing on the corresponding T Ei (the
test set). In this paper, we will use 5-fold cross-validation as the resampling
procedure.
We will define an approximation to the function trsG (C, γ) as:

rs(G,
b b G (C, γ) = mean ac(T Ei |T Ri , C, γ)
C, γ) = rs (4)
i

For simplicity, we will call this the response surface (without the qualifier
true) of the SVM for the data set G and for a particular resampling choice of
pairs {T Ri , T Ei }.
Selecting the best hyperparameters for the data set G is finding the maxi-
mum of the response surface rs b G . Or, to be more consonant with the optimiza-
tion terminology, one can think of minimizing the error surface 1 − rs b G.
In general, the response surface is not concave (or the error surface is not
convex) and thus there may be many local maxima, which requires optimization
algorithms that are not just based on gradient ascent. There have been many
proposals of algorithms that will find hyperparameters with higher values of
rs
b G . The first goal of this research is to compare many of those algorithms.
One must remember that the aim of hyperparameter search is to find a pair of
hyperparameters that maximizes the true response surface trsG , that is, the pair
that will result in the classifier with the highest accuracy for future data. The
rs
b G surface is an approximation of the trsG surface, and thus there will likely be
differences in the value of trsG (C1 , γ1 ) and rs
b G (C1 , γ1 ). The second goal of this
paper is to have some understanding if it is worth spending the computational
effort on algorithms that will likely select better hyperparameters in rs b but

3
may not achieve a better solution in trs. We will estimate the correlation of
improving the selection in rsb and the expected gain in accuracy on future data.
The rs
b G is also a step function - small changes in C and γ will not change
the accuracy of the resulting SVM on a finite set of data. Therefore, it is likely
that any search algorithm for pairs of C and γ will produce a set of pairs of
hyperparameters with the same value of rs b G.
If more than one pair of hyperparameters is selected there is a reasonable ar-
gument to choose the pair with the smallest C and the smallest γ. For example,
the tutorial page for hyperparameters in SVM in sklearn (Sklearn, 2019) makes
that argument. Both hyperparameters work as inverse regularization terms. A
large C will place emphasis on loweringPn the number of support vectors since
each one of them contributes to the i=1 ξi cost in the optimization. A lower
C will allow more support vectors, resulting in larger margins.
The γ parameter controls how fast the “influence” of a point decreases with
distance. The kernel value for two points will decrease as γ increases. As
γ increases, the decision surfaces become more ”curvy” and fit closely to the
training data. A smaller γ will generate decision surfaces that are flatter, and
thus a simpler model.
Thus, in the case of search algorithms that return a set of best hyperparame-
ters, one has to perform a further selection procedure. We call this post-search
selection. The third goal of this paper is to compare some different post-search
selection procedures regarding the accuracy of future data.
The fourth goal of the paper is to get some insight into the distributions of
the best C and γ selected by the many algorithms, whether these values are
concentrated on some region of the hyperparameter space.

2 Hyperparameter search algorithms


Broadly speaking, there are two different approaches to the SVM hyperparam-
eter search. The first approach considers the hyperparameter search as a gen-
eral non-convex optimization problem and uses general algorithms to solve it.
Among such general optimization algorithms are grid search, random search,
uniform design (Fang et al, 2000), gradient-free minimization algorithms such
as Nelder-Mead simplex (Nelder and Mead, 1965), genetic algorithms, particle
swarm optimization, ant colony optimization, Bayesian optimization (Pelikan
et al, 1999; Snoek et al, 2012), simulated annealing (Aarts and Korst, 1988).
All these alternatives have been proposed as search algorithms for the SVM
hyperparameters.
The second general approach considers the mathematical properties of the
SVM formulation itself, and propose particular search strategies to solve specif-
ically the SVM hyperparameter optimization problem. This approach can be
further divided into two broad directions. The first one defines smooth approxi-
mations to the estimated error of the classifier (usually the leave-one-out error)
and uses gradient descent algorithms on these approximations (Keerthi et al,
2006; Keerthi, 2002). We will not test these algorithms in this research.

4
The second direction tries to decouple the choice of γ from the choice of C
(the more frequent approach) so that γ is selected first (sometimes using a grid
search, sometimes not) based on some measure of the data, and only then the
C is chosen (usually by grid search).

2.1 General optimization approaches


We will discuss some of the general optimization algorithms for searching for
the SVM hyperparameters.

• grid: In the grid search the log2 C ×log2 γ space is divided into an equally
spaced grid and only the pairs in the grid points are tested.
We used the following constraints on C and γ not only for the grid search
but for all algorithms:
2−5 ≤C ≤ 215
(5)
2−15 ≤γ ≤ 23
which are the limits of the grid search suggested by the LibSVM package
(Chang and Lin, 2011; Hsu et al, 2003), a very popular SVM solver. The
grid search and all other algorithms perform the search in the log C ×
log γ space (and not in the C × γ space).
As we will discuss further in Section 3, we are interested in comparing
the quality of the different algorithms when controlling for the number of
times the optimizer evaluates rsb for some point in the log C × log γ space.
We call this number N . Notice that since we are using 5-fold to evaluate
rs,
b the SVM learning and evaluation methods will be called 5*N times.
We tested the grid procedure for N = 25, N = 100 and N = 400, and
in each case, we divided both the log C and the log γ dimensions into an
equal number of intervals (5, 10, and 20 respectively)
• ud: Uniform design is a space-filling method (Fang et al, 2000) used in the
design of computer experiments. It approximates a uniform distribution
over a space of parameters (or hyperparameters in our case), given the
number of points one needs to sample from that space. The sampling is
deterministic and it chooses points that minimize a discrepancy measure
in relation to the uniform distribution. Uniform design has been proposed
as a method to select SVM hyperparameters (Huang et al, 2007) .
To generate the uniform design tables we used the unidTab function from
the mixtox package (Zhu, 2016). We tested ud with N = {25, 100, 400}.
• rand: N random points are uniformly sampled from the log C ×log γ space
within the bounds of Equation 5.
• normrand: Randomly select log C from a normal distribution with mean 5
and standard deviation 5 and log γ from a normal distribution with mean
−5 and standard deviation 5.

5
We tested the other algorithms discussed in this section on 115 datasets
(see Section 3.1) and collected the C and γ selected by them. The distri-
bution of the selected log C and log γ follow a Gaussian distribution (see
Section 4.4). This prompted the realization that a random sampling using
these Gaussian distributions could achieve, in general, better results than
the rand search discussed above.
We tested normrand with N = {25, 100, 400}. But we must warn that
the analysis of the results for this algorithm has a risk of overfitting - the
algorithm was tuned using all the datasets available and we are measuring
their quality on the same set of datasets.
• gridhier: Hierarchical grid. A grid search is performed to select the point
x as the one with the highest accuracy. A second grid search centered on x
and bounded by the closest neighbours of x in the first grid is performed,
and the highest accuracy point among x and from the second search is
selected.
We tested the hierarchical procedure both for grid search with 25 + 25,
that is, with N = 25 in the first level, followed by N = 25 in the second
level, and with 100 + 100
• udhier: Hierarchical uniform design. Similar to the gridhier, but the
uniform design search is performed at both levels. Tested with N = 25+25
and 100 + 100.
• nelder: Nelder-Mead simplex algorithm (Nelder and Mead, 1965) is a
well-known derivate-free optimization algorithm. The algorithm maintains
a simplex of points in the parameter space (in our case of 2 dimensions
the simplex is a triangle) and updates the simplex point with the worst
evaluation with a new point symmetric to it on the other side of the
centroid of the remaining points. The algorithm converges when either
the simplex is too small or when the improvement on the evaluation on
changing the worse simplex vertex is below a threshold. Nelder Mead has
been suggested as a method for selecting SVM hyperparameters (Cohen
et al, 2005; Cawley and Talbot, 2007).
We used the neldermead function of the nloptr package (Ypma et al,
2014) with the initial point in the center of the log C × log γ space. There
are many algorithms to generate an initial simplex from a single initializa-
tion point (Wessing, 2018); unfortunately, neither the nolptr R package
nor the underlying nlopt C implementation documents which algorithm
is used.
We used the bounded version of the Nelder Mead algorithm with the
constraints in Equation 5 We tested Nelder-Mead with N = {25, 100, 400}.
The Nelder Mead algorithm has some other stopping criterium, besides
the number of evaluations. In the case of an early stop, we restarted the
nelder algorithm on a random point in log C × log γ , setting the number
of evaluations to the balance of evaluations.

6
• bobyqa: Bound optimization by quadratic approximation (BOBYQA)
(Powell, 2009) is a derivative-free optimization algorithm that approxi-
mates the response function by quadratic interpolation. We are not aware
of any publication that proposed using any of the algorithms in this family
to select SVM hyperparameters, but we feel it is an interesting alternative
to the Nelder-Mead algorithm.
We used the bobyqa function from the nlopt package (Ypma et al, 2014),
with the initial point in the center of the log C × log γ space. We tested
bobyqa with N = {25, 100, 400}. Similarly to nelder, the algorithm will
restart at some random point on early stopping.
• sa: Simulated annealing (Aarts and Korst, 1988) starts at a state and
probabilistically accepts moving to a different random state given the dif-
ference of the evaluations. At “high temperature,” at the beginning of
the process, even states that worsen the evaluation are accepted with high
probability. As the temperature lowers, the probability of accepting a bad
transition also lowers, and at very low temperatures only new states that
have better evaluation are accepted. Simulated annealing has been sug-
gested as a method to select SVM hyperparameters (Lin et al, 2008a; Pai
and Hong, 2005; Imbault and Lebart, 2004),.
We used the GenSA function from the GenSA package (Y. Xiang et al, 2013).
We used the default values of the simulated annealing for this function and
stopped the interaction after N calls, for N = {25, 100, 400}
• pso: Particle swarm optimization (Eberhart et al, 1995) considers a set
of solutions to the optimization problem as a set of particles, which have
a momentum (and thus will tend to move in the direction it has been
moving before) but are also attracted to other particles that found good
solutions to the optimization problem. Particle swarm optimization has
been used to select the SVM hyperparameter (Lin et al, 2008b; Huang
and Dun, 2008).
We used the psoptim function from the pso package (Bendtsen, 2012).
We tested with N = {25, 100, 400}.
• cma: Covariance matrix approximation - evolutionary strategy (Hansen
et al, 2003) is an evolution-inspired algorithm for general non-convex prob-
lems. CMA-ES maintains a set of K points in the search space with the
corresponding function value. It then adjusts a multidimensional Gaus-
sian distribution on the search space so that points with higher accuracy
would have higher likelihood, and generate new points based on this dis-
tribution. The algorithm then keeps the best K among the old and the
new points and repeats the process.
We used the cmaes package (Trautmann et al, 2011). We used the default
number of the initial population µ and the default number for the sub-
sequent generations (λ) and the number of generations was computed so
the search would not exceed the specified N = {100, 400}.

7
• bogp: Bayesian optimization (Pelikan et al, 1999; Snoek et al, 2012) mod-
els the response surface as a Gaussian process regression. This modeling
allows the optimizer to evaluate not only which data points should be
queried because they are likely to result in a good evaluation, but also
query which data point will improve its knowledge of the response func-
tion itself. Thus a Bayesian optimization model finds a particular balance
between exploitation (of the data points it believes will result in good eval-
uations) and exploration (of new data points that will improve its estimate
of the response surface itself).
Bayesian optimization has received a lot of attention recently, but our pre-
vious experience with it is that it is computationally costly. We, therefore,
opted to use a fast, mostly compiled implementation of Bayesian optimiza-
tion (as opposed to an R implementation). We used the Python RoBO
(Klein et al, 2017) implementation. We tested with N = {100, 400}.
• tpe: Tree of Parzen Estimators (Bergstra et al, 2013) is a Bayesian op-
timization inspired algorithm, but instead of using Gaussian processes to
model the distribution of values of the response surface, it models it as
two Gaussian mixture models. TPE is usually much faster than Gaussian
process based Bayesian optimization.
We used the HyperOpt python implementation (hyp, accessed 2019) of
the TPE, with N = {100, 400}.
We call grid, ud, normrand, and rand as grid-like search algorithms because
the evaluation points in the log C × log γ space are defined before any evaluation
need to be performed. We call gridhier and udhier the hierarchical algorithms
and we call nelder, bobyqa, sa, cma, pso, bogp, and tpe as general optimization
algorithms

2.2 SVM specific algorithms


• svmpath: Hastie et al (Hastie et al, 2004) discuss an algorithm that for
a given γ, computes all values of C which cause a change in the support
vectors set, and thus may cause a change in the accuracy of the classifier.
This is called the regularization path of an SVM. Furthermore, computing
the regularization path has similar computational cost as computing the
SVM itself, and thus, in principle, one could use such an algorithm to
search independently for the best hyperparameters. We performed a grid
search on the γ value (with N evaluations) for each value of γ we computed
the regularization path. The svmpath algorithm not only returns the C
values but also the number of points in the support vector for each C. We
use this number as an approximation to the error rate of the SVM in the
training set.
The svmpath algorithm only uses information from the training set and
thus there is no need to use a test set to select the best set of hyperpa-
rameters. Therefore, it does not require to use the 5-fold cross-validation

8
- one can use the whole training set and discover the combination of val-
ues γ and C that has the lowest number of points in the support vector.
Unfortunately, usually, there are many different C values with the same
number of support vectors (for a fixed γ), since the svmpath concerns with
changes in the support vector and not just its size ( a point may get in as
another gets out). In these cases, we selected a random C and γ among
the ones with minimal size of the support vector.
We used the svmpath function from the svmpath package (Hastie, 2012)
and N ∈ {5, 10, 20}.
• asymp: Keerthi and Lin (Keerthi and Lin, 2003) discuss that there should
be a good solution to the hyperparameters selection for an RBF SVM in
the line log γ = log Ĉ − log C in the log C × log γ space. Thus they propose
that a linear SVM should be used to find the Ĉ constant, and then a 1D
grid on the line log γ = log Ĉ − log C to select the best C and γ for the
RBF SVM.
The parametrization of this algorithm involves a pair of numbers: the
number of grid points for the C used in the linear SVM problem, and the
number of grid points in the log γ = log Ĉ − log C line to select the pair
C and γ. We used 5+5, 10+10 and 20+20 points.
• dbtc: Sun et al (Sun et al, 2010) propose that the γ parameter should be
chosen using a geometric criterium, in particular, it should be chosen by
maximizing the distance between the center of the two classes (in the fea-
ture space), which is called distance between two classes (DBTC). DBTC
is one of a family of techniques called internal metrics that use some metric
on the training set (not on the test set) to select γ such as Xi-alpha bound
(Joachims, 2000), Generalized approximate cross-validation (Wahba et al,
2001), Maximal Discrepancy (Anguita et al, 2003), among others, DBTC
achieves significantly higher accuracy and is significantly fasterthan the
others internal metrics (Duarte and Wainer, 2017).
In our solution, a grid in log γ is searched in order to maximize the DBTC.
With the selected γ, a grid on log C is used to select the other hyperpa-
rameter. We used 5+5 and 10+10 evaluation points.
We also implemented a sampled version of DBTC, where a sample of 50%
of data points on both classes is used to compute the distance. We call
this version sdbtc after sampled DBTC and tested it also on 5+5 and
10+10 evaluations.
• skl: The popular Python package scikit-learn (Pedregosa et al, 2011)
and the popular SVM solver implementation libSVM (Chang and Lin,
2011) use as the default for the γ hyperparameter the inverse of the number
of dimensions of the data. Sklearn also uses 1.0 as the default value for C.
We do not know of any published evidence for that choice of the default
value of C.

9
One must note the importance of default values for hyperparameters since
the less sophisticated user of popular packages usually relies on the default
values in the hope that there is good empirical or theoretical evidence for
that selection. In fact, the authors are unsophisticated users of many of the
packages described above (pso, GenSA, cma, svmpath, and so on) and we
did rely on the default values of those packages for the hyperparameters of
the corresponding search algorithm. Thus, it is important to test whether
the default values for SVM implementations are indeed reasonable choices.
We call skl1 the choice of using both default values. And sklN the choice
of setting γ as the sklearn default, and performing a grid search on the C,
with N = {5, 10, 20}.

• sigest: Caputo et al (Caputo et al, 2002) argue that γ could be derived


from the distribution of the distances between the data points (in the
original space). They argue that any 1/γ in the range between the 10% and
the 90% percentile of the distribution of |xi − xj |2 should be an acceptable
value. We chose the median (50% percentile) of that distribution as a fixed
γ and performed a grid search on log C. We used the sigest function of
the kernlab package (Karatzoglou et al, 2004). The best γ is computed
from the formula and a grid search on log C is performed to select the
other hyperparameter. We used N = {5, 10, 20} for the C grid.
The sigest function is used in the R package kernlab to determine the
default value of the γ parameter in their SVM implementation.
sigest like other algorithms discussed above that use information from
the training set to define the value of γ (skl and dbtc) used the whole
training set to define the selected γ ∗ . Using that γ ∗ , the grid on the C
value used the 5-fold cross-validation to select the best value of C.

2.3 Post-search selection procedures


As discussed, there are reasonable arguments for selecting pairs with lower C
and lower γ if more than one pair was found as the best set of hyperparameters.
Let us assume that B = {Ci , γi } are the set of best hyperparameters found
by a search procedure. But to select the one with the lowest C and the lowest
γ, one has to make a decision on which hyperparameter to minimize first. One
can select the subset with the lowest C first, and among them, select the one
with the lowest γ. Or one may perform the dual algorithm, of selecting first the
minimal γ and then C.
It is also reasonable to assume that if the set of B solutions is a plateau
on rsb G , a peak (in trsG ) would be somewhere in the middle of the plateau,
and thus a mean of the Ci and the mean of the γi would also be a reasonable
solution.
In this research, we will compare the following post-search selection proce-
dures:

10
• minCg First selects the minimum value of the Ci and then the minimum
value of the γi among the subset with the selected minimum Ci . Formally:

Ĉ = min{Ci |hCi , γi i ∈ B}
(6)
γ̂ = min{γj |hĈ, γj i ∈ B}

where Ĉ is the selected C and γ̂ the selected γ.


• mingC: First select the minimum γi , and then the minimum Ci among the
subset with the selected γi .
• meanCg The mean of the log Ci and the mean of the log γi

log Ĉ = mean {log Ci |hCi , γi i ∈ B}


(7)
log γ̂ = mean {log γi |hCi , γi i ∈ B}

• randCg: Select a random pair hCi , γi i from the set B

• maxCg: First select the maximum value of the Ci and then the maximum
value of the γi among the subset with the selected Ci .
• maxgC: First select the maximum γi , and then the maximum Ci among
the subset with the selected γi .

The last two procedures maxCg and maxgC are there only to test a procedure
that our arguments should point out as worse than the other ones.

3 Experimental protocol and statistical analysis


3.1 Datasets
We used 115 datasets from UCI, first processed by Fernández-Delgado et al(Fernández-
Delgado et al, 2014) and further processed by us. In particular, each multi-class
dataset was transformed into a binary problem.
For each of the original 115 UCI datasets:
• the categorical features were converted to an integer value: 1 for the ar-
bitrary first categorical value, 2 for the second value and so on.

• each feature was scaled to mean equal to 0 and standard deviation equal
to 1
• the multi-class datasets were converted to binary by assigning the most
frequent class to 0, the second most frequent to 1, the third most frequent
to 0, and so on.
The distribution of the size of the datasets, the number of features, and the
ratio of the less frequent class are shown in the histograms in Figure 1.

11
log_10 size number of features
60

15

40
count

count
10

20
5

0 0
1 2 3 4 5 0 100 200
logsize ndimension

minority ratio
15

10
count

0
0.0 0.1 0.2 0.3 0.4 0.5
ratio

Figure 1: Histograms of the size, number of features, and minority ratio of the
datasets used.

3.2 Experimental procedure


In general terms, for each dataset, we follow a nested cross-validation proce-
dure, 5-fold inside a 2-fold. The internal cross-validation (5-fold) is used to
select the hyperparameters (using the different search procedures) and the ex-
ternal 2-fold is used to estimate the accuracy of the classifier (with the different
hyperparameters) on future data (section 3.4).
Formally, each dataset Di is divided into two subsets S1i and S2i with the
same size and proportion of classes (this is the external 2-fold). Let us also
define 1̂ = 2 and 2̂ = 1, that is, the hat operator denotes the other subset.
Let us define θ(x, i, j) as the value of the hyperparameters that are selected
when we use the search algorithm x in subset j of dataset i, that is:

θ(x, i, j) = argmax rs(S


b ji , C, γ) (8)
C,γ∈x

where C, γ ∈ x indicates that C and γ are selected from the ones “generated”
by algorithm x. rs(S
b ji , C, γ) is defined in Equation 4 and in this research we
use the 5-fold cross-validation to compute it.
In the case of multiple best pairs of hyperparameters, we select a random
one from the set, that is, we follow the randCg post-search selection. There
are two reasons for this option. First, some of the general search algorithms
only return a single solution, so even if the algorithm found many equally good
solutions, it seems to us that this unique returned solution is a random one

12
among them. Second, as we will discuss in Section 4.3, there are no significant
differences between all of the selection procedures, and thus, any of them can
be used to select among equally good pairs of hyperparameters.
Let us denote α(x, i, j) as
α(x, i, j) = rs(S
b ji , θ(x, i, j)) (9)
that is, the highest accuracy found in the response surface of Sji by algorithm
x, again using the 5-fold evaluation . That is, θ(x, i, j) is the pair of best
hyperparameters selected by x on subset Sij and α(x, i, j) is the value of rs
b Sji
on that point.
Finally, let us denote β(x, i, j) as:
β(x, i, j) = ac(S̂i |Sji , θ(x, i, j)) (10)
that is, the accuracy on the second subset (S̂i ) of an SVM trained on the first
subset (Sji ) with hyperparameters selected by algorithm x on the first subset.
For each subset Sji , we recorded the value of θ(x, i, j) and measured α(x, i, j)
and β(x, i, j) , that is, for each subset we recorded the position (θ), the height
(α) of the highest point in rs b Sji , and the accuracy of testing the SVM trained
on Sji with the best hyperparameters discovered by x on a similar dataset (S̂i ).

3.3 Accuracy gain


Let us define for each algorithm x and dataset i the mean of the best accuracy
for both subsets:
α(x, i, j) + α(x, i, ̂)
ac(x, i) = (11)
2
This is a 2-fold estimate of the maximal accuracy achieved by the search algo-
rithm x for dataset i.
We will consider the 10 × 10 grid search as the base/standard search algo-
rithm. In fact, a popular guide to SVM (Hsu et al, 2003) suggests an 11 × 10
grid search, with C and γ limited by the constraints in Equation 5. We want
to compare the algorithms described in Section 2 with the 10 × 10 grid search
both in terms of the accuracy achieved and the execution time.
We will compute the accuracy gain of an algorithm x on dataset i in relation
to the 10 × 10 grid search as:
accgain(x, i) = ac(x, i) − ac(grid100, i) (12)
For each algorithm x, we will compute the average accuracy gain (by averag-
ing over all datasets) and the confidence interval for that mean. The confidence
interval is computed using bootstrap with 5000 replicates.

3.4 Expected accuracy on future data


The expected accuracy of an SVM trained on a dataset Di with hyperparameters
θ(x, i) is:
Eg [ac(g|G, θ(x, i))]

13
A 2-fold estimate of that value is:
ac(S̂i |Sji , θ(x, i, j)) + ac(Sji |S̂i , θ(x, i, ̂)) β(x, i, j) + β(x, i, ̂)
E2 (x, i) = =
2 2
(13)
The accuracy gain (in relation to the 10x10 grid) for future data, for dataset
Di and algorithm x is
E2 (x, i) − E2 (grid100, i)
We will compute the accuracy gain on future data for all datasets and algo-
rithms.

3.5 Time ratio


We measured the total execution time of each search algorithm when executing
sequentially – we will not discuss parallelization in this research. We computed
the ratio of the execution of the search algorithm x on dataset i, and the exe-
cution time of the base search, the 10 × 10 grid search. In all cases, the search
algorithm ran sequentially. We computed the mean and confidence interval on
the mean time ratio using bootstrap on 5000 replicates.
The algorithms bogp and tpe were implemented in compiled code with a
Python interface. For those two algorithms, we compared the execution time
with the grid100 also implemented in Python, using sklearn.
Each combination of algorithms and datasets was allowed to run for at most
5 days; the combinations that exceded that time limit were removed from both
the accuracy gain and the time ratio the analysis.

3.6 Post-search selection procedures


For the grid-like algorithms grid, ud, rand, and normrand, we collected all the
best solutions for each subset. That is, we collected all solutions to:

θ(x, i, j, k) = argmax rs(S


b ji , C, γ) (14)
C,γ∈x

where k is the index of each best solution.


We then selected a particular θ(x, i, j, k̂) using the different post-search se-
lection algorithms discussed in section 2.3. We then computed the accuracy on
the other subset of that choice of hyperparameters when tested on the other
subset of the dataset i, that is we computed

ac(S̂i |Sji , θ(x, i, j, k̂)

as an estimate of the accuracy on future data when using all of the selection
procedures.

14
3.7 Statistical analysis
The usual procedure for comparing different algorithms (in this case, the dif-
ferent search algorithms and the different post-search selection procedures) on
different datasets is to follow the one proposed by Demšsar (Demšsar, 2006):
a broad comparison of the accuracy of all algorithms using the Friedman test
(a non-parametric version of a repeated measure ANOVA), and if its p-value
is below the usual 0.05 (for a 95% confidence), a pairwise comparison of all
algorithms - the Nemenyi test. Those comparisons with a p-value below 0.05
would be deemed as significantly different (at a 95% confidence).
For the post-search selection comparisons, we perform this procedure. We
compared the expected accuracy gain on future data for each of the selection
procedures (for algorithms grid400, rand400, ud400, and normrand) to find
out which one is the procedure that yields significantly better results for the
accuracy on future data.
For the accuracy gain and time ratio, we will only compute the 95% confi-
dence interval for those measures. We do not seek to make a broad claim that
search algorithm A is “significantly better” than the other algorithms. One of
the reasons is that very likely there will be no such algorithm: we are com-
paring 47 different combinations of algorithms and parameterizations and it is
unlikely that one of such combinations is so much better than the others so
that it can be statistically distinguished from all other 46 algorithms with just
115 datasets/measurements. For that to be true the algorithm should be “very
good” in finding the maximum of the rs b G response surface. That is even less
likely true given that the rsb G is a step function, and thus “around” any very
good γ and C points there is a neighborhood of equally good points (with the
same value of rsb G ). If the dataset is large, the neighbourhood may be small
and maybe none of the other search algorithms found that region, but that is
increasingly unlikely for smaller datasets.
Finally, we must also point out that one should not use the confidence in-
tervals computed in this research as a substitute for hypothesis testing. When
comparing two sets of values, confidence intervals can subsume hypothesis test-
ing: if the 95% confidence intervals have no intersection, one can claim that at
95% confidence that the two sets are significantly different (the reverse is not
always true (Schenker and Gentleman, 2001)). But with multiple groups, as
pointed out by Ludbrook (2000) one would have to include the effect of multi-
ple comparisons in the confidence interval calculations. This was not done in
this research.

3.8 Reproducibility
The datasets used in the paper are available at https://figshare.com/s/
d0b30e4ee58b3c9323fb as described in Wainer (2016). The program to run
the different procedures and the different classifiers, the results of the multiple
runs, and the R program to perform the statistical analysis described in this
paper are available at https://figshare.com/s/35f990ae2b7f65062cca.

15
4 Results on the comparison of the search algo-
rithms
Table 1 is the main result of the comparisons of the search algorithms. The first
column is the name of the search algorithm and the N parameterization. The
next three columns are the mean accuracy gain and the low and high limits of
the 95% confidence interval for the accuracy gain. The last three columns are
the mean time ratio between the algorithm and the base search algorithm, and
the low and high limits of the 95% confidence interval. The table is ordered by
increasing mean accuracy gain and the small gap in the table indicates the start
of the range of algorithms which have a mean accuracy gain of 0 or greater.
Figure 2 displays the results of Table 1 in graphical form - which relates the
mean accuracy gain and mean log10 of the time ratio for each combination of
algorithm and parametrization.

Time ratio
algorithm mean low high mean low high
svmpath5 -0.0388 -0.0483 -0.0312 0.15 0.12 0.20
asymp10 -0.0384 -0.0502 -0.0304 1.86 1.37 2.47
asymp20 -0.0346 -0.0458 -0.0268 3.15 2.33 4.16
skl1 -0.0336 -0.0416 -0.0274 0.02 0.01 0.02
asymp40 -0.0292 -0.0407 -0.0218 5.79 4.27 7.54
svmpath10 -0.0292 -0.0376 -0.0224 0.28 0.23 0.35
svmpath20 -0.0250 -0.0325 -0.0188 0.55 0.45 0.68
sdbtc5 -0.0206 -0.0262 -0.0161 0.07 0.06 0.08
skl5 -0.0179 -0.0223 -0.0144 0.06 0.05 0.07
dbtc5 -0.0176 -0.0227 -0.0139 0.15 0.13 0.17
sigest5 -0.0144 -0.0174 -0.0119 0.06 0.05 0.06
sdbtc10 -0.0127 -0.0164 -0.0100 0.11 0.10 0.12
dbtc10 -0.0111 -0.0145 -0.0086 0.22 0.19 0.24
sigest10 -0.0102 -0.0129 -0.0081 0.09 0.08 0.10
sdbtc20 -0.0101 -0.0138 -0.0076 0.19 0.17 0.21
skl10 -0.0098 -0.0136 -0.0054 0.10 0.08 0.11
skl20 -0.0094 -0.0138 -0.0057 0.16 0.14 0.18
sigest20 -0.0086 -0.0113 -0.0065 0.16 0.14 0.17
sa25 -0.0083 -0.0106 -0.0066 0.34 0.31 0.38
Continues on next page

16
Table 1 – continued from the previous page
algorithm mean low high mean low high
dbtc20 -0.0074 -0.0101 -0.0053 0.34 0.30 0.37
grid25 -0.0055 -0.0069 -0.0042 0.27 0.26 0.28
normrand25 -0.0043 -0.0062 -0.0029 0.26 0.25 0.29
rand25 -0.0041 -0.0053 -0.0030 0.23 0.22 0.24
ud25 -0.0039 -0.0052 -0.0026 0.31 0.30 0.32
cma100 -0.0026 -0.0053 -0.0005 0.98 0.84 1.15
pso25 -0.0024 -0.0037 -0.0011 0.28 0.26 0.31
bobyqa25 -0.0020 -0.0039 -0.0006 0.30 0.26 0.35
sa100 -0.0009 -0.0022 0.0001 1.23 1.15 1.33
bobyqa400 -0.0006 -0.0027 0.0007 4.27 3.70 4.82
bobyqa100 -0.0004 -0.0022 0.0008 1.14 1.00 1.30
nelder25 -0.0002 -0.0015 0.0018 0.33 0.29 0.38

gridhier50 0.0000 -0.0011 0.0013 0.54 0.51 0.61


nelder100 0.0007 -0.0006 0.0020 1.14 0.99 1.29
rand100 0.0007 -0.0004 0.0020 0.89 0.86 0.92
cma400 0.0008 -0.0017 0.0029 3.77 3.23 4.32
ud100 0.0009 -0.0001 0.0021 1.15 1.11 1.20
normrand100 0.0010 0.0001 0.0021 1.01 0.96 1.07
udhier50 0.0012 0.0001 0.0024 0.55 0.52 0.59
nelder400 0.0020 0.0006 0.0037 4.33 3.82 4.85
pso100 0.0027 0.0018 0.0039 1.09 0.99 1.19
udhier200 0.0030 0.0020 0.0044 2.09 1.96 2.23
tpe100 0.0033 0.0024 0.0045 1.12 0.97 1.29
sa400 0.0034 0.0025 0.0046 4.76 4.42 5.12
rand400 0.0036 0.0027 0.0048 3.51 3.40 3.60
ud400 0.0038 0.0029 0.0050 4.48 4.20 4.77
grid400 0.0038 0.0028 0.0059 3.74 3.65 3.83
bogp100 0.0039 0.0028 0.0053 36.08 24.43 50.69
gridhier200 0.0041 0.0033 0.0050 1.88 1.78 2.00
normrand400 0.0042 0.0032 0.0057 4.00 3.83 4.21
pso400 0.0055 0.0045 0.0070 4.26 3.83 4.74
tpe400 0.0057 0.0047 0.0070 3.19 2.82 3.62
Continues on next page

17
Table 1 – continued from the previous page
algorithm mean low high mean low high
bogp400 0.0059 0.0048 0.0074 461.25 307.12 647.53

Table 1: Accuracy gain and time ratio for the search algorithms

Comparison of search algorithms


gridhier200 tpe400 pso400 bogp400
● ● ●
pso100 tpe100 ●grid400 ● normrand400 ●
● ●● ●●
● bogp100
normrand100 ●
ud100 rand400● sa400
udhier50 ● ●●●● ● ud400
0.00 nelder25
● ● ●
● udhier200 ● nelder400
bobyqa25 rand100
●● ● sa100 cma400 bobyqa400
rand25 pso25 gridhier50
●● ●
normrand25 ● ud25 bobyqa100 nelder100

sigest20 ●grid25 ● cma100


−0.01 skl10 ●



sa25
skl20 ●
sigest10 dbtc10 dbtc20

accgain

sigest5 ● sdbtc20
sdbtc10 dbtc5
● ●
skl5 sdbtc5
−0.02 ● type
a
● general−opt

svmpath20 a
● grid−like
svmpath10
● ● a
● hierarchical
−0.03 asymp40
skl1 a
● svm−specific

asymp20 ●
asymp10
● ●
−0.04 svmpath5
−2 −1 0 1 2
log10 time ratio

Figure 2: The accuracy gain and log10 time ratio in relation to the 10x10 grid of
all the search algorithms. The continuous line is the local average for algorithms,
removing asymp, svmpath, and bogp.

4.1 Discussion of some of the search algorithm


As discussed, one cannot say that an algorithm “performs significantly worse”
than the 10x10 grid using the confidence intervals alone even in these cases
where the confidence intervals do not include the 0. Thus, we will only use
the mean accuracy gain to make statements on whether an algorithm performs
better or worse than the 10x10 grid, and the reader should use both the range
of the interval and whether it contains or not the 0, as a measure of the strength
of that statement.
Two groups of algorithms are clearly distinguished from the others. The
SVM specific algorithms asymp and svmpath are clearly inferior to the other
ones - they achieve a lower accuracy gain and are much most computationally
costly than the other algorithms with a similar number of evaluations. On
the other side, the bogp - Bayesian optimization with Gaussian process is very

18
costly, although bogp400 did achieve the highest accuracy gain, it runs 300
times slower than the baseline grid100.
Also, in general, as expected, algorithms with a larger number of probes to
the response surface will perform better than algorithms with a lower number of
probes and will take correspondingly more computational time. But the more
interesting aspect is to consider the continuous line in Figure 2 as the mean
behavior of the search procedures (excluding the svmpath, asymp, and bogp),
and consider the ones that are above or below the “expected” behaviour. Among
the algorithms with 100 and 400 probes, tpe and pso perform better than the
grid-like algorithms, which perform better than the other general optimization
procedures (sa, cma, nelder, and bobyqa). Among the grid-like algorithms, the
normrand seems slightly better than the others, but we must remind the reader
that there it is possible that the normrand results due to a kind of “overfitting”
because the algorithm’s parameters (the mean and standard deviations of the
two distributions) are somewhat tunned to this particular set of datasets.
Among the algorithms with N=25, the derivative-free algorithms nelder and
bobyqa seem also interesting choices. Unsurprisingly, the hierarchical algorithms
perform above the mean in comparison to the other algorithms.
Finally, among the remaining SVM-specific algorithms, dbtc performs below
the average, and the sampled version of dbtc is even worse. sigest seems to
perform a little better than the skl algorithm.
In general terms, for the very fast search (1, 5 or 10 probes), one should use
the sigest algorithm to select the γ and perform a 5 or 10 grid search for C.
skl is maybe a simpler (implementation-wise) alternative but there are some
accuracy costs. dbtc is not a competitive alternative. For the fast searches (20
to 25 probes) one should use nelder with N=25, or any of the grid-like with
N=25. SVM-specific algorithms are not competitive with these options even at
N=20.
For the conventional range on the number of evaluations (N=50 or N=100),
one should consider the hierarchical uniform design with N=50 which is faster
but performs similarly to the grid-like algorithms. For 100 evaluations one
should use either tpe or pso. If one is willing to pay the computational cost
of 200 to 400 evaluations, one should again use tpe or pso, a hierarchical grid
of 100+100, or the grid-like algorithms at N=400. Finally, among grid-like
algorithms, the normrand seems to perform slightly better than the others.
All algorithms limited to 200 and 400 evaluations performed better than the
10x10 grid search (grid100). Clearly, they are, on average, finding points a
little higher in the response surface; the question is whether it is worth to spend
the extra effort in selecting the hyperparameters. We will discuss this question
below.

4.2 Is it worth to spend more effort searching for better


hyperparameters?
This research, so far, has been discussing search algorithms to find maxima, or
peaks, in the rs
b G surface. To remind the readers, rs b G is a 5-fold estimate of

19
trsG which is the true response surface of the SVM classifier when G is used as
the learning set and the accuracy is the expectation when the testing data is
sampled from the same distribution that G is sampled from. One really wants
to find the best set of hyper-parameters on trsG , but one is searching it on rs
b G.
The point is that, most likely, rs
b G is a noisy approximation of trsG , and peaks
in rsb G may not correspond to peaks in trsG . If that is the case, it may not
be worth it to spend a lot of computational effort on the most costly search
algorithms. We will provide two forms of evidence that that is the case.
The first evidence is that we do have another estimate of the trsG . As we
discussed in Section 3.4, we measure what we called the expected accuracy on
future data, which is the accuracy of training the SVM on a subset j of dataset
i (with hyperparameters selected on subset j) and testing on the other subset
̂ (Equation 13). We will display the average gain on the expected accuracy on
future data from each search algorithm in comparison to the 10x10 grid.
The second evidence will compare the peaks on two different rs b G surfaces.
The rs b G is based on a 5-fold split of the known data G. If a different random
generator seed is used, different folds would be created. We call each of an
instance of running a search algorithm using a different random seed to generate
different 5-folds a run. If the peaks in rs b G are “grounded” on true peaks of
trsG , there should not be much of a difference between the best-selected set of
hyperparameters between the two runs.
We ran the grid400,ud400,rand400, and normrand400 algorithms on all
datasets, with a different instance of the 5-fold cross-validation. The grid400
and the ud400 probe the same points in the log C × log γ space for both rs bG
surfaces, and the random seeds for the rand400 and normrand400 were also set
to the same number for the two runs, so also they generate the same sequence
of points to be probed in both instances. Therefore, for each of these algo-
rithms, the same set of points were probed, but in two different rs b G surfaces.
We computed different measures of the distance between the two sets of best
hyperparameters. We believe that the magnitude of the differences indicate that
the maximum in the rs b G surface for both runs are not “grounded” on a real
maximum of the trsG surface, but are indeed “noise” particular to each rs bG
surface.

4.2.1 Gains in the expected accuracy on future data


Table 2 displays the mean accuracy gain on future data in relation to grid100
and the 95% confidence interval on the mean. The algorithms are ordered in
increasing order of the accuracy gain.
Notice that all algorithms except for the top 10 (indicated by a small gap
in the table) have confidence intervals that include the 0. The fact that a
confidence interval includes the 0 is the reverse of the situation we warned the
reader in Section 3.7. There, we warned that if the confidence interval did not
include the 0, or if two intervals did not have an intersection one could not
conclude that the difference was significative because of the corrections needed
for the multiple comparisons. Here, the fact that the confidence interval does

20
include the 0 indicates that the differences are not significant – the corrections
for multiple comparisons would only enlarge the confidence intervals (Ludbrook,
2000).
Therefore, besides the first 10 SVM-specific algorithms in Table 2, none of
the other search algorithms seems to be significantly different from grid100 in
terms of accuracy on future data. The sa100 algorithm seems to present an
anomaly, it also has its confidence interval on the accuracy gain in the negative
range, that is, it does not include the 0. But, again, one cannot conclude that
the algorithm is significantly worse than grid100. We believe that this result
is spurious, in the sense that sa100 is not different than the other algorithms
on the expected accuracy gain for future data, it was just an accident that its
confidence interval does not include the 0.
algorithm mean low high
svmpath5 -0.0271 -0.0370 -0.0196
asymp10 -0.0261 -0.0381 -0.0179
asymp20 -0.0232 -0.0349 -0.0155
asymp40 -0.0198 -0.0309 -0.0117
svmpath10 -0.0192 -0.0273 -0.0123
skl1 -0.0168 -0.0262 -0.0101
svmpath20 -0.0158 -0.0231 -0.0098
sdbtc5 -0.0063 -0.0105 -0.0027
dbtc5 -0.0056 -0.0094 -0.0025
sigest5 -0.0041 -0.0079 -0.0006

skl5 -0.0032 -0.0078 0.0017


sa100 -0.0028 -0.0059 -0.0004
skl10 -0.0025 -0.0071 0.0027
sa25 -0.0025 -0.0057 0.0007
dbtc10 -0.0023 -0.0057 0.0007
sdbtc10 -0.0022 -0.0057 0.0012
sdbtc20 -0.0018 -0.0052 0.0017
sigest10 -0.0016 -0.0055 0.0021
sigest20 -0.0015 -0.0055 0.0019
cma400 -0.0014 -0.0049 0.0024
rand100 -0.0013 -0.0038 0.0008
Continues on next page

21
Table 2 – continued from the previous page
algorithm mean low high
sa400 -0.0012 -0.0039 0.0010
cma100 -0.0011 -0.0048 0.0025
skl20 -0.0011 -0.0062 0.0043
dbtc20 -0.0010 -0.0041 0.0016
tpe100 -0.0010 -0.0036 0.0012
pso25 -0.0009 -0.0031 0.0016
bogp400 -0.0009 -0.0033 0.0012
ud400 -0.0007 -0.0032 0.0016
normrand400 -0.0007 -0.0031 0.0013
bogp100 -0.0006 -0.0034 0.0018
tpe400 -0.0006 -0.0031 0.0018
nelder25 -0.0005 -0.0029 0.0020
bobyqa400 -0.0004 -0.0027 0.0024
bobyqa100 -0.0004 -0.0026 0.0018
gridhier200 -0.0004 -0.0025 0.0014
grid25 -0.0003 -0.0022 0.0018
nelder400 -0.0000 -0.0022 0.0025
nelder100 0.0000 -0.0019 0.0024
gridhier50 0.0001 -0.0020 0.0021
udhier50 0.0002 -0.0016 0.0020
udhier200 0.0002 -0.0020 0.0027
bobyqa25 0.0004 -0.0018 0.0024
pso100 0.0004 -0.0018 0.0025
rand400 0.0004 -0.0022 0.0030
pso400 0.0005 -0.0015 0.0024
rand25 0.0005 -0.0020 0.0031
ud100 0.0006 -0.0020 0.0035
normrand100 0.0006 -0.0016 0.0028
ud25 0.0008 -0.0016 0.0028
grid400 0.0009 -0.0012 0.0031
normrand25 0.0010 -0.0011 0.0032

Table 2: Accuracy gain on future data

Figure 3 displays the gain in accuracy on future data for all search algorithms
and their log10 time ratio in relation to the 10x10 grid. The placement of the
search algorithms on the horizontal axis is the same as in Figure 2, for easier
comparison.
The important conclusion is whatever gains each algorithm may have in
relation to grid100 in finding the peaks in rs
b G surface, those gains seem not to
yield better accuracy on future data.

22
Expected gain on future data
normrand25 ud25 normrand100 ud100 udhier200 grid400 pso400
● ● udhier50●●● nelder100 ●● ●
rand25 ● ●
0.00 grid25 ●bobyqa25 ●● ● ● ● nelder400 bobyqa400
● pso100 ● ●rand400
● ●●● ● bogp100
sigest20 skl20 ● ● ●
●●
● ud400 ●
● ● pso25 ● ● bogp400

skl10 sigest10 ● ● ● ●
gridhier50 bobyqa100 tpe400 normrand400
● sdbtc20 sa25 ● sa400
skl5 ● cma100 gridhier200
cma400
sigest5 sdbtc10 nelder25 rand100

sa100
● dbtc10 dbtc20 tpe100
sdbtc5 dbtc5

−0.01
future.gain


type
skl1 svmpath20

a
● general−opt
● asymp40 a
● grid−like
svmpath10 ●
−0.02
a
● hierarchical
asymp20
● a
● svm−specific

asymp10


svmpath5
−2 −1 0 1 2
log10 time ratio

Figure 3: The gain in accuracy on future data and log10 time ratio in relation
to the 10x10 grid of all the search algorithms.

4.2.2 Differences on two rs


b G surfaces
Table 3 reports the results on comparing two different rs b G surfaces. The first
line is the number of datasets for which there was only one peak (a single best
point C 0 and γ 0 ) in at least one of the runs. The second line is the proportion
among those datasets that the two runs resulted in the same best selection. The
third line is the average rank of the accuracy of the pair C 0 and γ 0 that was
selected as the single best solution in one run on the other run. The fourth line
is the median of the difference of accuracy between the best selection and the
second-best selection in one run. The fifth measurement is the median of the
difference in the accuracy of the best-selected solution on each run. Finally, the
last measurement is the average distance in log C × log γ space of the two best
solutions in both runs.
Let us inspect the result for the ud400 search algorithm. 45 datasets resulted
in at least one run selecting a single best set of hyperparameters. Let us call
C1 and γ1 the best selection for the first run and C2 and γ2 the best selection
for the second run. rs b 11 and rs
b 12 are the corresponding accuracies of the best
(superscript 1) points for each run (subscript 1 and 2). rs b 21 and rs
b 22 are the
accuracies of the second-best (superscript 2) solution for each run. For 9% of
the datasets, both runs resulted in the same solution, that is C1 = C2 and
γ1 = γ2 . The average rank of the accuracy at the point C2 , γ2 in the first run
and the point C1 , γ1 in the second run was 20.2. That is, on average, the best

23
Measurement ud400 grid400 rand400 normrand400
Number of datasets with a single 45 59 16 12
best selection
Proportion of datasets with the 9% 54% 6% 8%
same best solution for both runs
Average rank of the best solution 20.2 1.92 29.7 24.4
in one run, in the other run
Median difference between accu- 0.0009 0.0048 0.0013 0.0010
racy of the best and second best
selections in a single run
Median difference between the 0.0089 0.0085 0.012 0.011
accuracy of the best solutions in
both runs
Average distance between best 4.7 1.2 3.9 3.9
selection in both runs in log C ×
log γ space

Table 3: Comparison of the selection of the best hyperparameters for two runs
of the ud400, grid400, rand400, and normrand400 algorithms. See text for the
explanation of the measurements.

point in one run ranks as the 20th best point in the other run. This seems the
strongest evidence that the ud400 is mostly searching for peaks in the “noise” of
the rs
b G surface - if the peaks in the accuracy were really peaks in the trsG (C, γ)
surface one would not expect such a large difference from one run to the other.
This evidence can also be seen considering the accuracy values. The median
difference between rs b 21 (and rs
b 11 and rs b 22 ), that is, the best and second
b 12 and rs
−4
best solutions of a single run is 8.6 × 10 . The median difference between rs b 11
1
and rsb 2 , that is, the accuracy of the best solutions across runs, is ten times as
large 8.9 × 10−3 . Again, one would not expect such a difference between the
two runs. Finally, the average distance between C1 , γ1 and C2 , γ2 is 4.7.
The numbers are similar for the rand400 and normrand400 algorithms, but
for them, there were fewer datasets that generated a single best solution. The
exception is the grid400 algorithm, which has a much higher rate of agreement
of the best solution between the two runs, and therefore a much lower difference
between the median average between first and second-best in a single run, the
difference of the best accuracy between runs, and the average distance between
the best solutions between the two runs. We do not know how to explain this
discrepancy of the grid search in relation to the others.

4.3 Results on the different post search selection proce-


dures
There were 995 combinations of grid-like algorithms (grid, ud, rand, normrand)
and subsets (half of each dataset) where there was more than one best solution

24
for the hyperparameters. For these combinations, the median number of solu-
tions was 5, the mean number of solutions was 24.57, the minimum 2, and the
maximum 293.
Table 4 displays the mean rank of the accuracy on future data for each
selection procedure. As one can see, there seems to be very little difference
between the ranks. In fact, the Friedman test on the accuracy on future data
of the selection procedures results in a p-value of 0.42 (chi-squared = 4.948, df
= 5) which shows that there are no significant differences among the selection
procedures.

selection procedure mean rank


mingC 1.40
maxCg 1.41
randomCg 1.41
meanCg 1.41
minCg 1.42
maxgC 1.46

Table 4: The mean rank in terms of accuracy on future data on the post search
selection procedures

It could be the case that the non-significant differences are due to a large
number of comparisons. Table 5 only compares the “more reasonable” selection
procedures: minCg mingC, randomCg, and meanCg. The resulting Friedman test
has a p-value of 0.80 ( chi-squared = 1.0188, df = 3). Therefore, there is no
evidence of differences among the post-search selection procedures.

selection procedure mean rank


mingC 1.35
randomCg 1.36
meanCg 1.36
minCg 1.38

Table 5: The mean rank in terms of accuracy on future data of the four more
reasonable post-search selection procedures

Finally, even if we restrict the number of solutions to 6 or above (which


results in 464 combinations of grid-like algorithms and subsets) the p-value of
the Friedman test for all 6 selection procedures is 0.63 (chi-squared = 3.4738,
df = 5) and for the 4 selection procedures 0.56 (chi-squared = 2.0767, df = 3).
So, there is ample evidence that despite reasonable arguments to use one or
another post-search selection procedure, there are no significant differences in
future data accuracy among them.

25
4.4 Hyperparameters selected by the best algorithms
Figure 4 plots the best log C and log γ found by all the algorithms with 200 and
400 probes. The mean log C for the best algorithms without the normrand400
is 5.37 and the standard deviation is 5.33. The mean of the log γ is -5.56 and
the standard deviation is 3.57. These numbers are the source for our suggestion
of the normrand algorithm. A mean of 5 and a standard deviation of 5 for the
log C and a mean -5 and a standard deviation of 5 for the log γ are mnemonic
approximations of those values.

● ● ●● ● ● ●






● ● ●

● ● ●
● ●
● ● ● ● ●
● ●
● ● ● ● ● ●

● ● ● ●

● ●
● ●● ● ●


● ●

● ●
● ●

● ●
● ●
● ● ●

● ●
● ● ●

0 ● ●




● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●

● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ●
● ● ● ●
● ● ●

● ● ● ● ●
● ●
● ● ●
● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ●

● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
●● ●● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ●● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ●
● ● ●● ● ●
●● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ●● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ●● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●
●●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●

● ● ● ● ● ● ● ● ●
● ●
● ● ● ●● ● ●
● ● ● ● ●

● ● ● ●● ●
● ●● ● ● ●
● ● ● ●●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
●● ● ●
● ● ●
● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●●
● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ● ● ●● ● ● ●
● ●● ● ●
●●
● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ●● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ●
●●

● ●● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
log γ

● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
●● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
●● ● ●●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●● ● ● ●● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
●● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ●
● ● ●●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ●

●● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
●● ● ● ● ●● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●● ● ● ●
● ●
●●
● ●● ●● ● ● ●
● ● ● ● ● ● ●●
● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●

● ●● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ●● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ●● ● ●
● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ● ●
● ●
● ● ●
● ● ●
● ●
● ● ● ●
● ●
● ●
● ●
● ●
● ● ● ● ●

● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ●
● ●●
● ● ● ●

−10
● ● ● ● ● ● ● ●
● ● ●




● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ●
● ● ●



● ● ●● ● ● ●
● ● ●

● ●
● ● ●
● ●
● ● ● ● ●
● ● ●●

● ●
● ● ● ●
● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ●
● ● ●
● ● ●
● ●

● ● ● ● ●
● ● ● ● ● ●● ● ●
● ● ●
● ● ●

● ● ●
● ● ●

● ●●
● ● ●
● ● ●

● ●
● ● ● ●

● ●
● ● ● ●
● ● ● ● ●● ● ●
● ● ●


● ● ●
● ●




● ●
● ●
● ● ●
● ● ●●
● ● ●
● ●
● ● ● ● ● ● ● ●

● ●
● ●
● ●
● ●
● ● ●


● ● ● ●

● ● ● ● ●

● ●

0 10
log C

Figure 4: The density plot of the C and γ selected by the best algorithms. The
box are the constrained values

5 Discussion
An important issue to discuss is whether the results of this research can be
generalized to any future dataset. The data sets used in this research are a large
sample of those available at UCI in 2014, collected by the authors of Fernández-
Delgado et al (2014). We believe they can be considered as an unbiased sample
of real-world data sets. But some classes of data sets are not represented in
this set: very large data sets and data sets with a high number of dimensions.
Fortunately, RBF SVM are usually not applied in these types of data sets:

26
SVM is at least quadratic on the number of training examples and therefore
it does not scale well to large data sets, with the possible exception of online
SVM solvers; SVM with RBF kernel is usually not used on large dimensionality
problems since for these cases the linear SVM is probably a better solution.
Finally, all datasets are binary classification problems. We do not believe that
for multi-class problems, where the SVMs are used in one-vs-one or one-vs-all
configurations the conclusions would be different.
The choice of search algorithms was necessarily ad-hoc. We believe that this
research covered most of the important proposal for SVM hyperparameter search
algorithms. A somewhat non-conventional inclusion was the uniform design (ud)
search – we wanted to include at least one algorithm from the research area of
Design of Computer Experiments, which has not been very well represented in
the hyperparameter search research. The most notable absentee in this research
are the many versions of sequential testing (Krueger et al, 2015).
Also, we must point out that we made no attempt to select hyperparameters
for the search procedures themselves. The more complex general optimization
algorithms such as pso, sa, cma have themselves hyperparameters: one has
to define a function that will reduce the temperature in sa; one has to define
the number of particles and the relative strength of the momentum and the
attractive force among the particles in pso; and so on. Although we cited re-
search that proposed using general optimization algorithms such as pso, sa, cma
and so on, (Lin et al, 2008b; Huang and Dun, 2008; Lin et al, 2008a; Pai and
Hong, 2005; Imbault and Lebart, 2004) we did not attempt to reproduce the
algorithms hyperparameters as proposed by the authors or perform any search
for those hyperparameters - we relied on the default of the implementations
used. As we will discuss below, we do not think that a better choice of search
hyperparameters would have resulted in a better selection of the SVM hyperpa-
rameters. Similarly, we do not believe that the inclusion of some of the missing
search algorithms such as ant colony optimisation would have changed the main
conclusions of this research in any significant way.
The time measurement results are less generalizable: the results reflect the
2019 implementation in R of the search algorithms, and the relative speeds of
the different search algorithms may change as better and faster implementations
become available. Also, we must point out that the timing data refer to the
sequential execution of the search algorithms. Parallel execution is trivial for
grid-like algorithms and there is published research on parallel versions of pso,
sa, and bogp.
We are not aware of any research published that systematically compares dif-
ferent search algorithms for SVM, with the exception of Mantovani et al (2015)
who compares on 70 data sets 6 different search algorithms: grid search, ran-
dom search, default values for the hyperparameters, genetic algorithms (GA),
estimation of distribution algorithms (EDA), and particle swarm optimization
(pso). Different to this research, they run experiments of up to 10000 evalu-
ations, but they use a non-standard cross-validation to tune and measure the
accuracy of the SVM. What they call the S-CV (single cross-validation step)
divides the data set into K subsets, where they use a 2-fold to train with the

27
different combinations of hyperparameters, which are tested on one of the folds
remaining. The last fold is only used once to measure the accuracy of the best
selection of hyperparameters. The process is repeated K times, changing the
last two folds and using the remained in training. We used the more traditional
nested cross-validation - the subsets are an external 2-fold CV (to compute the
accuracy) and the selection of the hyperparameters is performed by the inter-
nal 5-fold CV (one selection per fold). We do not know if the simpler, less
costly S-CV, introduces bias to the evaluation, but we do know that nested CV
is considered a more standard way of evaluating many classifiers (Cawley and
Talbot, 2010). They performed the standard significance tests on the 6 search
algorithms and did not find any significant differences except for the default
settings which performed worse than the other 5 algorithms.
There has been some recent research on hyperparameter optimization in
general (Hutter et al, 2011; Bergstra et al, 2011; Snoek et al, 2012; Bergstra and
Bengio, 2012; Krueger et al, 2015; Hutter et al, 2019) but they tend to focus on
the more challenging problem of searching high dimensional spaces (Bergstra
et al, 2013).
Finally, this research has some strong similarities to (Wainer and Cawley,
2017). There the authors fixed the search procedure to select the SVM hyper-
parameters (an 11x10 grid search) and varied the resampling procedure used to
compute the rs b G . That paper finds a similar conclusion that it is not worth it
to use very precise and costly resampling procedures for the selection of hyper-
parameters. In particular, the authors suggest using a 2 or 3-fold resampling to
select the hyperparameters. Here, we find that it is not worth it to use costly
search procedures. In both cases, it seems that efforts in finding precisely the
maxima in rs b G or that measuring precisely rsb G are wasteful. The trsG surface
does not seem to have narrow, high maxima - narrow so that it is necessary to
not only measure it precisely but also to spend a lot of search effort to find it,
and high because the difference in accuracy between this high peak and the other
not so good points is significant enough. trsG for SVMs seem to be smooth (at
least near the maxima) and one should not expect that either measuring it more
precisely or probing it many times will yield a “sweet spot” in accuracy. We can
only speculate whether this might be true for other classification algorithms.

6 Conclusions
Regarding the main goal of this research, the comparison of different search
algorithms for the C and γ hyperparameters of an RBF SVM, we found that:
• more evaluations of the cross-validated accuracy will, in general, result in
the selection of hyperparameters that have higher cross-validated accuracy,
which is not surprising.
• if one is willing to allow many evaluations (100 or 400), tree of Parzen es-
timators (Bergstra et al, 2013) and particle swarm optimization (Eberhart

28
et al, 1995) seem to be the two algorithms that achieve a good balance of
execution time and finding higher accuracy points in the rs
b G surface.
• Bayesian optimization (Pelikan et al, 1999; Snoek et al, 2012) does find
high accuracy points but at a very high computational cost.

• all grid-like algorithms grid, rand, ud, and normrand seem to have similar
performance, but the accuracy gains for normrand seems a little higher
than the other three.
• the hierarchical versions of grid and ud are, not surprisingly, good alter-
natives to the flat versions of the algorithms.

• at lower evaluation points (25) nelder and pso seem to be the best alter-
native
• at even lower number of evaluations (10 and 20) the default value of the
γ parameter selected by sigest (Caputo et al, 2002) seems a better solu-
tion than the alternatives proposed by Sklearn and distance between two
classes. The sigest default value for 1/γ tested in this research is the
median distance squared among a 50% sample of the data points.
• the SVM-specific algorithms svmpath (Hastie et al, 2004) and asym (Keerthi
and Lin, 2003) are not competitive alternatives.

Regarding the second goal of this research, efforts in finding better hyper-
parameters with higher values of the cross-validated accuracy seems not to be
correlated to better expected accuracy on future data. There is no important
difference in (an estimative of) the expected accuracy on future data among the
best performing algorithm and the grid100.
Regarding the third goal of this research, we could not find any significant
difference between the different post-search selection procedures. That is, de-
spite arguments that one should select a lower C and lower γ if multiple best
solutions were found, there is no evidence that this selection procedure is any
better than a random selection, or other different procedures in terms of accu-
racy on future data.
Regarding the fourth goal of this research, at a first approximation, the
distribution of the best log C (for datasets whose features are scaled to zero mean
and unit standard deviation) is a normal distribution of mean equal to 5 and
standard deviation equal to 5. And the best log γ follows a normal distribution
with mean equal to −5 and standard deviation of 5.
Finally, an important conclusion of this paper is that is in likely not worth
it to spend too much computational time searching for better hyperparameters.
It is likely that the search algorithms are finding “noisy” peaks in the rs bG
surface which are not representative of “real” peaks in the trsG surface. If
that is indeed the case, then one should not spend a lot of computational effort
searching for the best set of hyperparameters, and our conclusions that pso and
tse are, in general, the best search algorithms for a high number of evaluations

29
are only of theoretical interest. In this case, one should choose algorithms with
a lower number of evaluations. Given that grid like algorithms can be trivially
parallelized, if the user has enough computational instances, we believe that a
grid-like search with 25 or a few more evaluations is likely enough to find good
hyperparameters for most of the problems.

References
Aarts E, Korst J (1988) Simulated annealing and Boltzmann machines. John
Wiley and Sons Inc.

Anguita D, Ridella S, Rivieccio F, Zunino R (2003) Hyperparameter design


criteria for support vector classifiers. Neurocomputing 55(12):109 – 134
Bendtsen C (2012) pso: Particle swarm optimization. version 1.0.3
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization.
Journal of Machine Learning Research 13:281–305

Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-


parameter optimization. In: Advances in Neural Information Processing Sys-
tems, pp 2546–2554
Bergstra J, Yamins D, Cox DD (2013) Making a science of model search: Hy-
perparameter optimization in hundreds of dimensions for vision architectures.
Journal of Machine Learning Research 28:115–123
Caputo B, Sim K, Furesjo F, Smola A (2002) Appearance-based object recogni-
tion using SVMs: which kernel should I use? In: NIPS Workshop on statisti-
cal methods for computational experiments in visual processing and computer
vision.
Cawley G, Talbot N (2007) Preventing over-fitting during model selection via
bayesian regularisation of the hyper-parameters. Journal of Machine Learning
Research 8:841–861
Cawley GC, Talbot NL (2010) On over-fitting in model selection and subse-
quent selection bias in performance evaluation. Journal of Machine Learning
Research 11:2079–2107
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2(3):27

Cohen G, Ruch P, Hilario M (2005) Model selection for support vector classifiers
via direct simplex search. In: FLAIRS Conference, pp 431–435
Demšsar J (2006) Statistical comparisons of classifiers over multiple data sets.
Journal of Machine Learning Research 7:1–30

30
Duarte E, Wainer J (2017) Empirical comparison of cross-validation and internal
metrics for tuning SVM hyperparameters. Pattern Recognition Letters 88:6–
11
Eberhart R, Kennedy J, et al (1995) A new optimizer using particle swarm
theory. In: International Symposium on Micro Machine and Human Science,
vol 1, pp 39–43
Fang KT, Lin D, Winker P, Zhang Y (2000) Uniform design: theory and appli-
cation. Technometrics 42(3):237–248
Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need
hundreds of classifiers to solve real world classification problems? Journal of
Machine Learning Research 15:3133–3181
Hansen N, Müller SD, Koumoutsakos P (2003) Reducing the time complexity
of the derandomized evolution strategy with covariance matrix adaptation
(cma-es). Evolutionary Computation 11(1):1–18

Hastie T (2012) svmpath: the SVM Path algorithm. version 0.953


Hastie T, Rosset S, Tibshirani R, Zhu J (2004) The entire regularization path for
the support vector machine. Journal of Machine Learning Research 5:1391–
1415

Hsu CW, Chang CC, Lin CJ (2003) A practical guide to support vector classi-
fication. Tech. rep., Department of Computer Science National Taiwan Uni-
versity, updated: May 19, 2016
Huang CL, Dun JF (2008) A distributed PSO–SVM hybrid system with feature
selection and parameter optimization. Applied Soft Computing 8(4):1381–
1391
Huang CM, Lee YJ, Lin D, Huang SY (2007) Model selection for support vec-
tor machines via uniform design. Computational Statistics & Data Analysis
52(1):335–346
Hutter F, Hoos HH, Leyton-Brown K (2011) Sequential model-based optimiza-
tion for general algorithm configuration. In: International Conference on
Learning and Intelligent Optimization, Springer, pp 507–523
Hutter F, Kotthoff L, Vanschoren J (2019) Automated Machine Learning.
Springer

Hyperopt: Distributed asynchronous hyper-parameter optimization. http://


github.com/hyperopt/hyperopt
Imbault F, Lebart K (2004) A stochastic optimization approach for parameter
tuning of support vector machines. In: International Conference on Pattern
Recognition, IEEE, vol 4, pp 597–600

31
Joachims T (2000) The maximum-margin approach to learning text classifiers:
Methods, theory, and algorithms. PhD thesis, Doctoral dissertation, Dort-
mund University
Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab – an S4 package
for kernel methods in R. Journal of Statistical Software 11(9):1–20
Keerthi S (2002) Efficient tuning of SVM hyperparameters using radius/margin
bound and iterative algorithms. IEEE Transactions on Neural Networks
13(5):1225–1229
Keerthi S, Lin CJ (2003) Asymptotic behaviors of support vector machines with
Gaussian kernel. Neural computation 15(7):1667–1689
Keerthi S, Sindhwani V, Chapelle O (2006) An efficient method for gradient-
based adaptation of hyperparameters in SVM models. In: Advances in neural
information processing systems, pp 673–680
Klein A, Falkner S, Mansur N, Hutter F (2017) Robo: A flexible and robust
bayesian optimization framework in python. In: NIPS 2017 Bayesian Opti-
mization Workshop
Krueger T, Panknin D, Braun M (2015) Fast cross-validation via sequential
testing. Journal of Machine Learning Research 16:1103–1155
Lin SW, Lee ZJ, Chen SC, Tseng TY (2008a) Parameter determination of sup-
port vector machine and feature selection using simulated annealing approach.
Applied soft computing 8(4):1505–1512
Lin SW, Ying KC, Chen SC, Lee ZJ (2008b) Particle swarm optimization for
parameter determination and feature selection of support vector machines.
Expert Systems with Applications 35(4):1817–1824
Ludbrook J (2000) Multiple inferences using confidence intervals. Clinical and
Experimental Pharmacology and Physiology 27(3):212–215
Mantovani RG, Rossi AL, Vanschoren J, Bischl B, de Carvalho AC (2015) Effec-
tiveness of random search in SVM hyper-parameter tuning. In: International
Joint Conference on Neural Networks (IJCNN), pp 1–8
Nelder J, Mead R (1965) A simplex method for function minimization. The
Computer Journal 7(4):308–313
Pai PF, Hong WC (2005) Support vector machines with simulated annealing
algorithms in electricity load forecasting. Energy Conversion and Management
46(17):2669–2688
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel
M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau
D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research 12:2825–2830

32
Pelikan M, Goldberg DE, Cantú-Paz E (1999) BOA: The Bayesian optimization
algorithm. In: Annual Conference on Genetic and Evolutionary Computation,
pp 525–532
Powell M (2009) The BOBYQA algorithm for bound constrained optimization
without derivatives. Tech. Rep. Report NA2009/06, Department of Applied
Mathematics and Theoretical Physics, University of Cambridge
Schenker N, Gentleman J (2001) On judging the significance of differences by ex-
amining the overlap between confidence intervals. The American Statistician
55(3):182–186

Sklearn (2019) RBF SVM parameters. https://scikit-learn.org/stable/


auto_examples/svm/plot_rbf_parameters.html
Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of
machine learning algorithms. In: Advances in Neural Information Processing
Systems, pp 2951–2959A

Sun J, Zheng C, Li X, Zhou Y (2010) Analysis of the distance between two


classes for tuning SVM hyperparameters. IEEE Transactions on Neural Net-
works 21(2):305–318
Trautmann H, Mersmann O, Arnu D (2011) cmaes: Covariance matrix adapting
evolutionary strategy. version1.0-11

Wahba G, Lin Y, Lee Y, Zhang H (2001) On the relation between the gacv
and Joachims’ xi-alpha method for tuning support vector machines, with
extensions to the non-standard case. Tech. Rep. 1039, Statistics Department
University of Wisconsin, Madison WI

Wainer J (2016) Comparison of 14 different families of classification algorithms


on 115 binary datasets. ArXiv e-prints 1606.00930
Wainer J, Cawley G (2017) Empirical evaluation of resampling procedures
for optimising svm hyperparameters. Journal of Machine Learning Research
18(15):1–35

Wessing S (2018) Proper initialization is crucial for the Nelder–Mead simplex


search. Optimization Letters pp 1–10
Y Xiang, Gubian S, Suomela B, Hoeng J (2013) Generalized simulated annealing
for efficient global optimization: the GenSA package for R. The R Journal
Volume 5/1, June 2013

Ypma J, et al (2014) nloptr: R interface to NLopt. version 1.0.4


Zhu X (2016) mixtox: Curve fitting and mixture toxicity assessment. version
1.3

33

You might also like