Hyperparameter Search)
Hyperparameter Search)
:
An empirical evaluation of 18 search algorithms
arXiv:2008.11655v1 [cs.LG] 26 Aug 2020
Abstract
SVM with an RBF kernel is usually one of the best classification al-
gorithms for most data sets, but it is important to tune the two hyper-
parameters C and γ to the data itself. In general, the selection of the
hyperparameters is a non-convex optimization problem and thus many
algorithms have been proposed to solve it, among them: grid search, ran-
dom search, Bayesian optimization, simulated annealing, particle swarm
optimization, Nelder Mead, and others. There have also been proposals
to decouple the selection of γ and C. We empirically compare 18 of these
proposed search algorithms (with different parameterizations for a total of
47 combinations) on 115 real-life binary data sets. We find (among other
things) that trees of Parzen estimators and particle swarm optimization
select better hyperparameters with only a slight increase in computation
time with respect to a grid search with the same number of evaluations.
We also find that spending too much computational effort searching the
hyperparameters will not likely result in better performance for future
data and that there are no significant differences among the different pro-
cedures to select the best set of hyperparameters when more than one is
found by the search algorithms.
Keywords: Hyperparameters; SVM; grid search; random search; non-
convex optimization algorithms
1 Introduction
Support vector machines (SVM) and in particular SVM with the RBF kernel is a
powerful classification technique, ranking amont the best two or three classifiers
∗ Corresponding author: wainer@ic.unicamp.br
1
for a large set of real life datasets (Fernández-Delgado et al, 2014; Wainer, 2016).
SVM with RBF kernel requires one to select the value of two hyperparameters
C and γ before computing the SVM itself. There have been many proposals
in the literature on how to search for the best pair of hyperparameters for a
particular data set, among them general optimization procedures such as grid
search, random search, simulated annealing, Bayesian optimization, and many
others. The main goal of this paper is to compare many of these proposals on a
large number of real-life binary data sets. The 18 families of search algorithms
that are compared in this research are discussed in Section 2, including their
different parameterizations. In total, we tested 47 different combinations of
algorithms and parameterizations.
SVM, in general, is formulated as a minimization problem:
n
1 X
min ||w||2 + C ξi
2 i=1
(1)
subject to yi (w · φ(xi ) + b) ≥ 1 − ξi
with ξi ≥ 0
2
Let us assume a given data set G as an i.i.d. sample from an unknown
distribution D. One would like to select the best pairs of hyperparameters as:
where Eg∼D is the expectation taken over a random variable g that represents
future data with the same distribution as the data G.
One can consider the right side of Equation 3 as the definition of a function
of two parameters, C and γ, which we will call the true response surface of
the SVM for the data set G. We will represent the true response surface as
trsG (C, γ).
Of course, one does not have access to the new data g or the underlying
distribution D, and thus one cannot compute Eg [ac(g|G, C, γ)]. One solution
is to compute an estimate of that value using some form of resampling of the
original data set G. Resampling will divide the data set G into one or more pairs
of data sets T Ri and T Ei , such that T Ri ∩ T Ei = ∅ and T Ri ∪ T Ei ⊆ G. K-fold
cross-validation, bootstrapping, and leave-one-out are examples of resampling
procedures.
The estimate of Eg [ac(g|G, C1 , γ1 )] for some fixed C1 and γ1 is the mean
of the accuracy of training the SVM (with C1 and γ1 as hyperparameters) on
each of the T Ri (the training set) and testing on the corresponding T Ei (the
test set). In this paper, we will use 5-fold cross-validation as the resampling
procedure.
We will define an approximation to the function trsG (C, γ) as:
rs(G,
b b G (C, γ) = mean ac(T Ei |T Ri , C, γ)
C, γ) = rs (4)
i
For simplicity, we will call this the response surface (without the qualifier
true) of the SVM for the data set G and for a particular resampling choice of
pairs {T Ri , T Ei }.
Selecting the best hyperparameters for the data set G is finding the maxi-
mum of the response surface rs b G . Or, to be more consonant with the optimiza-
tion terminology, one can think of minimizing the error surface 1 − rs b G.
In general, the response surface is not concave (or the error surface is not
convex) and thus there may be many local maxima, which requires optimization
algorithms that are not just based on gradient ascent. There have been many
proposals of algorithms that will find hyperparameters with higher values of
rs
b G . The first goal of this research is to compare many of those algorithms.
One must remember that the aim of hyperparameter search is to find a pair of
hyperparameters that maximizes the true response surface trsG , that is, the pair
that will result in the classifier with the highest accuracy for future data. The
rs
b G surface is an approximation of the trsG surface, and thus there will likely be
differences in the value of trsG (C1 , γ1 ) and rs
b G (C1 , γ1 ). The second goal of this
paper is to have some understanding if it is worth spending the computational
effort on algorithms that will likely select better hyperparameters in rs b but
3
may not achieve a better solution in trs. We will estimate the correlation of
improving the selection in rsb and the expected gain in accuracy on future data.
The rs
b G is also a step function - small changes in C and γ will not change
the accuracy of the resulting SVM on a finite set of data. Therefore, it is likely
that any search algorithm for pairs of C and γ will produce a set of pairs of
hyperparameters with the same value of rs b G.
If more than one pair of hyperparameters is selected there is a reasonable ar-
gument to choose the pair with the smallest C and the smallest γ. For example,
the tutorial page for hyperparameters in SVM in sklearn (Sklearn, 2019) makes
that argument. Both hyperparameters work as inverse regularization terms. A
large C will place emphasis on loweringPn the number of support vectors since
each one of them contributes to the i=1 ξi cost in the optimization. A lower
C will allow more support vectors, resulting in larger margins.
The γ parameter controls how fast the “influence” of a point decreases with
distance. The kernel value for two points will decrease as γ increases. As
γ increases, the decision surfaces become more ”curvy” and fit closely to the
training data. A smaller γ will generate decision surfaces that are flatter, and
thus a simpler model.
Thus, in the case of search algorithms that return a set of best hyperparame-
ters, one has to perform a further selection procedure. We call this post-search
selection. The third goal of this paper is to compare some different post-search
selection procedures regarding the accuracy of future data.
The fourth goal of the paper is to get some insight into the distributions of
the best C and γ selected by the many algorithms, whether these values are
concentrated on some region of the hyperparameter space.
4
The second direction tries to decouple the choice of γ from the choice of C
(the more frequent approach) so that γ is selected first (sometimes using a grid
search, sometimes not) based on some measure of the data, and only then the
C is chosen (usually by grid search).
• grid: In the grid search the log2 C ×log2 γ space is divided into an equally
spaced grid and only the pairs in the grid points are tested.
We used the following constraints on C and γ not only for the grid search
but for all algorithms:
2−5 ≤C ≤ 215
(5)
2−15 ≤γ ≤ 23
which are the limits of the grid search suggested by the LibSVM package
(Chang and Lin, 2011; Hsu et al, 2003), a very popular SVM solver. The
grid search and all other algorithms perform the search in the log C ×
log γ space (and not in the C × γ space).
As we will discuss further in Section 3, we are interested in comparing
the quality of the different algorithms when controlling for the number of
times the optimizer evaluates rsb for some point in the log C × log γ space.
We call this number N . Notice that since we are using 5-fold to evaluate
rs,
b the SVM learning and evaluation methods will be called 5*N times.
We tested the grid procedure for N = 25, N = 100 and N = 400, and
in each case, we divided both the log C and the log γ dimensions into an
equal number of intervals (5, 10, and 20 respectively)
• ud: Uniform design is a space-filling method (Fang et al, 2000) used in the
design of computer experiments. It approximates a uniform distribution
over a space of parameters (or hyperparameters in our case), given the
number of points one needs to sample from that space. The sampling is
deterministic and it chooses points that minimize a discrepancy measure
in relation to the uniform distribution. Uniform design has been proposed
as a method to select SVM hyperparameters (Huang et al, 2007) .
To generate the uniform design tables we used the unidTab function from
the mixtox package (Zhu, 2016). We tested ud with N = {25, 100, 400}.
• rand: N random points are uniformly sampled from the log C ×log γ space
within the bounds of Equation 5.
• normrand: Randomly select log C from a normal distribution with mean 5
and standard deviation 5 and log γ from a normal distribution with mean
−5 and standard deviation 5.
5
We tested the other algorithms discussed in this section on 115 datasets
(see Section 3.1) and collected the C and γ selected by them. The distri-
bution of the selected log C and log γ follow a Gaussian distribution (see
Section 4.4). This prompted the realization that a random sampling using
these Gaussian distributions could achieve, in general, better results than
the rand search discussed above.
We tested normrand with N = {25, 100, 400}. But we must warn that
the analysis of the results for this algorithm has a risk of overfitting - the
algorithm was tuned using all the datasets available and we are measuring
their quality on the same set of datasets.
• gridhier: Hierarchical grid. A grid search is performed to select the point
x as the one with the highest accuracy. A second grid search centered on x
and bounded by the closest neighbours of x in the first grid is performed,
and the highest accuracy point among x and from the second search is
selected.
We tested the hierarchical procedure both for grid search with 25 + 25,
that is, with N = 25 in the first level, followed by N = 25 in the second
level, and with 100 + 100
• udhier: Hierarchical uniform design. Similar to the gridhier, but the
uniform design search is performed at both levels. Tested with N = 25+25
and 100 + 100.
• nelder: Nelder-Mead simplex algorithm (Nelder and Mead, 1965) is a
well-known derivate-free optimization algorithm. The algorithm maintains
a simplex of points in the parameter space (in our case of 2 dimensions
the simplex is a triangle) and updates the simplex point with the worst
evaluation with a new point symmetric to it on the other side of the
centroid of the remaining points. The algorithm converges when either
the simplex is too small or when the improvement on the evaluation on
changing the worse simplex vertex is below a threshold. Nelder Mead has
been suggested as a method for selecting SVM hyperparameters (Cohen
et al, 2005; Cawley and Talbot, 2007).
We used the neldermead function of the nloptr package (Ypma et al,
2014) with the initial point in the center of the log C × log γ space. There
are many algorithms to generate an initial simplex from a single initializa-
tion point (Wessing, 2018); unfortunately, neither the nolptr R package
nor the underlying nlopt C implementation documents which algorithm
is used.
We used the bounded version of the Nelder Mead algorithm with the
constraints in Equation 5 We tested Nelder-Mead with N = {25, 100, 400}.
The Nelder Mead algorithm has some other stopping criterium, besides
the number of evaluations. In the case of an early stop, we restarted the
nelder algorithm on a random point in log C × log γ , setting the number
of evaluations to the balance of evaluations.
6
• bobyqa: Bound optimization by quadratic approximation (BOBYQA)
(Powell, 2009) is a derivative-free optimization algorithm that approxi-
mates the response function by quadratic interpolation. We are not aware
of any publication that proposed using any of the algorithms in this family
to select SVM hyperparameters, but we feel it is an interesting alternative
to the Nelder-Mead algorithm.
We used the bobyqa function from the nlopt package (Ypma et al, 2014),
with the initial point in the center of the log C × log γ space. We tested
bobyqa with N = {25, 100, 400}. Similarly to nelder, the algorithm will
restart at some random point on early stopping.
• sa: Simulated annealing (Aarts and Korst, 1988) starts at a state and
probabilistically accepts moving to a different random state given the dif-
ference of the evaluations. At “high temperature,” at the beginning of
the process, even states that worsen the evaluation are accepted with high
probability. As the temperature lowers, the probability of accepting a bad
transition also lowers, and at very low temperatures only new states that
have better evaluation are accepted. Simulated annealing has been sug-
gested as a method to select SVM hyperparameters (Lin et al, 2008a; Pai
and Hong, 2005; Imbault and Lebart, 2004),.
We used the GenSA function from the GenSA package (Y. Xiang et al, 2013).
We used the default values of the simulated annealing for this function and
stopped the interaction after N calls, for N = {25, 100, 400}
• pso: Particle swarm optimization (Eberhart et al, 1995) considers a set
of solutions to the optimization problem as a set of particles, which have
a momentum (and thus will tend to move in the direction it has been
moving before) but are also attracted to other particles that found good
solutions to the optimization problem. Particle swarm optimization has
been used to select the SVM hyperparameter (Lin et al, 2008b; Huang
and Dun, 2008).
We used the psoptim function from the pso package (Bendtsen, 2012).
We tested with N = {25, 100, 400}.
• cma: Covariance matrix approximation - evolutionary strategy (Hansen
et al, 2003) is an evolution-inspired algorithm for general non-convex prob-
lems. CMA-ES maintains a set of K points in the search space with the
corresponding function value. It then adjusts a multidimensional Gaus-
sian distribution on the search space so that points with higher accuracy
would have higher likelihood, and generate new points based on this dis-
tribution. The algorithm then keeps the best K among the old and the
new points and repeats the process.
We used the cmaes package (Trautmann et al, 2011). We used the default
number of the initial population µ and the default number for the sub-
sequent generations (λ) and the number of generations was computed so
the search would not exceed the specified N = {100, 400}.
7
• bogp: Bayesian optimization (Pelikan et al, 1999; Snoek et al, 2012) mod-
els the response surface as a Gaussian process regression. This modeling
allows the optimizer to evaluate not only which data points should be
queried because they are likely to result in a good evaluation, but also
query which data point will improve its knowledge of the response func-
tion itself. Thus a Bayesian optimization model finds a particular balance
between exploitation (of the data points it believes will result in good eval-
uations) and exploration (of new data points that will improve its estimate
of the response surface itself).
Bayesian optimization has received a lot of attention recently, but our pre-
vious experience with it is that it is computationally costly. We, therefore,
opted to use a fast, mostly compiled implementation of Bayesian optimiza-
tion (as opposed to an R implementation). We used the Python RoBO
(Klein et al, 2017) implementation. We tested with N = {100, 400}.
• tpe: Tree of Parzen Estimators (Bergstra et al, 2013) is a Bayesian op-
timization inspired algorithm, but instead of using Gaussian processes to
model the distribution of values of the response surface, it models it as
two Gaussian mixture models. TPE is usually much faster than Gaussian
process based Bayesian optimization.
We used the HyperOpt python implementation (hyp, accessed 2019) of
the TPE, with N = {100, 400}.
We call grid, ud, normrand, and rand as grid-like search algorithms because
the evaluation points in the log C × log γ space are defined before any evaluation
need to be performed. We call gridhier and udhier the hierarchical algorithms
and we call nelder, bobyqa, sa, cma, pso, bogp, and tpe as general optimization
algorithms
8
- one can use the whole training set and discover the combination of val-
ues γ and C that has the lowest number of points in the support vector.
Unfortunately, usually, there are many different C values with the same
number of support vectors (for a fixed γ), since the svmpath concerns with
changes in the support vector and not just its size ( a point may get in as
another gets out). In these cases, we selected a random C and γ among
the ones with minimal size of the support vector.
We used the svmpath function from the svmpath package (Hastie, 2012)
and N ∈ {5, 10, 20}.
• asymp: Keerthi and Lin (Keerthi and Lin, 2003) discuss that there should
be a good solution to the hyperparameters selection for an RBF SVM in
the line log γ = log Ĉ − log C in the log C × log γ space. Thus they propose
that a linear SVM should be used to find the Ĉ constant, and then a 1D
grid on the line log γ = log Ĉ − log C to select the best C and γ for the
RBF SVM.
The parametrization of this algorithm involves a pair of numbers: the
number of grid points for the C used in the linear SVM problem, and the
number of grid points in the log γ = log Ĉ − log C line to select the pair
C and γ. We used 5+5, 10+10 and 20+20 points.
• dbtc: Sun et al (Sun et al, 2010) propose that the γ parameter should be
chosen using a geometric criterium, in particular, it should be chosen by
maximizing the distance between the center of the two classes (in the fea-
ture space), which is called distance between two classes (DBTC). DBTC
is one of a family of techniques called internal metrics that use some metric
on the training set (not on the test set) to select γ such as Xi-alpha bound
(Joachims, 2000), Generalized approximate cross-validation (Wahba et al,
2001), Maximal Discrepancy (Anguita et al, 2003), among others, DBTC
achieves significantly higher accuracy and is significantly fasterthan the
others internal metrics (Duarte and Wainer, 2017).
In our solution, a grid in log γ is searched in order to maximize the DBTC.
With the selected γ, a grid on log C is used to select the other hyperpa-
rameter. We used 5+5 and 10+10 evaluation points.
We also implemented a sampled version of DBTC, where a sample of 50%
of data points on both classes is used to compute the distance. We call
this version sdbtc after sampled DBTC and tested it also on 5+5 and
10+10 evaluations.
• skl: The popular Python package scikit-learn (Pedregosa et al, 2011)
and the popular SVM solver implementation libSVM (Chang and Lin,
2011) use as the default for the γ hyperparameter the inverse of the number
of dimensions of the data. Sklearn also uses 1.0 as the default value for C.
We do not know of any published evidence for that choice of the default
value of C.
9
One must note the importance of default values for hyperparameters since
the less sophisticated user of popular packages usually relies on the default
values in the hope that there is good empirical or theoretical evidence for
that selection. In fact, the authors are unsophisticated users of many of the
packages described above (pso, GenSA, cma, svmpath, and so on) and we
did rely on the default values of those packages for the hyperparameters of
the corresponding search algorithm. Thus, it is important to test whether
the default values for SVM implementations are indeed reasonable choices.
We call skl1 the choice of using both default values. And sklN the choice
of setting γ as the sklearn default, and performing a grid search on the C,
with N = {5, 10, 20}.
10
• minCg First selects the minimum value of the Ci and then the minimum
value of the γi among the subset with the selected minimum Ci . Formally:
Ĉ = min{Ci |hCi , γi i ∈ B}
(6)
γ̂ = min{γj |hĈ, γj i ∈ B}
• maxCg: First select the maximum value of the Ci and then the maximum
value of the γi among the subset with the selected Ci .
• maxgC: First select the maximum γi , and then the maximum Ci among
the subset with the selected γi .
The last two procedures maxCg and maxgC are there only to test a procedure
that our arguments should point out as worse than the other ones.
• each feature was scaled to mean equal to 0 and standard deviation equal
to 1
• the multi-class datasets were converted to binary by assigning the most
frequent class to 0, the second most frequent to 1, the third most frequent
to 0, and so on.
The distribution of the size of the datasets, the number of features, and the
ratio of the less frequent class are shown in the histograms in Figure 1.
11
log_10 size number of features
60
15
40
count
count
10
20
5
0 0
1 2 3 4 5 0 100 200
logsize ndimension
minority ratio
15
10
count
0
0.0 0.1 0.2 0.3 0.4 0.5
ratio
Figure 1: Histograms of the size, number of features, and minority ratio of the
datasets used.
where C, γ ∈ x indicates that C and γ are selected from the ones “generated”
by algorithm x. rs(S
b ji , C, γ) is defined in Equation 4 and in this research we
use the 5-fold cross-validation to compute it.
In the case of multiple best pairs of hyperparameters, we select a random
one from the set, that is, we follow the randCg post-search selection. There
are two reasons for this option. First, some of the general search algorithms
only return a single solution, so even if the algorithm found many equally good
solutions, it seems to us that this unique returned solution is a random one
12
among them. Second, as we will discuss in Section 4.3, there are no significant
differences between all of the selection procedures, and thus, any of them can
be used to select among equally good pairs of hyperparameters.
Let us denote α(x, i, j) as
α(x, i, j) = rs(S
b ji , θ(x, i, j)) (9)
that is, the highest accuracy found in the response surface of Sji by algorithm
x, again using the 5-fold evaluation . That is, θ(x, i, j) is the pair of best
hyperparameters selected by x on subset Sij and α(x, i, j) is the value of rs
b Sji
on that point.
Finally, let us denote β(x, i, j) as:
β(x, i, j) = ac(Ŝi |Sji , θ(x, i, j)) (10)
that is, the accuracy on the second subset (Ŝi ) of an SVM trained on the first
subset (Sji ) with hyperparameters selected by algorithm x on the first subset.
For each subset Sji , we recorded the value of θ(x, i, j) and measured α(x, i, j)
and β(x, i, j) , that is, for each subset we recorded the position (θ), the height
(α) of the highest point in rs b Sji , and the accuracy of testing the SVM trained
on Sji with the best hyperparameters discovered by x on a similar dataset (Ŝi ).
13
A 2-fold estimate of that value is:
ac(Ŝi |Sji , θ(x, i, j)) + ac(Sji |Ŝi , θ(x, i, ̂)) β(x, i, j) + β(x, i, ̂)
E2 (x, i) = =
2 2
(13)
The accuracy gain (in relation to the 10x10 grid) for future data, for dataset
Di and algorithm x is
E2 (x, i) − E2 (grid100, i)
We will compute the accuracy gain on future data for all datasets and algo-
rithms.
as an estimate of the accuracy on future data when using all of the selection
procedures.
14
3.7 Statistical analysis
The usual procedure for comparing different algorithms (in this case, the dif-
ferent search algorithms and the different post-search selection procedures) on
different datasets is to follow the one proposed by Demšsar (Demšsar, 2006):
a broad comparison of the accuracy of all algorithms using the Friedman test
(a non-parametric version of a repeated measure ANOVA), and if its p-value
is below the usual 0.05 (for a 95% confidence), a pairwise comparison of all
algorithms - the Nemenyi test. Those comparisons with a p-value below 0.05
would be deemed as significantly different (at a 95% confidence).
For the post-search selection comparisons, we perform this procedure. We
compared the expected accuracy gain on future data for each of the selection
procedures (for algorithms grid400, rand400, ud400, and normrand) to find
out which one is the procedure that yields significantly better results for the
accuracy on future data.
For the accuracy gain and time ratio, we will only compute the 95% confi-
dence interval for those measures. We do not seek to make a broad claim that
search algorithm A is “significantly better” than the other algorithms. One of
the reasons is that very likely there will be no such algorithm: we are com-
paring 47 different combinations of algorithms and parameterizations and it is
unlikely that one of such combinations is so much better than the others so
that it can be statistically distinguished from all other 46 algorithms with just
115 datasets/measurements. For that to be true the algorithm should be “very
good” in finding the maximum of the rs b G response surface. That is even less
likely true given that the rsb G is a step function, and thus “around” any very
good γ and C points there is a neighborhood of equally good points (with the
same value of rsb G ). If the dataset is large, the neighbourhood may be small
and maybe none of the other search algorithms found that region, but that is
increasingly unlikely for smaller datasets.
Finally, we must also point out that one should not use the confidence in-
tervals computed in this research as a substitute for hypothesis testing. When
comparing two sets of values, confidence intervals can subsume hypothesis test-
ing: if the 95% confidence intervals have no intersection, one can claim that at
95% confidence that the two sets are significantly different (the reverse is not
always true (Schenker and Gentleman, 2001)). But with multiple groups, as
pointed out by Ludbrook (2000) one would have to include the effect of multi-
ple comparisons in the confidence interval calculations. This was not done in
this research.
3.8 Reproducibility
The datasets used in the paper are available at https://figshare.com/s/
d0b30e4ee58b3c9323fb as described in Wainer (2016). The program to run
the different procedures and the different classifiers, the results of the multiple
runs, and the R program to perform the statistical analysis described in this
paper are available at https://figshare.com/s/35f990ae2b7f65062cca.
15
4 Results on the comparison of the search algo-
rithms
Table 1 is the main result of the comparisons of the search algorithms. The first
column is the name of the search algorithm and the N parameterization. The
next three columns are the mean accuracy gain and the low and high limits of
the 95% confidence interval for the accuracy gain. The last three columns are
the mean time ratio between the algorithm and the base search algorithm, and
the low and high limits of the 95% confidence interval. The table is ordered by
increasing mean accuracy gain and the small gap in the table indicates the start
of the range of algorithms which have a mean accuracy gain of 0 or greater.
Figure 2 displays the results of Table 1 in graphical form - which relates the
mean accuracy gain and mean log10 of the time ratio for each combination of
algorithm and parametrization.
Time ratio
algorithm mean low high mean low high
svmpath5 -0.0388 -0.0483 -0.0312 0.15 0.12 0.20
asymp10 -0.0384 -0.0502 -0.0304 1.86 1.37 2.47
asymp20 -0.0346 -0.0458 -0.0268 3.15 2.33 4.16
skl1 -0.0336 -0.0416 -0.0274 0.02 0.01 0.02
asymp40 -0.0292 -0.0407 -0.0218 5.79 4.27 7.54
svmpath10 -0.0292 -0.0376 -0.0224 0.28 0.23 0.35
svmpath20 -0.0250 -0.0325 -0.0188 0.55 0.45 0.68
sdbtc5 -0.0206 -0.0262 -0.0161 0.07 0.06 0.08
skl5 -0.0179 -0.0223 -0.0144 0.06 0.05 0.07
dbtc5 -0.0176 -0.0227 -0.0139 0.15 0.13 0.17
sigest5 -0.0144 -0.0174 -0.0119 0.06 0.05 0.06
sdbtc10 -0.0127 -0.0164 -0.0100 0.11 0.10 0.12
dbtc10 -0.0111 -0.0145 -0.0086 0.22 0.19 0.24
sigest10 -0.0102 -0.0129 -0.0081 0.09 0.08 0.10
sdbtc20 -0.0101 -0.0138 -0.0076 0.19 0.17 0.21
skl10 -0.0098 -0.0136 -0.0054 0.10 0.08 0.11
skl20 -0.0094 -0.0138 -0.0057 0.16 0.14 0.18
sigest20 -0.0086 -0.0113 -0.0065 0.16 0.14 0.17
sa25 -0.0083 -0.0106 -0.0066 0.34 0.31 0.38
Continues on next page
16
Table 1 – continued from the previous page
algorithm mean low high mean low high
dbtc20 -0.0074 -0.0101 -0.0053 0.34 0.30 0.37
grid25 -0.0055 -0.0069 -0.0042 0.27 0.26 0.28
normrand25 -0.0043 -0.0062 -0.0029 0.26 0.25 0.29
rand25 -0.0041 -0.0053 -0.0030 0.23 0.22 0.24
ud25 -0.0039 -0.0052 -0.0026 0.31 0.30 0.32
cma100 -0.0026 -0.0053 -0.0005 0.98 0.84 1.15
pso25 -0.0024 -0.0037 -0.0011 0.28 0.26 0.31
bobyqa25 -0.0020 -0.0039 -0.0006 0.30 0.26 0.35
sa100 -0.0009 -0.0022 0.0001 1.23 1.15 1.33
bobyqa400 -0.0006 -0.0027 0.0007 4.27 3.70 4.82
bobyqa100 -0.0004 -0.0022 0.0008 1.14 1.00 1.30
nelder25 -0.0002 -0.0015 0.0018 0.33 0.29 0.38
17
Table 1 – continued from the previous page
algorithm mean low high mean low high
bogp400 0.0059 0.0048 0.0074 461.25 307.12 647.53
Table 1: Accuracy gain and time ratio for the search algorithms
−0.01 skl10 ●
●
●
●
sa25
skl20 ●
sigest10 dbtc10 dbtc20
●
accgain
sigest5 ● sdbtc20
sdbtc10 dbtc5
● ●
skl5 sdbtc5
−0.02 ● type
a
● general−opt
●
svmpath20 a
● grid−like
svmpath10
● ● a
● hierarchical
−0.03 asymp40
skl1 a
● svm−specific
●
asymp20 ●
asymp10
● ●
−0.04 svmpath5
−2 −1 0 1 2
log10 time ratio
Figure 2: The accuracy gain and log10 time ratio in relation to the 10x10 grid of
all the search algorithms. The continuous line is the local average for algorithms,
removing asymp, svmpath, and bogp.
18
costly, although bogp400 did achieve the highest accuracy gain, it runs 300
times slower than the baseline grid100.
Also, in general, as expected, algorithms with a larger number of probes to
the response surface will perform better than algorithms with a lower number of
probes and will take correspondingly more computational time. But the more
interesting aspect is to consider the continuous line in Figure 2 as the mean
behavior of the search procedures (excluding the svmpath, asymp, and bogp),
and consider the ones that are above or below the “expected” behaviour. Among
the algorithms with 100 and 400 probes, tpe and pso perform better than the
grid-like algorithms, which perform better than the other general optimization
procedures (sa, cma, nelder, and bobyqa). Among the grid-like algorithms, the
normrand seems slightly better than the others, but we must remind the reader
that there it is possible that the normrand results due to a kind of “overfitting”
because the algorithm’s parameters (the mean and standard deviations of the
two distributions) are somewhat tunned to this particular set of datasets.
Among the algorithms with N=25, the derivative-free algorithms nelder and
bobyqa seem also interesting choices. Unsurprisingly, the hierarchical algorithms
perform above the mean in comparison to the other algorithms.
Finally, among the remaining SVM-specific algorithms, dbtc performs below
the average, and the sampled version of dbtc is even worse. sigest seems to
perform a little better than the skl algorithm.
In general terms, for the very fast search (1, 5 or 10 probes), one should use
the sigest algorithm to select the γ and perform a 5 or 10 grid search for C.
skl is maybe a simpler (implementation-wise) alternative but there are some
accuracy costs. dbtc is not a competitive alternative. For the fast searches (20
to 25 probes) one should use nelder with N=25, or any of the grid-like with
N=25. SVM-specific algorithms are not competitive with these options even at
N=20.
For the conventional range on the number of evaluations (N=50 or N=100),
one should consider the hierarchical uniform design with N=50 which is faster
but performs similarly to the grid-like algorithms. For 100 evaluations one
should use either tpe or pso. If one is willing to pay the computational cost
of 200 to 400 evaluations, one should again use tpe or pso, a hierarchical grid
of 100+100, or the grid-like algorithms at N=400. Finally, among grid-like
algorithms, the normrand seems to perform slightly better than the others.
All algorithms limited to 200 and 400 evaluations performed better than the
10x10 grid search (grid100). Clearly, they are, on average, finding points a
little higher in the response surface; the question is whether it is worth to spend
the extra effort in selecting the hyperparameters. We will discuss this question
below.
19
trsG which is the true response surface of the SVM classifier when G is used as
the learning set and the accuracy is the expectation when the testing data is
sampled from the same distribution that G is sampled from. One really wants
to find the best set of hyper-parameters on trsG , but one is searching it on rs
b G.
The point is that, most likely, rs
b G is a noisy approximation of trsG , and peaks
in rsb G may not correspond to peaks in trsG . If that is the case, it may not
be worth it to spend a lot of computational effort on the most costly search
algorithms. We will provide two forms of evidence that that is the case.
The first evidence is that we do have another estimate of the trsG . As we
discussed in Section 3.4, we measure what we called the expected accuracy on
future data, which is the accuracy of training the SVM on a subset j of dataset
i (with hyperparameters selected on subset j) and testing on the other subset
̂ (Equation 13). We will display the average gain on the expected accuracy on
future data from each search algorithm in comparison to the 10x10 grid.
The second evidence will compare the peaks on two different rs b G surfaces.
The rs b G is based on a 5-fold split of the known data G. If a different random
generator seed is used, different folds would be created. We call each of an
instance of running a search algorithm using a different random seed to generate
different 5-folds a run. If the peaks in rs b G are “grounded” on true peaks of
trsG , there should not be much of a difference between the best-selected set of
hyperparameters between the two runs.
We ran the grid400,ud400,rand400, and normrand400 algorithms on all
datasets, with a different instance of the 5-fold cross-validation. The grid400
and the ud400 probe the same points in the log C × log γ space for both rs bG
surfaces, and the random seeds for the rand400 and normrand400 were also set
to the same number for the two runs, so also they generate the same sequence
of points to be probed in both instances. Therefore, for each of these algo-
rithms, the same set of points were probed, but in two different rs b G surfaces.
We computed different measures of the distance between the two sets of best
hyperparameters. We believe that the magnitude of the differences indicate that
the maximum in the rs b G surface for both runs are not “grounded” on a real
maximum of the trsG surface, but are indeed “noise” particular to each rs bG
surface.
20
include the 0 indicates that the differences are not significant – the corrections
for multiple comparisons would only enlarge the confidence intervals (Ludbrook,
2000).
Therefore, besides the first 10 SVM-specific algorithms in Table 2, none of
the other search algorithms seems to be significantly different from grid100 in
terms of accuracy on future data. The sa100 algorithm seems to present an
anomaly, it also has its confidence interval on the accuracy gain in the negative
range, that is, it does not include the 0. But, again, one cannot conclude that
the algorithm is significantly worse than grid100. We believe that this result
is spurious, in the sense that sa100 is not different than the other algorithms
on the expected accuracy gain for future data, it was just an accident that its
confidence interval does not include the 0.
algorithm mean low high
svmpath5 -0.0271 -0.0370 -0.0196
asymp10 -0.0261 -0.0381 -0.0179
asymp20 -0.0232 -0.0349 -0.0155
asymp40 -0.0198 -0.0309 -0.0117
svmpath10 -0.0192 -0.0273 -0.0123
skl1 -0.0168 -0.0262 -0.0101
svmpath20 -0.0158 -0.0231 -0.0098
sdbtc5 -0.0063 -0.0105 -0.0027
dbtc5 -0.0056 -0.0094 -0.0025
sigest5 -0.0041 -0.0079 -0.0006
21
Table 2 – continued from the previous page
algorithm mean low high
sa400 -0.0012 -0.0039 0.0010
cma100 -0.0011 -0.0048 0.0025
skl20 -0.0011 -0.0062 0.0043
dbtc20 -0.0010 -0.0041 0.0016
tpe100 -0.0010 -0.0036 0.0012
pso25 -0.0009 -0.0031 0.0016
bogp400 -0.0009 -0.0033 0.0012
ud400 -0.0007 -0.0032 0.0016
normrand400 -0.0007 -0.0031 0.0013
bogp100 -0.0006 -0.0034 0.0018
tpe400 -0.0006 -0.0031 0.0018
nelder25 -0.0005 -0.0029 0.0020
bobyqa400 -0.0004 -0.0027 0.0024
bobyqa100 -0.0004 -0.0026 0.0018
gridhier200 -0.0004 -0.0025 0.0014
grid25 -0.0003 -0.0022 0.0018
nelder400 -0.0000 -0.0022 0.0025
nelder100 0.0000 -0.0019 0.0024
gridhier50 0.0001 -0.0020 0.0021
udhier50 0.0002 -0.0016 0.0020
udhier200 0.0002 -0.0020 0.0027
bobyqa25 0.0004 -0.0018 0.0024
pso100 0.0004 -0.0018 0.0025
rand400 0.0004 -0.0022 0.0030
pso400 0.0005 -0.0015 0.0024
rand25 0.0005 -0.0020 0.0031
ud100 0.0006 -0.0020 0.0035
normrand100 0.0006 -0.0016 0.0028
ud25 0.0008 -0.0016 0.0028
grid400 0.0009 -0.0012 0.0031
normrand25 0.0010 -0.0011 0.0032
Figure 3 displays the gain in accuracy on future data for all search algorithms
and their log10 time ratio in relation to the 10x10 grid. The placement of the
search algorithms on the horizontal axis is the same as in Figure 2, for easier
comparison.
The important conclusion is whatever gains each algorithm may have in
relation to grid100 in finding the peaks in rs
b G surface, those gains seem not to
yield better accuracy on future data.
22
Expected gain on future data
normrand25 ud25 normrand100 ud100 udhier200 grid400 pso400
● ● udhier50●●● nelder100 ●● ●
rand25 ● ●
0.00 grid25 ●bobyqa25 ●● ● ● ● nelder400 bobyqa400
● pso100 ● ●rand400
● ●●● ● bogp100
sigest20 skl20 ● ● ●
●●
● ud400 ●
● ● pso25 ● ● bogp400
●
skl10 sigest10 ● ● ● ●
gridhier50 bobyqa100 tpe400 normrand400
● sdbtc20 sa25 ● sa400
skl5 ● cma100 gridhier200
cma400
sigest5 sdbtc10 nelder25 rand100
●
sa100
● dbtc10 dbtc20 tpe100
sdbtc5 dbtc5
−0.01
future.gain
●
type
skl1 svmpath20
●
a
● general−opt
● asymp40 a
● grid−like
svmpath10 ●
−0.02
a
● hierarchical
asymp20
● a
● svm−specific
asymp10
●
●
svmpath5
−2 −1 0 1 2
log10 time ratio
Figure 3: The gain in accuracy on future data and log10 time ratio in relation
to the 10x10 grid of all the search algorithms.
23
Measurement ud400 grid400 rand400 normrand400
Number of datasets with a single 45 59 16 12
best selection
Proportion of datasets with the 9% 54% 6% 8%
same best solution for both runs
Average rank of the best solution 20.2 1.92 29.7 24.4
in one run, in the other run
Median difference between accu- 0.0009 0.0048 0.0013 0.0010
racy of the best and second best
selections in a single run
Median difference between the 0.0089 0.0085 0.012 0.011
accuracy of the best solutions in
both runs
Average distance between best 4.7 1.2 3.9 3.9
selection in both runs in log C ×
log γ space
Table 3: Comparison of the selection of the best hyperparameters for two runs
of the ud400, grid400, rand400, and normrand400 algorithms. See text for the
explanation of the measurements.
point in one run ranks as the 20th best point in the other run. This seems the
strongest evidence that the ud400 is mostly searching for peaks in the “noise” of
the rs
b G surface - if the peaks in the accuracy were really peaks in the trsG (C, γ)
surface one would not expect such a large difference from one run to the other.
This evidence can also be seen considering the accuracy values. The median
difference between rs b 21 (and rs
b 11 and rs b 22 ), that is, the best and second
b 12 and rs
−4
best solutions of a single run is 8.6 × 10 . The median difference between rs b 11
1
and rsb 2 , that is, the accuracy of the best solutions across runs, is ten times as
large 8.9 × 10−3 . Again, one would not expect such a difference between the
two runs. Finally, the average distance between C1 , γ1 and C2 , γ2 is 4.7.
The numbers are similar for the rand400 and normrand400 algorithms, but
for them, there were fewer datasets that generated a single best solution. The
exception is the grid400 algorithm, which has a much higher rate of agreement
of the best solution between the two runs, and therefore a much lower difference
between the median average between first and second-best in a single run, the
difference of the best accuracy between runs, and the average distance between
the best solutions between the two runs. We do not know how to explain this
discrepancy of the grid search in relation to the others.
24
for the hyperparameters. For these combinations, the median number of solu-
tions was 5, the mean number of solutions was 24.57, the minimum 2, and the
maximum 293.
Table 4 displays the mean rank of the accuracy on future data for each
selection procedure. As one can see, there seems to be very little difference
between the ranks. In fact, the Friedman test on the accuracy on future data
of the selection procedures results in a p-value of 0.42 (chi-squared = 4.948, df
= 5) which shows that there are no significant differences among the selection
procedures.
Table 4: The mean rank in terms of accuracy on future data on the post search
selection procedures
It could be the case that the non-significant differences are due to a large
number of comparisons. Table 5 only compares the “more reasonable” selection
procedures: minCg mingC, randomCg, and meanCg. The resulting Friedman test
has a p-value of 0.80 ( chi-squared = 1.0188, df = 3). Therefore, there is no
evidence of differences among the post-search selection procedures.
Table 5: The mean rank in terms of accuracy on future data of the four more
reasonable post-search selection procedures
25
4.4 Hyperparameters selected by the best algorithms
Figure 4 plots the best log C and log γ found by all the algorithms with 200 and
400 probes. The mean log C for the best algorithms without the normrand400
is 5.37 and the standard deviation is 5.33. The mean of the log γ is -5.56 and
the standard deviation is 3.57. These numbers are the source for our suggestion
of the normrand algorithm. A mean of 5 and a standard deviation of 5 for the
log C and a mean -5 and a standard deviation of 5 for the log γ are mnemonic
approximations of those values.
● ● ●● ● ● ●
●
●
●
●
●
●
● ● ●
●
● ● ●
● ●
● ● ● ● ●
● ●
● ● ● ● ● ●
●
● ● ● ●
●
● ●
● ●● ● ●
●
● ●
● ●
● ●
●
● ●
● ●
● ● ●
●
● ●
● ● ●
0 ● ●
●
●
●
●
●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
●
● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ●
● ● ● ●
● ● ●
●
● ● ● ● ●
● ●
● ● ●
● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ●
●
● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
●● ●● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ●● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ●
● ● ●● ● ●
●● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ●● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ●● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●
●●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ●● ● ●
● ● ● ● ●
●
● ● ● ●● ●
● ●● ● ● ●
● ● ● ●●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
●● ● ●
● ● ●
● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●●
● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ● ● ●● ● ● ●
● ●● ● ●
●●
● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ●● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ●
●●
●
● ●● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
log γ
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
●● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
●● ● ●●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●● ● ● ●● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
●● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ●
● ● ●●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
●
●● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
●● ● ● ● ●● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●● ● ● ●
● ●
●●
● ●● ●● ● ● ●
● ● ● ● ● ● ●●
● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
●
● ●● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ●● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ●● ● ●
● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ● ●
● ●
● ● ●
● ● ●
● ●
● ● ● ●
● ●
● ●
● ●
● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ●
● ●●
● ● ● ●
●
−10
● ● ● ● ● ● ● ●
● ● ●
●
●
●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ●
● ● ●
●
●
●
● ● ●● ● ● ●
● ● ●
●
● ●
● ● ●
● ●
● ● ● ● ●
● ● ●●
●
● ●
● ● ● ●
● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ●
● ● ●
● ● ●
● ●
●
● ● ● ● ●
● ● ● ● ● ●● ● ●
● ● ●
● ● ●
●
● ● ●
● ● ●
●
● ●●
● ● ●
● ● ●
●
● ●
● ● ● ●
●
● ●
● ● ● ●
● ● ● ● ●● ● ●
● ● ●
●
●
● ● ●
● ●
●
●
●
●
● ●
● ●
● ● ●
● ● ●●
● ● ●
● ●
● ● ● ● ● ● ● ●
●
● ●
● ●
● ●
● ●
● ● ●
●
●
● ● ● ●
●
● ● ● ● ●
● ●
0 10
log C
Figure 4: The density plot of the C and γ selected by the best algorithms. The
box are the constrained values
5 Discussion
An important issue to discuss is whether the results of this research can be
generalized to any future dataset. The data sets used in this research are a large
sample of those available at UCI in 2014, collected by the authors of Fernández-
Delgado et al (2014). We believe they can be considered as an unbiased sample
of real-world data sets. But some classes of data sets are not represented in
this set: very large data sets and data sets with a high number of dimensions.
Fortunately, RBF SVM are usually not applied in these types of data sets:
26
SVM is at least quadratic on the number of training examples and therefore
it does not scale well to large data sets, with the possible exception of online
SVM solvers; SVM with RBF kernel is usually not used on large dimensionality
problems since for these cases the linear SVM is probably a better solution.
Finally, all datasets are binary classification problems. We do not believe that
for multi-class problems, where the SVMs are used in one-vs-one or one-vs-all
configurations the conclusions would be different.
The choice of search algorithms was necessarily ad-hoc. We believe that this
research covered most of the important proposal for SVM hyperparameter search
algorithms. A somewhat non-conventional inclusion was the uniform design (ud)
search – we wanted to include at least one algorithm from the research area of
Design of Computer Experiments, which has not been very well represented in
the hyperparameter search research. The most notable absentee in this research
are the many versions of sequential testing (Krueger et al, 2015).
Also, we must point out that we made no attempt to select hyperparameters
for the search procedures themselves. The more complex general optimization
algorithms such as pso, sa, cma have themselves hyperparameters: one has
to define a function that will reduce the temperature in sa; one has to define
the number of particles and the relative strength of the momentum and the
attractive force among the particles in pso; and so on. Although we cited re-
search that proposed using general optimization algorithms such as pso, sa, cma
and so on, (Lin et al, 2008b; Huang and Dun, 2008; Lin et al, 2008a; Pai and
Hong, 2005; Imbault and Lebart, 2004) we did not attempt to reproduce the
algorithms hyperparameters as proposed by the authors or perform any search
for those hyperparameters - we relied on the default of the implementations
used. As we will discuss below, we do not think that a better choice of search
hyperparameters would have resulted in a better selection of the SVM hyperpa-
rameters. Similarly, we do not believe that the inclusion of some of the missing
search algorithms such as ant colony optimisation would have changed the main
conclusions of this research in any significant way.
The time measurement results are less generalizable: the results reflect the
2019 implementation in R of the search algorithms, and the relative speeds of
the different search algorithms may change as better and faster implementations
become available. Also, we must point out that the timing data refer to the
sequential execution of the search algorithms. Parallel execution is trivial for
grid-like algorithms and there is published research on parallel versions of pso,
sa, and bogp.
We are not aware of any research published that systematically compares dif-
ferent search algorithms for SVM, with the exception of Mantovani et al (2015)
who compares on 70 data sets 6 different search algorithms: grid search, ran-
dom search, default values for the hyperparameters, genetic algorithms (GA),
estimation of distribution algorithms (EDA), and particle swarm optimization
(pso). Different to this research, they run experiments of up to 10000 evalu-
ations, but they use a non-standard cross-validation to tune and measure the
accuracy of the SVM. What they call the S-CV (single cross-validation step)
divides the data set into K subsets, where they use a 2-fold to train with the
27
different combinations of hyperparameters, which are tested on one of the folds
remaining. The last fold is only used once to measure the accuracy of the best
selection of hyperparameters. The process is repeated K times, changing the
last two folds and using the remained in training. We used the more traditional
nested cross-validation - the subsets are an external 2-fold CV (to compute the
accuracy) and the selection of the hyperparameters is performed by the inter-
nal 5-fold CV (one selection per fold). We do not know if the simpler, less
costly S-CV, introduces bias to the evaluation, but we do know that nested CV
is considered a more standard way of evaluating many classifiers (Cawley and
Talbot, 2010). They performed the standard significance tests on the 6 search
algorithms and did not find any significant differences except for the default
settings which performed worse than the other 5 algorithms.
There has been some recent research on hyperparameter optimization in
general (Hutter et al, 2011; Bergstra et al, 2011; Snoek et al, 2012; Bergstra and
Bengio, 2012; Krueger et al, 2015; Hutter et al, 2019) but they tend to focus on
the more challenging problem of searching high dimensional spaces (Bergstra
et al, 2013).
Finally, this research has some strong similarities to (Wainer and Cawley,
2017). There the authors fixed the search procedure to select the SVM hyper-
parameters (an 11x10 grid search) and varied the resampling procedure used to
compute the rs b G . That paper finds a similar conclusion that it is not worth it
to use very precise and costly resampling procedures for the selection of hyper-
parameters. In particular, the authors suggest using a 2 or 3-fold resampling to
select the hyperparameters. Here, we find that it is not worth it to use costly
search procedures. In both cases, it seems that efforts in finding precisely the
maxima in rs b G or that measuring precisely rsb G are wasteful. The trsG surface
does not seem to have narrow, high maxima - narrow so that it is necessary to
not only measure it precisely but also to spend a lot of search effort to find it,
and high because the difference in accuracy between this high peak and the other
not so good points is significant enough. trsG for SVMs seem to be smooth (at
least near the maxima) and one should not expect that either measuring it more
precisely or probing it many times will yield a “sweet spot” in accuracy. We can
only speculate whether this might be true for other classification algorithms.
6 Conclusions
Regarding the main goal of this research, the comparison of different search
algorithms for the C and γ hyperparameters of an RBF SVM, we found that:
• more evaluations of the cross-validated accuracy will, in general, result in
the selection of hyperparameters that have higher cross-validated accuracy,
which is not surprising.
• if one is willing to allow many evaluations (100 or 400), tree of Parzen es-
timators (Bergstra et al, 2013) and particle swarm optimization (Eberhart
28
et al, 1995) seem to be the two algorithms that achieve a good balance of
execution time and finding higher accuracy points in the rs
b G surface.
• Bayesian optimization (Pelikan et al, 1999; Snoek et al, 2012) does find
high accuracy points but at a very high computational cost.
• all grid-like algorithms grid, rand, ud, and normrand seem to have similar
performance, but the accuracy gains for normrand seems a little higher
than the other three.
• the hierarchical versions of grid and ud are, not surprisingly, good alter-
natives to the flat versions of the algorithms.
• at lower evaluation points (25) nelder and pso seem to be the best alter-
native
• at even lower number of evaluations (10 and 20) the default value of the
γ parameter selected by sigest (Caputo et al, 2002) seems a better solu-
tion than the alternatives proposed by Sklearn and distance between two
classes. The sigest default value for 1/γ tested in this research is the
median distance squared among a 50% sample of the data points.
• the SVM-specific algorithms svmpath (Hastie et al, 2004) and asym (Keerthi
and Lin, 2003) are not competitive alternatives.
Regarding the second goal of this research, efforts in finding better hyper-
parameters with higher values of the cross-validated accuracy seems not to be
correlated to better expected accuracy on future data. There is no important
difference in (an estimative of) the expected accuracy on future data among the
best performing algorithm and the grid100.
Regarding the third goal of this research, we could not find any significant
difference between the different post-search selection procedures. That is, de-
spite arguments that one should select a lower C and lower γ if multiple best
solutions were found, there is no evidence that this selection procedure is any
better than a random selection, or other different procedures in terms of accu-
racy on future data.
Regarding the fourth goal of this research, at a first approximation, the
distribution of the best log C (for datasets whose features are scaled to zero mean
and unit standard deviation) is a normal distribution of mean equal to 5 and
standard deviation equal to 5. And the best log γ follows a normal distribution
with mean equal to −5 and standard deviation of 5.
Finally, an important conclusion of this paper is that is in likely not worth
it to spend too much computational time searching for better hyperparameters.
It is likely that the search algorithms are finding “noisy” peaks in the rs bG
surface which are not representative of “real” peaks in the trsG surface. If
that is indeed the case, then one should not spend a lot of computational effort
searching for the best set of hyperparameters, and our conclusions that pso and
tse are, in general, the best search algorithms for a high number of evaluations
29
are only of theoretical interest. In this case, one should choose algorithms with
a lower number of evaluations. Given that grid like algorithms can be trivially
parallelized, if the user has enough computational instances, we believe that a
grid-like search with 25 or a few more evaluations is likely enough to find good
hyperparameters for most of the problems.
References
Aarts E, Korst J (1988) Simulated annealing and Boltzmann machines. John
Wiley and Sons Inc.
Cohen G, Ruch P, Hilario M (2005) Model selection for support vector classifiers
via direct simplex search. In: FLAIRS Conference, pp 431–435
Demšsar J (2006) Statistical comparisons of classifiers over multiple data sets.
Journal of Machine Learning Research 7:1–30
30
Duarte E, Wainer J (2017) Empirical comparison of cross-validation and internal
metrics for tuning SVM hyperparameters. Pattern Recognition Letters 88:6–
11
Eberhart R, Kennedy J, et al (1995) A new optimizer using particle swarm
theory. In: International Symposium on Micro Machine and Human Science,
vol 1, pp 39–43
Fang KT, Lin D, Winker P, Zhang Y (2000) Uniform design: theory and appli-
cation. Technometrics 42(3):237–248
Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need
hundreds of classifiers to solve real world classification problems? Journal of
Machine Learning Research 15:3133–3181
Hansen N, Müller SD, Koumoutsakos P (2003) Reducing the time complexity
of the derandomized evolution strategy with covariance matrix adaptation
(cma-es). Evolutionary Computation 11(1):1–18
Hsu CW, Chang CC, Lin CJ (2003) A practical guide to support vector classi-
fication. Tech. rep., Department of Computer Science National Taiwan Uni-
versity, updated: May 19, 2016
Huang CL, Dun JF (2008) A distributed PSO–SVM hybrid system with feature
selection and parameter optimization. Applied Soft Computing 8(4):1381–
1391
Huang CM, Lee YJ, Lin D, Huang SY (2007) Model selection for support vec-
tor machines via uniform design. Computational Statistics & Data Analysis
52(1):335–346
Hutter F, Hoos HH, Leyton-Brown K (2011) Sequential model-based optimiza-
tion for general algorithm configuration. In: International Conference on
Learning and Intelligent Optimization, Springer, pp 507–523
Hutter F, Kotthoff L, Vanschoren J (2019) Automated Machine Learning.
Springer
31
Joachims T (2000) The maximum-margin approach to learning text classifiers:
Methods, theory, and algorithms. PhD thesis, Doctoral dissertation, Dort-
mund University
Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab – an S4 package
for kernel methods in R. Journal of Statistical Software 11(9):1–20
Keerthi S (2002) Efficient tuning of SVM hyperparameters using radius/margin
bound and iterative algorithms. IEEE Transactions on Neural Networks
13(5):1225–1229
Keerthi S, Lin CJ (2003) Asymptotic behaviors of support vector machines with
Gaussian kernel. Neural computation 15(7):1667–1689
Keerthi S, Sindhwani V, Chapelle O (2006) An efficient method for gradient-
based adaptation of hyperparameters in SVM models. In: Advances in neural
information processing systems, pp 673–680
Klein A, Falkner S, Mansur N, Hutter F (2017) Robo: A flexible and robust
bayesian optimization framework in python. In: NIPS 2017 Bayesian Opti-
mization Workshop
Krueger T, Panknin D, Braun M (2015) Fast cross-validation via sequential
testing. Journal of Machine Learning Research 16:1103–1155
Lin SW, Lee ZJ, Chen SC, Tseng TY (2008a) Parameter determination of sup-
port vector machine and feature selection using simulated annealing approach.
Applied soft computing 8(4):1505–1512
Lin SW, Ying KC, Chen SC, Lee ZJ (2008b) Particle swarm optimization for
parameter determination and feature selection of support vector machines.
Expert Systems with Applications 35(4):1817–1824
Ludbrook J (2000) Multiple inferences using confidence intervals. Clinical and
Experimental Pharmacology and Physiology 27(3):212–215
Mantovani RG, Rossi AL, Vanschoren J, Bischl B, de Carvalho AC (2015) Effec-
tiveness of random search in SVM hyper-parameter tuning. In: International
Joint Conference on Neural Networks (IJCNN), pp 1–8
Nelder J, Mead R (1965) A simplex method for function minimization. The
Computer Journal 7(4):308–313
Pai PF, Hong WC (2005) Support vector machines with simulated annealing
algorithms in electricity load forecasting. Energy Conversion and Management
46(17):2669–2688
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel
M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau
D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research 12:2825–2830
32
Pelikan M, Goldberg DE, Cantú-Paz E (1999) BOA: The Bayesian optimization
algorithm. In: Annual Conference on Genetic and Evolutionary Computation,
pp 525–532
Powell M (2009) The BOBYQA algorithm for bound constrained optimization
without derivatives. Tech. Rep. Report NA2009/06, Department of Applied
Mathematics and Theoretical Physics, University of Cambridge
Schenker N, Gentleman J (2001) On judging the significance of differences by ex-
amining the overlap between confidence intervals. The American Statistician
55(3):182–186
Wahba G, Lin Y, Lee Y, Zhang H (2001) On the relation between the gacv
and Joachims’ xi-alpha method for tuning support vector machines, with
extensions to the non-standard case. Tech. Rep. 1039, Statistics Department
University of Wisconsin, Madison WI
33