Bayesian Optimization Primer
Ian Dewancker IAN @ SIGOPT. COM
Michael McCourt MIKE @ SIGOPT. COM
Scott Clark SCOTT @ SIGOPT. COM
1. SigOpt acquisition functions (discussed in Section 3.3). Several
open source Bayesian optimization software packages ex-
SigOpt employs Bayesian optimization to help experts tune ist and many of their methods and techniques are incor-
machine learning models and simulations. Instead of re- porated into SigOpt, where applicable. Spearmint (Snoek
sorting to standard techniques like grid search, random et al., 2012; 2014b;a), Hyperopt (Bergstra et al., 2013b;a),
search, or manual tuning, Bayesian optimization efficiently SMAC (Hutter et al., 2011b;a; Falkner, 2014) and MOE
trades off exploration and exploitation of the parameter (Clark et al., 2014), are summarized in Table 1.
space to quickly guide the user into the configuration that
best optimizes some overall evaluation criterion (OEC) like
accuracy, AUC, or likelihood. In this short introduction Table 1. Bayesian Optimization Software
we introduce Bayesian optimization and several techniques
that SigOpt uses to optimize users models and simulations. S OFTWARE R EGR . M ODEL ACQ . F UNCTION
For applications and examples of SigOpt using Bayesian S PEARMINT G AUSSIAN P ROCESS E XP. I MPROV
optimization in real world problems please visit MOE G AUSSIAN P ROCESS E XP. I MPROV
https://sigopt.com/research. H YPEROPT T REE PARZEN E ST. E XP. I MPROV
SMAC R ANDOM F OREST E XP. I MPROV
2. Bayesian Optimization
Bayesian optimization is a powerful tool for optimizing ob- In addition to these methods, non-Bayesian alternatives
jective functions which are very costly or slow to evaluate for performing hyperparameter optimization include grid
(Martinez-Cantin et al., 2007; Brochu et al., 2010; Snoek search, random search (Bergstra & Bengio, 2012), and
et al., 2012). In particular, we consider problems where the particle swarm optimization (Kennedy & Eberhart, 1995).
maximum is sought for an expensive function f : X ! R, These non-Bayesian techniques are often used in practice
due to the administrative overhead and expertise required
xopt = arg max f (x), to get reasonable results from these, and other, open source
x2X
Bayesian optimization packages. SigOpt wraps a wide
swath of Bayesian Optimization research around a simple
within a domain X ⇢ Rd which is a bounding box (tensor API, allowing experts to quickly and easily tune their mod-
product of bounded and connected univariate domains). els and leverage these powerful techniques.
Hyperparameter optimization for machine learning mod-
els is of particular relevance as the computational costs for 3.1. Sequential Model-Based Optimization
evaluating model variations is high, d is typically small,
Sequential model-based optimization (SMBO) is a succinct
and hyperparameter gradients are typically not available.
formalism of Bayesian optimization and useful when dis-
It is worth noting that Bayesian optimization techniques cussing variations (Hutter et al., 2011b; Bergstra et al.,
can be effective in practice even if the underlying function 2011; Hoffman & Shahriari, 2014). In this section we will
f being optimized is stochastic, non-convex, or even non- use this formalism to contrast some of the methods em-
continuous. ployed by the open source techniques from Table 1, upon
which SigOpt draws. Typically, a probabilistic regression
3. Bayesian Optimization Methods model M is initialized using a small set of samples from
the domain X . Following this initialization phase, new lo-
Bayesian optimization methods (summarized effectively in cations within the domain are sequentially selected by op-
(Shahriari et al., 2015)) can be differentiated at a high level timizing an acquisition function S which uses the current
by their regression models (discussed in Section 3.2) and model as a cheap surrogate for the expensive objective f .
Bayesian Optimization Primer
Each suggested function evaluation produces an observed pute the predictive distribution, including
result yi = f (xi ); note that the observation may be ran-
T
dom either because f is random or because the observation k(x) = K(x, x1 ) · · · K(x, xi ) ,
process is subject to noise. This result is appended to the Kj,k = K(xj , xk ).
historical set D = {(x1 , y1 ), . . . , (xi , yi )}, which is used
to update the regression model for generating the next sug-
gestion. In practice there is often a quota on the total time Predictions follow a normal distribution, therefore we
or resources available, which imposes a limit T on the total know p(y | x, D) = N (y | µ̂, ˆ 2 ) (Fasshauer & McCourt,
number of function evaluations. Algorithm 1 encapsulates 2015), where, assuming µ(x) ⌘ 0,
this process. T
y = y1 ··· yi ,
µ̂ = k(x)T (K + 2
n I)
1
y,
Algorithm 1 Sequential Model-Based Optimization ˆ 2 = K(x, x) T
k(x) (K + 2
n I)
1
k(x).
Input: f , X , S, M
D I NIT S AMPLES(f, X ) Spearmint can be flexibly configured to either assume the
for i |D| to T do objective function is noiseless or attempt to estimate the
p(y | x, D) F IT M ODEL(M, D) process noise parameter n2 , while MOE allows for only a
xi arg maxx2X S(x, p(y | x, D)) manual specification of n2 for now.
yi f (xi ) . Expensive step
D D [ (xi , yi ) 3.2.2. R ANDOM F ORESTS
end for Another choice for the probabilistic regression model is
an ensemble of regression trees (Breiman et al., 1984;
Breiman, 2001). Random forests are used by the Sequen-
tial Model-based Algorithm Configuration (SMAC) library
The initialization sampling strategy is often an important (Hutter et al., 2011b). One common approach in con-
consideration in the SMBO formalism. Viable approaches structing the predictive distribution is to assume a Gaussian
include random, quasi-random, and Latin hypercube sam- N (y | µ̂, ˆ 2 ). The parameters µ̂ and ˆ may be chosen as
pling of the domain (Hoffman & Shahriari, 2014). the empirical mean and variance of the regression values,
r(x), from the set of regression trees B in the forest,
3.2. Probabilistic Regression Models ( M )
1 X
Various probabilistic regression models can be used in the µ̂ = r(x),
|B|
Bayesian optimization context (Eggensperger et al., 2013); r2B
1 X
however, to operate in the standard SMBO algorithm the ˆ2 = (r(x) µ̂)2 .
model must define a predictive distribution p(y | x, D). |B| 1
r2B
This distribution captures the uncertainty in the surrogate
reconstruction of the objective function.
An attractive property of regression trees is that they nat-
urally support domains with conditional variables (Hutter
3.2.1. G AUSSIAN P ROCESSES
et al., 2009; Bergstra et al., 2011), that is, variables that
Gaussian processes (GPs) (Rasmussen & Williams, 2006) only exist when another takes on a certain configuration
have become a standard surrogate for modeling objective or range. As an example, consider a variable that controls
functions in Bayesian optimization (Snoek et al., 2012; the selection between several machine learning algorithms.
Martinez-Cantin, 2014). In this setting, the function f is Each algorithm may have different parameters on different
assumed to be a realization of a GP with mean function µ domains; these algorithm parameters are conditional vari-
and covariance kernel K, f ⇠ GP(µ, K). ables, only active depending on the choice of algorithm
(Eggensperger et al., 2013). SMAC supports such condi-
The choice of kernel function K in particular can have a
tional variables, while the GP backed Spearmint and MOE
drastic effect on the quality of the surrogate reconstruc-
currently do not.
tion (Rasmussen & Williams, 2006). Automatic relevance
determination (ARD) kernels, also known as anisotropic
3.2.3. T REE PARZEN E STIMATORS
kernels, are common variants (Snoek et al., 2012; Duve-
naud, 2014). By default, Spearmint and MOE use the ARD Using a tree-structured Parzen estimator (TPE) (Bergstra
Matérn and ARD squared exponential kernels, respectively. et al., 2011; 2013b) deviates from the standard SMBO al-
These kernels define important quantities needed to com- gorithm in that it does not define a predictive distribution
Bayesian Optimization Primer
over the objective function; instead, it creates two hierar- behind a simple API, allowing users to quickly realize the
chical processes, `(x) and g(x) acting as generative mod- promise of Bayesian optimization for their problems.
els for all domain variables. These processes model the
Detailed comparisons between these and other methods can
domain variables when the objective function is below and
be found at https://sigopt.com/research and in a coming
above a specified quantile y ⇤ ,
ICML paper.
(
`(x), if y < y ⇤,
p(x | y, D) =
g(x), if y y ⇤. References
Bergstra, James and Bengio, Yoshua. Random search for
Univariate Parzen estimators (Duda et al., 2000; Murphy, hyper-parameter optimization. The Journal of Machine
2012) are organized in a tree structure, preserving any spec- Learning Research, 13(1):281–305, 2012.
ified conditional dependence and resulting in a fit per vari-
able for each process `(x), g(x). With these two distribu- Bergstra, James, Yamins, Dan, and Cox, David D.
tions, one can optimize a closed form term proportional to Hyperopt: A python library for optimizing the
expected improvement (Bergstra et al., 2011). hyperparameters of machine learning algorithms.
https://github.com/hyperopt/hyperopt,
While this tree structure preserves the specified conditional 2013a.
relationships, other variable interdependencies may not be
captured. Gaussian processes and random forests, in con- Bergstra, James, Yamins, Daniel, and Cox, David. Making
trast, model the objective function as dependent on the en- a science of model search: Hyperparameter optimiza-
tire joint variable configuration. One benefit to using TPE tion in hundreds of dimensions for vision architectures.
is that it naturally supports domains with specified condi- In Proceedings of The 30th International Conference on
tional variables. Machine Learning, pp. 115–123, 2013b.
Bergstra, James S, Bardenet, Rémi, Bengio, Yoshua, and
3.3. Acquisition Function ( S ) Kégl, Balázs. Algorithms for hyper-parameter optimiza-
Acquisition functions define a balance between exploring tion. In Advances in Neural Information Processing Sys-
new areas in the objective space and exploiting areas that tems, pp. 2546–2554, 2011.
are already known to have favorable values. Many suitable Breiman, Leo. Random forests. Machine learning, 45(1):
functions have been proposed (Jones et al., 1998; Brochu, 5–32, 2001.
2010; Hoffman et al., 2011), including recent work involv-
ing information theory (Hernández-Lobato et al., 2014). Breiman, Leo, Friedman, Jerome H, Olshen, Richard A,
and Stone, Charles J. Classification and regression trees.
The acquisition function used by each method discussed in
wadsworth. Belmont, CA, 1984.
this article is expected improvement (Jones et al., 1998).
Intuitively, it defines the nonnegative expected improve- Brochu, Eric. Interactive Bayesian Optimization. PhD
ment over the best previously observed objective value (de- thesis, The University of British Columbia (Vancouver,
noted fbest ) at a given location x, 2010.
Z 1
Brochu, Eric, Brochu, Tyson, and de Freitas, Nando. A
EI(x | D) = (y fbest ) p(y | x, D) dy.
fbest bayesian interactive optimization approach to procedu-
ral animation design. In Proceedings of the 2010 ACM
When the predictive distribution p(y | x, D) is Gaussian, SIGGRAPH/Eurographics Symposium on Computer An-
EI(x) has a convenient closed form (Jones et al., 1998). imation, pp. 103–112. Eurographics Association, 2010.
A complete Bayesian treatment of EI involves integrat-
ing over parameters of the probabilistic regression model Clark, Scott, Liu, Eric, Frazier, Peter, Wang, JiaLei, Oktay,
(Snoek et al., 2012), as Spearmint does; this contrasts with Deniz, and Vesdapunt, Norases. MOE: A global, black
MOE and SMAC, which fit only a single parameterization. box optimization engine for real world metric optimiza-
tion. https://github.com/Yelp/MOE, 2014.
4. Conclusion Duda, Richard O., Hart, Peter E., and Stork, David G. Pat-
tern Classification (2Nd Edition). Wiley-Interscience,
Bayesian optimization represents a powerful tool in helping 2000. ISBN 0471056693.
experts optimize their machine learning models and simu-
lations. Selecting the best Bayesian optimization technique Duvenaud, David. Automatic model construction with
for each problem of interest is often non-intuitive. SigOpt Gaussian processes. PhD thesis, University of Cam-
eliminates these issues by wrapping this powerful research bridge, 2014.
Bayesian Optimization Primer
Eggensperger, Katharina, Feurer, Matthias, Hutter, Frank, Martinez-Cantin, Ruben. Bayesopt: A bayesian optimiza-
Bergstra, James, Snoek, Jasper, Hoos, Holger, and tion library for nonlinear optimization, experimental de-
Leyton-Brown, Kevin. Towards an empirical foundation sign and bandits. The Journal of Machine Learning Re-
for assessing bayesian optimization of hyperparameters. search, 15(1):3735–3739, 2014.
In NIPS workshop on Bayesian Optimization in Theory
and Practice, 2013. Martinez-Cantin, Ruben, de Freitas, Nando, Doucet, Ar-
naud, and Castellanos, José A. Active policy learning
Falkner, Stefan. pysmac : sim- for robot planning and exploration under uncertainty. In
ple python interface to SMAC. Robotics: Science and Systems, pp. 321–328, 2007.
https://github.com/automl/pysmac, 2014. Murphy, Kevin P. Machine learning: a probabilistic per-
spective. MIT press, 2012.
Fasshauer, Gregory and McCourt, Michael. Kernel-based
Approximation Methods in MATLAB. World Scientific, Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro-
2015. cesses for Machine Learning. the MIT Press, 2006.
Hernández-Lobato, José Miguel, Hoffman, Matthew W, Shahriari, Bobak, Swersky, Kevin, Wang, Ziyu, Adams,
and Ghahramani, Zoubin. Predictive entropy search for Ryan P., and de Freitas, Nando. Taking the human out
efficient global optimization of black-box functions. In of the loop: A review of bayesian optimization. Techni-
Advances in Neural Information Processing Systems, pp. cal report, Universities of Harvard, Oxford, Toronto, and
918–926, 2014. Google DeepMind, 2015.
Hoffman, Matthew D, Brochu, Eric, and de Freitas, Nando. Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P.
Portfolio allocation for bayesian optimization. In UAI, Practical bayesian optimization of machine learning al-
pp. 327–336. Citeseer, 2011. gorithms. In Advances in neural information processing
systems, pp. 2951–2959, 2012.
Hoffman, Matthew W and Shahriari, Bobak. Modular Snoek, Jasper, Swersky, Kevin, Larochelle, Hugo,
mechanisms for bayesian optimization. In NIPS Work- Gelbart, Michael, Adams, Ryan P, and Zemel,
shop on Bayesian Optimization, 2014. Richard S. Spearmint : Bayesian optimization soft-
ware. https://github.com/HIPS/Spearmint,
Hutter, Frank, Hoos, Holger H, Leyton-Brown, Kevin, and 2014a.
Stützle, Thomas. Paramils: an automatic algorithm con-
figuration framework. Journal of Artificial Intelligence Snoek, Jasper, Swersky, Kevin, Zemel, Richard S., and
Research, 36(1):267–306, 2009. Adams, Ryan Prescott. Input warping for bayesian op-
timization of non-stationary functions. In International
Hutter, Frank, Hoos, Holger, Leyton-Brown, Kevin, Conference on Machine Learning, 2014b.
Murphy, Kevin, and Ramage, Steve. SMAC:
Sequential model-based algorithm configuration.
http://www.cs.ubc.ca/labs/beta/Projects/SMAC/,
2011a.
Hutter, Frank, Hoos, Holger H, and Leyton-Brown, Kevin.
Sequential model-based optimization for general algo-
rithm configuration. In Learning and Intelligent Opti-
mization, pp. 507–523. Springer, 2011b.
Jones, Donald R, Schonlau, Matthias, and Welch,
William J. Efficient global optimization of expensive
black-box functions. Journal of Global optimization, 13
(4):455–492, 1998.
Kennedy, James and Eberhart, Russell C. Particle swarm
optimization. In Proceedings of the 1995 IEEE Inter-
national Conference on Neural Networks, volume 4, pp.
1942–1948, Perth, Australia, IEEE Service Center, Pis-
cataway, NJ, 1995.