Information-Geometric Optimization Algorithms
Information-Geometric Optimization Algorithms
Abstract
We present a canonical way to turn any smooth parametric family of probability distributions
on an arbitrary search space 𝑋 into a continuous-time black-box optimization method on
𝑋, the information-geometric optimization (IGO) method. Invariance as a major design
principle keeps the number of arbitrary choices to a minimum. The resulting IGO flow is
the flow of an ordinary differential equation conducting the natural gradient ascent of an
adaptive, time-dependent transformation of the objective function. It makes no particular
assumptions on the objective function to be optimized.
The IGO method produces explicit IGO algorithms through time discretization. It
naturally recovers versions of known algorithms and offers a systematic way to derive new
ones. In continuous search spaces, IGO algorithms take a form related to natural evolution
strategies (NES). The cross-entropy method is recovered in a particular case with a large
time step, and can be extended into a smoothed, parametrization-independent maximum
likelihood update (IGO-ML). When applied to the family of Gaussian distributions on R𝑑 ,
the IGO framework recovers a version of the well-known CMA-ES algorithm and of xNES.
For the family of Bernoulli distributions on {0, 1}𝑑 , we recover the seminal PBIL algorithm
and cGA. For the distributions of restricted Boltzmann machines, we naturally obtain
a novel algorithm for discrete optimization on {0, 1}𝑑 . All these algorithms are natural
instances of, and unified under, the single information-geometric optimization framework.
The IGO method achieves, thanks to its intrinsic formulation, maximal invariance
properties: invariance under reparametrization of the search space 𝑋, under a change of
parameters of the probability distribution, and under increasing transformation of the
function to be optimized. The latter is achieved through an adaptive, quantile-based
formulation of the objective.
Theoretical considerations strongly suggest that IGO algorithms are essentially charac-
terized by a minimal change of the distribution over time. Therefore they have minimal loss
in diversity through the course of optimization, provided the initial diversity is high. First
experiments using restricted Boltzmann machines confirm this insight. As a simple conse-
○2017
c Yann Ollivier, Ludovic Arnold, Anne Auger and Nikolaus Hansen.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v18/14-467.html.
Ollivier, Arnold, Auger and Hansen
quence, IGO seems to provide, from information theory, an elegant way to simultaneously
explore several valleys of a fitness landscape in a single run.
Keywords: black-box optimization, stochastic optimization, randomized optimization,
natural gradient, invariance, evolution strategy, information-geometric optimization
1. Introduction
Optimization problems are at the core of many disciplines. Given an objective function
𝑓 : 𝑋 → R, to be optimized on some space 𝑋, the goal of black-box optimization is to
find solutions 𝑥 ∈ 𝑋 with small (in the case of minimization) value 𝑓 (𝑥), using the least
number of calls to the function 𝑓 . In a black-box scenario, knowledge about the function
𝑓 is restricted to the handling of a device (e.g., a simulation code) that delivers the value
𝑓 (𝑥) for any input 𝑥 ∈ 𝑋. The search space 𝑋 may be finite, discrete infinite, or continuous.
However, optimization algorithms are often designed for a specific type of search space,
exploiting its specific structure.
One major design principle in general and in optimization in particular is related to
invariance. Invariance extends performance observed on a single function to an entire
associated invariance class, that is, it generalizes from a single problem to a class of problems.
Thus it hopefully provides better robustness w.r.t. changes in the presentation of a problem.
For continuous search spaces, invariance under translation of the coordinate system is
standard in optimization. Invariance under general affine-linear changes of the coordinates
has been—we believe—one of the keys to the success of the Nelder-Mead Downhill Simplex
method (Nelder and Mead, 1965), of quasi-Newton methods (Deuflhard, 2011) and of the
covariance matrix adaptation evolution strategy, CMA-ES (Hansen and Ostermeier, 2001;
Hansen and Auger, 2014). While these relate to transformations in the search space, another
important invariance concerns the application of monotonically increasing transformations
to 𝑓 , so that it is indifferent whether the function 𝑓 , 𝑓 3 or 𝑓 × |𝑓 |−2/3 is minimized. This
way some non-convex or non-smooth functions can be as “easily” optimised as convex ones.
Invariance under 𝑓 -transformation is not uncommon, e.g., for evolution strategies (Schwefel,
1995) or pattern search methods (Hooke and Jeeves, 1961; Torczon, 1997; Nelder and Mead,
1965); however it has not always been recognized as an attractive feature.
Many stochastic optimization methods have been proposed to tackle black-box optimiza-
tion. The underlying (often hidden) principle of these stochastic methods is to iteratively
update a probability distribution 𝑃𝜃 defined on 𝑋, parametrized by a set of parameters 𝜃.
At a given iteration, the distribution 𝑃𝜃 represents, loosely speaking, the current belief about
where solutions with the smallest values of the function 𝑓 may lie. Over time, 𝑃𝜃 is expected
to concentrate around the minima of 𝑓 . The update of the distribution involves querying
the function with points sampled from the current probability distribution 𝑃𝜃 . Although
implicit in the presentation of many stochastic optimization algorithms, this is the explicit
setting for the wide family of estimation of distribution algorithms (EDA) (Larranaga and
Lozano, 2002; Baluja and Caruana, 1995; Pelikan et al., 2002). Updates of the probability
distribution often rely on heuristics (nevertheless in Toussaint 2004 the possible interest of
information geometry to exploit the structure of probability distributions for designing better
grounded heuristics is pointed out). In addition, in the EDA setting we can distinguish two
theoretically founded approaches to update 𝑃𝜃 .
2
Information-Geometric Optimization
for 𝑥 = (𝑥𝑖 ) ∈ 𝑋.
From this setting, information-geometric optimization (IGO) can be defined in a natural
way. At each (continuous) time 𝑡, we maintain a value 𝜃𝑡 of the parameter of the distribution.
The function 𝑓 to be optimized is transferred to the parameter space Θ by means of a
suitable time-dependent transformation based on the 𝑃𝜃𝑡 -levels of 𝑓 (Definition 3). The IGO
flow, introduced in Definition 4, follows the natural gradient of the expected value of this
function of 𝜃𝑡 in the parameter space Θ, where the natural gradient derives from the Fisher
information metric. The IGO flow is thus the flow of an ordinary differential equation in
space Θ. This continuous-time gradient flow is turned into a family of explicit IGO algorithms
by taking an Euler time discretization of the differential equation and approximating the
distribution 𝑃𝜃𝑡 by using samples. From the start, the IGO flow is invariant under strictly
increasing transformations of 𝑓 (Proposition 17); we also prove that the sampling procedure
3
Ollivier, Arnold, Auger and Hansen
is consistent (Theorem 6). IGO algorithms share their final algebraic form with the natural
evolution strategies (NES) introduced in the Gaussian setting (Wierstra et al., 2008; Sun
et al., 2009; Glasmachers et al., 2010; Wierstra et al., 2014); the latter are thus recovered
in the IGO framework as an Euler approximation to a well-defined flow, without heuristic
arguments.
The IGO method also has an equivalent description as an infinitesimal maximum likelihood
update (Theorem 10); this reveals a new property of the natural gradient and does not require
a smooth parametrization by 𝜃 anymore. This also establishes a specific link (Theorem 12)
between IGO, the natural gradient, and the cross-entropy method (de Boer et al., 2005).
When we instantiate IGO using the family of Gaussian distributions on R𝑑 , we naturally
obtain versions of the well-known covariance matrix adaptation evolution strategy (CMA-ES,
Hansen and Ostermeier 2001; Hansen and Kern 2004; Jastrebski and Arnold 2006) and of
natural evolution strategies. With Bernoulli measures on the discrete cube {0, 1}𝑑 , we recover
(Proposition 14) the well-known population-based incremental learning (PBIL, Baluja and
Caruana 1995; Baluja 1994) and the compact genetic algorithm (cGA, Harik et al. 1999); this
derivation of PBIL or cGA as a natural gradient ascent appears to be new, and emphasizes
the common ground between continuous and discrete optimization.
From the IGO framework, it is (theoretically) immediate to build new optimization
algorithms using more complex families of distributions than Gaussian or Bernoulli. As an
illustration, distributions associated with restricted Boltzmann machines (RBMs) provide
a new but natural algorithm for discrete optimization on {0, 1}𝑑 which is able to handle
dependencies between the bits (see also Berny 2002). The probability distributions associated
with RBMs are multimodal; combined with the specific information-theoretic properties of
IGO that guarantee minimal change in diversity over time, this allows IGO to reach multiple
optima at once in a natural way, at least in a simple experimental setup (Appendix B).
The IGO framework is built to achieve maximal invariance properties. First, the IGO
flow is invariant under reparametrization of the family of distributions 𝑃𝜃 , that is, it only
depends on 𝑃𝜃 and not on the way we write the parameter 𝜃 (Proposition 18). This
invariance under 𝜃-reparametrization is the main idea behind information geometry (Amari
and Nagaoka, 2000). For instance, for Gaussian measures it should not matter whether
we use the covariance matrix or its inverse or a Cholesky factor as the parameter. This
limits the influence of encoding choices on the behavior of the algorithm. Second, the IGO
flow is invariant under a change of coordinates in the search space 𝑋, provided that this
change of coordinates globally preserves the family of distributions 𝑃𝜃 (Proposition 19). For
instance, for Gaussian distributions on R𝑑 , this includes all affine changes of coordinates.
This means that the algorithm, apart from initialization, does not depend on the precise way
the data is presented. Last, the IGO flow and IGO algorithms are invariant under applying
a strictly increasing function to 𝑓 (Proposition 17). Contrary to previous formulations using
natural gradients (Wierstra et al., 2008; Glasmachers et al., 2010; Akimoto et al., 2010),
this invariance is achieved from the start. Such invariance properties mean that we deal
with intrinsic properties of the objects themselves, and not with the way we encode them
as collections of numbers in R𝑑 . It also means, most importantly, that we make a minimal
number of arbitrary choices.
4
Information-Geometric Optimization
In Section 2, we define the IGO flow and the IGO algorithm. We begin with standard
facts about the definition and basic properties of the natural gradient, and its connection
with Kullback–Leibler divergence and diversity. We then proceed to the detailed description
of the algorithm.
In Section 3, we state some first mathematical properties of IGO. These include monotone
improvement of the objective function, invariance properties, and the form of IGO for
exponential families of probability distributions.
In Section 4 we explain the theoretical relationships between IGO, maximum likelihood
estimates and the cross-entropy method. In particular, IGO is uniquely characterized by a
weighted log-likelihood maximization property (Theorem 10).
In Section 5, we apply the IGO framework to explicit families of probability distributions.
Several well-known optimization algorithms are recovered this way. These include PBIL
(Sec. 5.1) using Bernoulli distributions, and versions of CMA-ES and other evolutionary
algorithms such as EMNA and xNES (Sec. 5.2) using Gaussian distributions.
Appendix A discusses further aspects, perspectives and implementation matters of
the IGO framework. In Appendix B, we illustrate how IGO can be used to design new
optimization algorithms, by deriving the IGO algorithm associated with restricted Boltzmann
machines. We illustrate experimentally, on a simple bimodal example, the specific influence
of the Fisher information matrix on the optimization trajectories, and in particular on the
diversity of the optima obtained. Appendix C details a number of further mathematical
properties of IGO (such as its invariance properties or the case of noisy objective functions).
Appendix D contains the previously omitted proofs of the mathematical statements.
2. Algorithm Description
We now present the outline of the algorithm. Each step is described in more detail in the
sections below.
The IGO flow can be seen as an estimation of distribution algorithm: at each time 𝑡, we
maintain a probability distribution 𝑃𝜃𝑡 on the search space 𝑋, where 𝜃𝑡 ∈ Θ. The value of
𝜃𝑡 will evolve so that, over time, 𝑃𝜃𝑡 gives more weight to points 𝑥 with better values of the
function 𝑓 (𝑥) to optimize.
A straightforward way to proceed is to transfer 𝑓 from 𝑥-space to 𝜃-space: define a
function 𝐹 (𝜃) as the 𝑃𝜃 -average of 𝑓 and then do a gradient descent for 𝐹 (𝜃) in space Θ
(Berny, 2000a, 2002, 2000b; Gallagher and Frean, 2005). This way, 𝜃 will converge to a point
such that 𝑃𝜃 yields a good average value of 𝑓 . We depart from this approach in two ways:
∙ Instead of the vanilla gradient for 𝜃, we use the so-called natural gradient given by
the Fisher information matrix. This reflects the intrinsic geometry of the space of
probability distributions, as introduced by Rao and Jeffreys (Rao, 1945; Jeffreys, 1946)
and later elaborated upon by Amari and others (Amari and Nagaoka, 2000). This
provides invariance under reparametrization of 𝜃 and, importantly, minimizes the
change of diversity of 𝑃𝜃 .
5
Ollivier, Arnold, Auger and Hansen
The algorithm is constructed in two steps: we first give an “ideal” version, namely, a
version in which time 𝑡 is continuous so that the evolution of 𝜃𝑡 is given by an ordinary
differential equation in Θ. Second, the actual algorithm is a time discretization using a finite
time step and Monte Carlo sampling instead of exact 𝑃𝜃 -averages.
𝜕𝑔/𝜕𝑦𝑖
Then one can check that 𝑧 is the normalized gradient of 𝑔 at 𝑦: 𝑧𝑖 = ‖𝜕𝑔/𝜕𝑦‖ . (This holds
only at points 𝑦 where the gradient of 𝑔 does not vanish.)
This shows that, for small 𝛿𝑡, the well-known gradient ascent of 𝑔 given by
𝜕𝑔
𝑦𝑖𝑡+𝛿𝑡 = 𝑦𝑖𝑡 + 𝛿𝑡 𝜕𝑦𝑖
realizes the largest increase of the value of 𝑔, for a given step size ‖𝑦 𝑡+𝛿𝑡 − 𝑦 𝑡 ‖.
The relation (1) depends on the choice of a norm ‖ · ‖ (the gradient of 𝑔 is given by
𝜕𝑔/𝜕𝑦
√︁∑︀ 𝑖 only in an orthonormal basis). If we√︁ use, instead of the standard metric ‖𝑦 − 𝑦 ′ ‖ =
(𝑦𝑖 − 𝑦𝑖′ )2 on R𝑑 , a metric ‖𝑦 − 𝑦 ′ ‖𝐴 = 𝐴𝑖𝑗 (𝑦𝑖 − 𝑦𝑖′ )(𝑦𝑗 − 𝑦𝑗′ ) defined by a positive
∑︀
definite matrix 𝐴𝑖𝑗 , then the gradient of 𝑔 with respect to this metric is given by 𝑗 𝐴−1 𝜕𝑔
∑︀
𝑖𝑗 𝜕𝑦𝑖 .
This follows from the textbook definition of gradients by 𝑔(𝑦 + 𝜀𝑧) = 𝑔(𝑦) + 𝜀⟨∇𝑔, 𝑧⟩𝐴 + 𝑂(𝜀2 )
with ⟨·, ·⟩𝐴 the scalar product associated with the matrix 𝐴𝑖𝑗 (Schwartz, 1992).
It is possible to write the analogue of (1) using the 𝐴-norm. We then find that the
gradient ascent associated with metric 𝐴 is given by
𝑦 𝑡+𝛿𝑡 = 𝑦 𝑡 + 𝛿𝑡 𝐴−1 𝜕𝑔
𝜕𝑦
for small 𝛿𝑡 and maximizes the increment of 𝑔 for a given 𝐴-distance ‖𝑦 𝑡+𝛿𝑡 −𝑦 𝑡 ‖𝐴 —it realizes
the steepest 𝐴-ascent. Maybe this viewpoint clarifies the relationship between gradient and
metric: this steepest ascent property can actually be used as a definition of gradients.
In our setting we want to use a gradient ascent in the parameter space Θ of our
distributions 𝑃𝜃 . The “vanilla” gradient 𝜕𝜃𝜕 𝑖 is associated with the metric ‖𝜃 − 𝜃′ ‖ =
√︁∑︀
(𝜃𝑖 − 𝜃𝑖′ )2 and clearly depends on the choice of parametrization 𝜃. Thus this metric, and
the direction pointed by this gradient, are not intrinsic, in the sense that they do not depend
only on the distribution 𝑃𝜃 . A metric depending on 𝜃 only through the distributions 𝑃𝜃 can
be defined as follows.
6
Information-Geometric Optimization
𝑃𝜃′ (d𝑥)
∫︁
KL(𝑃𝜃′ || 𝑃𝜃 ) = ln 𝑃𝜃′ (d𝑥) .
𝑥 𝑃𝜃 (d𝑥)
1 ∑︁
KL(𝑃𝜃+𝛿𝜃 || 𝑃𝜃 ) = 𝐼𝑖𝑗 (𝜃) 𝛿𝜃𝑖 𝛿𝜃𝑗 + 𝑂(𝛿𝜃3 ) . (2)
2
An equivalent definition of the Fisher information matrix is by the usual formulas (Cover
and Thomas, 2006)
𝜕𝑔(𝜃)
𝐼𝑖𝑗−1 (𝜃)
∑︁
(∇
̃︀ 𝜃 𝑔)𝑖 =
𝑗
𝜕𝜃𝑗
or more synthetically
̃︀ 𝜃 𝑔 = 𝐼 −1 𝜕𝑔
∇ .
𝜕𝜃
From now on, we will use ∇ ̃︀ 𝜃 to denote the natural gradient and 𝜕 to denote the vanilla
𝜕𝜃
gradient.
By construction, the natural gradient descent is intrinsic: it does not depend on the
chosen parametrization 𝜃 of 𝑃𝜃 , so that it makes sense to speak of the natural gradient
ascent of a function 𝑔(𝑃𝜃 ). The Fisher metric is essentially the only way to obtain this
property (Amari and Nagaoka, 2000, Section 2.4).
Given that the Fisher metric comes from the Kullback–Leibler divergence, the “shortest
path uphill” property of gradients mentioned above translates as follows (see also Amari
1998, Theorem 1):
1. Throughout the text we do not distinguish a probability distribution 𝑃 , seen as a measure, and its density
with respect to some unspecified reference measure d𝑥, and so will write indifferently 𝑃 (d𝑥) or 𝑃 (𝑥)d𝑥.
The measure-theoretic viewpoint allows for a unified treatment of the discrete and continuous case.
7
Ollivier, Arnold, Auger and Hansen
Proposition 1 The natural gradient ascent points in the direction 𝛿𝜃 achieving the largest
change of the objective function, for a given distance between 𝑃𝜃 and 𝑃𝜃+𝛿𝜃 in Kullback–
Leibler divergence. More precisely, let 𝑔 be a smooth function on the parameter space Θ. Let
𝜃 ∈ Θ be a point where ∇𝑔(𝜃)
̃︀ does not vanish. Then, if
∇𝑔(𝜃)
̃︀
𝛿𝜃 =
‖∇𝑔(𝜃)‖
̃︀
is the direction of the natural gradient of 𝑔 (with ‖ · ‖ the Fisher norm), we have
1
𝛿𝜃 = lim arg max 𝑔(𝜃 + 𝛿𝜃′ ).
𝜀→0 𝜀 𝛿𝜃′ such that
KL(𝑃𝜃+𝛿𝜃′ || 𝑃𝜃 )6𝜀2 /2
Here we have implicitly assumed that the parameter space Θ is such that no two points
𝜃 ∈ Θ define the same probability distribution, and the mapping 𝑃𝜃 ↦→ 𝜃 is smooth.
2.1.3 Why Use the Fisher Metric Gradient for Optimization? Relationship
to Diversity
The first reason for using the natural gradient is its reparametrization invariance, which
makes it the only gradient available in a general abstract setting (Amari and Nagaoka, 2000).
Practically, this invariance also limits the influence of encoding choices on the behavior of
the algorithm (Appendix C.1). The Fisher matrix can be also seen as an adaptive learning
rate for different components of the parameter vector 𝜃𝑖 : components 𝑖 with a high impact
on 𝑃𝜃 will be updated more cautiously.
Another advantage comes from the relationship with Kullback–Leibler distance in view
of the “shortest path uphill” (see also Amari 1998). To minimize the value of some function
𝑔(𝜃) defined on the parameter space Θ, the naive approach follows a gradient descent for 𝑔
using the “vanilla” gradient
𝜕𝑔
𝜃𝑖𝑡+𝛿𝑡 = 𝜃𝑖𝑡 + 𝛿𝑡 𝜕𝜃 𝑖
and, as explained above, this maximizes the increment of 𝑔 for a given increment ‖𝜃𝑡+𝛿𝑡 − 𝜃𝑡 ‖.
On the other hand, the Fisher gradient
𝜃𝑖𝑡+𝛿𝑡 = 𝜃𝑖𝑡 + 𝛿𝑡𝐼 −1 𝜕𝜃
𝜕𝑔
𝑖
8
Information-Geometric Optimization
(We have stated the proposition over a finite space to have a well-defined uniform
distribution. A short proof, together with the regularity conditions on 𝑔, is given in
Appendix D.)
So following the natural gradient of a function 𝑔, starting at or close to the uniform
distribution, amounts to optimizing the function 𝑔 with minimal loss of diversity, provided
the initial diversity is large. (This is valid, of course, only at the beginning; once one gets too
far from uniform, a better interpretation is minimal change of diversity.) On the other hand,
the vanilla gradient descent does not satisfy Proposition 2: it optimizes 𝑔 with minimal
change in the numerical values of the parameter 𝜃, which is of little interest.
So arguably this method realizes the best trade-off between optimization and loss
of diversity. (Though, as can be seen from the detailed algorithm description below,
maximization of diversity occurs only greedily at each step, and so there is no guarantee
that after a given time, IGO will provide the highest possible diversity for a given objective
function value.)
An experimental confirmation of the positive influence of the Fisher matrix on diversity
is given in Appendix B below.
9
Ollivier, Arnold, Auger and Hansen
The level functions 𝑞 : 𝑋 → [0, 1] reflect the probability to sample a better value than
𝑓 (𝑥). They are monotone in 𝑓 (if 𝑓 (𝑥1 ) 6 𝑓 (𝑥2 ) then 𝑞𝜃< (𝑥1 ) 6 𝑞𝜃< (𝑥2 ), and likewise for 𝑞 6 )
and invariant under strictly increasing transformations of 𝑓 .
A typical choice for 𝑤 is 𝑤(𝑞) = 1𝑞6𝑞0 for some fixed value 𝑞0 , the selection quantile. In
what follows, we suppose that a selection scheme (weighting scheme) 𝑤 has been chosen
once and for all.
As desired, the definition of 𝑊𝜃𝑓 is invariant under a strictly increasing transformation
of 𝑓 . For instance, the 𝑃𝜃 -median of 𝑓 gets remapped to 𝑤( 12 ).
Note that E𝑥∼𝑃𝜃 𝑊𝜃𝑓 (𝑥) is always equal to 01 𝑤, independently of 𝑓 and 𝜃: indeed, by
∫︀
over 𝜃, where 𝜃𝑡 is fixed at a given step but will adapt over time.
Importantly, 𝑊𝜃𝑓 (𝑥) can be estimated in practice: indeed, the 𝑃𝜃 -𝑓 -levels Pr𝑥′ ∼𝑃𝜃 (𝑓 (𝑥′ ) <
𝑓 (𝑥)) can be estimated by taking samples of 𝑃𝜃 and ordering the samples according to
the value of 𝑓 (see below). The estimate remains invariant under strictly increasing 𝑓 -
transformations.
10
Information-Geometric Optimization
̃︀ 𝜃 = 𝑃𝜃 ∇ln
Using the log-likelihood trick ∇𝑃 ̃︀ 𝑃𝜃 (assuming 𝑃𝜃 is smooth), we get an
equivalent expression of the update above as an integral under the current distribution 𝑃𝜃𝑡 ;
this is important for practical implementation. This leads to the following definition.
Definition 4 (IGO flow) The IGO flow is the set of continuous-time trajectories in space
Θ, defined by the ordinary differential equation
d𝜃𝑡
∫︁
=∇
̃︀ 𝜃 𝑊𝜃𝑓𝑡 (𝑥) 𝑃𝜃 (𝑥)d𝑥 (8)
d𝑡
∇
̃︀ 𝜃 𝑃𝜃 (𝑥)
∫︁
= 𝑊𝜃𝑓𝑡 (𝑥) 𝑃 𝑡 (𝑥)d𝑥
𝑃𝜃𝑡 (𝑥) 𝜃
∫︁
= 𝑊𝜃𝑓𝑡 (𝑥) ∇
̃︀ 𝜃 ln 𝑃𝜃 (𝑥) 𝑃 𝑡 (d𝑥)
𝜃 (9)
𝜕 ln 𝑃𝜃 (𝑥)
∫︁
= 𝐼 −1 (𝜃𝑡 ) 𝑊𝜃𝑓𝑡 (𝑥) 𝑃𝜃𝑡 (d𝑥) . (10)
𝜕𝜃
where the gradients are taken at point 𝜃 = 𝜃𝑡 , and 𝐼 is the Fisher information matrix.
Natural evolution strategies (NES, Wierstra et al. 2008; Glasmachers et al. 2010; Sun
et al. 2009; Wierstra et al. 2014) feature a related gradient descent with 𝑓 (𝑥) instead of
𝑊𝜃𝑓𝑡 (𝑥). The associated flow would read
d𝜃𝑡
∫︁
= −∇
̃︀ 𝜃 𝑓 (𝑥) 𝑃𝜃 (d𝑥) , (11)
d𝑡
where the gradient is taken at 𝜃𝑡 (in the sequel when not explicitly stated, gradients in 𝜃
are taken at 𝜃 = 𝜃𝑡 ). However, in the end NESs always implement algorithms using sample
quantiles (via “nonlinear fitness shaping”), as if derived from the gradient ascent of 𝑊𝜃𝑓𝑡 (𝑥).
The update (9) is a weighted average of “intrinsic moves” increasing the log-likelihood of
some points. We can slightly rearrange the update as
preference weight current sample distribution
d𝜃𝑡
∫︁ ⏞ ⏟ ⏞ ⏟
= 𝑊𝜃𝑓𝑡 (𝑥) ∇
̃︀ 𝜃 ln 𝑃𝜃 (𝑥) 𝑃 𝑡 (d𝑥)
𝜃 (12)
d𝑡 ⏟ ⏞
∫︁ intrinsic move to reinforce 𝑥
=∇
̃︀ 𝜃 𝑊𝜃𝑓𝑡 (𝑥) ln 𝑃𝜃 (𝑥) 𝑃𝜃𝑡 (d𝑥) , (13)
⏟ ⏞
weighted log-likelihood
which provides an interpretation for the IGO gradient flow as a gradient ascent optimization
of the weighted log-likelihood of the “good points” of the current distribution. In the sense
of Theorem 10 below, IGO is in fact the “best” way to increase this log-likelihood.
For exponential families of probability distributions, we will see later that the IGO flow
rewrites as a nice derivative-free expression (21).
11
Ollivier, Arnold, Auger and Hansen
∙ the integral under 𝑃𝜃𝑡 is approximated using 𝑁 samples taken from 𝑃𝜃𝑡 ;
∙ the value 𝑊𝜃𝑓𝑡 is approximated for each sample taken from 𝑃𝜃𝑡 ;
𝑡
∙ the time derivative d𝜃 d𝑡 is approximated by a 𝛿𝑡 time increment: instead of the
𝑡
continuous-time IGO flow (8) we use its Euler approximation scheme 𝜃𝑡+𝛿𝑡 ≈ 𝜃𝑡 + 𝛿𝑡 d𝜃
d𝑡 ,
so that the time 𝑡 of the flow is discretized with a step size 𝛿𝑡, which thus becomes
the learning rate of the algorithm. (See Corollary 21 for an interpretation of 𝛿𝑡 as a
number of bits of information introduced in the distribution 𝑃𝜃𝑡 at each step.)
We also assume that the Fisher information matrix 𝐼(𝜃) and 𝜕 ln𝜕𝜃 𝑃𝜃 (𝑥)
can be computed
(see discussion below if 𝐼(𝜃) is unknown).
At each step, we draw 𝑁 samples 𝑥1 , . . . , 𝑥𝑁 under 𝑃𝜃𝑡 . To approximate 𝑊𝜃𝑓𝑡 , we rank
the samples according to the value of 𝑓 . Define rk(𝑥𝑖 ) = #{𝑗 | 𝑓 (𝑥𝑗 ) < 𝑓 (𝑥𝑖 )} and let the
estimated weight of sample 𝑥𝑖 be
1 rk(𝑥𝑖 ) + 1/2
(︂ )︂
𝑤
̂︀𝑖 = 𝑤 , (14)
𝑁 𝑁
using the non-increasing selection scheme function 𝑤 introduced in Definition 3 above. (This
is assuming there are no ties in our sample; in case several sample points have the same
̂︀𝑖 by averaging the above over all possible rankings of the ties2 .)
value of 𝑓 , we define 𝑤
Then we can approximate the IGO flow as follows.
where 𝑤
̂︀𝑖 is the weight (14) obtained from the ranked values of the objective function 𝑓 .
(︁ )︁
1 𝑖−1/2
Equivalently one can fix the weights 𝑤𝑖 = 𝑁 𝑤 𝑁 once and for all and rewrite the
update as
𝑁 ⃒
𝑡+𝛿𝑡 𝑡 −1 𝑡
∑︁ 𝜕 ln 𝑃𝜃 (𝑥𝑖:𝑁 ) ⃒⃒
𝜃 = 𝜃 + 𝛿𝑡 𝐼 (𝜃 ) 𝑤𝑖 (18)
𝑖=1
𝜕𝜃 ⃒
𝜃=𝜃𝑡
with rk< (𝑥𝑖 ) = #{𝑗 | 𝑓 (𝑥𝑗 ) < 𝑓 (𝑥𝑖 )} and rk6 (𝑥𝑖 ) = #{𝑗 | 𝑓 (𝑥𝑗 ) 6 𝑓 (𝑥𝑖 )}.
12
Information-Geometric Optimization
where 𝑥𝑖:𝑁 denotes the sampled point ranked 𝑖th according to 𝑓 , i.e. 𝑓 (𝑥1:𝑁 ) < . . . < 𝑓 (𝑥𝑁 :𝑁 )
(assuming again there are no ties). Note that {𝑥𝑖:𝑁 } = {𝑥𝑖 } and {𝑤𝑖 } = {𝑤 ̂︀𝑖 }.
As will be discussed in Section 5, this update applied to multivariate normal distributions
or Bernoulli measures allows us to neatly recover versions of some well-established algorithms,
in particular CMA-ES and PBIL. Actually, in the Gaussian context updates of the form
(17) have already been introduced (Glasmachers et al., 2010; Akimoto et al., 2010), though
not formally derived from a continuous-time flow with inverse quantiles.
When 𝑁 → ∞, the IGO algorithm using samples approximates the continuous-time IGO
gradient flow, see Theorem 6 below. Indeed, the IGO algorithm, with 𝑁 = ∞, is simply the
Euler approximation scheme for the ordinary differential equation defining the IGO flow (8).
The latter result thus provides a sound mathematical basis for currently used rank-based
updates.
𝜕 ln 𝑃𝜃 (𝑥) 𝜕 ln 𝑃𝜃 (𝑥)
∫︁
𝐼𝑖𝑗 (𝜃) = 𝑃𝜃 (d𝑥)
𝑥 𝜕𝜃𝑖 𝜕𝜃𝑗
using 𝑃𝜃 -Monte Carlo samples for 𝑥. These samples may or may not be the same as those
used in the IGO update (17): in particular, it is possible to use as many Monte Carlo
13
Ollivier, Arnold, Auger and Hansen
14
Information-Geometric Optimization
Proposition 7 (Quantile improvement) Consider the IGO flow (8), with the weight
𝑤(𝑢) = 1𝑢6𝑞 where 0 < 𝑞 < 1 is fixed. Then the value of the 𝑞-quantile of 𝑓 improves over
time: if 𝑡1 6 𝑡2 then 𝑄𝑞𝑃 𝑡 (𝑓 ) 6 𝑄𝑞𝑃 𝑡 (𝑓 ). Here the 𝑞-quantile value 𝑄𝑞𝑃 (𝑓 ) of 𝑓 under a
𝜃 2 𝜃 1
probability distribution 𝑃 is defined as the largest number 𝑚 such that Pr𝑥∼𝑃 (𝑓 (𝑥) > 𝑚) >
1 − 𝑞.
Assume moreover that the objective function 𝑓 has no plateau, i.e. for any 𝑣 ∈ R
and any 𝜃 ∈ Θ we have Pr𝑥∼𝑃𝜃 (𝑓 (𝑥) = 𝑣) = 0. Then for 𝑡1 < 𝑡2 either 𝜃𝑡1 = 𝜃𝑡2 or
𝑄𝑞𝑃 𝑡 (𝑓 ) < 𝑄𝑞𝑃 𝑡 (𝑓 ).
𝜃 2 𝜃 1
The proof is given in Appendix D, together with the necessary regularity assumptions.
Note that on a discrete search space, the objective function has only plateaus, and the
𝑞-quantile will evolve by successive jumps even as 𝜃 evolves continuously.
This property is proved here only for the IGO gradient flow (8) with 𝑁 = ∞ and
𝛿𝑡 → 0. For an IGO algorithm with finite 𝑁 , the dynamics is random and one cannot expect
monotonicity. Still, Theorem 6 ensures that, with high probability, trajectories of a large
enough finite population stay close to the infinite-population limit trajectory.
In Akimoto and Ollivier (2013) this result was extended to finite time steps instead of
infinitesimal 𝛿𝑡, using the IGO-ML framework from Section 4 below.
15
Ollivier, Arnold, Auger and Hansen
By plugging this into the definition of the IGO flow (10) we find:
Note that the right-hand side does not involve derivatives w.r.t. 𝜃 any more. This result
makes it easy to simulate the IGO flow using, e.g., a Gibbs sampler for 𝑃𝜃 : both covariances
in (21) may be approximated by sampling, so that neither the Fisher matrix nor the gradient
term need to be known in advance, and no derivatives are involved.
The values of the variables 𝑇¯𝑖 = E𝑇𝑖 , namely the expected value of 𝑇𝑖 under the current
distribution, can often be used as an alternative parametrization for an exponential family
(e.g. for a one-dimensional Gaussian, these are the mean 𝜇 and the second moment 𝜇2 + 𝜎 2 ).
The IGO flow (9) may be rewritten using these variables, using the relation ∇ ̃︀ 𝜃 = 𝜕 for
𝑖 ¯
𝜕 𝑇𝑖
the natural gradient of exponential families (Proposition 29 in Appendix D). One finds:
Proposition 9 With the same setting as in Proposition 8, the expectation variables 𝑇¯𝑖 =
E𝑃𝜃 𝑇𝑖 satisfy the following evolution equation under the IGO flow
d𝑇¯𝑖
= Cov(𝑇𝑖 , 𝑊𝜃𝑓 ) = E(𝑇𝑖 𝑊𝜃𝑓 ) − 𝑇¯𝑖 E𝑊𝜃𝑓 . (22)
d𝑡
The proof is given in Appendix D, in the proof of Theorem 12. We shall further exploit
this fact in Section 4.
𝜕 ln 𝑃𝜃 (𝑥)
= 𝑈𝑖 (𝑥) − E𝑃𝜃 𝑈𝑖 (23)
𝜕𝜃𝑖
where
𝑈𝑖 (𝑥) = E𝑃𝜃 (𝑇𝑖 (𝑥, ℎ)|𝑥)
is the expectation of 𝑇𝑖 (𝑥, ℎ) knowing 𝑥. Then the Fisher matrix is
16
Information-Geometric Optimization
∙ By its very construction, the IGO flow is invariant under a number of transformations of
the original problem. First, replacing the objective function 𝑓 with a strictly increasing
function of 𝑓 does not change the IGO flow (Proposition 17 in Appendix C.1).
Second, changing the parameterization 𝜃 used for the family 𝑃𝜃 (e.g., letting 𝜃 be a
variance or a standard deviation) results in unchanged trajectories for the distributions
𝑃𝜃 under the IGO flow (Proposition 18 in Appendix C.1).
Finally, the IGO flow is insensitive to transformations of the original problem by a
change of variable in the search space 𝑋 itself, provided this transformation can be
reflected in the family of distributions 𝑃𝜃 (Proposition 19 in Appendix C.1); this covers,
for instance, optimizing 𝑓 (𝐴𝑥) instead of 𝑓 (𝑥) in R𝑑 using Gaussian distributions,
where 𝐴 is any invertible matrix.
These latter two invariances are specifically due to the natural gradient and are not
satisfied by a vanilla gradient descent. 𝑓 -invariance and 𝑋-invariance are directly
inherited from the IGO flow by IGO algorithms, but this is only approximately true of
𝜃-invariance, as discussed in Appendix C.1.
∙ The speed of the IGO flow is bounded by the variance of the weight function 𝑤 on [0; 1]
(Appendix C.2, Proposition 20). This implies that the parameter 𝛿𝑡 in the IGO flow is
not a meaningless variable but is related to the maximal number of bits introduced in
𝑃𝜃 at each step (Appendix C.2, Corollary 21).
∙ IGO algorithms still make sense when the objective function 𝑓 is noisy, that is, when
each call to 𝑓 returns a non-deterministic value. This can be accounted for without
changing the framework: we prove in Proposition 22 (Appendix C.3) that IGO for
noisy 𝑓 is equivalent to IGO for a non-noisy 𝑓˜ defined on a larger space 𝑋 × Ω with
distributions 𝑃˜𝜃 that are uniform over Ω. Consequently, theorems such as consistency
of sampling immediately transfer to the noisy case.
∙ The IGO flow can be computed explicitly in the simple case of linear functions on R𝑑
using Gaussian distributions, and its convergence can be proven for linear functions
on {0, 1}𝑑 with Bernoulli distributions (Appendix C.4).
17
Ollivier, Arnold, Auger and Hansen
not selective enough, and both theory and experiments confirm that for the Gaussian case,
efficient optimization requires 𝑞 < 1/2 (see Section 5.2). According to Beyer (2001), on the
sphere function 𝑓 (𝑥) = 𝑖 𝑥2𝑖 , the optimal 𝑞 is about 0.27 if sample size 𝑁 is not larger than
∑︀
the search space dimension 𝑑, and even smaller otherwise (Jebalia and Auger, 2010).
Second, replacing 𝑤 with 𝑤 + 𝑐 for some constant 𝑐 clearly has no influence on the IGO
continuous-time flow (7), since the gradient will cancel out the constant. However, this is
not the case for the update rule (18) with a finite sample of size 𝑁 .
Indeed, adding a constant 𝑐 to 𝑤 adds a quantity 𝑐 𝑁1 ∇𝜃 ln 𝑃𝜃 (𝑥𝑖 ) to the update. In
∑︀ ̃︀
expectation,
∫︀
this quantity
∫︀
vanishes because the 𝑃𝜃 -expected value of ∇ ̃︀ 𝜃 ln 𝑃𝜃 is 0 (because
(∇𝜃 ln 𝑃𝜃 ) 𝑃𝜃 = ∇𝑃𝜃 = ∇1 = 0). So adding a constant to 𝑤 does not change the
̃︀ ̃︀ ̃︀
expected value of the update, but it may change, √ e.g., its variance. The empirical average
of ∇ ̃︀ 𝜃 ln 𝑃𝜃 (𝑥𝑖 ) in the sample will be 𝑂(1/ 𝑁 ). So translating the weights results in a
√
𝑂(1/ 𝑁 ) change in the update. See also Section 4 in Sun et al. (2009).
Thus, one may be tempted to introduce a well chosen value of 𝑐 so as to reduce the
variance of the update. However, determining an optimal value for 𝑐 is difficult: the optimal
value minimizing the variance actually depends on possible correlations between ∇ ̃︀ 𝜃 ln 𝑃𝜃
and the function 𝑓 . The only general result is that one should shift 𝑤 such that 0 lies within
its range. Assuming independence, or dependence with enough symmetry, the optimal shift
is when the weights average to 0.
3.5.2 Complexity
The complexity of the IGO algorithm depends much on the computational cost model. In
optimization, it is fairly common to assume that the objective function 𝑓 is very costly
compared to any other calculations performed by the algorithm (Moré et al., 1981; Dolan
and Moré, 2002). Then the cost of IGO in terms of number of 𝑓 -calls is 𝑁 per iteration,
and the cost of using inverse quantiles and computing the natural gradient is negligible.
Setting the cost of 𝑓 aside, the complexity of the IGO algorithm depends mainly on
the computation of the (inverse) Fisher matrix. Assume an analytical expression for this
matrix is known. Then, with 𝑝 = dim Θ the number of parameters, the cost of storage of
the Fisher matrix is 𝑂(𝑝2 ) per iteration, and its inversion typically costs 𝑂(𝑝3 ) per iteration.
However, depending on the situation and on possible algebraic simplifications, strategies
exist to reduce this cost (e.g., Le Roux et al. 2007 in a learning context). For instance,
for CMA-ES the cost is 𝑂(𝑁 𝑝) (Suttorp et al., 2009). More generally, parametrization by
expectation parameters (see above), when available, may reduce the cost to 𝑂(𝑁 𝑝) as well.
If no analytical form of the Fisher matrix is known and Monte Carlo estimation is required,
then complexity depends on the particular situation at hand and is related to the best
sampling strategies available for a particular family of distributions. For Boltzmann machines,
for instance, a host of such strategies are available (Ackley et al., 1985; Salakhutdinov and
Murray, 2008; Salakhutdinov, 2009; Desjardins et al., 2010). Still, in such a situation, IGO
may be competitive if the objective function 𝑓 is costly.
18
Information-Geometric Optimization
3.5.4 Initialization
As with other distribution-based optimization algorithms, it is usually a good idea to initialize
in such a way as to cover a wide portion of the search space, i.e. 𝜃0 should be chosen so that
𝑃𝜃0 has large diversity. For IGO algorithms this is particularly effective, since, as explained
above, the natural gradient provides minimal change of diversity (greedily at each step) for
a given change in the objective function.
19
Ollivier, Arnold, Auger and Hansen
So if 𝑊 (𝑥) = 𝑊𝜃𝑓0 (𝑥) is the weight of the points according to quantilized 𝑓 -preferences,
the weighted maximum log-likelihood necessarily is the IGO flow (9) using the natural
gradient—or the IGO update (17) when using samples.
Thus the IGO flow is the unique flow that, continuously in time, slightly changes the
distribution to maximize the log-likelihood of points with good values of 𝑓 . (In addition,
IGO continuously updates the weight 𝑊𝜃𝑓𝑡 (𝑥) depending on 𝑓 and on the current distribution,
so that we keep optimizing.)
This theorem suggests a way to approximate the IGO flow by enforcing this interpretation
for a given non-infinitesimal step size 𝛿𝑡, as follows.
Definition 11 (IGO-ML algorithm) The IGO-ML algorithm with step size 𝛿𝑡 updates
the value of the parameter 𝜃𝑡 according to
{︃ ∫︁ }︃
∑︁
𝑡+𝛿𝑡
= arg max (1 − 𝛿𝑡
∑︀
𝜃 𝑤
̂︀𝑖 ) ln 𝑃𝜃 (𝑥) 𝑃𝜃𝑡 (d𝑥) + 𝛿𝑡 𝑤
̂︀𝑖 ln 𝑃𝜃 (𝑥𝑖 ) (30)
𝜃 𝑖 𝑖
where 𝑥1 , . . . , 𝑥𝑁 are sample points drawn according to the distribution 𝑃𝜃𝑡 , and 𝑤
̂︀𝑖 is the
weight (14) obtained from the ranked values of the objective function 𝑓 .
20
Information-Geometric Optimization
21
Ollivier, Arnold, Auger and Hansen
𝑇𝑗* =
∑︁
𝑤
̂︀𝑖 𝑇𝑗 (𝑥𝑖 ).
𝑖
Corollary 13 For exponential families, the standard CEM/ML update (31) coincides with
the IGO algorithm in parametrization 𝑇¯𝑗 with 𝛿𝑡 = 1.
Beware that the expectation parameters 𝑇¯𝑗 are not always the most obvious parameters
(Amari and Nagaoka, 2000, Section 3.5). For example, for 1-dimensional Gaussian distribu-
tions, the expectation parameters are the mean 𝜇 and the second moment 𝜇2 + 𝜎 2 , not the
mean and variance. When expressed back in terms of mean and variance, the update (34)
boils down to 𝜇 ← (1 − 𝛿𝑡)𝜇 + 𝛿𝑡𝜇* and 𝜎 2 ← (1 − 𝛿𝑡)𝜎 2 + 𝛿𝑡(𝜎 * )2 + 𝛿𝑡(1 − 𝛿𝑡)(𝜇* − 𝜇)2 ,
where 𝜇* and 𝜎 * denote the mean and standard deviation of the samples 𝑥𝑖 .
On the other hand, when using smoothed CEM with mean and variance as parameters,
the new variance is (1− 𝛿𝑡)𝜎 2 + 𝛿𝑡(𝜎 * )2 , which can be significantly smaller for 𝛿𝑡 ∈ (0, 1). This
proves, in passing, that the smoothed CEM update in other parametrizations is generally
not an IGO algorithm (because it can differ at first order in 𝛿𝑡).
The case of Gaussian distributions is further exemplified in Section 5.2 below: in partic-
ular, smoothed CEM in the (𝜇, 𝜎) parametrization almost invariably exhibits a reduction of
variance, often leading to premature convergence.
For these reasons we think that the IGO-ML algorithm is the sensible way to define an
interpolated ML estimate for 𝛿𝑡 < 1 in a parametrization-independent way (see however the
analysis of a critical 𝛿𝑡 in Section 5.2). In Appendix A we further discuss IGO and CEM
and sum up the differences and relative advantages.
Taking 𝛿𝑡 = 1 is a bold approximation choice: the “ideal” continuous-time IGO flow
itself, after time 1, does not coincide with the maximum likelihood update of the best points
in the sample. Since the maximum likelihood algorithm is known to converge prematurely
in some instances (Section 5.2), using the parametrization by expectation parameters with
large 𝛿𝑡 may not be desirable.
The considerable simplification of the IGO update in these coordinates reflects the duality
of coordinates 𝑇¯𝑖 and 𝜃𝑖 . More precisely, the natural gradient ascent w.r.t. the parameters
𝑇¯𝑖 is given by the vanilla gradient w.r.t. the parameters 𝜃𝑖 :
𝜕
∇
̃︀ ¯ =
𝑇𝑖
𝜕𝜃𝑖
(Proposition 29 in Appendix D).
22
Information-Geometric Optimization
𝜕 ln 𝑃𝜃 (𝑥) 𝑥𝑖 1 − 𝑥𝑖
= − .
𝜕𝜃𝑖 𝜃𝑖 1 − 𝜃𝑖
Let 𝑥1 , . . . , 𝑥𝑁 be 𝑁 samples at step 𝑡 with distribution 𝑃𝜃𝑡 and let 𝑥1:𝑁 , . . . , 𝑥𝑁 :𝑁 be
the samples ranked according to 𝑓 value. The natural gradient update (18) with Bernoulli
measures is then
𝑁
(︃ )︃
[𝑥𝑗:𝑁 ]𝑖 1 − [𝑥𝑗:𝑁 ]𝑖
𝜃𝑖𝑡+𝛿𝑡 = 𝜃𝑖𝑡 + 𝛿𝑡 𝜃𝑖𝑡 (1 − 𝜃𝑖𝑡 )
∑︁
𝑤𝑗 − (35)
𝑗=1
𝜃𝑖𝑡 1 − 𝜃𝑖𝑡
where 𝑤𝑗 = 𝑤((𝑗 − 1/2)/𝑁 )/𝑁 and [𝑦]𝑖 denotes the 𝑖th coordinate of 𝑦 ∈ 𝑋. The previous
equation simplifies to
𝑁 (︁ )︁
𝜃𝑖𝑡+𝛿𝑡 = 𝜃𝑖𝑡 + 𝛿𝑡
∑︁
𝑤𝑗 [𝑥𝑗:𝑁 ]𝑖 − 𝜃𝑖𝑡 , (36)
𝑗=1
∑︀𝑁
or, denoting 𝑤
¯ the sum of the weights 𝑗=1 𝑤𝑗 ,
𝑁
𝜃𝑖𝑡+𝛿𝑡 = (1 − 𝑤𝛿𝑡)
∑︁
¯ 𝜃𝑖𝑡 + 𝛿𝑡 𝑤𝑗 [𝑥𝑗:𝑁 ]𝑖 . (37)
𝑗=1
Proposition 14 The IGO algorithm on {0, 1}𝑑 using Bernoulli measures parametrized by 𝜃
as above, coincides with the compact Genetic Algorithm (cGA) when 𝑁 = 2 and 𝑤1 = −𝑤2 .
23
Ollivier, Arnold, Auger and Hansen
1
To better compare the update with the previous representation, note that 𝜃𝑖 =
1+exp(−𝜃˜𝑖 )
and thus we can rewrite
𝑁
𝛿𝑡 (︁ )︁
𝜃˜𝑖𝑡+𝛿𝑡 = 𝜃˜𝑖𝑡 +
∑︁
𝑡
𝑤 𝑗 [𝑥 𝑗:𝑁 ] 𝑖 − 𝜃 𝑖 . (40)
𝜃𝑖𝑡 (1 − 𝜃𝑖𝑡 ) 𝑗=1
So the direction of the update is the same as before and is given by the proportion of
bits set to 0 or 1 in the best samples, compared to its expected value under the current
distribution. The magnitude of the update is different since the parameter 𝜃˜ ranges from
−∞ to +∞ instead of from 0 to 1. We did not find this algorithm in the literature.
These updates also illustrate the influence of setting the sum of weights to 0 or not
(Section 3.5). If, at some time, the first bit is equal to 1 both for a majority of good points
and for a majority of bad points, then the original PBIL will increase the probability of
setting the first bit to 1, which is counterintuitive. If the weights 𝑤𝑖 are chosen to sum to 0
this noise effect disappears; otherwise, it disappears only on average.
4. Note that the pseudocode for the algorithm in Baluja and Caruana (1995, Figure 4) is slightly erroneous
since it gives smaller weights to better individuals. The error can be fixed by updating the probability in
reversed order, looping from NUMBER_OF_VECTORS_TO_UPDATE_FROM to 1. This was confirmed by S. Baluja
in personal communication. We consider here the corrected version of the algorithm.
24
Information-Geometric Optimization
5.2.1 CMA-ES.
The rank-𝜇-update CMA-ES implements the equations5
𝑁
∑︁
𝑡+1 𝑡
𝑚 = 𝑚 + 𝜂m ̂︀𝑖 (𝑥𝑖 − 𝑚𝑡 )
𝑤 (41)
𝑖=1
𝑁
∑︁
𝐶 𝑡+1 = 𝐶 𝑡 + 𝜂c ̂︀𝑖 ((𝑥𝑖 − 𝑚𝑡 )(𝑥𝑖 − 𝑚𝑡 )T − 𝐶 𝑡 )
𝑤 (42)
𝑖=1
where 𝑤
̂︀𝑖 are the weights based on ranked 𝑓 -values, see (14) and (17).
Proposition 15 The IGO update (17) for Gaussian distributions in the parametrization by
mean and covariance matrix (𝑚, 𝐶), coincides with the CMA-ES update equations (41) and
(42) with 𝜂c = 𝜂m .
This result is essentially due to Akimoto et al. (2010) and Glasmachers et al. (2010), who
showed that the CMA-ES update with 𝜂c = 𝜂m is a natural gradient update6 .
However, in deviation from the IGO algorithm, the learning rates 𝜂m and 𝜂c are assigned
different values if 𝑁 ≪ dim Θ in CMA-ES7 . Note that the Fisher information matrix is
block-diagonal in 𝑚 and 𝐶 (Akimoto et al., 2010), so that application of the different
5. The CMA-ES implements these equations given the parameter setting 𝑐1 = 0 and 𝑐𝜎 = 0 (or 𝑑𝜎 = ∞, see
e.g. Hansen 2009) that disengages the effect of the rank-one update and of step size control and therefore
of both so-called evolution paths.
6. In these articles the result has been derived for 𝜃 ← 𝜃 + 𝜂 ∇ ̃︀𝜃 E𝑃𝜃 𝑓 , see (11), leading to 𝑓 (𝑥𝑖 ) in place
of 𝑤̂︀𝑖 . No assumptions on 𝑓 have been used besides that it does not depend on 𝜃. Consequently, by
replacing 𝑓 with∑︀𝑊𝜃𝑓𝑡 , where 𝜃𝑡 is fixed, the derivation holds equally well∑︀ for 𝜃 ← 𝜃 + 𝜂 ∇ ̃︀𝜃 E𝑃𝜃 𝑊 𝑓𝑡 .
𝜃
2 2
7. Specifically, let |𝑤
̂︀𝑖 | = 1, then the settings are 𝜂m = 1 and 𝜂c ≈ 1/(𝑑 𝑤̂︀𝑖 ) (Hansen, 2006b).
25
Ollivier, Arnold, Auger and Hansen
learning rates and of the inverse Fisher matrix commute. Moreover, CMA-ES uses a path
cumulation method to adjust the step sizes, which is not covered by the IGO framework
(see Appendix A).
26
Information-Geometric Optimization
where ln is the logarithm of positive matrices. (Note that the current value 𝐶 𝑡 of 𝐶 is
represented as 𝑅 = 0.)
Then the IGO update (17) in the parametrization 𝜃 is as follows: the mean 𝑚 is updated
as in CMA-ES (41), and the parameter 𝑅 is updated as
𝑁
̂︀𝑖 × (𝐴−1 (𝑥𝑖 − 𝑚)(𝑥𝑖 − 𝑚)T (𝐴T )−1 − I𝑑 )
∑︁
𝑅 ← 𝛿𝑡 𝑤 (44)
𝑖=1
thus resulting in the same update as (43) (with 𝜂c = 𝛿𝑡) for the covariance matrix: 𝐶 ←
𝐴 exp(𝑅)𝐴T .
Proof Indeed, by basic differential geometry, if parametrization 𝜃′ = 𝜙(𝜃) is used, the IGO
update for 𝜃′ is 𝐷𝜙(𝜃𝑡 ) applied to the IGO update for 𝜃, where 𝐷𝜙 is the differential of
𝜙. Here, given the update (42) for 𝐶, to find the update for 𝑅 we have to compute the
differential of the map 𝐶 ↦→ ln(𝐴−1 𝐶(𝐴T )−1 ) taken at 𝐶 = 𝐴𝐴T : for any matrix 𝑀 we
have ln(𝐴−1 (𝐴𝐴T + 𝜀𝑀 )(𝐴T )−1 ) = 𝜀 𝐴−1 𝑀 (𝐴T )−1 + 𝑂(𝜀2 ). So to find the update for the
variable 𝑅 we have to apply 𝐴−1 . . . (𝐴T )−1 to the update (42) for 𝐶.
by Theorem 6. Here, the weights 𝑤^𝑖 are equal to 1/𝜇 for the 𝜇 best points (censored or elitist
sample) and 0 otherwise. This 𝜃maxLL maximizes the weighted log-likelihood of 𝑥1 , . . . , 𝑥𝑁 ;
equivalently, it minimizes the cross-entropy and the Kullback–Leibler divergence to the
distribution of the best 𝜇 samples8 .
For Gaussian distributions, Equation (45) can be explicitly written in the form
𝑁
𝑚𝑡+1 = 𝑚* =
∑︁
𝑤
̂︀𝑖 𝑥𝑖 (46)
𝑖=1
𝑁
𝐶 𝑡+1 = 𝐶 * = ̂︀𝑖 (𝑥𝑖 − 𝑚* )(𝑥𝑖 − 𝑚* )T
∑︁
𝑤 (47)
𝑖=1
27
Ollivier, Arnold, Auger and Hansen
Equations (46) and (47) also define the simplest continuous domain EDA, the estimation
of multivariate normal algorithm (EMNAglobal , Larranaga and Lozano 2002). Interestingly,
(46) and (47) only differ from (41) and (42) (with 𝜂m = 𝜂c = 1) in that the new mean 𝑚𝑡+1
is used instead of 𝑚𝑡 in the covariance matrix update (Hansen, 2006b).
The smoothed CEM (32) in this parametrization thus writes
𝑚𝑡+𝛿𝑡 = (1 − 𝛿𝑡)𝑚𝑡 + 𝛿𝑡𝑚* (48)
5.2.6 Critical 𝛿𝑡
Let us assume that 𝜇 < 𝑁 weights are set to 𝑤
̂︀𝑖 = 1/𝜇 and the remaining weights to zero,
so that the selection ratio is 𝑞 = 𝜇/𝑁 .
28
Information-Geometric Optimization
1.0
Figure 1: Critical 𝛿𝑡 versus selection truncation ratio 𝑞 for three values of 𝑗 in (50). With
𝛿𝑡 above the critical 𝛿𝑡, the variance decreases systematically when optimizing
a linear function, indicating failure of the algorithm. For CMA-ES/NES where
𝑗 = 0, the critical 𝛿𝑡 for 𝑞 < 0.5 is infinite.
Then there is a critical value of 𝛿𝑡 depending on this ratio 𝑞, such that above this critical
𝛿𝑡 the algorithms given by IGO-ML and smoothed CEM are prone to premature convergence.
Indeed, let 𝑓 be a linear function on R𝑑 , and consider the variance in the direction of the
gradient of 𝑓 . Assuming further 𝑁 → ∞ and 𝑞 ≤ 1/2, then the variance 𝐶 * of the elite
sample is smaller than the current variance 𝐶 𝑡 , by a constant factor. Depending on the
precise update for 𝐶 𝑡+1 , if 𝛿𝑡 is too large, the variance 𝐶 𝑡+1 is going to be smaller than 𝐶 𝑡 by
a constant factor as well. This implies that the algorithm is going to stall, i.e., the variance
will go to 0 before the optimum is reached. (On the other hand, the continuous-time IGO
flow corresponding to 𝛿𝑡 → 0 does not stall, see Appendix C.4.)
We now study the critical 𝛿𝑡 (in the limit 𝑁 → ∞) below which the algorithm does not
stall (this is done similarly to the analysis in Appendix C.4 for linear functions). For IGO-ML,
(𝑗 = 1 in Eq. 50, or equivalently for the smoothed CEM in the expectation parameters
(𝑚, 𝐶 + 𝑚𝑚T ), see Section √ 4), the variance increases if and only if 𝛿𝑡 is smaller than the
critical value 𝛿𝑡 = 𝑞𝑏 2𝜋𝑒 𝑏2 /2 where 𝑏 is the percentile function of 𝑞, i.e. 𝑏 is such that
√crit
𝑞 = 𝑏∞ 𝑒−𝑥 /2 / 2𝜋. This value 𝛿𝑡 crit is plotted
∫︀ 2
√ as a solid line in Fig. 1. For 𝑗 = 2, 𝛿𝑡 crit
is smaller, related to the above by 𝛿𝑡 crit ← 1 + 𝛿𝑡 crit − 1 and plotted as a dashed line in
Fig. 1. For CEM (𝑗 = ∞), the critical 𝛿𝑡 is zero (reflecting the non-IGO behavior of CEM
in this parametrization). For CMA-ES (𝑗 = 0), the critical 𝛿𝑡 is infinite for 𝑞 < 1/2. When
the selection ratio 𝑞 is above 1/2, for all algorithms the critical 𝛿𝑡 becomes zero.
29
Ollivier, Arnold, Auger and Hansen
families of distributions on the search space 𝑋 = {0, 1}𝑑 extending Bernoulli distributions to
allow for dependencies between the bits (see, e.g., Berny 2002 for Boltzmann machines used
in an optimization setting). The details are given in Appendix B. We chose this example for
two reasons.
First, it illustrates how to obtain a novel algorithm from a family of probability distri-
butions in the IGO framework, in a case when there is no fully explicit expression for the
Fisher information matrix.
Second, it illustrates the influence of the natural gradient on diversity and its relevance
to multimodal optimization. The RBM probability distributions on {0, 1}𝑑 are multimodal,
contrary to Bernoulli distributions. Thus, in principle, IGO on such distributions could reach
several optima simultaneously. This allows for a simple testing of the entropy-maximizing
property of the IGO flow (Proposition 2). Accordingly, for a given improvement on the
objective function, we would expect IGO to favor preserving the diversity of 𝑃𝜃 . Indeed
on a simple objective function with two optima, we find that the IGO flow converges to a
distribution putting weight on both optima, while a descent using the vanilla gradient always
only concentrates around one optimum. This might be relevant for multimodal optimization,
where simultaneous handling of multiple hypotheses is seen as useful for optimization (Sareni
and Krähenbühl, 1998; Das et al., 2011), or in situations in which several valleys appear
equally good at first but only one of them contains the true optimum.
We refer to Appendix B for a reminder about Boltzmann machines, for how IGO rolls
out in this case, and for the detailed experimental setup for studying the influence of IGO
in a multimodal situation.
∙ The IGO flow is singled out by its equivalent description as an infinitesimal weighted
maximum log-likelihood update (Theorem 10). In a particular parametrization and
with a step size of 1, IGO recovers the cross-entropy method (Corollary 13). This
reveals a connection between CEM and the natural gradient, and allows to define
a new, fully parametrization-invariant smoothed maximum likelihood update, the
IGO-ML.
30
Information-Geometric Optimization
∙ Theoretical arguments suggest that the IGO flow minimizes the change of diversity in
the course of optimization. As diversity usually decreases in the course of optimization,
IGO algorithms tend to exhibit minimal diversity loss for the observed improvement.
In particular, starting with high diversity and using multimodal distributions may
allow simultaneous exploration of multiple optima of the objective function. Prelimi-
nary experiments with restricted Boltzmann machines confirm this effect in a simple
situation.
Thus, the IGO framework provides sound theoretical foundations to optimization algo-
rithms based on probability distributions. In particular, this viewpoint helps to bridge the
gap between continuous and discrete optimization.
The invariance properties, which reduce the number of arbitrary choices, together with
the relationship between natural gradient and diversity, may contribute to a theoretical
explanation of the good practical performance of those currently used algorithms, such as
CMA-ES, which can be interpreted as instantiations of IGO.
We hope that invariance properties will acquire in computer science the importance they
have in mathematics, where intrinsic thinking is the first step for abstract linear algebra
or differential geometry, and in modern physics, where the notions of invariance w.r.t. the
coordinate system and so-called gauge invariance play a central role.
Acknowledgments
The authors would like to thank Michèle Sebag for the acronym and for helpful comments.
We also thank Youhei Akimoto for helpful feedback and inspiration. Y. O. would like to
thank Cédric Villani and Bruno Sévennec for helpful discussions on the Fisher metric. A. A.
and N. H. would like to acknowledge the Dagstuhl Seminar No 10361 on the Theory of
Evolutionary Computation (http://www.dagstuhl.de/10361) for crucially inspiring this
paper, their work on natural gradients and beyond. This work was partially supported by
the ANR-2010-COSI-002 grant (SIMINOLE) of the French National Research Agency.
31
Ollivier, Arnold, Auger and Hansen
32
Information-Geometric Optimization
33
Ollivier, Arnold, Auger and Hansen
While the IGO flow is exactly 𝜃-invariant, for any practical implementation of an
IGO algorithm, a parametrization choice has to be made. Different parametrizations
lead to different algorithms and larger values of 𝛿𝑡 are likely to result in more differing
algorithms. Still, since all IGO algorithms approximate the IGO flow, two parametrizations
in combination with IGO will differ less than the same two parametrizations in combination
with another algorithm (such as the vanilla gradient or the smoothed CEM method)—at
least if the learning rate 𝛿𝑡 is not too large. The chosen parametrization becomes more
relevant as the step size 𝛿𝑡 increases.
On the other hand, natural evolution strategies have not emphasized 𝜃-invariance: the
chosen parametrization (Cholesky, exponential) has been considered as a defining feature.
We believe the term “natural evolution strategy” should rather be used independently of the
chosen parameterization, thereby referring to the usage of the natural gradient as the main
principle for the update of distribution parameters.
The cross-entropy method (CEM, de Boer et al. 2005) can be used to produce optimization
algorithms given a family of probability distributions on an arbitrary space, by performing a
jump to a maximum likelihood estimate of the parameters.
We have seen (Corollary 13) that the standard CEM is an IGO algorithm in a particular
parametrization, with a learning rate 𝛿𝑡 equal to 1. However, it is well-known, both
theoretically and experimentally (Branke et al., 2007; Hansen, 2006b; Wagner et al., 2004),
that standard CEM loses diversity too fast in many situations. The usual solution (de Boer
et al., 2005) is to reduce the learning rate (smoothed CEM, Equation 32), but this breaks
the reparametrization invariance of non-smoothed CEM.
On the other hand, the IGO flow can be seen as a maximum likelihood update with
infinitesimal learning rate (Theorem 10). This interpretation allows to define a particular
IGO algorithm, the IGO-ML (Definition 11): it performs a maximum likelihood update with
an arbitrary learning rate, and keeps the reparametrization invariance. It coincides with
CEM when the learning rate is set to 1, but it differs from smoothed CEM by the exchange
of the order of argmax and averaging (compare Equations 30 and 32), and coincides with
the IGO flow for small learning rates. We argue that this new, fully invariant algorithm is
the conceptually better way to reduce the learning rate and achieve smoothing in CEM.
Standard CEM with rate 1 can lose diversity, yet is a particular case of an IGO algo-
rithm: this illustrates the fact that reasonable values of the learning rate 𝛿𝑡 depend on the
parametrization. We have studied this phenomenon in detail for various Gaussian IGO
algorithms (Section 5.2).
Why would a smaller learning rate perform better than a large one in an optimization
setting? It might seem more efficient to jump directly to the maximum likelihood estimate
of currently known good points, instead of performing an iterated gradient ascent towards
this maximum. Due to having only a limited number of samples, however, optimization
faces a “moving target”, contrary to a learning setting in which the example distribution is
often stationary. Currently known good points heavily depend on the current distribution
and are likely not to indicate the position at which the optimum lies, but, rather, the broad
34
Information-Geometric Optimization
direction in which the optimum is to be found. After each update, the next elite sample
points are going to be located somewhere new.
Dividing by the associated norms will yield the cosine cos 𝛼 of the angle between 𝛿𝜃𝑡 and
𝛿𝜃𝑡+𝛿𝑡 .
35
Ollivier, Arnold, Auger and Hansen
If this cosine is positive, the learning rate 𝛿𝑡 may be increased. If the cosine is negative,
the learning rate probably needs to be decreased. Various schemes for the change of 𝛿𝑡 can
be devised; for instance, inspired by step size updates commonly used in evolution strategies,
one can multiply 𝛿𝑡 by exp(𝛽(cos 𝛼)) or exp(𝛽 sign(cos 𝛼)), where 𝛽 ≈ min(𝑁/ dim Θ, 1/2).
As this cosine can be quite noisy, cumulation over several time steps might be advisable.
As before, this scheme is constructed to be robust w.r.t. reparametrization of 𝜃, thanks
to the use of the Fisher metric. However, for large learning rates 𝛿𝑡, in practice the
parametrization might well become relevant.
36
Information-Geometric Optimization
For instance, this strongly suggests that if we have 𝛿𝑡 → 0 while 𝑁 is kept fixed in an
IGO algorithm, noise will disappear (compare Remark 2 in Akimoto et al. 2012).
Second, for large 𝑁 , one expects√︀ the variance of the IGO update to scale like 1/𝑁 , so
that the noise term will scale like 𝛿𝑡/𝑁 . This formally suggests that, within reasonable
bounds, multiplying or dividing both 𝑁 and 𝛿𝑡 by the same factor should result in similar
behavior of the algorithm, so that for instance it should be reasonable to reset 𝛿𝑡 to 10𝛿𝑡/𝑁
and 𝑁 to 10. (Note that the cost in terms of 𝑓 -calls of these two algorithms is similar.)
This dependency is reflected in evolution strategies in several ways, provided 𝑁 is smaller
than the search space dimension. First, theoretical results for large search space dimension on
the sphere function, 𝑓 (𝑥) = ‖𝑥‖2 , indicate that the optimal step size 𝛿𝑡 for the mean vector
is proportional to 𝑁 , provided the weighting function 𝑤 is either truncation selection with a
fixed truncation ratio (Beyer, 2001) or optimal weights (Arnold, 2006). Second, the learning
(︁∑︀ )︁2 ∑︀
𝑁
rate 𝛿𝑡 of the covariance matrix in CMA-ES is chosen proportional to ̂︀𝑖 / 𝑁
𝑖=1 𝑤 ̂︀𝑖2
𝑖=1 𝑤
which is again proportional to 𝑁 (Hansen and Kern, 2004). For small enough 𝑁 , the progress
per 𝑓 -call is then in both cases rather independent of the choice of 𝑁 .
These results suggest that in implementations, 𝑁 can be chosen rather freely, whereas
𝛿𝑡 will be set to const · 𝑁 . The constant is chosen small enough to ensure stability, but as
large as possible to maximize speed; still, 𝛿𝑡 ≤ 1 is another constraint for (very) large 𝑁 .
37
Ollivier, Arnold, Auger and Hansen
Figure 2: The RBM architecture with the observed (x) and latent (h) variables. In our
experiments, a single hidden unit was used.
two optima, and compare the behavior of the IGO flow with that of a vanilla gradient flow.
The latter always converges to a single optimum, even though RBMs allow for multimodal
distributions. On the other hand, IGO seems to always converge to both optima at once,
taking advantage of RBM multimodality.
Finally we interpret this observations by pointing out a non-obvious breach of sym-
metry between 0 and 1 on {0, 1}𝑑 for the vanilla gradient of RBMs; the natural gradient
automatically compensates for this.
𝑒−𝐸(x,h) ∑︁
𝑃𝜃 (x, h) = ∑︀ −𝐸(x′ ,h′ )
, 𝑃𝜃 (x) = 𝑃𝜃 (x, h), (51)
x′ ,h′ 𝑒 h
(compare Section 3.3). The distribution is parametrized by 𝜃 = (a, b, w). The “biases” a
and b can be subsumed into the weights w by introducing variables 𝑥0 and ℎ0 always equal
to one, which we implicitly do from now on.
The hidden variable h introduces correlations between the components of x (see Figure 2).
For instance, with 𝑑ℎ = 1, the distribution on x is the sum of two Bernoulli distributions
with h acting as a “switch”.
38
Information-Geometric Optimization
by the objective function values. However, these expressions are not explicit due to the
expectations under 𝑃𝜃 . Still, these expectations can be replaced with Monte Carlo sampling;
this is what we will do.
Actually, when applying IGO to such distributions to optimize an objective function
𝑓 (x), we have two choices. The first is to see the objective function 𝑓 (x) as a function of
(x, h) where h is a dummy variable; then we can use the IGO algorithm to optimize over
(x, h) using the distributions 𝑃𝜃 (x, h). A second possibility is to marginalize 𝑃𝜃 (x, h) over
the hidden units h as in (51), to define the distribution 𝑃𝜃 (x); then we can use the IGO
algorithm to optimize 𝑓 over x using 𝑃𝜃 (x).
These two approaches yield slightly different algorithms. The Fisher matrix for the
distributions 𝑃𝜃 (x, h) is given by (20) (exponential families) whereas the one for the distri-
butions 𝑃𝜃 (x) is given by (24) (exponential families with latent variables). For instance,
with 𝐼𝑤𝑖𝑗 𝑤𝑖′ 𝑗 ′ denoting the entry of the Fisher matrix corresponding to the components 𝑤𝑖𝑗
and 𝑤𝑖′ 𝑗 ′ of the parameter 𝜃, from (20) we get
𝐼𝑤𝑖𝑗 𝑤𝑖′ 𝑗 ′ (𝜃) = Cov𝑃𝜃 (𝑥𝑖 ℎ𝑗 , 𝑥𝑖′ ℎ𝑗 ′ ) = E𝑃𝜃 [𝑥𝑖 ℎ𝑗 𝑥𝑖′ ℎ𝑗 ′ ] − E𝑃𝜃 [𝑥𝑖 ℎ𝑗 ] E𝑃𝜃 [𝑥𝑖′ ℎ𝑗 ′ ] (53)
whereas from (24) we get the same expression in which each ℎ𝑗 is replaced with its expectation
(︁ ∑︀ )︁−1
ℎ̄𝑗 knowing x namely ℎ̄𝑗 = E𝑃𝜃 [ℎ𝑗 |x] = 1 + 𝑒−𝑏𝑗 − 𝑖 𝑥𝑖 𝑤𝑖𝑗 and likewise for ℎ𝑗 ′ .
Both versions were tested on a small instance and found to be viable. However the
version using (x, h) is numerically more stable and requires fewer samples to get a reliable
(in particular, invertible) estimate of the Fisher matrix, both in practice and theory10 . For
this reason we selected the first approach: we optimize 𝑓 as a function of (x, h) using IGO
for the probability distributions 𝑃𝜃 (x, h).
A practical IGO update is thus obtained from (21) by replacing the expectations with
Monte Carlo samples (x, h) from 𝑃𝜃 . A host of strategies are available to sample from RBMs
(Ackley et al., 1985; Salakhutdinov and Murray, 2008; Salakhutdinov, 2009; Desjardins et al.,
2010); we simply used Gibbs sampling (Hinton., 2002). So the IGO update reads
𝜃𝑡+𝛿𝑡 = 𝜃𝑡 + 𝛿𝑡 Cov(𝑇,
̂︂ 𝑇 )−1 Cov(𝑇,
̂︂ 𝑊𝜃𝑓𝑡 ) (54)
where 𝑇 denotes the vector of all statistics 𝑇𝑖𝑗 (x, h) = 𝑥𝑖 ℎ𝑗 corresponding to the component
𝑤𝑖𝑗 of the parameter 𝜃, and where Cov
̂︂ denotes the empirical covariance over a set of Monte
Carlo samples (x, h) taken from 𝑃𝜃𝑡 (x, h). Thus 𝐼^ = Cov(𝑇, ̂︂ 𝑇 ) is the estimated Fisher
2 𝑓
matrix, a matrix of size dim(𝜃) as in (53). Cov(𝑇, 𝑊𝜃𝑡 ) is a vector of size dim(𝜃) giving
̂︂
the correlation, in the Monte Carlo sample, between the statistics 𝑇𝑖𝑗 and the ranked values
𝑊𝜃𝑓𝑡 of the objective function of the points (x, h) in the sample: this vector is the sum of
weighted log-probability derivatives in the IGO update (17), thanks to (19).
Different sample sizes 𝑁Fish and 𝑁 can be used to evaluate Cov(𝑇,̂︂ 𝑇 ) and Cov(𝑇,
̂︂ 𝑊𝜃𝑓𝑡 ).
𝑓
The sample size for Cov(𝑇,
̂︂ 𝑊𝜃𝑡 ) is just the IGO parameter 𝑁 , the number of points on
10. Indeed, if 𝐼1 (𝜃) is the Fisher matrix at 𝜃 in the first approach and 𝐼2 (𝜃) in the second approach, we always
have 𝐼1 (𝜃) > 𝐼2 (𝜃) in the sense of positive-definite matrices. This is because probability distributions
on the pair (x, h) carry more information than their projections on x only, and so the Kullback–Leibler
distances will always be larger. In particular, there exist values of 𝜃 for which the Fisher matrix 𝐼2 is not
invertible whereas 𝐼1 is.
39
Ollivier, Arnold, Auger and Hansen
which the objective function has to be evaluated; it is typically kept small especially if 𝑓
is costly. On the other hand it is important to obtain a reliable (in particular, invertible)
estimate of the Fisher matrix: invertibility requires a number of samples at least, and ideally
much larger than, dim(𝜃), because each point in the sample contributes a rank-1 term to the
empirical covariance matrix Cov(𝑇,
̂︂ 𝑇 ). Increasing 𝑁Fish does not require additional 𝑓 -calls.
which, as a function of x, has two optima, one at x = y and the other at its binary
complement x = ȳ. The value of the base point y was randomized for each independent run.
We ran both the IGO algorithm as described above, and a version using the vanilla
gradient instead of the natural gradient (that is, omitting the Fisher matrix in the IGO
update).
The dimension was 𝑑 = 40 and we used an RBM with only one latent variable (𝑑ℎ = 1).
Therefore dim(𝜃) = 81. We used a large sample size of 𝑁 = 10, 000 for Monte Carlo sampling,
so as to be close to the theoretical IGO flow behavior. We also tested a smaller, more
realistic sample size of 𝑁 = 10 (still keeping 𝑁Fish = 10, 000), with similar but noisier results.
The selection scheme (Section 2.2) was 𝑤(𝑞) = 1𝑞61/5 (cf. Rechenberg 1994) so that the best
20% points in the sample are given weight 1 for the update.
The RBM was initialized so that at startup, the distribution 𝑃𝜃0 is close to uniform on
∑︀ 𝑤
(x, h), in line with Proposition 2. Explicitly we set 𝑤𝑖𝑗 ← 𝒩 (0, 𝑑.𝑑1 ℎ ) and then 𝑏𝑗 ← − 𝑖 2𝑖𝑗
∑︀ 𝑤
and 𝑎𝑖 ← − 𝑗 2𝑖𝑗 + 𝒩 (0, 0.01 𝑑2
) which ensure a close-to-uniform initial distribution.
Full experimental details, including detailed setup and additional results, can be found in
a previous version of this article (Ollivier et al., 2011, Section 5). (In particular, IGO runs are
frozen when the estimated Fisher matrix becomes singular or unreliable.) The code used for
these experiments can be found at http://www.ludovicarnold.com/projects:igocode .
The results show that, most of the time, with IGO the distribution 𝑃𝜃𝑡 converges to a
distribution giving positive mass to both optima; on the other hand, over 300 independent
runs, the same algorithm using the vanilla gradient only converges to one optimum at the
expense at the other, so that the vanilla gradient never exploited the possibility offered by
the RBM to create a bimodal probability distribution on {0, 1}𝑑 . Figure 3 shows ten random
runs (out of 300 in our experiments) of the two algorithms: for each of the two optima we
plot its distance to the nearest of the points drawn from 𝑃𝜃𝑡 , as a function of time 𝑡.11
11. Note that the value of 𝛿𝑡 is not directly comparable between the natural and vanilla gradients. Theory
suggests that at startup the IGO trajectory with 𝛿𝑡 is most comparable to the vanilla gradient trajectory
with 4𝛿𝑡, because from (53) most of the diagonal terms of the Fisher matrix are equal to 1/4 and
most off-diagonal terms are 0 at startup. The experiments confirm that this yields roughly comparable
convergence speeds.
40
Information-Geometric Optimization
dist. to closest opt. (nat. grad.) dist. to closest opt. (van. grad.)
30 30
δt =0.5 δt =2.
25 25
20 20
15 15
10 10
5 5
0 0
0 20 40 60 80 100 0 20 40 60 80 100
gradient steps gradient steps
dist. to second opt. (nat. grad.) dist. to second opt. (van. grad.)
30 30
δt =0.5 δt =2.
25 25
20 20
15 15
10 10
5 5
0 0
0 20 40 60 80 100 0 20 40 60 80 100
gradient steps gradient steps
Figure 3: Distance to the two optima during 10 IGO optimization runs and 10 vanilla
gradient runs. For each optimum, we plot its distance to the nearest point among
the samples from 𝑃𝜃𝑡 found at each step.
This is even clearer when looking at the hidden variable h. With 𝑑ℎ = 1 the two possible
values ℎ = 0 and ℎ = 1 can create a sum of two Bernoulli distributions on x. Figure 4
plots the average value of ℎ in the sample for IGO and for the vanilla gradient. As can be
seen, over IGO optimization the balance between the two modes is preserved, whereas using
vanilla gradient optimization the zero mode is lost.
41
Ollivier, Arnold, Auger and Hansen
δt =1.
0.6
δt =2.
0.4
0.2
0.0
0 20 40 60 80 100
gradient steps
h-statistics (vanilla gradient)
1.0
δt =1.
0.8 δt =2.
δt =4.
0.6
δt =8.
0.4
0.2
0.0
0 20 40 60 80 100
gradient steps
Figure 4: Median over 300 runs of the average of ℎ in the sample, for IGO and vanilla
gradient optimization. Error bars indicate the 16th and 84th quantile over the
runs.
variables 𝑎𝑖 ← 𝑎𝑖 + 𝑤𝑖𝑗 , 𝑏𝑗 ← −𝑏𝑗 , 𝑤𝑖𝑗 ← −𝑤𝑖𝑗 does not preserve this Euclidean metric.
Thus, exchanging 0 and 1 for the hidden variables will result in two different vanilla gradient
ascents. The observed asymmetry on ℎ is a consequence of this implicit asymmetry.
Of course it is possible to use parametrizations for which the vanilla gradient will be
more symmetric: for instance, using −1/1 instead of 0/1 for the variables, or rewriting the
42
Information-Geometric Optimization
energy as
𝐸(x, h) = − − 12 ) − − 12 ) − − 21 )(ℎ𝑗 − 12 )
∑︀ ∑︀ ∑︀
𝑖 𝐴𝑖 (𝑥𝑖 𝑗 𝐵𝑗 (ℎ𝑗 𝑖,𝑗 𝑊𝑖𝑗 (𝑥𝑖 (56)
with “bias-free” parameters 𝐴𝑖 , 𝐵𝑗 , 𝑊𝑖𝑗 related to the usual parametrization by 𝑤𝑖𝑗 = 𝑊𝑖𝑗 ,
𝑎𝑖 = 𝐴𝑖 − 12 𝑗 𝑤𝑖𝑗 , and 𝑏𝑗 = 𝐵𝑗 − 12 𝑖 𝑤𝑖𝑗 . The vanilla gradient might perform better in
∑︀ ∑︀
this parametrization.
However, we adopted the approach of using a family of probability distributions found
in the literature, with the parametrization commonly found in the literature. We then used
the vanilla gradient and the natural gradient on these distributions—and indeed the vanilla
gradient, or an approximation thereof, is routinely applied to RBMs in the literature to
optimize the log-likelihood of data (Hinton., 2002; Hinton et al., 2006; Bengio et al., 2007).
It was not obvious a priori (at least for us) that the vanilla gradient ascent favors ℎ = 1.
Since the first version of this article was written, this phenomenon has been recognized for
Boltzmann machines (Montavon and Müller, 2012).
This directly illustrates the specific influence of the chosen gradient (the two imple-
mentations only differ by the inclusion of the Fisher matrix): the natural gradient offers a
systematic way to recover symmetry from a non-symmetric gradient update.
Symmetry alone does not explain why IGO reaches the two optima simultaneously: a
symmetry-preserving stochastic algorithm could as well end up on either single optimum
with 50% probability in each run. The diversity-preserving property of IGO (Proposition 2)
offers a reasonable interpretation of why this does not happen.
For the algorithm with finite 𝑁 and 𝛿𝑡 > 0, invariance under reparametrization of 𝜃 is
only true approximately, in the limit when 𝛿𝑡 → 0. As mentioned above, the IGO update
43
Ollivier, Arnold, Auger and Hansen
(17), with 𝑁 = ∞, is simply the Euler approximation scheme for the ordinary differential
equation (8) defining the IGO flow. At each step, the Euler scheme is known to make an error
𝑂(𝛿𝑡 2 ) with respect to the true flow. This error actually depends on the parametrization of
𝜃.
So the IGO updates for different parametrizations coincide at first order in 𝛿𝑡, and may,
in general, differ by 𝑂(𝛿𝑡 2 ). For instance the difference between the CMA-ES and xNES
updates is indeed 𝑂(𝛿𝑡 2 ), see Section 5.2.
For comparison, using the vanilla gradient results in a divergence of 𝑂(𝛿𝑡) at each step
between different parametrizations, so this divergence could be of the same magnitude as
the steps themselves.
In that sense, one can say that IGO algorithms are “more parametrization-invariant”
than other algorithms. This stems from their origin as a discretization of the IGO flow.
However, if the map 𝜙 is affine then this phenomenon disappears: parametrizations that
differ by an affine map on 𝜃 yield the same IGO algorithm.
The next proposition states that, for example, if one uses a family of distributions on
R𝑑 which is invariant under affine transformations, then IGO algorithms optimize equally
well a function and its image under any affine transformation (up to an obvious change in
the initialization). This proposition generalizes the well-known corresponding property of
CMA-ES (Hansen and Auger, 2014, Proposition 9).
This invariance under 𝑋-transformations only holds provided the 𝑋-transformation
preserves the “shape” of the family of probability distributions 𝑃𝜃 , as follows.
Let us define, as usual, the image (pushforward) of a probability distribution 𝑃 by a
transformation 𝜙 : 𝑋 → 𝑋 as the probability distribution 𝑃 ′ such that 𝑃 ′ (𝑌 ) = 𝑃 (𝜙−1 (𝑌 ))
for any subset 𝑌 ⊂ 𝑋 (Schilling, 2005, Chapter 7). In the continuous domain, the density of
the new distribution 𝑃 ′ is obtained by the usual change of variable formula involving the
Jacobian of 𝜙 (Schilling, 2005, Chapter 15).
We say that a transformation 𝜙 : 𝑋 → 𝑋 globally preserves a family of probability
distributions (𝑃𝜃 ), if the image of any 𝑃𝜃 by 𝜙 is equal to some distribution 𝑃𝜃′ in the same
family, and if moreover the correspondence 𝜃 ↦→ 𝜃′ is locally a diffeomorphism.
Proposition 19 (𝑋-invariance) Let 𝜙 : 𝑋 → 𝑋 be a one-to-one transformation of the
search space which globally preserves the family of measures 𝑃𝜃 . Let 𝜃𝑡 be the IGO flow
trajectory for the optimization of function 𝑓 , initialized at 𝑃𝜃0 . Let (𝜃′ )𝑡 be the IGO flow
trajectory for optimization of 𝑓 ∘ 𝜙−1 , initialized at the image of 𝑃𝜃0 by 𝜙. Then 𝑃(𝜃′ )𝑡 is
the image of 𝑃𝜃𝑡 by 𝜙.
For the discretized algorithm with population size 𝑁 and step size 𝛿𝑡 > 0, the same is
true up to an error of 𝑂(𝛿𝑡 2 ) per iteration. This error disappears if the action of the map 𝜙
on Θ by pushforward is affine.
The latter case of affine transforms is well exemplified by CMA-ES: here, using the
variance and mean as the parametrization of Gaussians, the new mean and variance after
an affine transform of the search space are an affine function of the old mean and variance;
specifically, for the affine transformation 𝐴 : 𝑥 ↦→ 𝐴𝑥 + 𝑏 we have (𝑚, 𝐶) ↦→ (𝐴𝑚 + 𝑏, 𝐴𝐶𝐴T ).
Another example, on the discrete search space 𝑋 = {0, 1}𝑑 , is the exchange of 0 and 1: for
reasonable choices of the family 𝑃𝜃 , the IGO flow and IGO algorithms will be invariant
under such a change in the way the data is presented.
44
Information-Geometric Optimization
Corollary 21 Consider an IGO algorithm with selection scheme 𝑤, step size 𝛿𝑡 and sample
size 𝑁 . Then, with probability 1,
KL(𝑃𝜃𝑡+𝛿𝑡 || 𝑃𝜃𝑡 ) 1
2
6 Var[0,1] 𝑤 + 𝑜(1)(𝛿𝑡,𝑁 )→(0,∞) .
𝛿𝑡 2
For instance, with 𝑤(𝑞) = 1𝑞6𝑞0 and neglecting the error terms, an IGO algorithm
introduces at most 12 𝛿𝑡 2 𝑞0 (1 − 𝑞0 ) bits of information (in base 𝑒) per iteration into the
probability distribution 𝑃𝜃 . The proof is given in Appendix D.
Thus, the time discretization parameter 𝛿𝑡 is not just an arbitrary variable: it has an
intrinsic interpretation related to a number of bits introduced at each step of the algorithm.
This kind of relationship suggests, more generally, to use the Kullback–Leibler divergence as
an external and objective way to measure learning rates in those optimization algorithms
which use probability distributions.
The result above is only an upper bound. Maximal speed can be achieved only if all “good”
points point in the same direction. If the various good points in the sample suggest moves
in inconsistent directions, then the IGO update will be much smaller. While non-consistent
moves are generally to be expected if 𝑁 < dim Θ, it may also be a sign that the signal is
noisy, or that the family of distributions 𝑃𝜃 is not well suited to the problem at hand and
should be enriched.
As an example, using a family of Gaussian distributions with unkown mean and fixed
identity variance on R𝑑 , one checks that for the optimization of a linear√function on R𝑑 , with
the weight 𝑤(𝑢) = 1𝑢<1/2 , the IGO flow moves at constant speed 1/ 2𝜋 ≈ 0.4, whatever
the dimension 𝑑. On a rapidly varying sinusoidal function, the moving speed will be much
slower because there are “good” and “bad” points in all directions.
This may suggest ways to design the selection scheme 𝑤 to achieve maximal speed in some
instances. Indeed, looking at the proof of the proposition, which involves a Cauchy–Schwarz
inequality, one can see that the maximal speed is achieved only if there is a linear relationship
45
Ollivier, Arnold, Auger and Hansen
between the weights 𝑊𝜃𝑓 (𝑥) and the gradient ∇ ̃︀ 𝜃 ln 𝑃𝜃 (𝑥). For instance, for the optimization
𝑑
of a linear function on R using Gaussian measures of known variance, the maximal speed
will be achieved when the selection scheme 𝑤(𝑢) is the inverse of the Gaussian cumulative
distribution function. (In particular, 𝑤(𝑢) tends to +∞ when 𝑢 → 0 and to −∞ when
𝑢 → 1.) This is in accordance with known results: the expected value of the 𝑖-th order
statistic of 𝑁 standard Gaussian variates is the optimal 𝑤 ̂︀𝑖 value for Gaussians on the sphere
function, 𝑓 (𝑥) = 𝑖 𝑥2𝑖 , where 𝑑 → ∞ (Beyer, 2001; Arnold, 2006). For 𝑁 → ∞, this order
∑︀
∙ The IGO algorithm (16), using a family of distributions 𝑃𝜃 on space 𝑋, applied to the
noisy function 𝑓 , and where the samples are ranked according to the random observed
value of 𝑓 (here we assume that, for each sample, the noise 𝜔 is independent from
everything else);
∙ The IGO algorithm on space 𝑋 ×[0, 1], using the family of distributions 𝑃˜𝜃 = 𝑃𝜃 ⊗𝑈[0,1] ,
applied to the deterministic function 𝑓˜. Here 𝑈[0,1] denotes the uniform law on [0, 1].
𝑁 ⃒
𝜕 ln 𝑃𝜃 (𝑥𝑖 ) ⃒⃒
𝜃𝑡+𝛿𝑡 = 𝜃𝑡 + 𝛿𝑡 𝐼 −1 (𝜃𝑡 )
∑︁
𝑤
̂︀𝑖
𝑖=1
𝜕𝜃 ⃒
𝜃=𝜃𝑡
46
Information-Geometric Optimization
where each weight 𝑤 ̂︀𝑖 computed from (14) now incorporates noise from the objective
function because the rank of 𝑥𝑖 is computed on the random function, or equivalently
on the deterministic function 𝑓˜: rk(𝑥𝑖 ) = #{𝑗, 𝑓˜(𝑥𝑗 , 𝜔𝑗 ) < 𝑓˜(𝑥𝑖 , 𝜔𝑖 )}.
Consistency of sampling (Theorem 6) thus takes the following form through Proposi-
tion 22: When 𝑁 → ∞, the 𝑁 -sample IGO update rule on the noisy function 𝑓 converges
with probability 1 to the update rule
∫︁ 1 ∫︁
˜
𝜃𝑡+𝛿𝑡 = 𝜃𝑡 + 𝛿𝑡 ∇
̃︀ 𝜃 𝑊𝜃𝑓𝑡 (𝑥, 𝜔) 𝑃𝜃 (d𝑥) d𝜔
∫︁0
= 𝜃𝑡 + 𝛿𝑡 ∇
̃︀ 𝜃 ¯ 𝑓𝑡 (𝑥) 𝑃𝜃 (d𝑥)
𝑊 (57)
𝜃
where 𝑊 ¯ 𝑓 (𝑥) = E𝜔 𝑊 𝑓˜(𝑥, 𝜔). Thus when 𝑁 → ∞ the update becomes deterministic as the
𝜃 𝜃
effects of noise get averaged, as could be expected.
Thus, applying the IGO algorithm to noisy objective functions as in Proposition 22
requires defining the IGO flow for noisy objective functions as
d𝜃𝑡
∫︁
=∇
̃︀ 𝜃 ¯ 𝑓𝑡 (𝑥) 𝑃𝜃 (d𝑥)
𝑊 (58)
d𝑡 𝜃
where 𝑊 ¯ 𝑓 (𝑥) = E𝜔 𝑊 𝑓˜(𝑥, 𝜔) is the average weight of 𝑥 over the values 𝑓 (𝑥). Thus, the
𝜃 𝜃
effects of noise remain visible even for 𝑁 → ∞, as 𝑊 ¯ is in general flatter than 𝑊 .
Note that there is another possible way to define the IGO flow for noisy objective
functions. The flow (58) amounts to taking a point 𝑥, computing a (random) value 𝑓 (𝑥),
computing the level 𝑞𝜃< (𝑥) = Pr𝑥′ ∼𝑃𝜃 (𝑓 (𝑥′ ) < 𝑓 (𝑥)) of this random value 𝑓 (𝑥), applying
the selection scheme 𝑤, and then averaging. In this approach the value 𝑞𝜃< (𝑥) is a random
variable depending on the value of 𝑓 (𝑥). Alternatively, we could define the value 𝑞𝜃 (𝑥) by
applying Definition 3 unchanged, namely, 𝑞𝜃< (𝑥) = Pr𝑥′ ∼𝑃𝜃 (𝑓 (𝑥′ ) < 𝑓 (𝑥)) in which 𝑓 (𝑥) itself
is treated as a random variable under the probability Pr, so that 𝑞𝜃< (𝑥) is deterministic and
takes into account all possible values for 𝑓 (𝑥). Then we could apply the selection scheme
𝑤 to this value 𝑞𝜃< (𝑥) as in Definition 3 and define the IGO flow accordingly. The value of
𝑞𝜃< (𝑥) in this second version is the expected value of 𝑞𝜃< (𝑥) in the first version.
When the selection scheme 𝑤 is affine, these two approaches coincide; however this is not
the case in general, as the second version averages the 𝑞-values while the first version averages
the weights 𝑤(𝑞). The second version would be the 𝑁 → ∞ limit of a slightly more complex
algorithm using several evaluations of 𝑓 for each sample 𝑥𝑖 in order to compute noise-free
average ranks and quantiles. The first version has the advantage of leaving the algorithm
unchanged and of inheriting properties from the deterministic case via Proposition 22.
C.4 The IGO Flow for Linear Functions on {0, 1}𝑑 and R𝑑
In this section we take a closer look at the IGO differential equation solutions of (8) for some
simple examples of objective functions, for which it is possible to obtain exact information
about these IGO trajectories.
We start with the discrete search space 𝑋 = {0, 1}𝑑 and linear functions (to be minimized)
defined as 𝑓 (𝑥) = 𝑐 − 𝑑𝑖=1 𝛼𝑖 𝑥𝑖 with 𝛼𝑖 > 0. (So maximization of the classical onemax
∑︀
∑︀𝑑
function 𝑓onemax (𝑥) = 𝑖=1 𝑥𝑖 is covered by setting 𝛼𝑖 = 1.) The differential equation of
47
Ollivier, Arnold, Auger and Hansen
the IGO flow (8) for the Bernoulli measures 𝑃𝜃 (𝑥) = 𝑝𝜃1 (𝑥1 ) . . . 𝑝𝜃𝑑 (𝑥𝑑 ) defined on 𝑋 is the
𝛿𝑡 → 0 limit of the IGO-PBIL update (36):
d𝜃𝑖𝑡
∫︁
= 𝑊𝜃𝑓𝑡 (𝑥)(𝑥𝑖 − 𝜃𝑖𝑡 )𝑃𝜃𝑡 (d𝑥) =: 𝑔𝑖 (𝜃𝑡 ) . (59)
d𝑡
Although finding the analytical solution of the differential equation (59) for any initial
condition seems a bit intricate, we show that the solution of (59) converges to (1, . . . , 1)
starting from any initial 𝜃 ∈ (0, 1]𝑑 . Note that starting with 𝜃𝑖 = 0 for some 𝑖 prevents IGO
(and PBIL) from sampling any 1 for component 𝑖, so that the components of 𝜃 that are
equal to 0 at startup will always stay so; in that case the IGO flow effectively works as in a
smaller-dimensional space.
To prove convergence to (1, . . . , 1), we establish the following result:
Lemma 23 Assume that the selection scheme 𝑤 : [0, 1] → R is bounded, that 𝑤 is a
nonincreasing function and that there exists 𝑞0 , 𝑞1 ∈ (0, 1) such that 𝑤(𝑞0 ) ̸= 𝑤(𝑞1 ). On
d𝜃𝑡 ∑︀ d𝜃𝑡
𝑓 (𝑥) = 𝑐 − 𝑑𝑖=1 𝛼𝑖 𝑥𝑖 , the solution of (59) satisfies 𝑑𝑖=1 𝛼𝑖 d𝑡𝑖 > 0; moreover 𝛼𝑖 d𝑡𝑖 = 0
∑︀ ∑︀
Proof We want to prove that Cov[−𝑊𝜃𝑓𝑡 (𝑥), 𝑓 (𝑥)] > 0 with equality if and only if 𝜃𝑡 ∈ {0, 1}𝑑 .
Let us define the random variable 𝑍 = 𝑓 (𝑥) when 𝑥 ∼ 𝑃𝜃𝑡 . Remark that 𝑊𝜃𝑓𝑡 (𝑥) = G (𝑍)
where
∫︁ 𝐹 6 (𝑧)
1 𝜃𝑡
G (𝑧) = 6 <
𝑤(𝑞)𝑑𝑞 , (61)
𝐹𝜃𝑡 (𝑧) − 𝐹𝜃𝑡 (𝑧) 𝐹𝜃<𝑡 (𝑧)
where 𝐹𝜃6𝑡 (𝑧) = Pr𝑥′ ∼𝑃𝜃𝑡 (𝑓 (𝑥′ ) 6 𝑧) and 𝐹𝜃<𝑡 (𝑧) = Pr𝑥′ ∼𝑃𝜃𝑡 (𝑓 (𝑥′ ) < 𝑧). That is G is the
average of 𝑤 on the interval (𝐹𝜃<𝑡 (𝑧), 𝐹𝜃6𝑡 (𝑧)). Since 𝑤 is nonincreasing, the function G is also
nonincreasing. We have the following equality
Cov[−𝑊𝜃𝑓𝑡 (𝑥), 𝑓 (𝑥)] = Cov[−G (𝑍), 𝑍]
48
Information-Geometric Optimization
Let us assume that one single 𝜃𝑖 belongs to (0, 1) and all the others are either 0 or 1. We
can assume without loss of generality that 𝜃1 ∈ (0, 1). Then 𝑓 (𝑥) takes (only) two distinct
values with positive probability; let 𝑧1 be the one associated with 𝑥1 = 1 and 𝑧2 with 𝑥1 = 0.
Then 𝑧1 < 𝑧2 thanks to the definition of 𝑓 and because 𝛼𝑖 > 0. Moreover, unravelling the
definitions we find G (𝑧1 ) = 𝜃11 0𝜃1 𝑤(𝑞)𝑑𝑞 and G (𝑧2 ) = 1−𝜃
1 ∫︀ 1
∫︀
1 𝜃1
𝑤(𝑞)𝑑𝑞.
The assumption states that there exists 𝑞0 , 𝑞1 ∈ (0, 1) such that 𝑤(𝑞0 ) ̸= 𝑤(𝑞1 ). Assume
without loss of generality that 𝑞0 < 𝑞1 and 𝑤(𝑞0 ) > 𝑤(𝑞1 ). Either 𝑤(𝑞1 ) > 𝑤(𝜃1 ) or
𝑤(𝑞1 ) < 𝑤(𝜃1 ). In the first case, then 𝜃1 > 𝑞0 and
∫︁ 𝜃1 (︃∫︁ )︃
1 1 𝑞0 ∫︁ 𝜃1
1
𝑤(𝑞)𝑑𝑞 = 𝑤(𝑞)𝑑𝑞 + 𝑤(𝑞)𝑑𝑞 > (𝑤(𝑞0 )𝑞0 + 𝑤(𝜃1 )(𝜃1 − 𝑞0 )) > 𝑤(𝜃1 ) ,
𝜃1 0 𝜃1 0 𝑞0 𝜃1
∫︀ 1
Meanwhile, G (𝑧2 ) = 1
1−𝜃1 𝜃1 𝑤(𝑞)𝑑𝑞 6 𝑤(𝜃1 ) and thus
We are now ready to state the convergence of the trajectories of IGO-PBIL (and cGA)
update in the limit of 𝛿𝑡 → 0.
49
Ollivier, Arnold, Auger and Hansen
differential equation (59) is stable and for any initial condition 𝜃 ∈ (0; 1]𝑑 , the continuous-time
trajectory solving (59) converges to (1, . . . , 1).
Proof Remark first that if we start at 𝜃0 ∈ [𝜀, 1]𝑑 , then the solution of (59) stays in
[𝜀, 1]𝑑 . Indeed, the trajectory cannot go out of [0, 1]𝑑 and, in addition, for 𝜃 ∈ [𝜀, 1)𝑑 we
have 𝑔𝑖 (𝜃) > 0 so that the trajectory of (59) cannot go out of [𝜀, 1]𝑑 . The proof that
𝑔𝑖 (𝜃) > 0 for 𝜃 ∈ [𝜀, 1)𝑑 is similar to the proof that Cov[−𝑊𝜃𝑓𝑡 (𝑥), 𝑓 (𝑥)] > 0 in Lemma 24:
namely, we can write 𝑔𝑖 (𝜃) = E[𝑊𝜃𝑓𝑡 (𝑥)|𝑥𝑖 ](𝑥𝑖 − 𝜃𝑖 )𝑝𝜃𝑖 (𝑥𝑖 )𝑑𝑥𝑖 = 𝜃𝑖 (1 − 𝜃𝑖 )[ℎ(1) − ℎ(0)] where
∫︀
ℎ(𝑥𝑖 ) = E[𝑊𝜃𝑓𝑡 (𝑥)|𝑥𝑖 ] = E[G (𝑓 (𝑥))|𝑥𝑖 ] where G is defined in (61). Then ℎ(1) > ℎ(0) comes
from the increase of G when going to better function values.
Consider now on [𝜀, 1]𝑑 , the non-negative function 𝑉 (𝜃) = 𝑑𝑖=1 𝛼𝑖 − 𝑑𝑖=1 𝛼𝑖 𝜃𝑖 . Then
∑︀ ∑︀
𝑉 * (𝜃) := ∇𝑉 (𝜃) · 𝑔(𝜃) 6 0, and 𝑉 * (𝜃) equals zero only for 𝜃 ∈ {0, 1}𝑑 according to Lemma 23.
Hence for 𝜃 ∈ [𝜀, 1]𝑑 , the function 𝑉 is a Lyapunov function (Khalil, 1996; Agarwal and
O’Regan, 2008), which is minimal at (1, . . . , 1) and such that 𝑉 * (𝜃) < 0 except at (1, . . . , 1).
Therefore the trajectory of (59) will converge to (1, . . . , 1) as 𝑡 goes to infinity (the proof
is similar to that of (Khalil, 1996, Theorem 4.1)). Given that this holds for any 𝜀 > 0, we
can conclude that the trajectory of (59) converges to (1, . . . , 1) starting from any 𝜃 ∈ (0, 1]𝑑 .
d𝑚𝑡
∫︁
= 𝑊𝜃𝑓𝑡 (𝑥)(𝑥 − 𝑚𝑡 )𝑃𝒩 (𝑚𝑡 ,(𝜎𝑡 )2 𝐼𝑑 ) (𝑥)d𝑥 (63)
d𝑡 R𝑑 ⎧ ⎫
𝑑
(︃ )︃2
𝜎𝑡
d˜ 1 ⎨∑︁ 𝑥𝑖 − 𝑚𝑡𝑖
∫︁ ⎬
= − 1 𝑊𝜃𝑓𝑡 (𝑥)𝑃𝒩 (𝑚𝑡 ,(𝜎𝑡 )2 𝐼𝑑 ) (𝑥)d𝑥 (64)
d𝑡 R𝑑 2𝑑 ⎩ 𝑖=1 𝜎𝑡 ⎭
d𝑚𝑡 [︁ ]︁
= 𝜎 𝑡 E 𝑊𝜃𝑓𝑡 (𝑚𝑡 + 𝜎 𝑡 𝒩 )𝒩 (65)
d𝑡 [︃ (︃ )︃ ]︃
d˜𝜎𝑡 1 ‖𝒩 ‖2 𝑓 𝑡 𝑡
=E − 1 𝑊𝜃𝑡 (𝑚 + 𝜎 𝒩 ) . (66)
d𝑡 2 𝑑
Let us analyze the solution of the previous system on linear functions. Without loss of
generality (thanks to invariance) we can consider the linear function 𝑓 (𝑥) = 𝑥1 . We have
50
Information-Geometric Optimization
with ℱ the cumulative distribution of a standard normal distribution, and 𝒩1 the first
component of 𝒩 . The differential equation thus simplifies into
⎛ ⎞
E [𝑤(ℱ(𝒩1 ))𝒩1 ]
d𝑚𝑡
⎜
0 ⎟
= 𝜎𝑡 ⎜
⎜ ⎟
.. ⎟ (69)
d𝑡 ⎜
⎝ .
⎟
⎠
0
𝜎𝑡
d˜ 1 [︁ ]︁
= E (|𝒩1 |2 − 1)𝑤(ℱ(𝒩1 )) . (70)
d𝑡 2𝑑
Consider, for example, the truncation selection function, i.e. 𝑤(𝑞) = 1𝑞6𝑞0 where 𝑞0 ∈
(0, 1). (Within the IGO algorithm, this results in so-called intermediate recombination of
the 𝑞0 × 𝑁 best samples.) We find that
d𝑚𝑡1
= 𝜎 𝑡 E[𝒩1 1{𝒩1 6ℱ −1 (𝑞0 )} ] =: 𝜎 𝑡 𝛽 (71)
d𝑡
𝜎𝑡
(︂∫︁ 𝑞0
d˜ 1
)︂
−1 2
= ℱ (𝑢) d𝑢 − 𝑞0 =: 𝛼 . (72)
d𝑡 2𝑑 0
The solution of the IGO flow for the linear function 𝑓 (𝑥) = 𝑥1 is thus given by
𝜎0𝛽
𝑚𝑡1 = 𝑚01 + exp(𝛼𝑡) (73)
𝛼
𝜎 𝑡 = 𝜎 0 exp(𝛼𝑡) . (74)
The coefficient 𝛽 is negative for any 𝑞0 < 1. The coefficient 𝛼 is positive if and only if
𝑞0 < 1/2 by a simple calculus argument12 ; this corresponds to selecting less than half of the
sampled points in an ES. In this case the step size 𝜎 𝑡 grows exponentially fast to infinity
and the mean vector moves along the gradient direction towards minus ∞ at the same
rate. If more than half of the points are selected, 𝑞0 > 1/2, the step size will decrease to
zero exponentially fast and the mean vector will get stuck (compare also Grahl et al. 2005;
Hansen 2006a; Pošík 2008).
For an analysis of the solutions of the system of differential equations (65) and (66)
on more complex functions, namely convex-quadratic functions and twice continuously
differentiable functions, we refer to Akimoto et al. (2012).
∫︀ ℱ −1 (𝑞 ) ∫︀ 𝑦
12. Indeed 𝛼 = 2𝑑√1 2𝜋 −∞ 0 (𝑥2 − 1) exp(−𝑥2 /2)d𝑥 = 2𝑑√1 2𝜋 𝑔(ℱ −1 (𝑞0 )) where 𝑔(𝑦) = −∞ (𝑥2 −
1) exp(−𝑥2 /2)d𝑥. Using 𝑔(0) = 0 and lim𝑦→±∞ 𝑔(𝑦) = 0, and studying the sign of 𝑔 ′ (𝑦), we find
that 𝑔 is positive for 𝑦 < 0 and negative for 𝑦 > 0. Since ℱ −1 (𝑞0 ) < 0 if and only if 𝑞0 < 1/2, we find
that 𝛼 = 2𝑑√1 2𝜋 𝑔(ℱ −1 (𝑞0 )) is positive if and only if 𝑞0 < 1/2.
51
Ollivier, Arnold, Auger and Hansen
Appendix D. Proofs
This final appendix provides longer proofs of propositions and theorems of the paper.
1 1
{︂ }︂
∇𝑓 (𝑥) = lim arg max 𝑓 (𝑥 + ℎ) − ‖ℎ‖2
𝜀→0 𝜀 ℎ 2𝜀
1
{︂ }︂
arg max 𝑓 (𝑦) − ‖𝑦 − 𝑥‖2 = 𝑥 + 𝜀∇𝑓 (𝑥) + 𝑂(𝜀2 )
𝑦 2𝜀
when 𝜀 → 0.
Proposition 2 follows by taking the Fisher information metric at 𝜃0 for the metric ‖ · ‖2 ,
and using the relation KL(𝑃 || 𝑄) = − Ent(𝑃 ) + log #𝑋 if 𝑄 is the uniform distribution on
a finite space 𝑋.
𝜕 ln 𝑃𝜃 (𝑥)
Proposition 27 Let 𝜃 ∈ Θ. Assume that the derivative𝜕𝜃 𝜃 exists for 𝑃 -almost all
⃒ 𝜕 ln 𝑃𝜃 (𝑥) ⃒2
⃒ ⃒
𝑥 ∈ 𝑋 and that E𝑃𝜃 ⃒ 𝜕𝜃 ⃒ < +∞. Assume that the function 𝑤 is non-decreasing and
bounded.
Let (𝑥𝑖 )𝑖∈N be a sequence of independent samples of 𝑃𝜃 . Then with probability 1, as
𝑁 → ∞ we have
𝑁
1 ∑︁ ̂︁ 𝑓 (𝑥𝑖 ) 𝜕 ln 𝑃𝜃 (𝑥𝑖 ) → 𝑊 𝑓 (𝑥) 𝜕 ln 𝑃𝜃 (𝑥) 𝑃𝜃 (d𝑥)
∫︁
𝑊 𝜃
𝑁 𝑖=1 𝜕𝜃 𝜕𝜃
where
rk𝑁 (𝑥𝑖 ) + 1/2
(︂ )︂
̂︁ 𝑓 (𝑥𝑖 ) = 𝑤
𝑊
𝑁
̂︁ 𝑓 (𝑥𝑖 )
with rk𝑁 (𝑥𝑖 ) = #{1 6 𝑗 6 𝑁, 𝑓 (𝑥𝑗 ) < 𝑓 (𝑥𝑖 )}. (When there are 𝑓 -ties in the sample, 𝑊
is defined as the average of 𝑤((𝑟 + 1/2)/𝑁 ) over the possible rankings 𝑟 of 𝑥𝑖 .)
52
Information-Geometric Optimization
Proof Let 𝑔 : 𝑋 → R be any function with E𝑃𝜃 𝑔 2 < ∞. We will show that 𝑁1 ̂︁ 𝑓 (𝑥𝑖 )𝑔(𝑥𝑖 ) → ∑︀
𝑊
𝑊𝜃𝑓 (𝑥)𝑔(𝑥) 𝑃𝜃 (d𝑥). Applying this with 𝑔 equal to the components of 𝜕 ln𝜕𝜃
𝑃𝜃 (𝑥)
∫︀
will yield
the result.
Let us decompose
1 ∑︁ ̂︁ 𝑓 1 ∑︁ 𝑓 1 ∑︁ ̂︁ 𝑓
𝑊 (𝑥𝑖 )𝑔(𝑥𝑖 ) = 𝑊𝜃 (𝑥𝑖 )𝑔(𝑥𝑖 ) + (𝑊 (𝑥𝑖 ) − 𝑊𝜃𝑓 (𝑥𝑖 ))𝑔(𝑥𝑖 ).
𝑁 𝑁 𝑁
Each summand in the first term involves only one sample 𝑥𝑖 (contrary to 𝑊 ̂︁ 𝑓 (𝑥𝑖 ) which
depends on the whole sample). So by the strong law of large numbers, almost surely
𝑓
1 ∑︀
𝑊𝜃𝑓 (𝑥)𝑔(𝑥) 𝑃𝜃 (d𝑥). So we have to show that the second
∫︀
𝑁 𝑊 𝜃 (𝑥 𝑖 )𝑔(𝑥 𝑖 ) converges to
term converges to 0 almost surely.
By the Cauchy–Schwarz inequality, we have
⃒ ∑︁ ⃒2
⃒ 1 1 1
(︂ ∑︁ )︂ (︂ ∑︁ )︂
𝑓 𝑓 ⃒ 𝑓 𝑓 2 2
⃒
⃒𝑁 (𝑊 (𝑥𝑖 ) − 𝑊𝜃 (𝑥𝑖 ))𝑔(𝑥𝑖 )⃒ 6
̂︁ ⃒ (𝑊 (𝑥𝑖 ) − 𝑊𝜃 (𝑥𝑖 ))
̂︁ 𝑔(𝑥𝑖 )
𝑁 𝑁
By the strong law of large numbers, the second term 𝑁1 𝑔(𝑥𝑖 )2 converges to E𝑃𝜃 𝑔 2 almost
∑︀
6
𝑟𝑖 −1
𝑓 1 ∑︁
𝑊 (𝑥𝑖 ) = 6
̂︁
<
𝑤((𝑘 + 1/2)/𝑁 )
𝑟𝑖 − 𝑟𝑖 𝑘=𝑟<
𝑖
∫︀ 𝑞𝑖6
and moreover 𝑊𝜃𝑓 (𝑥𝑖 ) = 𝑤(𝑞𝑖< ) if 𝑞𝑖< = 𝑞𝑖6 or 𝑊𝜃𝑓 (𝑥𝑖 ) = 1
𝑞𝑖<
𝑤 otherwise.
𝑞𝑖6 −𝑞𝑖< ⃒ ⃒
The Glivenko–Cantelli theorem (Billingsley, 1995, Theorem 20.6) implies that sup𝑖 ⃒ 𝑞𝑖6 − 𝑟𝑖6 /𝑁 ⃒
⃒ ⃒
tends to 0 almost surely, and likewise for sup𝑖 |𝑞𝑖< − 𝑟𝑖< /𝑁 |. So let 𝑁 be large enough so
that these errors are bounded by 𝜀.
Since 𝑤 is non-increasing, we have 𝑤(𝑞𝑖< ) 6 𝑤(𝑟𝑖< /𝑁 − 𝜀). In the case 𝑞𝑖< = ̸ 𝑞𝑖6 , we
< 6 6 <
decompose the interval [𝑞𝑖 ; 𝑞𝑖 ] into (𝑟𝑖 − 𝑟𝑖 ) subintervals. The average of 𝑤 over each such
subinterval is compared to a term in the sum defining 𝑤𝑁 (𝑥𝑖 ): since 𝑤 is non-increasing,
the average of 𝑤 over the 𝑘 th subinterval is at most 𝑤((𝑟𝑖< + 𝑘)/𝑁 − 𝜀). So we get
6
𝑟𝑖 −1
1
𝑊𝜃𝑓 (𝑥𝑖 )
∑︁
6 6 <
𝑤(𝑘/𝑁 − 𝜀)
𝑟𝑖 − 𝑟𝑖 𝑘=𝑟<
𝑖
53
Ollivier, Arnold, Auger and Hansen
so that
6
𝑟𝑖 −1
1
𝑊𝜃𝑓 (𝑥𝑖 )
∑︁
̂︁ 𝑓
− 𝑊 (𝑥𝑖 ) 6 6 (𝑤(𝑘/𝑁 − 𝜀) − 𝑤((𝑘 + 1/2)/𝑁 )).
𝑟𝑖 − 𝑟𝑖< 𝑘=𝑟<
𝑖
Let us sum over 𝑖, remembering that there are (𝑟𝑖6 −𝑟𝑖< ) values of 𝑗 for which 𝑓 (𝑥𝑗 ) = 𝑓 (𝑥𝑖 ).
Taking the positive part, we get
𝑁 ⃒ −1
1 ∑︁ ⃒ 𝑓 𝑓
⃒ 1 𝑁∑︁
⃒ 𝑊𝜃 (𝑥𝑖 ) − 𝑊 (𝑥𝑖 )⃒ 6 (𝑤(𝑘/𝑁 − 𝜀) − 𝑤((𝑘 + 1/2)/𝑁 )).
̂︁ ⃒
𝑁 𝑖=1 + 𝑁 𝑘=0
and
−1
1 𝑁∑︁
∫︁ 1+1/2𝑁
𝑤((𝑘 + 1/2)/𝑁 ) > 𝑤
𝑁 𝑘=0 1/2𝑁
(we implicitly extend the range of 𝑤 so that 𝑤(𝑞) = 𝑤(0) for 𝑞 < 0 and likewise for 𝑞 > 1).
So we have
𝑁 ⃒ ∫︁ 1/2𝑁 ∫︁ 1+1/2𝑁
1 ∑︁ ⃒ 𝑓 𝑓
⃒
⃒ 𝑊𝜃 (𝑥𝑖 ) − 𝑊 (𝑥𝑖 )⃒ 6 𝑤− 𝑤 6 (2𝜀 + 3/𝑁 )𝐵
̂︁ ⃒
𝑁 𝑖=1 + −𝜀−1/𝑁 1−𝜀−1/𝑁
on the three sets 𝑓 (𝑥) < 𝑚, 𝑓 (𝑥) = 𝑚 and 𝑓 (𝑥) > 𝑚, we get that 𝑔𝑡 (𝜃) = Pr𝑥∼𝑃𝜃 (𝑓 (𝑥) <
𝑚) + Pr𝑥∼𝑃𝜃 (𝑓 (𝑥) = 𝑚) 𝑞−𝑝 − 𝑡
𝑝𝑚 . In particular, 𝑔𝑡 (𝜃 ) = 𝑞.
54
Information-Geometric Optimization
Since we follow a gradient ascent of 𝑔𝑡 , for 𝛿𝑡 small enough we have 𝑔𝑡 (𝜃𝑡+𝛿𝑡 ) > 𝑔𝑡 (𝜃𝑡 )
unless the gradient vanishes. If the gradient vanishes we have 𝜃𝑡+𝛿𝑡 = 𝜃𝑡 and the quantiles
are the same. Otherwise we get 𝑔𝑡 (𝜃𝑡+𝛿𝑡 ) > 𝑔𝑡 (𝜃𝑡 ) = 𝑞.
(𝑝− +𝑝𝑚 )−𝑝−
Since 𝑞−𝑝 −
𝑝𝑚 6 𝑝𝑚 = 1, we have 𝑔𝑡 (𝜃) 6 Pr𝑥∼𝑃𝜃 (𝑓 (𝑥) < 𝑚) + Pr𝑥∼𝑃𝜃 (𝑓 (𝑥) =
𝑚) = Pr𝑥∼𝑃𝜃 (𝑓 (𝑥) 6 𝑚).
So Pr𝑥∼𝑃𝜃𝑡+𝛿𝑡 (𝑓 (𝑥) 6 𝑚) > 𝑔𝑡 (𝜃𝑡+𝛿𝑡 ) > 𝑞. This implies, by definition, that the 𝑞-
quantile value of 𝑃𝜃𝑡+𝛿𝑡 is at most 𝑚. Moreover, if the objective function has no plateau
then Pr𝑥∼𝑃𝜃𝑡+𝛿𝑡 (𝑓 (𝑥) = 𝑚) = 0 and so Pr𝑥∼𝑃𝜃𝑡+𝛿𝑡 (𝑓 (𝑥) < 𝑚) > 𝑞 which implies that the
𝑞-quantile of 𝑃𝜃𝑡+𝛿𝑡 is stricly less than 𝑚.
Then we have
{︃ ∫︁ ∫︁ }︃
𝜃 = arg max ln 𝑃𝜃 (𝑥) 𝑃𝜃0 (d𝑥) + 𝜀 ln 𝑃𝜃 (𝑥) (𝑊 (𝑥) − 1) 𝑃𝜃0 (d𝑥)
{︃ ∫︁ ∫︁ ∫︁ }︃
= arg max ln 𝑃𝜃 (𝑥) 𝑃𝜃0 (d𝑥) − ln 𝑃𝜃0 (𝑥) 𝑃𝜃0 (d𝑥) + 𝜀 ln 𝑃𝜃 (𝑥) (𝑊 (𝑥) − 1) 𝑃𝜃0 (d𝑥)
When 𝜀 → 0, the first term exceeds the second one if KL(𝑃𝜃0 || 𝑃𝜃 ) is too large (because
𝑊 is bounded), and so KL(𝑃𝜃0 || 𝑃𝜃 ) tends to 0. So we can assume that 𝜃 is close to 𝜃0 .
When 𝜃 = 𝜃0 + 𝛿𝜃 is close to 𝜃0 , we have
1 ∑︁
KL(𝑃𝜃0 || 𝑃𝜃 ) = 𝐼𝑖𝑗 (𝜃0 ) 𝛿𝜃𝑖 𝛿𝜃𝑗 + 𝑂(𝛿𝜃3 )
2
55
Ollivier, Arnold, Auger and Hansen
with 𝐼𝑖𝑗 (𝜃0 ) the Fisher matrix at 𝜃0 . (This actually holds both for KL(𝑃𝜃0 || 𝑃𝜃 ) and
KL(𝑃𝜃 || 𝑃𝜃0 ).)
∑︀
Thus, we can apply Lemma 26 using the Fisher metric 𝐼𝑖𝑗 (𝜃0 ) 𝛿𝜃𝑖 𝛿𝜃𝑗 , and working on
a small neighborhood of 𝜃0 in 𝜃-space (which can be identified with Rdim Θ ). The lemma
states that the argmax above is attained at
∫︁
𝜃 = 𝜃 0 + 𝜀∇
̃︀ 𝜃 ln 𝑃𝜃 (𝑥) (𝑊 (𝑥) − 1) 𝑃𝜃0 (d𝑥)
up to 𝑂(𝜀2 ), with ∇
̃︀ the natural gradient.
∫︀
Finally, the gradient cancels the constant −1 because (∇̃︀ ln 𝑃𝜃 ) 𝑃𝜃 = 0 at 𝜃 = 𝜃0 . This
0
proves Theorem 10.
∇
̃︀ ¯ ln 𝑃 (𝑥).
𝑇𝑖
Proposition 28 Let 𝛿𝜃𝑖 and 𝛿 ′ 𝜃𝑖 be two small variations of the parameters 𝜃𝑖 . Let 𝛿𝑃 (𝑥)
and 𝛿 ′ 𝑃 (𝑥) be the resulting variations of the probability distribution 𝑃 , and 𝛿 𝑇¯𝑖 and 𝛿 ′ 𝑇¯𝑖 the
resulting variations of 𝑇¯𝑖 . Then the scalar product, in Fisher information metric, between
the tangent vectors 𝛿𝑃 and 𝛿 ′ 𝑃 , is
𝑖 𝑖
Proof
𝜕 𝑇¯𝑖 ¯
= 𝐼𝑖𝑗 , the Fisher matrix for 𝜃. Indeed, 𝜕𝜕𝜃𝑇𝑗𝑖 = 𝑥 𝑇𝑖 (𝑥) 𝜕𝑃𝜕𝜃(𝑥)
∫︀
First, let us prove that 𝜕𝜃𝑗 𝑗
=
𝜕 ln 𝑃 (𝑥)
= 𝑥 𝑇𝑖 (𝑥)𝑃 (𝑥)(𝑇𝑗 (𝑥) − 𝑇¯𝑗 ) by (19); this is equal to Cov(𝑇𝑖 , 𝑇𝑗 ) which
∫︀ ∫︀
𝑥 𝑇𝑖 (𝑥)𝑃 (𝑥) 𝜕𝜃𝑗
is 𝐼𝑖𝑗 by (20).
56
Information-Geometric Optimization
Then, by definition of the Fisher metric we have ⟨𝛿𝑃, 𝛿 ′ 𝑃 ⟩ = 𝑖,𝑗 𝐼𝑖𝑗 𝛿𝜃𝑖 𝛿 ′ 𝜃𝑗 but 𝑗 𝐼𝑖𝑗 𝛿 ′ 𝜃𝑗
∑︀ ∑︀
∑︀ 𝜕 𝑇¯𝑖 ′
is equal to 𝛿 ′ 𝑇¯𝑖 by the above, because 𝛿 ′ 𝑇¯𝑖 = 𝛿 𝜃𝑗 . Thus we find ⟨𝛿𝑃, 𝛿 ′ 𝑃 ⟩ = 𝛿𝜃𝑖 𝛿 ′ 𝑇¯𝑖
∑︀
𝑗 𝜕𝜃𝑗 𝑖
as needed.
for 𝛿𝜃. Then (75) is proved along the same lines as (76).
Back to the proof of Theorem 12. We can now compute the desired terms:
𝜕 ln 𝑃 (𝑥)
∇
̃︀ ¯ ln 𝑃 (𝑥) =
𝑇𝑖 = 𝑇𝑖 (𝑥) − 𝑇¯𝑖
𝜕𝜃𝑖
by (19). This proves the first statement (34) in Theorem 12 about the form of the IGO
update in these parameters.
The other statements follow easily from this together with the additional fact (33) that,
for any set of (positive or negative) weights 𝑎𝑖 with 𝑎𝑖 = 1, the value 𝑇 * = 𝑖 𝑎𝑖 𝑇 (𝑥𝑖 )
∑︀ ∑︀
∑︀
maximizes 𝑖 𝑎𝑖 ln 𝑃 (𝑥𝑖 ).
57
Ollivier, Arnold, Auger and Hansen
‖𝑣‖ = sup (𝑣 · 𝑤)
𝑤, ‖𝑤‖61
and in particular
so that
(︁ ∑︀ )︁
E((𝑤 · 𝑋)2 ) = E (
∑︀
𝑖 𝑤𝑖 𝑋𝑖 )( 𝑗 𝑤𝑗 𝑋𝑗 )
∑︀ ∑︀
= 𝑖 𝑗 E(𝑤𝑖 𝑋𝑖 𝑤𝑗 𝑋𝑗 )
∑︀ ∑︀
= 𝑖 𝑗 𝑤𝑖 𝑤𝑗 E(𝑋𝑖 𝑋𝑗 )
∑︀ ∑︀
= 𝑖 𝑗 𝑤𝑖 𝑤𝑗 𝐶𝑖𝑗
with 𝐶𝑖𝑗 the covariance matrix of 𝑋. The latter expression is the scalar product (𝑤 · 𝐶𝑤).
Since 𝐶 is a symmetric positive-semidefinite matrix, (𝑤 · 𝐶𝑤) is at most 𝜆‖𝑤‖2 with 𝜆 the
largest eigenvalue of 𝐶.
d𝜃𝑡
For the IGO flow we have d𝑡 = E𝑥∼𝑃𝜃 𝑊𝜃𝑓 (𝑥)∇
˜ 𝜃 ln 𝑃𝜃 (𝑥).
√︁
So applying the lemma, we get that the norm of d𝜃 d𝑡 is at most 𝜆 Var𝑥∼𝑃𝜃 𝑊𝜃𝑓 (𝑥) where
𝜆 is the largest eivengalue of the covariance matrix of ∇ ˜ 𝜃 ln 𝑃𝜃 (𝑥) (expressed in a coordinate
system where the Fisher matrix at the current point 𝜃 is the identity).
By construction of the quantiles, we have Var𝑥∼𝑃𝜃 𝑊𝜃𝑓 (𝑥) 6 Var[0,1] 𝑤 (with equality
unless there are ties). Indeed, for a given 𝑥, let 𝒰 be a uniform random variable in [0, 1]
independent from 𝑥 and define the random variable 𝑄 = 𝑞 < (𝑥) + (𝑞 6 (𝑥) − 𝑞 < (𝑥))𝒰. Then
𝑄 is uniformly distributed between the upper and lower quantiles 𝑞 6 (𝑥) and 𝑞 < (𝑥) and
thus we can rewrite 𝑊𝜃𝑓 (𝑥) as E(𝑤(𝑄)|𝑥). By the Jensen inequality we have Var 𝑊𝜃𝑓 (𝑥) =
Var E(𝑤(𝑄)|𝑥) 6 Var 𝑤(𝑄). In addition when 𝑥 is taken under 𝑃𝜃 , 𝑄 is uniformly distributed
in [0, 1] and thus Var 𝑤(𝑄) = Var[0,1] 𝑤, i.e., Var𝑥∼𝑃𝜃 𝑊𝜃𝑓 (𝑥) 6 Var[0,1] 𝑤.
Besides, consider the tangent space in Θ-space at point 𝜃𝑡 , and let us choose an orthonor-
mal basis in this tangent space for the Fisher metric. Then, in this basis we have ∇ ˜ 𝑖 ln 𝑃𝜃 (𝑥) =
58
Information-Geometric Optimization
where 𝑥𝑖 ∼ 𝑃𝜃 and 𝑤^𝑖 is according to (14) where ranking is applied to the values 𝑓˜(𝑥𝑖 , 𝜔𝑖 ),
with 𝜔𝑖 uniform variables in [0, 1] independent from 𝑥𝑖 and from each other.
On the other hand, for the IGO algorithm using 𝑃𝜃 ⊗𝑈[0,1] and applied to the deterministic
function 𝑓˜, 𝑤^𝑖 is computed using the ranking according to the 𝑓˜ values of the sampled points
𝑥
˜𝑖 = (𝑥𝑖 , 𝜔𝑖 ), and thus coincides with the one in (77).
Besides,
𝜕𝜃 ln 𝑃𝜃⊗𝑈[0,1] (˜
𝑥𝑖 ) = 𝜕𝜃 ln 𝑃𝜃 (𝑥𝑖 ) + 𝜕𝜃 ln 𝑈[0,1] (𝜔𝑖 )
⏟ ⏞
=0
and thus, both the vanilla gradients and the Fisher matrix 𝐼 (given by the tensor square
of the vanilla gradients) coincide. This proves that the IGO algorithm update on space
𝑋 × [0, 1], using the family of distributions 𝑃˜𝜃 = 𝑃𝜃 ⊗ 𝑈[0,1] , applied to the deterministic
function 𝑓˜, coincides with (77).
References
D.H. Ackley, G.E. Hinton, and T.J. Sejnowski. A learning algorithm for Boltzmann machines.
Cognitive Science, 9(1):147–169, 1985.
59
Ollivier, Arnold, Auger and Hansen
Solving from Nature - PPSN XI, volume 6238 of Lecture Notes in Computer Science, pages
154–163. Springer, 2010.
S.-I. Amari. Natural gradient works efficiently in learning. Neural Comput., 10:251–276,
February 1998. ISSN 0899-7667. doi: 10.1162/089976698300017746. URL http://portal.
acm.org/citation.cfm?id=287476.287477.
S.-I. Amari and H. Nagaoka. Methods of information geometry, volume 191 of Translations
of Mathematical Monographs. American Mathematical Society, Providence, RI, 2000.
ISBN 0-8218-0531-2. Translated from the 1993 Japanese original by Daishi Harada.
S. Baluja. Population based incremental learning: A method for integrating genetic search
based function optimization and competitve learning. Technical Report CMU-CS-94-163,
Carnegie Mellon Report, 1994.
S. Baluja and R. Caruana. Removing the genetics from the standard genetic algorithm. In
Proceedings of ICML’95, pages 38–46, 1995.
Y. Bengio, A.C. Courville, and P. Vincent. Unsupervised feature learning and deep learning:
A review and new perspectives. ArXiv preprints, arXiv:1206.5538, 2012.
A. Berny. An adaptive scheme for real function optimization acting as a selection operator. In
Combinations of Evolutionary Computation and Neural Networks, 2000 IEEE Symposium
on, pages 140 –149, 2000b. doi: 10.1109/ECNN.2000.886229.
60
Information-Geometric Optimization
H.-G. Beyer. The Theory of Evolution Strategies. Natural Computing Series. Springer-Verlag,
2001.
H.-G. Beyer and H.-P. Schwefel. Evolution strategies—a comprehensive introduction. Natural
computing, 1(1):3–52, 2002. ISSN 1567-7818.
J. Branke, C. Lode, and J.L. Shapiro. Addressing sampling errors and diversity loss in UMDA.
In Proceedings of the 9th annual conference on Genetic and evolutionary computation,
pages 508–515. ACM, 2007.
T.M. Cover and J.A. Thomas. Elements of information theory. Wiley-Interscience [John
Wiley & Sons], Hoboken, NJ, second edition, 2006. ISBN 978-0-471-24195-9; 0-471-24195-4.
S. Das, S. Maity, B.-Y. Qu, and P.N. Suganthan. Real-parameter evolutionary multimodal
optimization - a survey of the state-of-the-art. Swarm and Evolutionary Computation, 1
(2):71–88, 2011.
P.-T. de Boer, D.P. Kroese, S. Mannor, and R.Y. Rubinstein. A tutorial on the cross-entropy
method. Annals OR, 134(1):19–67, 2005.
P. Deuflhard. Newton methods for nonlinear problems: affine invariance and adaptive
algorithms, volume 35. Springer, 2011.
E.D. Dolan and J.J. Moré. Benchmarking optimization software with performance profiles.
Mathematical programming, 91(2):201–213, 2002.
61
Ollivier, Arnold, Auger and Hansen
N. Hansen. The CMA evolution strategy: a comparing review. In J.A. Lozano, P. Larranaga,
I. Inza, and E. Bengoetxea, editors, Towards a new evolutionary computation. Advances
on estimation of distribution algorithms, pages 75–102. Springer, 2006b.
N. Hansen and A. Auger. Principled design of continuous stochastic search: From theory to
practice. In Y. Borenstein and A. Moraglio, editors, Theory and Principled Methods for
the Design of Metaheuristics, Natural Computing Series, pages 145–180. Springer Berlin
Heidelberg, 2014. ISBN 978-3-642-33205-0. doi: 10.1007/978-3-642-33206-7_8. URL
http://dx.doi.org/10.1007/978-3-642-33206-7_8.
N. Hansen and S. Kern. Evaluating the CMA evolution strategy on multimodal test functions.
In X. Yao et al., editors, Parallel Problem Solving from Nature PPSN VIII, volume 3242
of LNCS, pages 282–291. Springer, 2004.
N. Hansen, S.D. Müller, and P. Koumoutsakos. Reducing the time complexity of the deran-
domized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary
Computation, 11(1):1–18, 2003. ISSN 1063-6560.
G.R. Harik, F.G. Lobo, and D.E. Goldberg. The compact genetic algorithm. Evolutionary
Computation, IEEE Transactions on, 3(4):287–297, 1999.
G.E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets.
Neural Conputation, 18:1527–1554, 2006.
R. Hooke and T.A. Jeeves. “Direct search” solution of numerical and statistical problems.
Journal of the ACM, 8:212–229, 1961.
62
Information-Geometric Optimization
G.A. Jastrebski and D.V. Arnold. Improving evolution strategies through active covariance
matrix adaptation. In Evolutionary Computation, 2006. CEC 2006. IEEE Congress on,
pages 2814–2821. IEEE, 2006. ISBN 0780394879.
H. Jeffreys. An invariant form for the prior probability in estimation problems. Proc. Roy.
Soc. London. Ser. A., 186:453–461, 1946. ISSN 0962-8444.
H.K. Khalil. Nonlinear Systems. Nonlinear Systems. Prentice-Hall, Inc., second edition,
1996.
P.E. Kloeden and E. Platen. Numerical solution of stochastic differential equations, volume 23
of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1992. ISBN 3-540-
54062-8.
S. Kullback. Information theory and statistics. Dover Publications Inc., Mineola, NY, 1997.
ISBN 0-486-69684-7. Reprint of the second (1968) edition.
P. Larranaga and J.A. Lozano. Estimation of distribution algorithms: A new tool for
evolutionary computation. Springer Netherlands, 2002. ISBN 0792374665.
N. Le Roux, P.-A. Manzagol, and Y. Bengio. Topmoumoute online natural gradient algorithm.
In NIPS, 2007.
G. Montavon and K.-R. Müller. Deep boltzmann machines and the centering trick. In
G. Montavon, G.B. Orr, and K.-R. Müller, editors, Neural Networks: Tricks of the Trade
(2nd ed.), volume 7700 of Lecture Notes in Computer Science, pages 621–637. Springer,
2012. ISBN 978-3-642-35288-1.
J.J. Moré, B.S. Garbow, and K.E. Hillstrom. Testing unconstrained optimization software.
ACM Transactions on Mathematical Software (TOMS), 7(1):17–41, 1981.
J.A. Nelder and R. Mead. A simplex method for function minimization. The Computer
Journal, pages 308–313, 1965.
63
Ollivier, Arnold, Auger and Hansen
M. Pelikan, D.E. Goldberg, and F.G. Lobo. A survey of optimization by building and
using probabilistic models. Computational optimization and applications, 21(1):5–20, 2002.
ISSN 0926-6003.
P. Pošík. Preventing premature convergence in a simple eda via global step size setting. In
Parallel Problem Solving from Nature–PPSN X, pages 549–558. Springer, 2008.
C.R. Rao. Information and the accuracy attainable in the estimation of statistical parameters.
Bull. Calcutta Math. Soc., 37:81–91, 1945. ISSN 0008-0659.
R. Ros and N. Hansen. A simple modification in CMA-ES achieving linear time and space
complexity. In G. Rudolph, T. Jansen, S. Lucas, C. Polini, and N. Beume, editors,
Proceedings of Parallel Problem Solving from Nature (PPSN X), volume 5199 of Lecture
Notes in Computer Science, pages 296–305. Springer, 2008.
R.Y. Rubinstein. The cross-entropy method for combinatorial and continuous optimization.
Methodology and Computing in Applied Probability, 1:127–190, 1999. ISSN 1387-5841.
URL http://dx.doi.org/10.1023/A:1010091220143. 10.1023/A:1010091220143.
R.Y. Rubinstein and D.P. Kroese. The cross-entropy method: a unified approach to combi-
natorial optimization, Monte-Carlo simulation, and machine learning. Springer-Verlag
New York Inc, 2004. ISBN 038721240X.
B. Sareni and L. Krähenbühl. Fitness sharing and niching methods revisited. IEEE Trans.
Evolutionary Computation, 2(3):97–106, 1998.
T. Schaul, T. Glasmachers, and J. Schmidhuber. High dimensions and heavy tails for
natural evolution strategies. In Proceedings of the 13th annual conference on Genetic and
evolutionary computation, GECCO ’11, pages 845–852, New York, NY, USA, 2011. ACM.
ISBN 978-1-4503-0557-0. doi: 10.1145/2001576.2001692. URL http://doi.acm.org/10.
1145/2001576.2001692.
R.L. Schilling. Measures, integrals and martingales. Cambridge University Press, New York,
2005. ISBN 978-0-521-61525-9; 0-521-61525-9. doi: 10.1017/CBO9780511810886. URL
http://dx.doi.org/10.1017/CBO9780511810886.
64
Information-Geometric Optimization
L. Schwartz. Analyse. II, volume 43 of Collection Enseignement des Sciences [Collection: The
Teaching of Science]. Hermann, Paris, 1992. Calcul différentiel et équations différentielles,
With the collaboration of K. Zizi.
F. Silva and L. Almeida. Acceleration techniques for the backpropagation algorithm. Neural
Networks, pages 110–119, 1990.
T. Suttorp, N. Hansen, and C. Igel. Efficient covariance matrix update for variable metric
evolution strategies. Machine Learning, 75(2):167–197, 2009.
D. Whitley. The genitor algorithm and selection pressure: Why rank-based allocation of
reproductive trials is best. In Proceedings of the third international conference on Genetic
algorithms, pages 116–121, 1989.
E. Zhou and J. Hu. Gradient-based adaptive stochastic search for non-differentiable opti-
mization. IEEE Transactions on Automatic Control, 59(7):1818–1832, July 2014. ISSN
0018-9286. doi: 10.1109/TAC.2014.2310052.
65