KEMBAR78
PhysRevResearch.5.013122 Physics of Networks | PDF | Optimal Control | Mathematical Optimization
0% found this document useful (0 votes)
44 views9 pages

PhysRevResearch.5.013122 Physics of Networks

Networks and physics

Uploaded by

bob salinas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views9 pages

PhysRevResearch.5.013122 Physics of Networks

Networks and physics

Uploaded by

bob salinas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

PHYSICAL REVIEW RESEARCH 5, 013122 (2023)

Resetting in stochastic optimal control

Benjamin De Bruyne and Francesco Mori


LPTMS, CNRS, Université Paris-Sud, Université Paris-Saclay, 91405 Orsay, France

(Received 11 May 2022; revised 6 July 2022; accepted 30 October 2022; published 16 February 2023)

“When in a difficult situation, it is sometimes better to give up and start all over again.” While this empirical
truth has been regularly observed in a wide range of circumstances, quantifying the effectiveness of such a
heuristic strategy remains an open challenge. In this paper, we combine the notions of optimal control and
stochastic resetting to address this problem. The emerging analytical framework allows one not only to measure
the performance of a given restarting policy, but also to obtain the optimal strategy for a wide class of dynamical
systems. We apply our technique to a system with a final reward and show that the reward value must be larger
than a critical threshold for resetting to be effective. Our approach, analogous to the celebrated Hamilton-Jacobi-
Bellman paradigm, provides the basis for the investigation of realistic restarting strategies across disciplines. As
an application, we show that the framework can be applied to an epidemic model to predict the optimal lockdown
policy.

DOI: 10.1103/PhysRevResearch.5.013122


Finding the optimal strategy to operate a complex system where 2D is the strength of the noise. The external control
is a longstanding problem and has attracted a lot of attention u(x, t ) can be tuned to achieve a given goal, e.g., performing
in recent decades. Since the seminal works of Pontryagin [1] a task for a robot or generating profits in finance. Of course,
and Bellman [2], optimal control theory has received renewed controlling the system will generate operating costs, such as
interest due to its applications in a wide range of contexts, electrical consumption or transaction fees. Optimally control-
such as artificial intelligence [3] and finance [4]. In a typical ling the system requires balancing a trade-off between high
setting, optimal control considers a system whose state at rewards, measured over time by a function R(x, t ), and low
time t can be represented by a d-dimensional vector x(t ). operating costs, often taken to be proportional to u2 (x, t ). To
For instance, the state x(t ) could correspond to the degrees be precise, for a system located at position x at time t, the
of freedom of an autonomous robot or the asset values in reward in a small time interval dt is R(x, t )dt and the cost is
a financial portfolio. The system typically evolves in time u2 (x, t )dt/2.
following a deterministic law, e.g., the laws of motion for In principle, solving this optimization problem is intrin-
mechanical systems or the law of supply and demand for sically difficult due to the high dimensionality of the space
financial markets. Oftentimes, the mathematical modeling of of solutions. Remarkably, Bellman introduced a general way
these laws is prohibitively expensive and one introduces a to solve this problem, known as dynamical programming,
stochastic contribution to account for the missing information which consists in breaking down the optimization into simpler
on the environment. Given the laws of motion, optimal control subproblems in a recursive manner such that the present action
aims to operate the system in the best possible way by using is taken to maximize the future outcome. In doing so, the key
an external control, e.g., actuators for robots or market orders quantity to keep track of is the optimal payoff J (x, t ), defined
in finance. as the expected payoff for an optimally controlled system
One of the simplest ways to describe analytically the evo- located at x at time t. Using this framework, one can show that
lution in time of the system x(t ) is a first-order differential the optimal control is simply given by u∗ (x, t ) = ∇ x J (x, t ),
equation of the form ẋ(t ) = f (x, t ). This law is often a sim- driving the system towards the direction in which the payoff
plified description of the system and a source of Gaussian increases the most. The optimal payoff J (x, t ) satisfies the
white noise η(t ) is introduced to capture the fluctuations celebrated Hamilton-Jacobi-Bellman (HJB) equation [5],
around the deterministic model. In addition, the external
control on the system is usually modeled as a drift u(x, t ). −∂t J = Dx J + f · ∇ x J + 21 (∇x J )2 + R, (2)
Summing up these contributions, the full mathematical de- where x and ∇ x are, respectively, the Laplacian and the
scription of the system is given by gradient operators. Here, for convenience, we dropped the x
√ and t dependence in the functions in Eq. (2). The quadratic
ẋ(t ) = f (x, t ) + 2D η(t ) + u(x, t ), (1)
term (∇x J )2 renders this equation nonlinear and difficult to
solve for arbitrary reward functions. Nevertheless, there ex-
ist few analytically solvable cases. For instance, in the case
Published by the American Physical Society under the terms of the of d = 1, where f (x, t ) = 0 and R(x, t ) = −α(x − x f )2 δ(t −
Creative Commons Attribution 4.0 International license. Further t f )/2, the optimal control has the simple form u∗ (x, t ) =
distribution of this work must maintain attribution to the author(s) −α(x − x f )/[1 + α(t f − t )], which, in the limit α → ∞,
and the published article’s title, journal citation, and DOI. is reminiscent of the effective force to generate bridge

2643-1564/2023/5(1)/013122(9) 013122-1 Published by the American Physical Society


BENJAMIN DE BRUYNE AND FRANCESCO MORI PHYSICAL REVIEW RESEARCH 5, 013122 (2023)

Brownian motion [6]. This optimal control continuously symbol ·x0 indicates the average over all stochastic trajec-
drives the system to maximize the final reward by arriving tories starting from x0 at time t and evolving according to
close to the target x f at time t f . In more realistic systems, one Eq. (3). Note that the payoff F is a functional and depends
has to rely on numerical methods to solve Eq. (2) [7]. on the full function r.
Ideas from optimal control have also been proven success- Remarkably, we find that the optimal resetting policy r ∗
ful in different areas of physics [8–10]. Moreover, stochastic that maximizes F is bang-bang, resulting in an impulsive
optimal control has been applied to a variety of systems, control strategy [5]: the system is reset with probability one
such as supply-chain planning [11], swarms of drones [12], if its state is outside of a time-dependent domain (t ) and
and fluctuating interfaces [13]. These systems all have in evolves freely otherwise:
common that the optimal control can be orchestrated as a 
0 if x ∈ (t ),
coordinated sequence of infinitesimal local changes. How- r ∗ (x, t )dt = (5)
1 if x ∈
/ (t ).
ever, numerous systems do not fall in this class and require
global changes to be optimally managed. A timely example The domain (t ) evolves according to
of such circumstances is the Covid-19 crisis, during which
(t ) = {x : J (x, t )  J (xres , t ) − c(x, t )}, (6)
the main control policies have been global measures such as
regional and national lockdowns. Other instances arise in the where the optimal payoff function J (x, t ) = maxr Fx,t [r] is
context of search processes, both in the cases of computer the solution of the differential equation
algorithms [14] and time-sensitive rescue missions [15]. In
−∂t J = Dx J + f · ∇ x J + R, x ∈ (t ). (7)
the latter situations, a common and widely observed strat-
egy is to stop and resume the operations from a predefined Equation (7) must be solved jointly with Eq. (6) starting from
location. Such situations particularly arise in the context of the final condition J (x, t f ) = 0 and evolving backward in time
search processes [16,17], chemical reactions [18], and catas- with the Neumann boundary condition ∇ x J (x, t ) · n(x) = 0,
trophes in population dynamics [19,20]. Unfortunately, the where n(x) is the normal unit vector to the boundary. Out-
HJB framework is not well suited to study such resetting side of the domain (t ), the solution is given by J (x, t ) =
protocols. Indeed, resetting is known to be quite different J (xres , t ) − c(x, t ). The definition of the domain (t ) in
from a local force and exhibits interesting features, including Eq. (6) has a clear interpretation: at any given time, the opti-
out-of-equilibrium dynamics [21–23], dynamical phase tran- mal policy is to restart the system if its expected payoff is less
sitions [24–26], and nontrivial first-passage properties [16,17] than the one at the resetting location minus the cost incurred
(for a recent review, see [27]). This observation naturally for a restart. The emerging framework outlined in Eqs. (5)–(7)
called into question the existence of an analytical framework is the main result of this paper and provides a general method
to devise the optimal resetting control policy. to optimally control stochastic systems through restarts. The
In this paper, we combine the notions of stochastic reset- derivation of this result is presented in Appendix A.
ting and optimal control into resetting optimal control, which Going from Eq. (4) to Eq. (7), we have reduced a functional
provides a natural framework to operate systems through optimization problem to a partial differential equation, which
stochastic restarts. Our goal is not to provide an accurate is often easier to solve. Note, however, that the mathematical
description of a specific system, but rather to consider a min- problem in Eqs. (6) and (7) is of a special kind as the evolution
imal model to explore resetting in optimal control. To model of the domain of definition (t ) is coupled to the solution
resetting policies, we exchange the control force u(x, t ) for a J (x, t ) of the differential equation. This kind of equations be-
resetting rate r(x, t ). In a small time interval dt, the state of longs to a class known as Stefan problems [28], which often
the system is reset to a predefined location xres with proba- arise in the field of heat transfer, where one studies the evo-
bility p = r(x, t )dt and evolves freely otherwise. In sum, the lution of an interface between two phases, e.g., ice and water
dynamical system evolves according to on a freezing lake. In this context, one must solve the heat
 equation for the temperature profile with an interface between
xres , √ prob. = p
x(t + dt ) = the water and ice phases, which moves according to the tem-
x(t ) + f (x, t )dt + 2Dηdt, prob. = 1 − p,
perature gradient. The interface is therefore to be considered
(3) as an additional unknown function, which must be jointly
where the subscript “res” in xres stands for “resetting location.” solved with the differential equation. To draw an analogy
Similarly to the HJB framework, we aim to find the optimal with our case, the optimal payoff function J (x, t ) plays the
resetting rate r, as a function of x and t, that balances the role of the temperature and the boundary of the domain (t )
trade-off between high rewards, measured over time by the corresponds to the water-ice interface. Note, however, that the
function R(x, t ), and low operating costs, which depend on two Stefan problems have different boundary conditions. The
r(x, t ). To mathematically pose the optimization problem, we domain (t ) can be obtained by solving the Stefan problem
naturally extend the HJB paradigm and define the following in Eqs. (6) and (7) numerically. This can be achieved, for
payoff functional: instance, by using an Euler explicit scheme with a space-time
 t f  discretization and updating the domain (t ) at each time step
Fx0 ,t [r] = dτ [R(x(τ ), τ ) − c(x(τ ), τ )r(x(τ ), τ )] , according to Eq. (6). The domain (t ) is illustrated in Fig. 1
t x0 for the case of a one-dimensional random walk with a final
(4) reward. Such a situation corresponds to rewarding the system
where c(x, τ ) is the cost associated to resetting, t f is the by a finite amount for arriving at some target location at
time horizon up to which the system is controlled, and the the final time, while penalizing it with a unit cost for each

013122-2
RESETTING IN STOCHASTIC OPTIMAL CONTROL PHYSICAL REVIEW RESEARCH 5, 013122 (2023)

FIG. 2. The optimal expected payoff J (x) for a one-dimensional


FIG. 1. Space-time illustration of the optimal resetting policy random walk starting at x for the discounted reward R(x) = −αx 2
for a one-dimensional random walk (green line) to reach the target and cost C(x) = c and for different values of the cost-reward ra-
location x f = 1 (filled circle) exactly at time t f = 1. A reward α tio v = c/α. The continuous lines correspond to the exact result
is received upon reaching the location x f at the final time, while a in Eq. (9), while the symbols correspond to numerical simulations
unit cost is incurred upon resetting (red dashed arrows) to the origin performed with β = D = 1. The optimal strategy is to reset for
xres = 0 (gray dashed line). The optimal strategy that maximizes the |x| > u(v), where u(v) is defined in the text. For |x| < u(v), the
expected payoff is to reset upon touching the blue region and to system is let free to evolve and its payoff depends continuously on x.
evolve freely in the white region, which we denote as (t ) in the
text. The domain (t ) is obtained by numerical integration of the
Stefan problem in Eqs. (6) and (7), with R(x, t ) = αδ(t − 1)δ(x − 1), Solving Eq. (8), we obtain, for β = D = 1, the exact expres-
f (x, t ) = 0, α = 10, and D = 1. The boundary of the domain (t ) sion for the optimal payoff,
guides the particle to the location x f at time t f , while avoiding
resetting as much as possible. The shape of (t ) depends nontrivially J (x) = α[−2 − x 2 + 2u(v) cosh(x)/ sinh(u(v))], x ∈ ,
on the reward value α. Further explanations for this shape are given (9)
in the text. where u(v) is the boundary of the symmetric domain , i.e.,
 = {x : |x| < u(v)}. (10)
resetting event. This setting is further discussed in the next
paragraph, where we illustrate our framework with various The boundary u(v) is the unique positive solution of the
examples. transcendental equation v − u2 (v) + 2u(v) tanh[u(v)/2] = 0,
The Stefan problem in Eq. (7) is the analog of the where v = c/α is the cost-reward ratio. The optimal strat-
HJB equation (2) for a resetting control. Despite the mov- egy thus corresponds to resetting the system if |x| > u(v).
ing boundary, Eqs. (6) and (7) have a linear dependence When v 1, the cost of resetting is much smaller than the
in J. One might, therefore, wonder if exactly solvable reward,
√ therefore the boundary is close to the origin u(v) ∼
models exist within this framework. Interestingly, we have 2 (3v)1/4 , allowing the state of the system to remain close
found a time-independent formulation in the infinite time to the optimal location x = 0. On the other hand, when v 1,
horizon limit t f → ∞ which allows for exact analytical the cost of resetting is much larger than the running √ cost and
solutions to be found. This is achieved by considering dis- the boundary is set far away from the origin, u(v) ∼ v + 1.
counted rewards and costs of the form R(x, t ) = e−βt R(x) and Beyond the optimal resetting policy, our approach predicts the
c(x, t ) = e−βt C(x), where β > 0 is the discount rate. Ac- function J (x), measuring the expected payoff upon starting
cordingly, we also consider the drift to be time independent, from x and following the optimal strategy. In Fig. 2, J (x) is
f (x, t ) = f (x). Discounted payoffs are common in the control shown for |x| < u(v) and for various values of the cost-reward
theory literature [29] and capture situations in which the effect ratio v. As a function of x, J (x) has a symmetric bell shape
of the payoff decays over a typical timescale 1/β. Such effect centered around the origin, where the reward is maximal.
is, for instance, observed in financial contexts, where β is As |x| increases, J (x) decreases since the reward decreases
related to interest rates and is used to compare future and and the resetting boundary comes closer. Note that in this
present benefits. Using the ansatz J (x, t ) = e−βt J (x), we find special case, the optimal policy can also be recovered within
that Eq. (7) becomes a time-independent ordinary differential the framework of first-passage resetting [30,31], as shown in
equation of the form Appendix B. This is only true for this particular example,
where the control variable is a scalar.
βJ = Dx J + f · ∇ x J + R, x ∈ , (8) Our previous example focused on the case of an infinite
time horizon, corresponding to t f → ∞. We now investi-
where the domain  = {x : J (x)  J (xres ) − C(x)} is also gate the effect of a finite time horizon. One of the simplest
independent of time. This equation can be explicitly solved settings where such effect can be studied is the case of a
in the absence of external forces by choosing a quadratic one-dimensional random walk with a Dirac delta final reward
reward R(x) = −αx 2 and a constant resetting cost C(x) = c R(x, t ) = α δ(x − x f )δ(t − t f ), with α > 0, and a constant
to the origin xres = 0. Note that R(x)  0 and is maximized at resetting cost c(x, t ) = c. Such parameters correspond to re-
x = 0, rewarding the system for being close to the origin. warding the system by a finite amount α for arriving at the

013122-3
BENJAMIN DE BRUYNE AND FRANCESCO MORI PHYSICAL REVIEW RESEARCH 5, 013122 (2023)

target position x f at time t f , while penalizing it with a constant uals and we model the evolution of the epidemic with the
cost c for each resetting event. Note that the system needs to susceptible-infectious-recovered (SIR) model, which is one
arrive at the location x f exactly at time t f to get the reward, of the simplest compartmental models in epidemiology [36].
while earlier visits do not provide any additional benefits (as Previous works have approached this problem by continu-
for a rendezvous, where one needs to be at the right place at ously controlling the infection rate [37]. However, in real
the right time). The δ reward function has to be understood situations, one cannot adapt the restrictions in a continuous
as the continuum limit of a Gaussian function whose width way and our framework is well suited to describe discontinu-
tends to zero [32]. Before presenting the optimal policy for ous policies, such as lockdowns. In Appendix D, we show that
this problem, it is instructive to consider two limiting cases. our technique can be directly applied to this problem to find
For α → 0, one should never reset as the reward is not worth the optimal time to impose a lockdown.
the resetting cost. On the other hand, for α → ∞, the cost In sum, we combined optimal control and stochastic re-
of resetting becomes negligible and the optimal strategy is to setting to address the effectiveness of restarting policies. The
reset if restarting would bring the system closer to x f , inde- emerging framework, contained in Eqs. (6) and (7), provides
pendently of time. Interestingly, we observe that the crossover a unifying paradigm to obtain the optimal resetting strategy
between these two regimes occurs as a√ sharp transition at the for a wide class of dynamical systems. Our method can be
critical value α = αc , where αc = x f c 2π e ≈ 4.13273 x f c, generalized to discrete-time systems and quadratic costs as-
which we predict analytically (see Appendix C). For α < sociated with resetting (see Appendix E). Furthermore, it is
αc , the optimal policy is to never reset, corresponding to a simple exercise to include a continuous control policy in
(t ) = R for all 0  t  t f . The situation is more subtle for addition to resetting to account for realistic systems. In ad-
α > αc , where resetting is only favorable in a specific time dition, one would need to explore ways to solve the moving
window before t f . To describe this window, it is convenient boundary problem in high dimensions, which might require
to introduce the backward time τ = t f − t. We find that no approximation schemes. It would be interesting to investigate
boundary is present for τ < τ ∗ , where τ ∗ is the smallest extensions to optimal stopping problems and to study cost

solution of the transcendental equation αe−x f /(4Dτ ) =
2
positive
√ functions that are first-passage functionals [38], for instance
c 4π Dτ ∗ . At τ = τ ∗ , a boundary appears and one must re- where the time horizon is a first-passage time. This would be
sort to numerical integration techniques to find the solution for particularly relevant in the context of search processes.
τ > τ ∗ (see Fig. 1). We observe numerically that the boundary
evolves with τ , i.e., backward in time, in a nonmonotonic We warmly thank F. Aguirre-Lopez, S. N. Majumdar,
way and eventually disappears. This optimal policy can be S. Redner, A. Rosso, G. Schehr, E. Trizac, and L. Zadnik
understood as follows. Close to t f , where τ < τ ∗ , it is unlikely for fruitful comments and feedback. This work was partially
for the system to reach the target location x f from the origin supported by the Luxembourg National Research Fund (FNR)
in the remaining time. Thus, it is not convenient to reset. On (App. No. 14548297).
the other hand, for very early times, it is not yet necessary to
reset since, after resetting, the system would typically evolve APPENDIX A: DERIVATION OF EQS. (6) AND (7)
away from the target.
Our framework can be easily generalized to other resetting In this Appendix, we derive our main result given in
dynamics. For instance, in light of the recent experiments of Eqs. (6) and (7) of the main text. This derivation is based on a
stochastic resetting on colloidal particles [33–35], it is rele- dynamical programming argument (also known as a backward
vant to consider the case where the new state of the system x approach). We consider evolving the system from time t to
after a resetting event is drawn at random from some probabil- t + dt. According to the equations of motion in Eq. (3) of the
ity distribution PR (x |x), which can eventually depend on the main text, the state
√ of the system either (i) evolves from x to
current x of the system. In practice, it is physically impossible x + f (x, t )dt + 2D η(t )dt with probability 1 − r(x, t )dt or
to reset the particles to a fixed location since resetting is (ii) is reset to position xres with probability r(x, t )dt. In the
performed by using an optic trap. In this case, our main results time interval dt, the payoff in Eq. (4) of the main text changes
in Eqs. (5) and (7) remain valid, while the definition of the by the amount R(x, t )dt − c(x, t )r(x, t )dt. For the subsequent
domain (t ) is modified as evolution of the process from√ t + dt to t f , the new initial value
   is either x + f (x, t )dt + 2D η(t )dt in the former case or
  
(t ) = x : J (x, t )  dx [J (x , t )PR (x |x)] − c(x, t ) . xres in the latter case. Following this argument and using the
notation F[r | x, t] ≡ Fx,t [r], we obtain
(11)
Note that the case of a fixed resetting location xres corre- F[r | x, t] = R(x, t )dt − c(x, t )r(x, t )dt
sponds to PR (x |x) = δ(x − xres ). Similarly, one can extend √
+ F[r | x + f (x, t )dt + 2D η(t )dt, t + dt]
the framework to describe situations in which a finite amount
of time is required for each resetting event. × [1 − r(x, t )dt]
Finally, we illustrate the generality of our framework by + F[r | xres , t + dt]r(x, t )dt, (A1)
applying it to the problem of finding the optimal lockdown
policy to navigate a pandemic. We do not aim to provide where · is an average over the noise realizations η(t ). The
an accurate description of a specific pandemic, but rather goal is now to maximize both sides over the full function r.
to illustrate the generality of our framework and to gain Note that the functional F on the right-hand side of Eq. (A1)
interesting insights. We consider a population of N individ- does not depend on the value of the function r(x, t ) at time

013122-4
RESETTING IN STOCHASTIC OPTIMAL CONTROL PHYSICAL REVIEW RESEARCH 5, 013122 (2023)

t, given that its initial time is t + dt. We can therefore first


maximize over the function r(x, τ ) for τ ∈ [t + dt, t f ] and
then over the value r(x, t ) at time t, which gives
−∂t J (x, t ) = R(x, t ) + Dx J (x, t ) + f (x, t ) · ∇ x J (x, t )
+ max {r(x, t )[J (xres , t ) − c(x, t ) − J (x, t )]},
r(x,t )

(A2)
where we used the definition of J (x, t ), expanded to first
order in dt, and averaged over η(t ). We are left with maxi-
mizing a linear function of the scalar variable r(x, t ). Given
that r(x, t ) is a Poissonian resetting rate, it must be posi- FIG. 3. The rescaled optimal expected payoff J (x) for a one-
tive and less than 1/dt since r(x, t )dt is a probability. This dimensional random walk starting at x = 0 for the discounted reward
immediately gives the expression for the bang-bang optimal R(x) = −αx 2 and cost C(x) = c as a function of the cost-reward
control policy r ∗ (x, t ) given in Eq. (5) in the main text along ratio v = c/α.
with the domain (t ). Plugging this back into Eq. (A2), we
obtain the moving boundary problem introduced in the main given by
text. 
The Neumann boundary condition ∇ x J (x, t ) · n(x) = 0 ∞ ∞

−βτ
can be obtained upon a careful analysis of Eq. (A1) close to J (x|L) = dτ e R(x) − e−βti C(x) , (B1)
0
the boundary. We start from Eq. (A1) and write i=1 x0

F[r | x, t] = R(x, t )dt − c(x, t )r(x, t )dt where ti is the time of the ith resetting event and the average
√ is computed over all trajectories {x(τ )} with x(0) = x0 . In
+ F[r | x + f (x, t )dt + 2D η(t )dt, t + dt] particular, for R(x) = −αx 2 and C(x) = c, we obtain
× [1 − r(x, t )dt]  ∞ ∞

+ F[r | xres , t + dt]r(x, t )dt. (A3) J (x|L) = −α dτ e−βτ x(τ )2 x0 − c e−βti x0 . (B2)
0 i=1
We set x ∈ ∂(t ) and we expand Eq. (A3) to the first nontriv-
The two average quantities can be computed by standard
ial order in dt. The main difference with the case x ∈ (t ),
renewal techniques and one obtains
discussed above, is that the linear term in η(t ) does not vanish

as the gradient of the payoff functional with respect to x 2D x 2 L 2 cosh(x0 β/D)
vanishes outside of the domain (t ). We obtain J (x|L) = −α 2 + 0 − √
β β β cosh(L β/D) − 1
√ √ √
0 = 2D∇ x F[r | x, t] · η(t )dt 1[x + 2D η(t )dt ∈ (t )], cosh(x0 β/D)
(A4) −c √ . (B3)
cosh(L β/D) − 1
where 1 is the indicator
√ function and the average over η(t ) is Upon minimizing over L and setting β = D = 1, we recover
of the order of O( dt ). By decomposing η(t ) over the parallel the expression in Eqs. (9) and (10) in the main text. The
and perpendicular components of the normal direction n(x) to optimal payoff in Eq. (9) in the main text is shown as a
the boundary at x, Eq. (A4) immediately gives the boundary function of the cost-reward ratio in Fig. 3. The optimal payoff
condition ∇ x F[r|x, t] · n(x) = 0, which is, by definition, also decreases monotonically as the cost-reward ratio is increased.
valid for J.
APPENDIX C: DIRAC δ FINAL REWARD
APPENDIX B: FIRST-PASSAGE RESETTING
IN A FINITE INTERVAL
In this Appendix, we consider the case of a Dirac δ final
reward in d = 1, corresponding to R(x, t ) = αδ(x − x f )δ(t −
In this Appendix, we show that the optimal policy in the t f ), where α > 0 is the magnitude of the reward and x f is
special case of discounted payoffs can be obtained within the target location. We also assume that f (x, t ) = 0, c(x, t ) =
the framework of first-passage resetting. We consider a one- c > 0 and we consider the optimal payoff in the time-reversed
dimensional diffusive particle with diffusion coefficient D dynamics, defined as I (x, τ ) = J (x, t f − τ ). It is easy to show
whose position x belongs to the finite interval [−L, L] with that I (x, τ ) satisfies the diffusion equation,
L > 0. We assume that when the particle reaches any of the
interval boundaries, it is instantaneously reset to the origin ∂τ I (x, τ ) = D∂xx I (x, τ ), (C1)
xres = 0. We assume that at the initial time, the particle is with initial condition I (x, τ = 0) = αδ(x − x f ). Assuming
at position x0 , with −L < x0 < L. In analogy with the dis- that no boundary appears for small τ (to be verified a posteri-
counted payoff in the main text, we associate a discounted ori), the optimal payoff function is the usual Gaussian profile,
reward R(x, t ) = e−βt R(x) and discounted cost of resetting,
c(x, t ) = e−βt C(x), to the trajectory of the particle. Consider- 1
e−(x−x f ) /(4πDτ )
2
I (x, τ ) = α √ . (C2)
ing an infinite time horizon, the average payoff is therefore 4π Dt

013122-5
BENJAMIN DE BRUYNE AND FRANCESCO MORI PHYSICAL REVIEW RESEARCH 5, 013122 (2023)

APPENDIX D: APPLICATION TO LOCKDOWN POLICIES


In this Appendix, we apply the framework developed in
the main text to a problem of finding the optimal lockdown
policy to navigate a pandemic. To describe the propagation of
an epidemic within a population, we employ the SIR model.
We consider a population of N individuals, divided into three
groups (S, I, R), where S, I, and R are, respectively, the num-
ber of susceptible, infected, or recovered individuals. Note
that by definition, S + I + R = N and therefore a given state
of the system is completely specified by the vector x = (S, I ).
We assume that new infections occur at a rate βSI, while
infected individuals recover at a rate γ I. In other words, the
FIG. 4. Scaled time Dτ ∗ /x02 at which the barrier first appears
system evolves as a Markov jump process with rates
as a function of α/(cx0 ). The continuous blue line corresponds to

the smallest positive solution of α = c 4π Dτ ex f /(4Dτ
2 )
√ , while the w[(S, I ) → (S − 1, I + 1)] = βSI (D1)
dashed red line corresponds to the critical value αc = 2π ecx0 . For
α < αc , no barrier is present. and
w[(S, I ) → (S, I − 1)] = γ I, (D2)
A boundary appears at time τ when the condition I (x, τ ) <
I (0, τ ) + c is verified for the first time for some value of x. where w[x → x ] indicates the rate of the transition from x
Using Eq. (C2), this condition can be rewritten, for |x − x f | > to x .
x f , as During a real pandemic, national governments are usually
√ presented with conflicting objectives. Indeed, the rapid spread
c 4π Dτ of the disease carries heavy public-health costs, in particular
α > −x2 /(4Dτ ) . (C3) when the number of infected individuals that require treatment
− e−(x−x f ) /(4Dτ )
2
e f
exceeds the hospital alert level of a country. This spread can
Minimizing the right-hand
√ side of Eq. (C3) over√x and τ ,
we obtain α > cx f 2eπ . Thus, for α < αc = cx f 2eπ , no
boundary appears and the cost function is given by Eq. (C2)
for any τ . On the other hand, for α > αc , two boundaries
appear
√ at time τ ∗ , which is the smallest solution of α =
c 4π Dτ ex f /(4Dτ ) . Thus, for τ < τ ∗ , the cost function is
2

given by Eq. (C2), while it is hard to determine analytically for


τ > τ ∗ . We obtain numerically the boundary for τ > τ ∗ (see
Fig. 1 of the main text). Note that at τ = τ ∗ , the condition in
Eq. (C3) is only verified for x → ±∞, meaning that the two
boundaries start from infinity at the critical time. The critical
time τ ∗ is shown in Fig. 4 as a function of α. The asymptotic
behaviors of τ ∗ as a function of α are given by

x 2 /(2D) for α → αc ,
τ =  f 2
∗  (C4)
x f (4D) / ln(α) for α → ∞.

FIG. 6. Phase-space illustration of the optimal lockdown policy


for a SIR epidemic model with infection rate β and recovery rate γ .
Since the total number N of individuals is fixed, only configurations
below the black dashed line, where S + I  N, are allowed. We con-
sider a (negative) reward R((S, I ), t) = −aI − [b(I − Ia )] θ (I − Ia ),
where Ia is the hospital alert level (horizontal dashed line) and θ (x) is
the Heaviside step function. A fixed cost c is incurred upon resetting
from the state (S, I ) to the state (S, αI), which describes the effects
of a lockdown. The optimal strategy that maximizes the expected
payoff is to reset upon touching the blue region and to evolve freely
in the white region, which we denote as (t ) in the text. The domain
(t ) is obtained by numerical integration of the Stefan problem
presented in (D4). The boundary of the domain (t ) guides the
epidemic below the hospital alert level, while avoiding lockdowns
FIG. 5. Reward function R((S, I ), t) in (D3) describing the as much as possible. The figure corresponds to the choice of parame-
public-health cost in the context of the optimal lockdown policy ters t = 0, a = 2, b = 10, α = 0.5, N = 200, Ia = 0.3 N, β = 5/N,
problem. The reward is a piecewise linear function whose slope t f = 1, and γ = 1. An optimally controlled epidemic trajectory is
increases above the hospital alert level threshold Ic . shown in Fig. 7.

013122-6
RESETTING IN STOCHASTIC OPTIMAL CONTROL PHYSICAL REVIEW RESEARCH 5, 013122 (2023)

be countered by imposing lockdowns, which, however, have


an impact on the economy of a country.
To describe the public-health cost, we introduce the (nega-
tive) reward function (see Fig. 5),
R((S, I ), t) = −aI − [b(I − Ic )] θ (I − Ic ), (D3)
where Ic represents the hospital alert level, a > 0 is the cost
per infected person when the hospitals are not saturated, and
b > 0 is the excess cost when the hospitals are saturated. Here,
θ (z) is the Heaviside theta function. We describe lockdowns
by allowing the possibility for governments to decrease the
number of infected people by a constant fraction, without
changing the number of susceptible people. In other words,
the effect of a lockdown is to reset the system from state
x = (S, I ) to state x = (S, αI), where 0 < α < 1. We as-
sume that each lockdown comes with a fixed cost c > 0, due FIG. 7. Epidemic evolution in the SIR model controlled by an
to the negative socioeconomic impact. The goal in then to optimal lockdown policy according to the reward function given
in (D3) and a fixed cost c for resetting. The red dashed line
find the optimal resetting (or lockdown) policy r ∗ (x, t ) which
corresponds to the lockdown event. Since the total number N of
minimizes the payoff function defined in Eq. (4) of the main
individuals is fixed, only configurations below the black dashed line,
text, in order to control the system up to a given time horizon
where S + I  N, are allowed. The horizontal dashed line describes
tf . hospital alert level Ic . The figure corresponds to the choice of parame-
It is easy to employ our framework in this case and to show ters a = 2, b = 10, α = 0.5, N = 200, Ic = 0.3 N, β = 5/N, γ = 1,
that the optimal payoff function J((S, I ), t) evolves according and t f = 1, which are the same as those used in Fig. 6.
to
−∂t J((S, I ), t) = R((S, I ), t) + βSI[J((S − 1, I + 1), t) We employ our framework numerically and obtain the op-
− J((S, I ), t)] − γ I[J((I − 1, S), t) timal policy in the (S − I ) plane, as shown in Fig. 6. As can be
seen in this figure, the policy is to never reset if the number of
− J((S, I ), t)], (D4) susceptible individuals is larger than a critical value. Then, as
for (S, I ) ∈ (t ), and the number of susceptible individuals is lowered, the optimal
policy is to reset if the number of infected individuals is larger
J((S, I ), t) = J((S, αI), t) − c, (D5) than a threshold which increases as the number of susceptible
individuals is lowered. The optimal domain (t ) guides the
for (S, I ) ∈
/ (t ). Here the domain (t ) is defined as epidemic below the hospital alert level, while avoiding lock-
(t ) = {x : J((S, I ), t)  J((S, αI), t) − c}. (D6) downs as much as possible. An optimally managed epidemic
trajectory is shown in Fig. 7.

APPENDIX E: GENERALIZATION TO A QUADRATIC RESETTING COST


In this Appendix, we generalize our framework to the case in which a quadratic cost is associated to resetting with the
following payoff functional:
 t f 
1
F[r|x0 , t] = dτ R(x(τ ), τ ) − c(x(τ ), τ )r(x(τ ), τ ) 2
, (E1)
t 2 x0

with c(x, τ )  0. Following the procedure outlined in the main text, we obtain the moving boundary problem,

−∂t J (x, t ) = Dx J (x, t ) + f (x, t ) · ∇ x J (x, t ) + R(x, t ), x ∈ (t )


1
−∂t J (x, t ) = Dx J (x, t ) + f (x, t ) · ∇ x J (x, t ) + R(x, t ) + [J (x, t ) − J (0, t )]2 , x∈
/ (t ), (E2)
2 c[x(τ ), τ ]
where

(t ) = {x : J (x, t )  J (xres , t )}. (E3)

The differential equation (E2) must by solved by imposing the continuity of the solution and its derivative at the boundary of
(t ). The optimal resetting rate r ∗ (x, t ) is no longer bang-bang and is given by

0, x ∈ (t )
r ∗ (x, t ) = (E4)
[J (xres , t ) − J (x, t )]/c[x(τ ), τ ], x ∈
/ (t ).

013122-7
BENJAMIN DE BRUYNE AND FRANCESCO MORI PHYSICAL REVIEW RESEARCH 5, 013122 (2023)

APPENDIX F: GENERALIZATION TO DISCRETE-TIME SYSTEMS


In this Appendix, we generalize our framework to the case of discrete-time systems whose state xn evolves according to the
Markov rule,

x with probability pn (x)
xn = res (F1)
xn−1 + fn (xn−1 ) + ηn with probability 1 − pn (x),
where ηn are independently and identically distributed random variables drawn from a probability distribution q(η) and fn (x)
is an external force. The distribution q(η) is arbitrary and includes, for instance, the case of fat-tailed distributions. The payoff
functional in Eq. (4) of the main text straightforwardly generalizes to
 n

F ({pm , . . . , pn }|xm , m) = R(xl , l ) − c(xl , l )pl (xl ) , (F2)
l=m xm

where the control is the sequence of resetting probabilities, {pm , . . . , pn }, and the average on the right-hand side is taken over
all system trajectories starting from xm at step m. Following a similar approach as in the previous sections, we obtain that the
optimal policy is given by

0 if x ∈ m ,
pm (x) = (F3)
1 if x ∈ / m ,
where m ⊆ Rd is a time-dependent domain. The corresponding discrete moving boundary problem for the optimal payoff
function, J (x, m) = max{pm ,...,pn } F[{pm , . . . , pn } | x, m], is given by
  ∞ 
J (x, m) = R(x, m) + max J (xres , m + 1) + c(x, m), dη q(η) J[x + fm (x) + η, m + 1] , (F4)
−∞

with
  ∞ 
m = x : dη p(η) J[x + fm (x) + η, m + 1]  J (xres , m + 1) + c(x, m) . (F5)
−∞

[1] L. Pontryagin, V. Boltyanskii, R. Gamkrelidze, and E. [11] A. Dolgui, D. Ivanov, S. P. Sethi, and B. Sokolov, Scheduling
Mishchenko, The Mathematical Theory of Optimal Processes in production, supply chain and Industry 4.0 systems by optimal
(Interscience, New York, 1962). control: fundamentals, state-of-the-art and applications, Intl. J.
[2] R. Bellman and R. E. Kalaba, Dynamic Programming and Mod- Product. Res. 57, 411 (2019).
ern Control Theory (Academic Press, New York, 1965). [12] V. Gómez, S. Thijssen, A. Symington, S. Hailes, and H. J.
[3] S. Russell and P. Norvig, Artificial Intelligence: A Modern Kappen, Real-time Stochastic optimal control for multi-agent
Approach (Prentice Hall, Englewood Cliffs, NJ, 2003). quadrotor systems, in Twenty-Sixth International Conference on
[4] H. Pham, Continuous-time Stochastic Control and Optimization Automated Planning and Scheduling, edited by A. Coles, A.
with Financial Applications (Springer Science & Business Me- Coles, S. Edelkamp, D. Magazzeni, and S. Sanner (AAAI Press,
dia, Berlin, 2009). Palo Alto, CA, 2016).
[5] R. Stengel, Optimal Control and Estimation (Dover, New York, [13] K. Khanin, S. Nechaev, G. Oshanin, A. Sobolevski, and O.
1993). Vasilyev, Ballistic deposition patterns beneath a growing KPZ
[6] S. N. Majumdar and H. Orland, Effective Langevin equations interface, Phys. Rev. E 82, 061107 (2010).
for constrained stochastic processes, J. Stat. Mech. (2015) [14] A. Montanari and R. Zecchina, Optimizing Searches via Rare
P06039. Events, Phys. Rev. Lett. 88, 178701 (2002).
[7] H. J. Kappen, Linear Theory for Control of Nonlinear Stochas- [15] M. F. Shlesinger, Search research, Nature (London) 443, 281
tic Systems, Phys. Rev. Lett. 95, 200201 (2005). (2006).
[8] T. Schmiedl and U. Seifert, Optimal Finite-Time Processes [16] M. R. Evans and S. N. Majumdar, Diffusion with Stochastic
In Stochastic Thermodynamics, Phys. Rev. Lett. 98, 108301 Resetting, Phys. Rev. Lett. 106, 160601 (2011).
(2007). [17] M. R. Evans and S. N. Majumdar, Diffusion with optimal reset-
[9] L. Bertini, A. De Sole, D. Gabrielli, G. Jona-Lasinio, and C. ting, J. Phys. A: Math. Theor. 44, 435001 (2011).
Landim, Macroscopic fluctuation theory, Rev. Mod. Phys. 87, [18] S. Reuveni, M. Urbakh, and J. Klafter, Role of substrate un-
593 (2015). binding in Michaelis Menten enzymatic reactions, Proc. Natl.
[10] U. Boscain, M. Sigalotti, and D. Sugny, Introduction to the Acad. Sci. USA 111, 4391 (2014).
pontryagin maximum principle for quantum optimal control, [19] A. Economou and D. Fakinos, A continuous-time Markov chain
Phys. Rev. X Quantum 2, 030203 (2021). under the influence of a regulating point process and applica-

013122-8
RESETTING IN STOCHASTIC OPTIMAL CONTROL PHYSICAL REVIEW RESEARCH 5, 013122 (2023)

tions in stochastic models with catastrophes, Eur. J. Operat. Res. [29] W. H. Fleming and H. M. Soner, Controlled Markov Processes
149, 625 (2003). and Viscosity Solutions (Springer Science & Business Media,
[20] P. Visco, R. J. Allen, S. N. Majumdar, M. R. Evans, Switching New York, 2006).
and growth for microbial populations in catastrophic responsive [30] B. De Bruyne, J. Randon-Furling, and S. Redner, Optimization
environments, Biophys. J. 98, 1099 (2010). in First-Passage Resetting, Phys. Rev. Lett. 125, 050602 (2020).
[21] V. Méndez and D. Campos, Transport properties and first-arrival [31] B. De Bruyne, J Randon-Furling, and S. Redner, Optimization
statistics of random motion with stochastic reset times, Phys. and growth in first-passage resetting, J. Stat. Mech. (2021)
Rev. E 93, 022106 (2016). 013203.
[22] S. Eule and J. J. Metzger, Nonequilibrium steady states of [32] This is the same way as for the standard deviation of the white
stochastic processes with intermittent resetting, New J. Phys. noise.
18, 033006 (2016). [33] O. Tal-Friedman, A. Pal, A. Sekhon, S. Reuveni, and Y.
[23] F. Mori, S. N. Majumdar, and G. Schehr, Distribution of the Roichman, Experimental realization of diffusion with stochastic
time of the maximum for stationary processes, Europhys. Lett. resetting, J. Phys. Chem. Lett. 11, 7350 (2020).
135, 30003 (2021). [34] B. Besga, A. Bovon, A. Petrosyan, S. N. Majumdar, and S.
[24] L. Kuśmierz, S. N. Majumdar, S. Sabhapandit, G. Schehr, First Ciliberto, Optimal mean first-passage time for a Brownian
Order Transition for the Optimal Search Time of Lévy Flights searcher subjected to resetting: Experimental and theoretical
with Resetting, Phys. Rev. Lett. 113, 220602 (2014). results, Phys. Rev. Res. 2, 032029(R) (2020).
[25] S. N. Majumdar, S. Sabhapandit, and G. Schehr, Dynamical [35] B. Besga, F. Faisant, A. Petrosyan, S. Ciliberto, and S. N.
transition in the temporal relaxation of stochastic processes Majumdar, Dynamical phase transition in the first-passage
under resetting, Phys. Rev. E 91, 052131 (2015). probability of a Brownian motion, Phys. Rev. E 104, L012102
[26] D. Campos and V. Méndez, Phase transitions in optimal search (2021).
times: How random walkers should combine resetting and flight [36] W. Kermack and A. G. McKendrick, A contribution to the
scales, Phys. Rev. E 92, 062115 (2015). mathematical theory of epidemics, Proc. R. Soc. London A 115,
[27] M. R. Evans, S. N. Majumdar, and G. Schehr, Stochastic Re- 700 (1927)
setting and Applications, J. Phys. A: Math. Theor. 53, 193001 [37] J. Samuel and S. Sinha, Optimal control in pandemics, Phys.
(2020). Rev. E 103, L010301 (2021).
[28] J. Crank, Free and Moving Boundary Problems (Oxford Univer- [38] S. N. Majumdar, Persistence in nonequilibrium systems, Curr.
sity Press, New York, 1984). Sci. 89, 2076 (2005).

013122-9

You might also like