KEMBAR78
Dynamic Asset Allocation | PDF | Asset Allocation | Mathematical Optimization
0% found this document useful (2 votes)
515 views323 pages

Dynamic Asset Allocation

Dynamic Asset Allocation Identifying Regime Shifts in Financial Time Series to Build Robust Portfolios
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (2 votes)
515 views323 pages

Dynamic Asset Allocation

Dynamic Asset Allocation Identifying Regime Shifts in Financial Time Series to Build Robust Portfolios
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 323

Downloaded from orbit.dtu.

dk on: mar 03, 2019

Dynamic Asset Allocation - Identifying Regime Shifts in Financial Time Series to Build
Robust Portfolios

Nystrup, Peter

Publication date:
2018

Document Version
Publisher's PDF, also known as Version of record

Link back to DTU Orbit

Citation (APA):
Nystrup, P. (2018). Dynamic Asset Allocation - Identifying Regime Shifts in Financial Time Series to Build
Robust Portfolios. DTU Compute. DTU Compute PHD-2017, Vol.. 465

General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright
owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

 Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
 You may not further distribute the material or use it for any profit-making activity or commercial gain
 You may freely distribute the URL identifying the publication in the public portal

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Ph.D. Thesis
Doctor of Philosophy

Dynamic Asset Allocation


Identifying Regime Shifts in Financial Time Series to Build
Robust Portfolios

Peter Nystrup

Kongens Lyngby
November 2017
Technical University of Denmark
Department of Applied Mathematics and Computer Science
Richard Petersens Plads, Building 324
2800 Kongens Lyngby, Denmark
Phone +45 4525 3031
compute@compute.dtu.dk
www.compute.dtu.dk
Short Contents

Short Contents i

Abstract iii

Resumé v

Preface vii

Acknowledgments ix

Publications xi

Acronyms xiii

Contents xv

1 Introduction 1

2 Contribution 9

3 Conclusion 23

References 27

A Stylized facts of financial time series and hidden Markov mod-


els in continuous time 37

B Long memory of financial time series and hidden Markov


models with time-varying parameters 59

C Regime-based versus static asset allocation: Letting the data


speak 85
ii Short Contents

D Dynamic allocation or diversification: A regime-based ap-


proach to multiple assets 99

E Detecting change points in VIX and S&P 500: A new ap-


proach to dynamic asset allocation 119

F Greedy Gaussian segmentation of multivariate time series 141

G Dynamic portfolio optimization across hidden market regimes 173

H Multi-period trading via convex optimization 203

I Multi-period portfolio selection with drawdown control 263


Abstract

Long-term investors can often bear the risk of outsized market movements or tail
events more easily than the average investor; for bearing this risk, they hope to
earn significant excess returns. Rebalancing periodically to a fixed benchmark
allocation, however, is not the way to do this. In the presence of time-varying in-
vestment opportunities, portfolio weights should be adjusted as new information
arrives to take advantage of favorable regimes and reduce potential drawdowns.
This thesis contributes to a better understanding of financial markets’ behavior
in the form of a model-based framework for dynamic asset allocation.
Regime-switching models can match financial markets’ tendency to change their
behavior abruptly and the phenomenon that the new behavior often persists for
several periods after a change. Regime shifts lead to time-varying parameters
and, in addition, the parameters within the regimes and the transition proba-
bilities change over time. Using recursive and adaptive estimation techniques
to capture this, we are able to better reproduce the volatility persistence that
dynamic asset allocation benefits from. With this approach it is sufficient to
distinguish between two regimes in stock returns in order for it to be profitable
to change asset allocation based solely on the inferred regimes, both in a single-
and multi-asset universe.
We advocate the use of model predictive control for translating forecasts into a
dynamic strategy and controlling drawdowns by solving a multi-period optimiza-
tion problem. We implement this based on forecasts from a multivariate hidden
Markov model with time-varying parameters. Our results show that a substan-
tial amount of value can be added by adjusting the asset allocation to the current
market conditions, rather than rebalancing periodically to a static benchmark.
By proposing a practical approach to drawdown control, we demonstrate the
theoretical link to dynamic asset allocation and the importance of identifying
and acting on regime shifts in order to limit losses and build robust portfolios.
Keywords: Risk management; Regime switching; Adaptive estimation; Fore-
casting; Model predictive control; Portfolio optimization; Drawdown control.
iv
Resumé

Langsigtede investorer kan ofte bære risikoen for store markedsudsving eller
ekstreme hændelser bedre end den gennemsnitlige investor. Som kompensation
for at påtage sig denne risiko håber de at opnå betydelige merafkast. Periodevis
rebalancering til en statisk benchmark-allokering er imidlertid ikke måden at
opnå dette. Ved tilstedeværelse af tidsvarierende investeringsmuligheder bør
porteføljevægtene tilpasses, efterhånden som ny information bliver tilgængelig,
for at drage fordel af gunstige regimer og reducere potentielle tab. Denne afhand-
ling bidrager til en bedre forståelse af finansielle markeders opførsel i form af et
modelbaseret fundament for dynamisk aktivallokering.
Regimeskiftmodeller kan genskabe finansielle markeders tendens til pludseligt
at skifte opførsel og det fænomen, at den nye opførsel ofte varer ved længe efter
et skift. Regimeskift fører til tidsvarierende parametre, men også parametrene
inden for regimerne og overgangssandsynlighederne ændrer sig over tid. Ved at
bruge rekursive og adaptive estimationsmetoder til at fange dette er vi i stand
til bedre at genskabe den volatilitetspersistens, som dynamisk aktivallokering
udnytter. Med denne tilgang er det tilstrækkeligt at skelne mellem to regimer
i aktieafkast, før end det er profitabelt at ændre aktivallokering udelukkende
baseret på de udledte regimer i universer bestående af et enkelt eller flere aktiver.
Vi advokerer for brugen af modelprædiktiv regulering til at omsætte forudsigelser
til en dynamisk strategi og kontrollere tab gennem løsning af et flerperiode opti-
meringsproblem. Vi implementerer dette baseret på forudsigelser fra en skjult
Markov model med tidsvarierende parametre. Vores resultater viser, at be-
tydelig værdi kan tilføres ved at tilpasse aktivallokeringen til de nuværende
markedsforhold frem for at rebalancere periodevist til et statisk benchmark.
Ved at foreslå en praktisk tilgang til tabskontrol demonstrerer vi den teoretiske
forbindelse til dynamisk aktivallokering og vigtigheden af at identificere og agere
på regimeskift for at begrænse tab og bygge robuste porteføljer.
Nøgleord: Risikostyring; Regimeskift; Adaptiv estimation; Forudsigelse; Model-
prædiktiv regulering; Porteføljeoptimering; Tabskontrol.
vi
Preface

This thesis was prepared in the Department of Applied Mathematics and Com-
puter Science at Technical University of Denmark in partial fulfillment of the
requirements for the degree of Doctor of Philosophy (Ph.D.) in Engineering. It
consists of a collection of nine papers written during the course of my Ph.D.
study.
The research study was carried out in collaboration with Danish pension fund
Sampension and Lund University in Sweden, with support from Innovation Fund
Denmark under Grant No. 4135-00077B. As part of the study, I spent the sum-
mer of 2016 as a visiting student researcher in the Department of Electrical
Engineering at Stanford University in California.
In addition to a handful of visits to Stanford, the financial support from Sampen-
sion and Innovation Fund Denmark has allowed me to participate in conferences
and seminars in London, Frankfurt, Rennes, Aarhus, Paris, New York, Austin,
Cairns, and Brussels.
This thesis deals with different aspects of mathematical modeling of the stylized
behavior of financial returns using regime-switching models with the aim of
developing a model-based framework for dynamic asset allocation. The idea for
the study emanated from my Master’s thesis (Nystrup 2014), which established
the potential for model-driven, regime-based asset allocation. In continuation
of this work we initially focused on adaptive estimation of regime-switching
models. Before pursuing with a multi-period optimization approach based on
model predictive control, demonstrating the connection between dynamic asset
allocation and drawdown control, we tested strategies where the asset allocation
was fully determined by the identified market regime.

Copenhagen, November 2017

Peter Nystrup
viii
Acknowledgments

My first acknowledgment is to my academic supervisors, Professor Henrik Mad-


sen from Technical University of Denmark and Professor Erik Lindström from
Lund University, for many interesting discussions over the years. Henrik has a
great ability to combine theoretical insight with a hands-on approach to mathe-
matical modeling, forecasting, and control. His vast experience makes him able
to draw parallels to previous applications in many different areas. Erik has
played an important role with his enthusiasm and knowledge of quantitative fi-
nance. He has always made time to provide feedback and to research new ideas
that could contribute to the study.
I want to thank my two industrial supervisors, Head of Investment Analysis
Bo William Hansen and Chief Investment Officer Henrik Olejasz Larsen, along
with the rest of my colleagues from Sampension, for their contribution. Bo
and Henrik have shown a lot of interest in my work since the beginning of my
Master’s thesis and have encouraged me to continue working on the topic of
dynamic asset allocation. I have benefitted from their background in economics
and practical experience with financial markets and asset allocation. This study
would not have been possible without the strong support from Bo and Henrik.
I also want to thank Professor Stephen Boyd and his research group at Stanford
University for their collaboration. Stephen challenged me from the first day
we met to view everything as an optimization problem. He has a great ability
to translate a profound mathematical understanding into practical solutions to
pertinent problems. Thanks to Stephen, my visits to Stanford have been very
rewarding.
I am grateful to my parents and my fiancée Jenn for their love and support
and extensive help with proofreading. Thank you also to the anonymous re-
viewers at various journals and conference participants, whose comments and
suggestions have led to improvements of the individual publications. Finally, the
financial support from Sampension and Innovation Fund Denmark is gratefully
acknowledged.
x
Publications

Paper A

Nystrup, P., H. Madsen, and E. Lindström. “Stylised facts of financial time


series and hidden Markov models in continuous time.” Quantitative Finance,
vol. 15, no. 9 (2015b), pp. 1531–1541.

Paper B

Nystrup, P., H. Madsen, and E. Lindström. “Long memory of financial time


series and hidden Markov models with time-varying parameters.” Journal of
Forecasting, vol. 36, no. 8 (2017b), pp. 989–1002.

Paper C

Nystrup, P., B. W. Hansen, H. Madsen, and E. Lindström. “Regime-based ver-


sus static asset allocation: Letting the data speak.” Journal of Portfolio
Management, vol. 42, no. 1 (2015a), pp. 103–109.

Paper D

Nystrup, P., B. W. Hansen, H. O. Larsen, H. Madsen, and E. Lindström. “Dy-


namic allocation or diversification: A regime-based approach to multiple as-
sets.” Journal of Portfolio Management, vol. 44, no. 2 (2017a), pp. 62–73.

Paper E

Nystrup, P., B. W. Hansen, H. Madsen, and E. Lindström. “Detecting change


points in VIX and S&P 500: A new approach to dynamic asset allocation.”
Journal of Asset Management, vol. 17, no. 5 (2016), pp. 361–374.
xii Publications

Paper F

Hallac, D., P. Nystrup, and S. Boyd. “Greedy Gaussian segmentation of multi-


variate time series.” Advances in Data Analysis and Classification (2018), p.
forthcoming.

Paper G

Nystrup, P., H. Madsen, and E. Lindström. “Dynamic portfolio optimization


across hidden market regimes.” Quantitative Finance, vol. 18, no. 1 (2018b),
pp. 83–95.

Paper H

Boyd, S., E. Busseti, S. Diamond, R. N. Kahn, K. Koh, P. Nystrup, and J. Speth.


“Multi-period trading via convex optimization.” Foundations and Trends in
Optimization, vol. 3, no. 1 (2017), pp. 1–76.

Paper I

Nystrup, P., S. Boyd, E. Lindström, and H. Madsen. “Multi-period portfolio


selection with drawdown control.” Annals of Operations Research (2018a), p.
forthcoming.
Acronyms

ACF Autocorrelation function


AIC Akaike’s information criterion
AR Annualized return
AT Annual turnover
BIC Bayesian information criterion
CPPI Constant-proportion portfolio insurance
CR Calmar ratio
CTHMM Continuous-time hidden Markov model
DAA Dynamic asset allocation
DM Developed market
EM Emerging market
EM Expectation–maximization
ETF Exchange-traded fund
EWMA Exponentially-weighted moving average
FM Fixed mix
GARCH Generalized autoregressive conditional heteroskedasticity
GAS Generalized autoregressive score
GGS Greedy Gaussian segmentation
HMM Hidden Markov model
xiv Acronyms

HSMM Hidden semi-Markov model


IR Information ratio
LASSO Least absolute shrinkage and selection operator
LLO Leveraged long-only
LO Long only
LS Long–short
MDD Maximum drawdown
ML Maximum likelihood
MLE Maximum-likelihood estimate
MPC Model predictive control
MPO Multi-period optimization
NLP Natural language processing
OBPI Option-based portfolio insurance
RBAA Regime-based asset allocation
SAA Strategic asset allocation
SD Standard deviation
SGM Segmented Gaussian model
SPO Single-period optimization
SR Sharpe ratio
TAA Tactical asset allocation
VIX Chicago Board Options Exchange’s Volatility Index
Contents

Short Contents i

Abstract iii

Resumé v

Preface vii

Acknowledgments ix

Publications xi

Acronyms xiii

Contents xv

1 Introduction 1
1.1 Regime shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Stylized facts of financial returns . . . . . . . . . . . . . . . . . . 3
1.3 Time-varying parameters . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Dynamic portfolio optimization . . . . . . . . . . . . . . . . . . . 5
1.5 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Contribution 9
2.A Stylized facts of financial time series and hidden Markov models
in continuous time . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.B Long memory of financial time series and hidden Markov models
with time-varying parameters . . . . . . . . . . . . . . . . . . . . 10
2.C Regime-based versus static asset allocation: Letting the data speak 12
2.D Dynamic allocation or diversification: A regime-based approach
to multiple assets . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
xvi Contents

2.E Detecting change points in VIX and S&P 500: A new approach
to dynamic asset allocation . . . . . . . . . . . . . . . . . . . . . 14
2.F Greedy Gaussian segmentation of multivariate time series . . . . 16
2.G Dynamic portfolio optimization across hidden market regimes . . 17
2.H Multi-period trading via convex optimization . . . . . . . . . . . 18
2.I Multi-period portfolio selection with drawdown control . . . . . . 19

3 Conclusion 23
3.1 Commercial perspectives . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

References 27

A Stylized facts of financial time series and hidden Markov mod-


els in continuous time 37
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2 Hidden Markov models in discrete time . . . . . . . . . . . . . . 41
3 Hidden Markov models in continuous time . . . . . . . . . . . . . 43
4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
A Parameter estimates . . . . . . . . . . . . . . . . . . . . . . . . . 55
B FTSE results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

B Long memory of financial time series and hidden Markov


models with time-varying parameters 59
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2 The hidden Markov model . . . . . . . . . . . . . . . . . . . . . . 63
3 Long memory and regime switching . . . . . . . . . . . . . . . . . 65
4 Adaptive parameter estimation . . . . . . . . . . . . . . . . . . . 67
5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

C Regime-based versus static asset allocation: Letting the data


speak 85
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2 Letting the data speak . . . . . . . . . . . . . . . . . . . . . . . . 88
3 The hidden Markov model . . . . . . . . . . . . . . . . . . . . . . 90
4 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . 95
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Contents xvii

D Dynamic allocation or diversification: A regime-based ap-


proach to multiple assets 99
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2 Asset universe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3 The hidden Markov model . . . . . . . . . . . . . . . . . . . . . . 107
4 State inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

E Detecting change points in VIX and S&P 500: A new ap-


proach to dynamic asset allocation 119
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
2 VIX and S&P 500 . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3 Change-point detection . . . . . . . . . . . . . . . . . . . . . . . 125
4 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . 137
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

F Greedy Gaussian segmentation of multivariate time series 141


1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
3 Greedy Gaussian segmentation . . . . . . . . . . . . . . . . . . . 150
4 Validation and parameter selection . . . . . . . . . . . . . . . . . 152
5 Variations and extensions . . . . . . . . . . . . . . . . . . . . . . 153
6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

G Dynamic portfolio optimization across hidden market regimes 173


1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
2 The hidden Markov model . . . . . . . . . . . . . . . . . . . . . . 177
3 Dynamic portfolio optimization . . . . . . . . . . . . . . . . . . . 182
4 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

H Multi-period trading via convex optimization 203


1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
2 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
4 Single-period optimization . . . . . . . . . . . . . . . . . . . . . . 220
5 Multi-period optimization . . . . . . . . . . . . . . . . . . . . . . 237
6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
xviii Contents

7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

I Multi-period portfolio selection with drawdown control 263


1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
2 Multi-period portfolio selection . . . . . . . . . . . . . . . . . . . 267
3 Data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
4 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
CHAPTER 1
Introduction

Asset allocation is the most important determinant of portfolio performance


(Brinson et al. 1986, Ibbotson and Kaplan 2000). In the presence of time-
varying investment opportunities, portfolio weights should be adjusted as new
information arrives (Bansal et al. 2004). Traditional strategic approaches rather
seek to develop robust, static portfolios that optimize efficiency across a range of
scenarios (Sheikh and Sun 2012). Strategic asset allocation (SAA) is long term
in nature and based on long-term views of asset class performance (Dahlquist
and Harvey 2001, Campbell and Viceira 2002). Even if the SAA is reconsidered
on an annual basis, it is unlikely to change significantly, as long as the purpose
remains “all-weather” efficiency.
The purpose of dynamic asset allocation (DAA) is to take advantage of favor-
able regimes and reduce potential drawdowns during adverse regimes. DAA is,
by definition, more restricted than SAA in terms of the size of the investment
opportunity set, because it is difficult to invest dynamically in illiquid assets
such as private real estate, private equity, infrastructure, etc. This is worth
mentioning, given that illiquid alternatives have become a larger part of institu-
tional investors’ portfolios in recent years. Strategic investors, such as pension
funds, that invest with a long time horizon typically face constraints on the size
of possible deviations from their benchmark allocation; even so, they can still
benefit from reacting to significant regime shifts (Kritzman et al. 2012).
The 2008 financial crisis clearly showed that diversification is not sufficient to
avoid large drawdowns. Diversification fails, when needed the most, because cor-
relations between risky assets tend to strengthen during times of crisis (see, e.g.,
Pedersen 2009, Ibragimov et al. 2011). Large drawdowns challenge investors’
2 Introduction

financial and psychological tolerance and lead to fund redemption and firing of
portfolio managers. By engaging in DAA it is possible to generate protection
against drawdowns. In fact, portfolio insurance can be regarded as the most
general form of DAA, as argued by Goltz et al. (2008).1
DAA can be profitable even if markets are efficient. A market is said to be
efficient if prices in the market fully reflect all available information. When this
condition is satisfied, market participants cannot earn a riskless profit on the
basis of available information—in other words, there are no arbitrage opportu-
nities (Fama 1970, Pedersen 2015). Trends can be present in efficient markets if
the equilibrium expected return changes over time. The existence of a business
cycle, where the expected rate of return on capital changes over time, is one
example. When the business cycle is not a deterministic phenomenon, asset
prices need not follow a random walk with a constant or deterministic trend
(Levich 2001).
This first chapter introduces the concept of DAA and a distinction between
rule- and optimization-based approaches. This includes the stylized facts that
are exploited and the significance of time-varying parameters. The purpose is
to motivate the importance of identifying and acting on regime shifts in order
to limit losses and build robust portfolios. At the end of the chapter, a list of
research questions is posed and the remainder of the thesis is outlined.

1.1 Regime shifts


Regime shifts, some of which can be recurring (recessions versus expansions)
and some of which can be permanent (structural breaks), are prevalent across
a wide range of financial markets and in the behavior of many macro variables
(Ang and Timmermann 2012). Observed regimes in financial markets are re-
lated to the phases of the business cycle (Campbell 1999, Cochrane 2005). The
link is complex and difficult to exploit for investment purposes, due to the low
frequency and large lag in the availability of data related to the business cycle.
In this thesis, the focus is on readily available market data, instead of attempting
to establish the link to the business cycle. If financial markets are efficient, the
outlook for the economy should be reflected in asset prices to the extent that
it can be predicted (Siegel 1991). Interestingly, most studies on DAA consider
monthly rather than daily returns. A monthly data frequency might make sense
for SAA, but not when studying DAA. Even if it is not the expectation to trade
every day, the option should be there. In an optimization-based framework there
are more efficient ways to reduce turnover than lowering the data frequency.

1 It is known from Merton’s (1973) replicating argument interpretation of the Black and Scholes

(1973) formula that nonlinear payoffs based on an underlying asset can be replicated by dynamic
trading in the underlying asset and a risk-free asset.
1.2 Stylized facts of financial returns 3

Ang and Bekaert (2002) were among the first to consider the impact of regime
shifts on asset allocation. They showed that regime-based asset allocation
(RBAA) can add value over rebalancing to static weights and reduce drawdowns.
Numerous studies have followed, including Ang and Bekaert (2004), Bauer et al.
(2004), Ammann and Verhofen (2006), Guidolin and Timmermann (2007, 2008),
Bulla et al. (2011), Guidolin and Ria (2011), Kritzman et al. (2012), Bae et al.
(2014). Only Kritzman et al. (2012) forecasted regimes in important drivers of
asset returns rather than identifying regimes directly in asset returns based on a
model with a fixed number of recurring regimes. The majority have considered
dynamic allocation to stocks in combination with bonds and/or a risk-free asset,
often involving larger changes in allocation than most investors are willing to
or allowed to implement. The focus on stocks is natural, since portfolio risk is
typically dominated by stock market risk (see, e.g., Goyal et al. 2015, Bass et al.
2017). In a multi-asset setting, however, it is essential whether regime shifts
occur synchronously across asset classes.
It is important to emphasize that regime shifts are an abstraction. Abrupt,
seemingly discontinuous changes can occur in systems with no switches (Preis
and Stanley 2010); for example, complex systems subject to reinforcing feedback
(Gârleanu and Pedersen 2007). Models with a fixed number of recurring regimes,
as discussed in the next sections, is only one out of many possible ways to
model or capture abrupt changes; for example, adaptive estimation can be used
to track variations in (the parameters of) a system (see, e.g., Gustafsson 2000),
change-point detection can be used to identify significant changes (see, e.g., Ross
et al. 2011), or stochastic differential equations can be used to model nonlinear
variability at multiple levels (see, e.g., Lindström et al. 2015).

1.2 Stylized facts of financial returns


The hidden Markov model (HMM) is a popular choice for inferring the state
of financial markets. In an HMM, the distribution that generates an observa-
tion depends on the state of an unobserved Markov chain (Cappé et al. 2005,
Frühwirth-Schnatter 2006, Zucchini and MacDonald 2009).2 It is a black-box
model, but the inferred states can often be linked to phases of the business cycle
(Guidolin and Timmermann 2007, Ang and Timmermann 2012). The possibility
of interpreting the states combined with the model’s ability to reproduce stylized
facts of financial returns is part of the reason why it has become increasingly
popular.
Rydén et al. (1998) showed the ability of the HMM to reproduce most of the
stylized facts of daily return series introduced by Granger and Ding (1995a,b).
The one stylized fact that could not be reproduced by an HMM was the positive,
significant, and slowly decaying autocorrelation function (ACF) of absolute and
2 See paper A for an introduction to hidden Markov models.
4 Introduction
0.25

0.25
ACF(rt2 )
ACF(rt )
0.10

0.10
-0.05

-0.05
0 100 200 300 400 500 0 100 200 300 400 500
Lag Lag

Figure 1.1: Autocorrelation functions for daily log-returns of the S&P 500 index and
their squared values from 1928 to 2016. The dashed lines are the boundaries of ap-
proximate 95% confidence intervals under the null hypothesis of independence (Madsen
2008).

squared daily returns, which is of great importance, for example, in financial


risk management.

First noted by Mandelbrot (1963), the volatility of asset prices forms clusters,
as large price movements tend to be followed by large price movements and vice
versa. Daily returns do not have the long-memory property themselves, only
their absolute and squared values do, as it can be seen from figure 1.1. Malmsten
and Teräsvirta (2010) argued that the very slow decay rate of autocorrelations
should not be considered a stylized fact, since the decay rate in shorter subseries,
on average, is substantially faster and roughly exponential.

In order to improve its ability to reproduce the long memory, subsequent pa-
pers have extended the Gaussian HMM studied by Rydén et al. (1998). Bulla
and Bulla (2006) considered hidden semi-Markov models (HSMMs) with other
sojourn-time distributions than the memoryless geometric distribution and Bulla
(2011) considered HMMs with other conditional distributions than the Gaussian
distribution. The types of sojourn-time and conditional distributions are highly
dependent on the number of regimes. For example, Langrock and Zucchini
(2011) showed how an HMM can be structured to fit any sojourn-time distri-
bution with arbitrary precision by mapping multiple latent states to the same
output state.

Guidolin (2011a,b) found in his review of the literature on applications of


Markov-switching models in empirical finance that roughly half the studies se-
lected this model based on economic motivations rather than statistical reason-
ing. Further, half the studies did not consider it a possibility that the number
of regimes could exceed two, and there was an overweight of studies based on
Gaussian mixtures in which the underlying Markov chain was assumed to be
time homogeneous.
1.3 Time-varying parameters 5

1.3 Time-varying parameters


Regime shifts in the data-generating process lead to time-varying parameters,
but the parameters within the regimes and the transition probabilities also
change over time. It is hardly realistic that the parameters of the conditional
distributions can only jump with constant probability between a fixed number
of constant values. Rydén et al. (1998) found that the parameters as well as
the optimal number of states have changed considerably over time. Addition-
ally, Bulla (2011) found that the optimal choice of conditional distributions has
changed over time.
Attempts to capture this dynamic with a stationary model result in very complex
models with lots of parameters that need to be estimated (see, e.g., Calvet
and Fisher 2004, Song 2014, Augustyniak et al. 2016). A better alternative
to a complex stationary model is to take into account the development of the
parameters in the underlying model. This is crucial in order to minimize the
difference between in- and out-of-sample forecasting performance (Dacco and
Satchell 1999).
The time variation can be either parameter or observation driven. In parameter-
driven models, the parameters are stochastic processes with their own source of
error. A model for the parameter changes in the form of a hierarchical model
can include exogenous explanatory variables (see, e.g., Ang and Bekaert 2004).
Alternatively, the time variation can be observation driven based on the gradient
of the likelihood function—i.e., the score function. This estimation approach was
an important part of my Master’s thesis (Nystrup 2014).
Recursive and adaptive techniques for estimating models with time-varying
parameters are well known in the engineering literature (see, e.g., Ljung and
Söderström 1983). Although there exist earlier examples of applications to
econometric models (e.g., Madsen et al. 1999), these techniques have attracted
renewed attention in the econometrics literature following the paper by Creal
et al. (2013). Several recursive and adaptive estimation techniques have been
proposed specifically for HMMs (see Khreich et al. 2012, and references therein).
Without increasing the number of parameters it is possible to reproduce much
more complex dynamics by allowing for them to be time varying. The time
variation has a significant impact on the optimal choice of sojourn-time and
conditional distributions. Although the regimes are not really recurring when
the parameters are allowed to change within the regimes, the number of regimes
remains a determining factor.

1.4 Dynamic portfolio optimization


DAA aims to benefit from volatility persistence, since risk-adjusted returns, on
average, are substantially lower during turbulent periods, irrespective of the
6 Introduction

source of turbulence (Kritzman and Li 2010, Moreira and Muir 2017). The
negative correlation between volatility and returns is sometimes explained by
changes in attitudes toward risk; because high-volatility regimes are associated
with increased risk aversion and reduced risk capacity, a high-volatility environ-
ment is likely to be accompanied by falling asset prices.3
DAA can be divided into rule- and optimization-based approaches, respectively.
Examples of heuristic, rule-based approaches include scaling the proportion of
stocks inversely proportionally to forecasted, realized, or unexpected volatility
and switching between risky and risk-free assets based on whether the volatility
is above or below a threshold (see, e.g., Zakamulin 2014, Moreira and Muir 2017).
RBAA is a subcategory of rule-based approaches, where the asset allocation
depends on the inferred regime.
Fornaciari and Grillenzoni (2017) proposed to estimate the parameters of the
model and the decision rule simultaneously and recursively based on a profitabil-
ity criteria. Optimizing the parameters of the decision rule, however, does not
guarantee that the decision rule is optimal for the problem at hand. Testing
many different specifications in order to find a decision rule with good perfor-
mance increases the risk of inferior performance out of sample. Further, it can
be argued that a static decision rule is hardly optimal if the underlying model
used for regime inference is time varying.
DAA is a multi-period problem, yet it is often approximated by a sequence of
myopic, single-period optimizations, which makes it impossible to properly ac-
count for the consequences of trading, constraints, time-varying forecasts, etc.
Following Mossin (1968), Samuelson (1969), and Merton (1969), the literature
on multi-period portfolio selection is predominantly based on dynamic program-
ming, which properly takes into account the idea of recourse and updated infor-
mation available as a sequence of trades is chosen (see Gârleanu and Pedersen
2013, and references therein). Unfortunately, actually carrying out dynamic
programming for trade selection is impractical, except for some very special or
small cases, due to the curse of dimensionality (Bellman 1956, Boyd et al. 2014).
Herzog et al. (2007) and Boyd et al. (2014) proposed to solve the stochastic
control problem of DAA using model predictive control (MPC). The idea is to
control a portfolio based on forecasts of asset returns and relevant parameters.
Every day a decision is made whether or not to change the current portfolio
allocation, knowing that the decision will be reconsidered the next day with
new input. Possible benefits from changing allocation are traded off against
risks and costs.
MPC constitutes a promising alternative to the static decision rules that dom-
inate the literature on RBAA following Ang and Bekaert (2002). It offers an
3 Increased risk aversion (a behavioral explanation) and reduced risk capacity (an institutional

explanation) are difficult to distinguish in data. Both effects have support (e.g., Cohn et al. 2015,
Brunnermeier and Pedersen 2009).
1.5 Research questions 7

optimal implementation conditional on the choice of hyperparameters, which


all have a clear interpretation as transaction costs, holding costs, etc. Further,
there are computational advantages to using MPC in cases when forecasts are
updated every time a new observation becomes available, since the optimal
control actions are reconsidered anyway. If asset returns are forecasted using a
regime model, significant allocation changes will still occur in response to regime
shifts.
MPC is a mathematically tractable approach to multi-period portfolio optimiza-
tion. For this approach to be useful in practical applications, it is crucial that
the problem can be quickly solved. It does not matter whether it has a closed-
form solution. Similarly, it should be easy to handle all costs and constraints
that are practically important in portfolio management. Following Markowitz
(1952), the goal is to find an optimal tradeoff between risk and return, where
risk can be variance, maximum drawdown, or something else. It is essential
that the risk function measures the risk that the investor is concerned with and
that its minimization is feasible, not what kind of utility function justifies it or
what assumptions are implicitly made about the return distribution (Markowitz
2014).
The mean–variance criterion is the most commonly used in portfolio selection
(Kolm et al. 2014). Unlike this quadratic measure, alternatives exist that only
penalize down-side risk; popular choices are value-at-risk and conditional value-
at-risk focused on a small fraction of the worst possible outcomes (see, e.g.,
Rockafellar and Uryasev 2000, Bertsimas et al. 2004). However, risk measures
based on a narrow part of the distribution are more susceptible to forecast un-
certainty, causing portfolios resulting from their minimization to be less robust
(Lim et al. 2011, Downing et al. 2015). A better alternative for investors con-
cerned with tail risk is drawdown control, which, unlike tail-risk minimization,
prevents a portfolio from losing more than a given limit.

1.5 Research questions

1. Previous studies have achieved a better reproduction of financial returns’


stylized behavior by considering other sojourn-time and conditional distri-
butions than the Gaussian HMM. Can this be explained by the parameters
changing over time, and it is possible to achieve a better reproduction by
taking this into account?
2. How many regimes are needed, and are they recurring?
3. Can the returns from other asset classes/risk factors be described by a
multivariate regime-switching model?
4. Previous studies on RBAA have obtained significantly better results in
sample than out of sample. Is it possible to increase the robustness by
8 Introduction

basing a dynamic strategy on a model with time-varying parameters in


order to obtain similarly good result out of sample?
5. Is it profitable to change asset allocation based solely on identified regimes?
6. How do model predictions optimally translate into a DAA strategy?
7. In continuation of the previous question, what role do transaction and
holding costs play?
8. Is it possible to realize higher risk-adjusted returns by including other
asset classes than stocks and cash?

In this study, we use R for statistical modeling and forecasting (R Core Team
2017). For optimization, we use Python and the library CVXPY (Diamond and
Boyd 2016).

1.6 Thesis outline


This thesis consists of a collection of nine papers written during the course of my
Ph.D. study. Each paper is included as a separate appendix: paper A through
paper I. Before that, in the next chapter, I will go through each paper individu-
ally and summarize its main findings and contribution in order to emphasize the
common thread through the study. Chapter 3 concludes the study by providing
answers to the research questions posed in section 1.5.
CHAPTER 2
Contribution

In this chapter, I will go through each paper individually and summarize its
main findings and contribution. I will describe the process that led to each
paper, including some of the work that did not get included in the final version.
In addition, I will discuss possibilities for future work. The chapter assumes that
an introduction to the different topics and the existing literature has already
been given, thus, it is highly recommended to read chapter 1 before delving into
this chapter.
Research, and publication of research results in particular, is a highly nonlinear
process. The papers are, therefore, not arranged in chronological order based
on when the work was done or the time of publication. Rather, I have arranged
them in order to emphasize the common thread in the work.

2.A Stylized facts of financial time series and hidden


Markov models in continuous time
Published in Quantitative Finance
Subsequent papers have extended the Gaussian HMM studied by Rydén et al.
(1998) in order to improve its ability to reproduce the slow decay of the autocorre-
lation function that is characteristic for long series of squared and absolute daily
returns. Paper A contributes to this literature by proposing a continuous-time
formulation as a flexible alternative to the dominating discrete-time models.
A major limitation of discrete-time HMMs and HSMMs is the quadratic increase
in the number of parameters with the number of states. This limitation does
10 Contribution

not apply to continuous-time HMMs, as it can reasonably be assumed that the


only possible transitions in an infinitesimally short time interval are to the neigh-
boring states. This assumption leads to a linear rather than quadratic increase
in the number of parameters with the number of states and, consequently, a
significant reduction in the number of parameters for higher-order models.
We find that the possibility to increase the number of states leads to a better
fit to both the distributional and temporal properties of daily log-returns of
the S&P 500 and FTSE indices from 1993 to 2013. Specifically, we find that
a continuous-time HMM with four states provides a better fit to the autocorre-
lation function of the squared returns than discrete-time, two- and three-state
HMMs and HSMMs with conditional Gaussian and t distributions (see, e.g., fig-
ure 4 on page 48). This is also the case when restraining the impact of outliers
in order to reduce the amount of noise in the empirical autocorrelation function
(see, e.g., figure 7 on page 50).
In the paper, we emphasize the possibility of approximating any positive-valued
sojourn-time distribution with arbitrary precision by introducing dummy states
that are indistinguishable from one or more of the original states.4 We, however,
do not find any indication that the memoryless property of the sojourn-time dis-
tribution is inconsistent with the long-memory property of the squared returns.
There are other advantages of a continuous-time formulation that we do not
explore in the paper; most importantly the possibility to incorporate temporal
inhomogeneity without a dramatic increase in the number of parameters and
the flexibility to use data that is not (assumed to be) equidistantly sampled.

2.B Long memory of financial time series and hidden


Markov models with time-varying parameters
Published in the Journal of Forecasting
Paper B contributes to the same strand of literature proposing extensions to the
Gaussian HMM in order to improve its ability to reproduce the long memory
of volatility. As an alternative to increasing the model complexity, for example
by adding more states or considering other sojourn-time or conditional distri-
butions, we propose a recursive maximum-likelihood estimation approach that
allows for the parameters of the estimated model to be time varying. Adaptiv-
ity in time is achieved with exponential forgetting of past observations. It is
implicitly assumed that the time scales of the regime changes and the variations
of the model parameters are separable.
This fundamentally different approach was an important part of my Master’s
thesis (Nystrup 2014). The time variation is observation driven based on the
score function of the predictive likelihood function. A disadvantage of HMMs
4 This is also possible in discrete time, as shown by Langrock and Zucchini (2011).
2.B Long memory and time-varying parameters 11

is that the score function must consider all previous observations and cannot
reasonably be approximated by the score function of the latest observation, as it
is often done for other models (Khreich et al. 2012). This leads to a significant
increase in computational complexity.
We find that the parameters of the estimated models vary significantly over time
(see, e.g., figure 5 on page 76), in agreement with the findings by Rydén et al.
(1998) and Bulla (2011). By taking this variation into account, we show that
a Gaussian HMM with only two states is able to reproduce the long memory
of squared daily returns of the S&P 500 index from 1928 to 2014 (figure 6 on
page 77). Rydén et al. (1998) considered this stylized fact to be the most difficult
fact to reproduce with an HMM. The adaptively-estimated model provides a
particularly good fit to the autocorrelation function of the squared daily returns
when the impact of exceptional observations is reduced.
Fast adaptation to parameter changes leads to improved one-step density fore-
casts. The choice of memory length affects the parameter estimates and can be
viewed as a tradeoff between bias and variance. A shorter memory yields a faster
adaptation to changes but a more noisy estimate, as fewer observations are used
for the estimation. We find that by using exponential forgetting, the effective
memory length can be reduced compared to when using a sliding-window ap-
proach. Exponential forgetting is also more meaningful, because it assigns most
weight to the most recent observations and, at the same time, does not exclude
observations from the estimation from one time step to the next. It can be seen
as a continuous mixture of sliding windows of different lengths.
A two-state HMM with a t-distribution in the high-variance state has the high-
est predictive log-likelihood (see table 8 on page 79). When leaving out the 20
observations that make the most negative contributions to the log-likelihood,
however, the adaptively-estimated Gaussian HMM outperforms the other mod-
els and estimation approaches considered. This is less than 0.1% of the total
number of observations. Thus, while the t-distribution is a better fit to the most
exceptional observations, the adaptively-estimated Gaussian HMM provides the
best one-step-ahead density forecasts for the remainder of the sample. We show
that the forecasting performance can be further improved using local smoothing
to forecast the parameter variations. The largest improvement is obtained for
the 20 observations that are most difficult to forecast.
The adaptively-estimated HMM combines abrupt regime changes with smooth
variations in the underlying parameters. We have spent some time looking into
variable forgetting (Fortescue et al. 1981, Uosaki et al. 1996, Cooper and Worden
2000), which is a natural extension of the fixed-parameter forgetting approach
proposed in this paper. The idea is to reduce the memory following a decay in
forecasting accuracy in order to quickly adapt the model parameters to a new
setting. On the contrary, if the forecasting error is small, then the memory is
increased in order to get stable estimates. The method of Fortescue et al. (1981)
12 Contribution

is not directly applicable to a regime-switching model, because of the need to dis-


tinguish between parameter changes caused by a regime change and parameter
changes in the underlying model—that is, the forecasting performance naturally
changes whenever the regime changes.
We have also contemplated the possibility of formulating a model for the pa-
rameter changes in the form of a hierarchical model, possibly including relevant
exogenous variables (see, e.g., Ang and Bekaert 2004). The proposed method
for estimating the time variation of the parameters is an important step toward
the identification of a hierarchical model structure. Another possibility for fu-
ture work is to further develop the proposed estimation approach and compare
it to approaches that have been proposed in areas other than finance (see Khre-
ich et al. 2012, and references therein); for example, it would be interesting to
explore the impact of adding a regularization term to the likelihood objective as
a means of introducing prior information (see, e.g., Johansen 1997, Pinson and
Madsen 2012).

2.C Regime-based versus static asset allocation: Letting


the data speak
Published in The Journal of Portfolio Management
Paper C presents an application of the recursive, adaptive estimation approach
and the model from the previous paper to regime-based asset allocation. We
focus on the Gaussian HMM with two regimes for several reasons: first, because
we know from the previous paper that this model is able to reproduce the long
memory of volatility; second, because increasing the number of regimes makes
it more difficult to distinguish between them out of sample; and third, because
more regimes lead to more transitions between regimes and, consequently, higher
portfolio turnover and transaction costs.
The adaptive estimation approach is a new contribution to the literature on
RBAA. Previously, Bulla et al. (2011) were the only ones to apply an HMM
with time-varying parameters to asset allocation, but they used a simple sliding-
window estimation approach. The adaptive, online approach ensures robustness
in order to prevent large differences between in- and out-of-sample performance,
as observed in some previous studies. Instead of the median filter used by Bulla
et al. (2011), we apply a filter based on the inferred probability to decode the
hidden market regimes.
We identify regimes in daily returns of a global stock index and, based on this,
allocate the entire portfolio between the stock index in the low-volatility regime
and a global government bond index in the high-volatility regime. We do this
over a 20-year period from 1994 to 2013 and assume a one-day delay in the
implementation of allocation changes. The identified regimes seem intuitive
2.D Dynamic allocation or diversification 13

when looking at the log-returns at the bottom of figure 4 on page 93, although
the result is different from what would be expected if the regimes were based
on a business cycle indicator.
The performance of the RBAA strategy is good compared to that of a static
portfolio with the same average asset allocation, but the difference is mainly
due to the financial crisis in 2008. The break-even transaction cost is more than
200 basis points per one-way transaction. In other words, a simple switching
rule based on the identified regimes alone is profitable. A long–short strategy
based on the stock index is not as successful.

2.D Dynamic allocation or diversification: A


regime-based approach to multiple assets
Published in The Journal of Portfolio Management
Paper D is an extension of the approach from the previous paper to a multi-
asset portfolio. The potential benefit from taking large positions in a few assets
at a time comes at the cost of reduced diversification. In order to analyze this
tradeoff, we compare the performance of RBAA to a static benchmark in a more
comprehensive asset universe, because the potential for diversification is limited
by the size of the asset universe. This analysis is the main contribution of the
paper.
The market indices included in the study are carefully chosen to mimic the
major liquid asset classes typically considered by an institutional investor. These
include developed market (DM) and emerging market (EM) stocks and high-
yield bonds, listed DM real estate, gold, oil, corporate bonds, inflation-linked
bonds, and government bonds. The data period is 1997 through 2015.
We first considered deriving and modeling a few central risk factors in order to
reduce the complexity of the modeling task and develop a scalable approach—
not to uncover a hidden potential for diversification (see Cocoma et al. 2017).
We analyzed the correlation structure and its principal components and did not
find any stationary structure. Both the eigenvalues and eigenvectors change over
time, and the changes are larger than what can be explained by measurement
noise (Fenn et al. 2011, Allez and Bouchaud 2012).
Instead, we define risk–on and risk–off portfolios based on the asset class’ corre-
lation with DM stocks. We choose the benchmark allocation to mimic a 60/40
long-only SAA portfolio of an institutional investor. This is a much simpler
approach than carrying out an actual portfolio optimization, which should in-
crease its robustness. We use the same model and estimation approach as in the
previous paper (which originates from paper B) to switch between the risk–on
and risk–off portfolios based on regimes in the daily returns of DM stocks.
14 Contribution

A second contribution of the paper is to introduce a new way of decoding hidden


market regimes that is based on Narasimhan et al. (2006). The essence of the
algorithm is that the initial state becomes increasingly certain as more obser-
vations are included in the sequence and the latency increases. It accumulates
evidence online until the certainty estimate reaches a threshold, after which the
identified initial state is outputted and the algorithm proceeds to estimate the
next state in the sequence. Despite being very intuitive, the algorithm has never
been applied in studies of RBAA.
We find that the annualized standard deviation is minimized when half of the
portfolio is allocated to the RBAA strategy, and the risk-adjusted return is
maximized when the allocation to RBAA reaches 80% (see figure 6 on page 112).
Compared to the 60/40 SAA portfolio, the RBAA portfolios have higher risk-
adjusted returns and suffer much smaller maximum drawdowns. Compared
to the median-filter approach used by Bulla et al. (2011), the online decoding
approach leads to a higher risk-adjusted return and a lower turnover.

2.E Detecting change points in VIX and S&P 500: A


new approach to dynamic asset allocation
Published in the Journal of Asset Management
The starting point for paper E is a belief that a fixed number of recurring regimes
is too simplistic a model to represent the dynamics of financial returns, albeit
the regimes are not really recurring when the underlying parameters change
over time. Its contribution is to propose a new approach to DAA that is based
on detection of change points without fitting a model with a fixed number of
regimes to the data, without estimating any parameters, and without assuming
a specific distribution of the data.
Most traditional approaches to change detection assume that the distributional
form of the data is known before and after a change with only the parameters
being unknown (see, e.g., Page 1954, Roberts 1959, Siegmund and Venkatraman
1995, Gustafsson 2000). However, this assumption rarely holds in sequential
applications. Typically, there is no prior knowledge of the true distribution
or assumptions made about the distribution may be incorrect; for example,
assuming the data is Gaussian will cause occasional large values to be interpreted
as change points, even though they should more correctly be classified as extreme
values (Ross 2013).
We apply the nonparametric (distribution-free) change-detection approach of
Ross et al. (2011), which does not assume that anything is known about the
distribution of the data before monitoring begins. We examine whether to test
for location, scale or more general distributional changes and whether changes
in Chicago Board Options Exchange’s Volatility Index (VIX) or change points
2.E Detecting change points in VIX and S&P 500 15

detected in daily returns of the S&P 500 index from 1990 to 2015 provide the
most profitable signal.

We consider the VIX because of its forward-looking nature. It essentially offers a


market-determined, forward-looking estimate of one-month stock market volatil-
ity and is regarded as an indicator of market stress (Whaley 2000). It imbeds
many properties of other macroeconomic and financial uncertainty measures, as
shown by Racicot and Théoret (2016). Moreover, there is a significantly nega-
tive relationship between changes in the VIX and contemporaneous returns of
the S&P 500 index, as depicted in figure 1 on page 123.

When a change point is detected, the only knowledge of the new regime is the
observations encountered between the change point and the time of detection.
Unlike the previous two papers, where the asset allocation was determined by
the inferred regime, there is no natural strategy for changing allocation when a
change point is detected. We find that simple switching strategies perform better
than strategies where the allocation to the stock index is a linear function of the
estimated volatility, despite the fact that the volatility assumes many different
levels in the detected regimes.

The best performing strategy is a switching strategy that is fully invested in


the S&P 500 index in the low-volatility state and cash in the high-volatility
state, based on detected changes in the scale parameter of log-returns of the
VIX. This strategy outperforms both the S&P 500 index and a static portfolio
with the same average allocation to the stock index both in terms of realized
and risk-adjusted return, and it has a significantly lower tail risk (see figure 12
on page 136). Due to the assumption of zero interest on cash positions, there
is no other source of performance than the index. We show that a similar risk-
adjusted return can be obtained by selling short-term VIX futures instead of
buying the S&P 500 index in the low-volatility state.

It is left for future research to generalize the nonparametric change-detection


approach of Ross et al. (2011) to a multivariate setting. Change detection is
a compelling alternative to a regime-switching model, because it makes fewer
assumptions about the data. It captures the same abrupt regime changes, but
provides a more flexible representation of the dynamics of financial returns. The
disadvantage is that there is no prior knowledge of the new dynamics when a
change point is detected. After a change point, when important allocation
decisions are made based on forecasts of future dynamics, this is done based on
very limited information.
16 Contribution

2.F Greedy Gaussian segmentation of multivariate time


series
To appear in Advances in Data Analysis and Classification
Before discussing its contribution, I will describe the work that led to paper F.
We began by adapting the ℓ1 trend filtering method proposed by Kim et al.
(2009) to produce estimates that are piecewise constant, making it suited to
analyzing time series that are subject to level shifts. A regularization parameter
was used to control the tradeoff between smoothness (or number of changes) of
the estimated level and size of the residuals, somewhat similar to the average run
length in change detection (see, e.g., Gustafsson 2000). It can be thought of as an
optimization-based approach to change detection and time series segmentation.
We tested this on daily log-returns of the S&P 500 and their squared values and
experimented with using iterated reweighting to improve the results (Candès
et al. 2008). We found that there was much more information in the volatility
than in the returns themselves, in agreement with the finding in the previous
paper.
Kim et al. (2009) described how the basic ℓ1 trend (level) estimation method
can be generalized to handle multivariate time series data by using the sum
of the ℓ2 norms of the second (first) differences as the measure of smoothness.
Compared to estimating levels separately in each time series, this formulation
couples together changes in the levels of individual entries at the same time
index, so the level component found tends to show simultaneous changes.5 The
multivariate time series could be simultaneous returns of the FTSE and S&P
500 indices or returns of the S&P 500 and their squared values.
A difficulty with this method is scaling the relative importance of changes in
mean and volatility or one time series relative to another; however, this is exactly
what the Gaussian density does. Further, by assuming a multivariate Gaussian
distribution, we can benefit from information in the correlation structure when
identifying regimes (Münnix et al. 2012, Chetalova et al. 2015). Similar to the ℓ1
and ℓ2 trend filtering problems, Gaussian segmentation is a convex optimization
problem, yet there is no available software that can solve it.
We formulate this as a covariance-regularized maximum-likelihood problem,
which can be reduced to a combinatorial optimization problem of searching
over the possible breakpoints. This problem is in general difficult to solve glob-
ally. The contribution of paper F is to propose an efficient heuristic method
that approximately solves it, and always yields a locally optimal choice, in the
sense that no change of any one breakpoint improves the objective. Our imple-
mentation is made available in a Python software package.6

5 The idea behind this penalty is used in the group lasso (Yuan and Lin 2006).
6 Available at https://github.com/cvxgrp/GGS.
2.G Dynamic portfolio optimization across hidden market regimes 17

We illustrate the approach with four different examples. First, we do a small


example with daily log-returns from 1997 to 2015 for a stock, bond, and oil
price index, where we plot the means, variances, and correlations in figure 4 on
page 159. This is a great alternative to using a sliding window to illustrate the
dynamics. Second, we emphasize the scalability of our method by considering
a larger example with the 309 companies in the S&P 500 index that have been
publicly listed for the entire 19-year period from before. Our method is quite
efficient and easily scales to problems with vectors of dimension over 1,000 and
time series of arbitrary length. Third, we look at an example from the field of
natural language processing to illustrate how our method can be applied to a
different type of dataset. Our method correctly identifies both the breakpoint
locations and the general topic of each segment in data obtained by concate-
nating the introductions from five different Wikipedia articles (see figure 9 on
page 164). Fourth, we show that our method is more robust than a left-to-right
HMM using a synthetic example where observations are generated from a known
sequence of segments.

2.G Dynamic portfolio optimization across hidden


market regimes
Published in Quantitative Finance

Paper G contributes to the literature on dynamic asset allocation by combining


forecasts from an HMM with a multi-period optimization approach to asset
allocation based on model predictive control. It is a promising alternative to
the static decision rules that dominate the literature, including paper C and
paper D (see also Ang and Bekaert 2004, Guidolin and Timmermann 2007,
Bulla et al. 2011, Kritzman et al. 2012). Fornaciari and Grillenzoni (2017)
proposed to dynamically optimize the parameters of the decision rule, but this
led to problems with overfitting and poor performance. MPC, on the other hand,
offers an optimal implementation, conditional on the choice of hyperparameters,
which all have a clear interpretation as transaction costs, holding costs, etc.

The benefit of MPC for multi-period portfolio optimization was proposed by


Herzog et al. (2007), Meindl and Primbs (2008), and Boyd et al. (2014). The
idea is to control a portfolio based on forecasts of asset returns and relevant
parameters. Every day a decision is made whether or not to change the current
portfolio allocation, knowing that the decision will be reconsidered the next
day with new input. Possible benefits from changing allocation are traded off
against risks and costs. A sequence of trades is planned, but only the first one
is executed. There are computational advantages to using MPC in cases when
estimates of future return statistics are updated every time a new observation
becomes available, since the optimal control actions are reconsidered anyway.
18 Contribution

The implementation is based on forecasts from the same HMM with time-varying
parameters estimated using the recursive, adaptive approach that was presented
in paper B and applied in paper C and paper D. It is the first time that we use
this model for forecasting returns and not just distinguishing between market
regimes. Although change detection, trend filtering, and Gaussian segmentation
are great tools for analysis, we find them to be less useful for forecasting.
We test this on various major stock market indices, one at a time, using daily
data from 1984 to 2015. Cash positions are assumed to be risk-free and yield zero
interest; hence, the only source of performance is the stock indices. The MPC
approach realizes a higher risk-adjusted return than a buy-and-hold investment
in five out of six indices—regardless of whether risk is measured by standard
deviation or maximum drawdown—and has a higher annualized return than
the underlying index in four out of six cases, cf. table 10 on page 197. We
experiment with the impact of an additional ℓ1 trading penalty and find that it
increases the robustness of the approach.

2.H Multi-period trading via convex optimization


Published in Foundations and Trends in Optimization
Paper H presents a continuation of the MPC idea from Boyd et al. (2014) that
was utilized in the previous paper. We consider a basic model of multi-period
trading, which can be used to evaluate the performance of a trading strategy.
We describe a framework for single-period optimization, where the trades in
each period are found by solving a convex optimization problem that trades
off expected return, risk, transaction and holding cost. We then describe a
multi-period version of the trading method, where optimization is used to plan
a sequence of trades, with only the first one executed, using estimates of future
quantities that are unknown when the trades are chosen.
Our contribution is to describe the single- and multi-period methods in one
simple framework, giving a careful description of the development and the ap-
proximations made, and discussing how convex optimization can be used in
multi-period trading. The methods can be thought of as good ways to exploit
forecasts, no matter how they are made. It is a better alternative to the more or
less ad-hoc trading strategies that are sometimes used to evaluate new return-
prediction models.
Our focus is not on theoretical issues, but on practical ones that arise in multi-
period trading. The advantages of a multi-period approach include the ability to
properly model transaction and holding costs and take into account differences
in short- and long-term forecasts. We develop a basic dynamic model of trading
that describes how a portfolio and associated cash account change over time,
due to trading, investment gains, and various costs associated with trading
and holding portfolios. We present realistic models of transaction and holding
2.I Multi-period portfolio selection with drawdown control 19

costs along with examples of many different constraints that arise in practical
investment management and can easily be included.
The optimization-based trading methods we describe are practical and reliable
when the problems to be solved are convex (Boyd and Vandenberghe 2004).
Vast increases in computing power have changed how optimization can be used
in investing. In particular, it is now possible to quickly run full-blown, multi-
period optimizations and search over hyperparameters in backtests. Real-world
single-period convex problems with thousands of assets can be solved using
generic algorithms in well under a second, which is critical for evaluating a
proposed algorithm with historical or simulated data, for many values of the
parameters in the method.
At the end of the paper we discuss computation times as well as possibilities for
speedup for a few examples based on daily data for the constituents of the S&P
500 index from 2012 through 2016. The examples illustrate the usefulness of
the approach and the potential for exploiting the power of convex optimization
in multi-period trading and portfolio optimization.
Having established a well-functioning framework for utilizing and assessing the
value of multi-step forecasts, it is left for future research to develop these. Many
of the ideas and methods described in the paper are implemented in a companion
open-source Python package.7 Hopefully this will be useful in future research—
in academia as well as in industry—when evaluating the performance of return-
prediction models.

2.I Multi-period portfolio selection with drawdown


control
To appear in Annals of Operations Research
Paper I implements a specific case of the methods from the previous paper, with
an additional mode that controls for drawdown by adjusting the risk aversion
based on realized drawdown. By proposing a practical approach to drawdown
control, we provide evidence of the theoretical link between dynamic asset allo-
cation and drawdown control. We disprove the common conception that mean–
variance optimization cannot be consistent with a focus on controlling tail risk.
We demonstrate the approach using data from 1999 through 2016 for the same
ten indices that we considered in paper D.8 These indices are chosen to mimic
the major liquid asset classes typically considered by an institutional investor.
Compared to paper G, it is an extension from one to ten assets. In connection
with paper D, we attempted to derive and model central risk factors without
success, so this time we focus on a multivariate model of all ten indices.
7 Available at https://github.com/cvxgrp/cvxportfolio.
8 The only difference is that the LBMA Gold Price is substituted for the S&P GSCI Gold index.
20 Contribution

We contemplated an approach similar to Fortescue et al. (1981) with a memory


length that is adjusted based on forecasting accuracy, as an alternative to using
regime shifts to capture abrupt parameter changes. This would overcome the
issue with distinguishing between parameter changes caused by a regime shift
and parameter changes in the underlying model. Taking this idea one step fur-
ther, the memory could depend on portfolio performance rather than forecasting
accuracy, as in closed-loop control (Bertsekas 1995). A disadvantage is that the
forecasts become constant, in the absence of a model describing the dynamics.
Although the approach is conceptually intriguing, we found that it does not
compete with a regime-switching model.
We instead base our implementation on forecasts from a multivariate hidden
Markov model with time-varying parameters. We estimate the model using the
online, adaptive version of the expectation–maximization algorithm proposed by
Stenger et al. (2001). Contrary to the recursive, adaptive estimation approach
from paper B, where it is necessary to consider past observations when evaluat-
ing the score function, the algorithm of Stenger et al. (2001) is truly online. It
is much faster and easily scales to higher dimensions (Khreich et al. 2012). We
find that two regimes is sufficient, after testing models with two, three, and four
regimes.
The usual issues when estimating a high-dimensional covariance matrix also arise
in the context of HMMs, causing unstable estimates of the transition matrix
and of the hidden states. In fact, the problem is even more pronounced, as
some regimes are seldom visited, leaving a very small sample for estimating the
covariance matrix. When we fit a multivariate HMM to all indices at once using
in-sample training data, the state sequence has low persistence and frequent
switches, leading to excessive portfolio turnover and poor results. In addition to
applying a Stein-type shrinkage estimator to the estimated covariance matrix in
each regime, as proposed by Fiecas et al. (2017), we find that it is necessary to
estimate the state sequence based solely on two stock indices. We still estimate
the mean vector and covariance matrix in each state based on all the indices.
This approach is inspired by paper D and supported by the finding of meaningful
common breakpoints across stocks, bonds, and oil in paper F. In addition to
regularizing the estimated covariance matrices, we find that it is necessary to
include transaction and holding costs and constraints in order to regularize the
portfolio optimization problem and reduce the risk due to estimation error.
Our testing shows that, even if they knew the future returns when choosing
their benchmark, investors who insist on rebalancing to a static, diversified
benchmark portfolio could not have outperformed the dynamic approach net of
transaction costs over the 18-year test period in terms of risk-adjusted return.
This is regardless of whether risk is measured by standard deviation or maximum
drawdown (figure 7 on page 290). The dynamic approach also outperforms
an equally-weighted portfolio and a fixed-mix portfolio with the same average
allocation. The outperformance happens continually over the 18-year period
2.I Multi-period portfolio selection with drawdown control 21

(figure 6 on page 288). The combination of leverage and drawdown control is


particularly successful, as it is possible to increase the excess return by several
hundred basis points without suffering a larger maximum drawdown (figure 8(b)
on page 291).
In summary, the MPC approach to multi-period portfolio optimization com-
bined with the proposed approach to drawdown control is practical, efficient,
and leads to impressive results. In addition to improving the multi-period fore-
casts, it would be interesting in future research to compare the deterministic
MPC approach with stochastic MPC based on scenarios. The starting point
for the comparison could be the multivariate HMM proposed in this paper. By
generating scenarios, rather than summarizing the forecast distribution by its
first two moments at every time step, it is possible to account for both the
interdependence structure of prediction errors and the predictive distributions
(Pinson et al. 2009). The entire forecast distribution could also be taken into
account by using stochastic programming (see, e.g., Ali et al. 2015).
22
CHAPTER 3
Conclusion

Long-term investors can often bear the risk of outsized market movements or tail
events more easily than the average investor; for bearing this risk, they hope to
earn significant excess returns. Our results show that rebalancing periodically
to a fixed benchmark allocation is not the way to do this. Regime shifts that
present a big challenge to traditional, static approaches to asset allocation pose
a large potential for dynamic approaches. A substantial amount of value can be
added by adjusting the asset allocation to the current market conditions, rather
than rebalancing periodically to a static benchmark.
Regime shifts lead to time-varying parameters and, in addition, we found that
the parameters within the regimes and the transition probabilities change over
time. By taking this into account, we were able to better reproduce the long
memory of squared daily returns that DAA benefits from. When applying an
adaptive estimation approach to allow for time-varying parameters, we found
that a two-state Gaussian HMM was able to reproduce this long memory. There
was no need for other sojourn-time or conditional distributions. The time vari-
ation was observation driven based on the gradient of the likelihood function.
Furthermore, fast adaptation to parameter changes led to improved one-step
density forecasts.
We showed that by using a continuous-time formulation it is possible to increase
the number of states with a linear rather than quadratic increase in the number
of parameters and thereby obtain a better fit to the distributional and temporal
properties of daily returns. Meanwhile, when allowing for time-varying parame-
ters, two regimes were sufficient, though the time-varying behavior meant that
the regimes were not actually recurring.
24 Conclusion

We found that it is profitable to change asset allocation based solely on regimes


identified in stock returns. This included changing allocation between stocks
and bonds and switching between predefined risk–on and risk–off portfolios to
sustain a level of diversification. In both cases, a regime-based approach led to
higher risk-adjusted returns and a lower maximum drawdown compared to a
static portfolio. It was not a problem to obtain good results out of sample when
using the two-state Gaussian HMM with time-varying parameters for regime
inference.
The nonrecurring regimes led us to consider alternatives to the application of
HMMs to identify regime shifts, such as nonparametric change detection and
trend filtering. We even developed a greedy algorithm for segmenting multi-
variate Gaussian time series, applicable to everything from text data to high-
dimensional return series. Using this algorithm we were able to benefit from
information in the correlation structure when identifying regimes in past re-
turns. Although change detection, trend filtering, and Gaussian segmentation
are great tools for analysis, we found them to be less useful for forecasting.
Expanding the univariate HMM to a multivariate model was a challenging task.
In order to reduce the complexity of the modeling task, we considered deriving
and modeling central risk factors, but when we analyzed the correlation struc-
ture and its principal components, we did not find any stationary structure. We
switched from the gradient-based estimation approach to an online version of
the expectation–maximization algorithm that is significantly faster and easily
scalable to high dimensions.
The usual issues when estimating a high-dimensional covariance matrix based
on a limited number of observations also arose in the context of HMMs, caus-
ing unstable estimates of the transition matrix and of the hidden states. If the
returns were perfectly Gaussian, then increasing the dimension should enhance
the regime inference, since more information is available. As the multivariate
Gaussian distribution was only an approximation, increasing the dimension in-
troduced more outliers and led to a state sequence with low persistence and
frequent switches, resulting in excessive portfolio turnover and poor results. We
obtained better results by estimating the state sequence based solely on two
stock indices, while still estimating the mean vector and covariance matrix in
each state based on all the indices.
We showed how forecasts can be optimally translated into a DAA strategy using
MPC. By taking advantage of powerful computers combined with advances in
convex optimization, we developed a framework based on MPC that makes it fea-
sible to solve multi-period portfolio optimization problems with large numbers
of assets and search over hyperparameters in backtests. This was a significant
breakthrough compared to analytical solutions based on dynamic programming
that are infeasible in most practical applications due to the curse of dimension-
ality. We found that transaction and holding costs play an important role in
3.1 Commercial perspectives 25

terms of regularizing the optimization problem and reducing the risk due to
estimation error.
By allowing for investment in a portfolio of assets rather than only stocks and
cash, it was possible to achieve a higher risk-adjusted return and a more steady
outperformance over time relative to a fixed-weight benchmark. We imple-
mented the MPC approach based on forecasts from the multivariate HMM with
time-varying parameters. Our testing showed that, even if they knew the future
returns when choosing their benchmark, investors who insisted on rebalancing
to a static, diversified benchmark portfolio could not have outperformed the dy-
namic approach net of transaction costs in terms of risk-adjusted return. This
was regardless of whether risk was measured by standard deviation or maximum
drawdown.
We showed how an optimization approach based on MPC can be used to con-
trol drawdowns, with little or no loss of mean–variance efficiency, by adjusting
the risk aversion based on realized drawdown. DAA, even without drawdown
control, led to a significantly lower maximum drawdown compared to a fixed-
weight portfolio, regardless of whether the approach was rule or optimization
based. The combination of leverage and drawdown control was particularly suc-
cessful, as it was possible to increase the excess return by several hundred basis
points without suffering a larger maximum drawdown.

3.1 Commercial perspectives


DAA is a highly relevant topic, since the majority of assets, for example in
pension funds, are still being managed using primarily static benchmarks. In-
stitutional investors are increasingly looking to incorporate some element of
dynamic decision-making within portfolios, both for return enhancement and as
a risk-management tool. The two financial crises during the 2000s have exposed
the weaknesses of a static asset allocation and contributed to the rising inter-
est in dynamic and regime-based approaches. The recency of events means that
large amounts of data containing turbulent periods are available for backtesting.
Avoiding large drawdowns is valuable in itself—even for investors with a very
long investment horizon—and, in addition, provides an opportunity for tak-
ing more risk at other, more favorable times, without increasing the overall
risk. Hence, better risk management can be directly converted into higher (risk-
adjusted) returns.
This study underpins the demand for dynamic approaches with better possibil-
ity for adjusting risk and asset allocation over time, in order to benefit from
time-varying risk premia by continually allocating investments toward the best
risk–return tradeoff, concurrently with developments in the business cycle and
financial markets. It contributes to a better understanding of financial markets’
behavior in the form of a model-based framework for DAA.
26 Conclusion

The commercial value of the study extends beyond the specific methods that
have been developed. The demonstration of the advantages of DAA and its
connection to drawdown control should give rise to a fundamental reform of the
approach to asset allocation. This could potentially contribute to better risk
management within financial institutions.

3.2 Future work


Having established a well-functioning framework for assessing the value of multi-
period forecasts, it is left for future research to develop these. There is potential
for further developing recursive and adaptive estimation techniques for regime-
switching models, for example by exploring the impact of adding a regularization
term reflecting prior information to the likelihood objective. One possible way
to obtain a better description of the time-varying behavior of the parameters
could be to allow different forgetting factors for each parameter or consider
more advanced state-, time-, or data-dependent forgetting. Another way could
be to formulate a model for the parameter changes in the form of a hierarchical
model, possibly including exogenous explanatory variables. This could be a
hierarchical stochastic-differential-equation model with continuous rather than
discrete regime shifts, where the parameters themselves are stochastic processes.
Although we did not have success with risk-factor-based forecasting, this ap-
proach still has appealing properties in terms of scalability. If the main im-
portance is to distinguish between risk–on and risk–off regimes, as our results
suggest, it may suffice to consider stock returns. Conversely, it is worth ex-
ploring in more detail whether information from other asset classes, correlations
(within and between asset classes), economic variables, interest rates, or other
possible indicators can be included to improve regime inference.
Given the promising results of the MPC approach to portfolio optimization and
drawdown control, it is worth further developing this approach and considering
integrating estimation and control. For example, the forgetting parameter used
in estimation could be adjusted based on forecasting accuracy or even portfo-
lio performance. As an alternative to allocation at asset-class level, the MPC
approach could be utilized for dynamic allocation to risk premia, such as value
and momentum, or various quantitative strategies.
Stochastic MPC is another open route for future research. Instead of replacing
all future, unknown quantities by their forecasted values, in order to turn the
stochastic control problem into a deterministic optimization problem, scenarios
could be generated to represent the uncertainty in the forecasts. Alternatively,
the entire distribution could be taken into account by using stochastic program-
ming. It is not a given, however, that the increase in computational complexity
would be offset by better results.
References

Ali, A., J. Z. Kolter, S. Diamond, and S. Boyd. “Disciplined convex stochastic


programming: A new framework for stochastic optimization.” In Proceedings
of the 31st Conference on Uncertainty in Artificial Intelligence (2015), pp.
62–71.
Allez, R. and J. P. Bouchaud. “Eigenvector dynamics: General theory and some
applications.” Physical Review E, vol. 86, no. 4 (2012), p. 046202.
Ammann, M. and M. Verhofen. “The effect of market regimes on style allocation.”
Financial Markets and Portfolio Management, vol. 20, no. 3 (2006), pp. 309–
337.
Ang, A. and G. Bekaert. “International asset allocation with regime shifts.”
Review of Financial Studies, vol. 15, no. 4 (2002), pp. 1137–1187.
Ang, A. and G. Bekaert. “How regimes affect asset allocation.” Financial Ana-
lysts Journal, vol. 60, no. 2 (2004), pp. 86–99.
Ang, A. and A. Timmermann. “Regime changes and financial markets.” Annual
Review of Financial Economics, vol. 4, no. 1 (2012), pp. 313–337.
Augustyniak, M., L. Bauwens, and A. Dufays. “A new approach to volatility
modeling: the high-dimensional Markov model.” Tech. Rep. 2016042, Univer-
sité catholique de Louvain (2016).
Bae, G. I., W. C. Kim, and J. M. Mulvey. “Dynamic asset allocation for varied
financial markets under regime switching framework.” European Journal of
Operational Research, vol. 234, no. 2 (2014), pp. 450–458.
28 Conclusion

Bansal, R., M. Dahlquist, and C. R. Harvey. “Dynamic trading strategies and


portfolio choice.” Working Paper 10820, National Bureau of Economic Re-
search (2004).

Bass, R., S. Gladstone, and A. Ang. “Total portfolio factor, not just asset,
allocation.” Journal of Portfolio Management, vol. 43, no. 5 (2017), pp. 38–
53.

Bauer, R., R. Haerden, and R. Molenaar. “Asset allocation in stable and unstable
times.” Journal of Investing, vol. 13, no. 3 (2004), pp. 72–80.

Bellman, R. E. “Dynamic programming and Lagrange multipliers.” Proceedings


of the National Academy of Sciences, vol. 42, no. 10 (1956), pp. 767–769.

Bertsekas, D. P. Dynamic Programming and Optimal Control. Athena Scientific:


Belmont (1995).

Bertsimas, D., G. J. Lauprete, and A. Samarov. “Shortfall as a risk measure:


properties, optimization and applications.” Journal of Economic Dynamics &
Control, vol. 28, no. 7 (2004), pp. 1353–1381.

Black, F. and M. Scholes. “The pricing of options and corporate liabilities.”


Journal of Political Economy, vol. 81, no. 3 (1973), pp. 637–654.

Boyd, S., M. T. Mueller, B. O’Donoghue, and Y. Wang. “Performance bounds


and suboptimal policies for multi-period investment.” Foundations and Trends
in Optimization, vol. 1, no. 1 (2014), pp. 1–72.

Boyd, S. and L. Vandenberghe. Convex Optimization. Cambridge University


Press: New York (2004).

Brinson, G. P., L. R. Hood, and G. L. Beebower. “Determinants of portfolio


performance.” Financial Analysts Journal, vol. 42, no. 4 (1986), pp. 39–44.

Brunnermeier, M. K. and L. H. Pedersen. “Market liquidity and funding liquid-


ity.” Review of Financial studies, vol. 22, no. 6 (2009), pp. 2201–2238.

Bulla, J. “Hidden Markov models with t components. Increased persistence and


other aspects.” Quantitative Finance, vol. 11, no. 3 (2011), pp. 459–475.

Bulla, J. and I. Bulla. “Stylized facts of financial time series and hidden semi-
Markov models.” Computational Statistics & Data Analysis, vol. 51, no. 4
(2006), pp. 2192–2209.

Bulla, J., S. Mergner, I. Bulla, A. Sesboüé, and C. Chesneau. “Markov-switching


asset allocation: Do profitable strategies exist?” Journal of Asset Manage-
ment, vol. 12, no. 5 (2011), pp. 310–321.
3.2 Future work 29

Calvet, L. E. and A. J. Fisher. “How to forecast long-run volatility: Regime


switching and the estimation of multifractal processes.” Journal of Financial
Econometrics, vol. 2, no. 1 (2004), pp. 49–83.
Campbell, J. Y. “Asset prices, consumption, and the business cycle.” In Hand-
book of Macroeconomics, edited by J. B. Taylor and M. Woodford, vol. 1C,
chap. 19. Elsevier: Amsterdam (1999), pp. 1231–1303.
Campbell, J. Y. and L. M. Viceira. Strategic Asset Allocation: Portfolio Choice
for Long-Term Investors. Oxford University Press: New York (2002).
Candès, E. J., M. B. Wakin, and S. Boyd. “Enhancing sparsity by reweighted ℓ1
minimization.” Journal of Fourier Analysis and Applications, vol. 14, no. 5–6
(2008), pp. 877–905.
Cappé, O., E. Moulines, and T. Rydén. Inference in Hidden Markov Models.
Springer: New York (2005).
Chetalova, D., R. Schäfer, and T. Guhr. “Zooming into market states.” Journal
of Statistical Mechanics: Theory and Experiment, vol. 2015, no. 1 (2015), p.
P01029.
Cochrane, J. H. “Financial markets and the real economy.” Foundations and
Trends in Finance, vol. 1, no. 1 (2005), pp. 1–101.
Cocoma, P., M. Czasonis, M. Kritzman, and D. Turkington. “Facts about fac-
tors.” Journal of Portfolio Management, vol. 43, no. 5 (2017), pp. 55–65.
Cohn, A., J. Engelmann, E. Fehr, and M. A. Maréchal. “Evidence for counter-
cyclical risk aversion: an experiment with financial professionals.” American
Economic Review, vol. 105, no. 2 (2015), pp. 860–885.
Cooper, J. E. and K. Worden. “On-line physical parameter estimation with adap-
tive forgetting factors.” Mechanical Systems and Signal Processing, vol. 14,
no. 5 (2000), pp. 705–730.
Creal, D., S. J. Koopman, and A. Lucas. “Generalized autoregressive score mod-
els with applications.” Journal of Applied Econometrics, vol. 28, no. 5 (2013),
pp. 777–795.
Dacco, R. and S. Satchell. “Why do regime-switching models forecast so badly?”
Journal of Forecasting, vol. 18, no. 1 (1999), pp. 1–16.
Dahlquist, M. and C. R. Harvey. “Global tactical asset allocation.” Emerging
Markets Quarterly, vol. 5, no. 1 (2001), pp. 6–14.
Diamond, S. and S. Boyd. “CVXPY: A Python-embedded modeling language for
convex optimization.” Journal of Machine Learning Research, vol. 17, no. 83
(2016), pp. 1–5.
30 Conclusion

Downing, C., A. Madhavan, A. Ulitsky, and A. Singh. “Portfolio construction


and tail risk.” Journal of Portfolio Management, vol. 42, no. 1 (2015), pp.
85–102.

Fama, E. F. “Efficient capital markets: A review of theory and empirical work.”


Journal of Finance, vol. 25, no. 2 (1970), pp. 383–417.

Fenn, D. J., M. A. Porter, S. Williams, M. McDonald, N. F. Johnson, and N. S.


Jones. “Temporal evolution of financial-market correlations.” Physical Review
E, vol. 84, no. 2 (2011), p. 026109.

Fiecas, M., J. Franke, R. von Sachs, and J. T. Kamgaing. “Shrinkage estimation


for multivariate hidden Markov models.” Journal of the American Statistical
Association, vol. 112, no. 517 (2017), pp. 424–435.

Fornaciari, M. and C. Grillenzoni. “Evaluation of on-line trading systems:


Markov-switching vs time-varying parameter models.” Decision Support Sys-
tems, vol. 93 (2017), pp. 51–61.

Fortescue, T. R., L. S. Kershenbaum, and B. E. Ydstie. “Implementation of self-


tuning regulators with variable forgetting factors.” Automatica, vol. 17, no. 6
(1981), pp. 831–835.

Frühwirth-Schnatter, S. Finite Mixture and Markov Switching Models. Springer:


New York (2006).

Gârleanu, N. and L. H. Pedersen. “Liquidity and risk management.” American


Economic Review, vol. 97, no. 2 (2007), pp. 193–197.

Gârleanu, N. and L. H. Pedersen. “Dynamic trading with predictable returns and


transaction costs.” Journal of Finance, vol. 68, no. 6 (2013), pp. 2309–2340.

Goltz, F., L. Martellini, and K. D. Simsek. “Optimal static allocation decisions


in the presence of portfolio insurance.” Journal of Investment Management,
vol. 6, no. 2 (2008), pp. 37–56.

Goyal, A., A. Ilmanen, and D. Kabiller. “Bad habits and good practices.” Journal
of Portfolio Management, vol. 41, no. 4 (2015), pp. 97–107.

Granger, C. W. J. and Z. Ding. “Some properties of absolute return: An alter-


native measure of risk.” Annales D’Economie Et Statistique, vol. 40 (1995a),
pp. 67–92.

Granger, C. W. J. and Z. Ding. “Stylized facts on the temporal and distribu-


tional properties of daily data from speculative markets.” Unpublished paper,
Department of Economics, University of California, San Diego (1995b).
3.2 Future work 31

Guidolin, M. “Markov switching in portfolio choice and asset pricing models: A


survey.” In Missing Data Methods: Time-Series Methods and Applications,
edited by D. M. Drukker, vol. 27b of Advances in Econometrics. Emerald
Group Publishing: Bingley (2011a), pp. 87–178.
Guidolin, M. “Markov switching models in empirical finance.” In Missing Data
Methods: Time-Series Methods and Applications, edited by D. M. Drukker,
vol. 27b of Advances in Econometrics. Emerald Group Publishing: Bingley
(2011b), pp. 1–86.
Guidolin, M. and F. Ria. “Regime shifts in mean–variance efficient frontiers:
Some international evidence.” Journal of Asset Management, vol. 12, no. 5
(2011), pp. 322–349.
Guidolin, M. and A. Timmermann. “Asset allocation under multivariate regime
switching.” Journal of Economic Dynamics and Control, vol. 31, no. 11 (2007),
pp. 3503–3544.
Guidolin, M. and A. Timmermann. “International asset allocation under regime
switching, skew, and kurtosis preferences.” Review of Financial Studies,
vol. 21, no. 2 (2008), pp. 855–901.
Gustafsson, F. Adaptive Filtering and Change Detection. Wiley: West Sussex
(2000).
Herzog, F., G. Dondi, and H. P. Geering. “Stochastic model predictive control
and portfolio optimization.” International Journal of Theoretical and Applied
Finance, vol. 10, no. 2 (2007), pp. 203–233.
Ibbotson, R. G. and P. D. Kaplan. “Does asset allocation policy explain 40, 90,
or 100 percent of performance?” Financial Analysts Journal, vol. 56, no. 1
(2000), pp. 26–33.
Ibragimov, R., D. Jaffee, and J. Walden. “Diversification disasters.” Journal of
Financial Economics, vol. 99, no. 2 (2011), pp. 333–348.
Johansen, T. A. “On Tikhonov regularization, bias and variance in nonlinear
system identification.” Automatica, vol. 33, no. 3 (1997), pp. 441–446.
Khreich, W., E. Granger, A. Miri, and R. Sabourin. “A survey of techniques
for incremental learning of HMM parameters.” Information Sciences, vol. 197
(2012), pp. 105–130.
Kim, S. J., K. Koh, S. Boyd, and D. Gorinevsky. “ℓ1 trend filtering.” SIAM
Review, vol. 51, no. 2 (2009), pp. 339–360.
Kolm, P., R. Tütüncü, and F. Fabozzi. “60 years of portfolio optimization:
Practical challenges and current trends.” European Journal of Operational
Research, vol. 234, no. 2 (2014), pp. 356–371.
32 Conclusion

Kritzman, M. and Y. Li. “Skulls, financial turbulence, and risk management.”


Financial Analysts Journal, vol. 66, no. 5 (2010), pp. 30–41.
Kritzman, M., S. Page, and D. Turkington. “Regime shifts: Implications for
dynamic strategies.” Financial Analysts Journal, vol. 68, no. 3 (2012), pp.
22–39.
Langrock, R. and W. Zucchini. “Hidden Markov models with arbitrary state
dwell-time distributions.” Computational Statistics & Data Analysis, vol. 55,
no. 1 (2011), pp. 715–724.
Levich, R. M. International Financial Markets: Prices and Policies. McGraw–
Hill: New York, 2nd ed. (2001).
Lim, A. E., J. G. Shanthikumar, and G. Y. Vahn. “Conditional value-at-risk in
portfolio optimization: Coherent but fragile.” Operations Research Letters,
vol. 39, no. 3 (2011), pp. 163–171.
Lindström, E., H. Madsen, and J. N. Nielsen. Statistics for Finance. Chapman
& Hall: London (2015).
Ljung, L. and T. Söderström. Theory and Practice of Recursive Identification.
MIT Press: Cambridge (1983).
Madsen, H. Time Series Analysis. Chapman & Hall: London (2008).
Madsen, H., J. N. Nielsen, E. Lindström, M. Baadsgaard, and J. Holst. “Statis-
tics in finance.” Lund University (1999). Lecture notes.
Malmsten, H. and T. Teräsvirta. “Stylized facts of financial time series and
three popular models of volatility.” European Journal of Pure and Applied
Mathematics, vol. 3, no. 3 (2010), pp. 443–477.
Mandelbrot, B. “The variation of certain speculative prices.” Journal of Business,
vol. 36, no. 4 (1963), pp. 394–419.
Markowitz, H. “Portfolio selection.” Journal of Finance, vol. 7, no. 1 (1952), pp.
77–91.
Markowitz, H. “Mean–variance approximations to expected utility.” European
Journal of Operational Research, vol. 234, no. 2 (2014), pp. 346–355.
Meindl, P. J. and J. A. Primbs. “Dynamic hedging of single and multi-
dimensional options with transaction costs: a generalized utility maximization
approach.” Quantitative Finance, vol. 8, no. 3 (2008), pp. 299–312.
Merton, R. C. “Lifetime portfolio selection under uncertainty: The continuous-
time case.” Review of Economics and Statistics, vol. 51, no. 3 (1969), pp.
247–257.
3.2 Future work 33

Merton, R. C. “Theory of rational option pricing.” Bell Journal of Economics


and Management Science, vol. 4, no. 1 (1973), pp. 141–183.

Moreira, A. and T. Muir. “Volatility-managed portfolios.” Journal of Finance,


vol. 72, no. 4 (2017), pp. 1611–1644.

Mossin, J. “Optimal multiperiod portfolio policies.” Journal of Business, vol. 41,


no. 2 (1968), pp. 215–229.

Münnix, M. C., T. Shimada, R. Schaefer, F. Leyvraz, T. H. Seligman, T. Guhr,


and H. E. Stanley. “Identifying states of a financial market.” Scientific Reports,
vol. 2, no. 1 (2012), p. 644.

Narasimhan, M., P. Viola, and M. Shilman. “Online decoding of Markov models


under latency constraints.” In Proceedings of the 23rd International Confer-
ence on Machine Learning (2006), pp. 657–664.

Nystrup, P. Regime-Based Asset Allocation: Do Profitable Strategies Exist?


Master’s thesis, Technical University of Denmark (2014).

Page, E. S. “Continuous inspection schemes.” Biometrika, vol. 41, no. 1–2 (1954),
pp. 100–115.

Pedersen, L. H. “When everyone runs for the exit.” International Journal of


Central Banking, vol. 5, no. 4 (2009), pp. 177–199.

Pedersen, L. H. Efficiently inefficient: how smart money invests and market


prices are determined. Princeton University Press: Princeton (2015).

Pinson, P. and H. Madsen. “Adaptive modelling and forecasting of offshore wind


power fluctuations with Markov-switching autoregressive models.” Journal of
Forecasting, vol. 31, no. 4 (2012), pp. 281–313.

Pinson, P., H. Madsen, H. A. Nielsen, G. Papaefthymiou, and B. Klöckl. “From


probabilistic forecasts to statistical scenarios of short-term wind power pro-
duction.” Wind Energy, vol. 12, no. 1 (2009), pp. 51–62.

Preis, T. and H. E. Stanley. “Switching phenomena in a system with no switches.”


Journal of Statistical Physics, vol. 438, no. 1–3 (2010), pp. 431–446.

R Core Team. R: A Language and Environment for Statistical Computing. R


Foundation for Statistical Computing, Vienna, Austria (2017). URL https:
//www.R-project.org/.

Racicot, F. É. and R. Théoret. “Macroeconomic shocks, forward-looking dynam-


ics, and the behavior of hedge funds.” Journal of Banking & Finance, vol. 62
(2016), pp. 41–61.
34 Conclusion

Roberts, S. W. “Control chart tests based on geometric moving averages.” Tech-


nometrics, vol. 1, no. 3 (1959), pp. 239–250.

Rockafellar, R. T. and S. Uryasev. “Optimization of conditional value-at-risk.”


Journal of Risk, vol. 2, no. 3 (2000), pp. 21–42.

Ross, G. J. “Modelling financial volatility in the presence of abrupt changes.”


Physica A: Statistical Mechanics and its Applications, vol. 392, no. 2 (2013),
pp. 350–360.

Ross, G. J., D. K. Tasoulis, and N. M. Adams. “Nonparametric monitoring of


data streams for changes in location and scale.” Technometrics, vol. 53, no. 4
(2011), pp. 379–389.

Rydén, T., T. Teräsvirta, and S. Åsbrink. “Stylized facts of daily return series
and the hidden Markov model.” Journal of Applied Econometrics, vol. 13,
no. 3 (1998), pp. 217–244.

Samuelson, P. A. “Lifetime portfolio selection by dynamic stochastic program-


ming.” Review of Economics and Statistics, vol. 51, no. 3 (1969), pp. 239–246.

Sheikh, A. Z. and J. Sun. “Regime change: Implications of macroeconomic shifts


on asset class and portfolio performance.” Journal of Investing, vol. 21, no. 3
(2012), pp. 36–54.

Siegel, J. J. “Does it pay stock investors to forecast the business cycle?” Journal
of Portfolio Management, vol. 18, no. 1 (1991), pp. 27–34.

Siegmund, D. and E. S. Venkatraman. “Using the generalized likelihood ra-


tio statistic for sequential detection of a change-point.” Annals of Statistics,
vol. 23, no. 1 (1995), pp. 255–271.

Song, Y. “Modelling regime switching and structural breaks with an infinite


hidden Markov model.” Journal of Applied Econometrics, vol. 29, no. 5 (2014),
pp. 825–842.

Stenger, B., V. Ramesh, N. Paragios, F. Coetzee, and J. M. Buhmann. “Topol-


ogy free hidden Markov models: Application to background modeling.” In
Proceedings of the Eighth IEEE International Conference on Computer Vi-
sion, vol. 1 (2001), pp. 294–301.

Uosaki, K., M. Yotsuya, and T. Hatanaka. “Adaptive identification of non-


stationary systems with multiple forgetting factors.” In Proceedings of the
35th IEEE Conference on Decision and Control (1996), pp. 851–856.

Whaley, R. E. “The investor fear gauge.” Journal of Portfolio Management,


vol. 26, no. 3 (2000), pp. 12–17.
3.2 Future work 35

Yuan, M. and Y. Lin. “Model selection and estimation in regression with grouped
variables.” Journal of the Royal Statistical Society. Series B (Methodological),
vol. 68, no. 1 (2006), pp. 49–67.
Zakamulin, V. “Dynamic asset allocation strategies based on unexpected volatil-
ity.” Journal of Alternative Investments, vol. 16, no. 4 (2014), pp. 37–50.
Zucchini, W. and I. L. MacDonald. Hidden Markov Models for Time Series: An
Introduction Using R. Chapman & Hall: London, 2nd ed. (2009).
36
PAPER A
38
Originally published in Quantitative Finance

Stylized facts of financial time series and hidden


Markov models in continuous time

Peter Nystrup, Henrik Madsen, and Erik Lindström

Abstract

Hidden Markov models are often applied in quantitative finance to capture


the stylized facts of financial returns. They are usually discrete-time mod-
els and the number of states rarely exceeds two because of the quadratic
increase in the number of parameters with the number of states. This paper
presents an extension to continuous time where it is possible to increase the
number of states with a linear rather than quadratic growth in the number
of parameters. The possibility of increasing the number of states leads to a
better fit to both the distributional and temporal properties of daily returns.

Keywords: Hidden Markov models; Continuous time; Daily returns; Leptokur-


tosis; Volatility clustering; Long memory.

1 Introduction
The normal distribution is well-known as being a poor fit to most financial
returns. Mixtures of normal distributions provide a much better fit as they are
able to reproduce both the skewness and leptokurtosis often observed (Cont
2001). Markov switching mixture models, also referred to as hidden Markov
models (HMMs), are a natural extension in order to also capture the temporal
properties of financial returns. In an HMM, the distribution that generates
an observation depends on the state of an underlying and unobserved Markov
chain.
The ability of an HMM to reproduce most of the stylized facts of daily return
series introduced by Granger and Ding (1995b,a) was illustrated by Rydén et al.
(1998). They found that the one stylized fact that cannot be reproduced by an
HMM is the slow decay of the autocorrelation function (ACF) of squared daily
returns, which is of great importance, for instance, in financial risk management.
According to Bulla and Bulla (2006), the lack of flexibility of an HMM to model
this temporal higher-order dependence can be explained by the implicit assump-
tion of geometrically distributed sojourn times in the hidden states. Silvestrov
and Stenberg (2004), among others, argued that the memoryless property of the
40 Stylized facts and hidden Markov models in continuous time

geometric distribution is inadequate from an empirical perspective, although it


is consistent with the no-arbitrage principle.
Bulla and Bulla (2006) considered hidden semi-Markov models (HSMMs) in
which the sojourn-time distribution is modeled explicitly for each hidden state
so that the Markov property is transferred to the imbedded first-order Markov
chain. They showed that HSMMs with negative binomial sojourn-time distri-
butions are able to reproduce most of the stylized facts comparably well, and
often better, than the HMM. Specifically, they found HSMMs to reproduce the
long-memory property of squared daily returns much better than HMMs. They,
however, did not consider the complicated problem of selecting the most appro-
priate sojourn-time distributions, and, following the approach by Rydén et al.
(1998), they only considered models with two hidden states.
Bulla (2011) later showed that HMMs with t-distributed components reproduce
most of the stylized facts as well or better than the Gaussian HMM, at the
same time as increasing the persistence of the visited states and the robustness
to outliers. Bulla (2011) also found that models with three states provided a
better fit than models with two states.
Many different stylized facts have been established for financial returns. See,
for example, Granger and Ding (1995b,a), Granger et al. (2000), Cont (2001),
Malmsten and Teräsvirta (2010). This paper focuses on the stylized facts relat-
ing to the long memory of the ACF and examines the importance of the number
of hidden states on the ability to fit the slowly decaying ACF of squared daily
returns. An extension of HMMs to continuous time is presented as a flexible
alternative to the discrete-time models.
Two hidden states are found to be too few to reproduce the slowly decaying
ACF as well as the observed skewness and leptokurtosis. A major limitation
of discrete-time HMMs and HSMMs is the quadratic increase in the number of
parameters with the number of states. This limitation does not apply to HMMs
in continuous time, as it can reasonably be assumed that the only possible
transitions in an infinitesimally short time interval are to the neighboring states.
This assumption leads to a linear rather than quadratic growth in the number
of parameters with the number of states, and consequently, a significant re-
duction in the number of parameters for higher-order models. With the added
flexibility, the number of states can be considered a parameter that needs to be
estimated.1 In addition, it is possible to incorporate temporal inhomogeneity
without a dramatic increase in the number of parameters using a continuous-
time formulation.
Section 2 gives an introduction to the main theory relating to HMMs and
HSMMs. Section 3 introduces HMMs where the underlying Markov chain is

1 See Cappé et al. (2005) for a perspective on order estimation.


2 Hidden Markov models in discrete time 41

a continuous-time Markov chain. Section 4 contains a description of the data


analyzed. The empirical results are reported in section 5 and section 6 concludes.
All parameter estimates can be found in appendix A. The results of the analysis
of the FTSE 100 index can be found in appendix B.

2 Hidden Markov models in discrete time


In a hidden Markov model, the probability distribution that generates an obser-
vation depends on the state of an underlying and unobserved Markov process.
HMMs are a particular kind of dependent mixture and are therefore also referred
to as Markov switching mixture models.
A sequence of discrete random variables {St : t ∈ N} is said to be a Markov
chain if, for all t ∈ N, it satisfies the Markov property:

Pr ( St+1 | St , . . . , S1 ) = Pr ( St+1 | St ) . (1)

The conditional probabilities Pr ( Su+t = j| Su = i) = γij (t) are called transition


probabilities. The Markov chain is said to be homogeneous if the transition
probabilities are independent of u, otherwise inhomogeneous.
A Markov chain with transition probability matrix Γ (t) = {γij (t)} has sta-
tionary distribution π if πΓ = π and π1 = 1. The Markov chain is termed
stationary if π = δ, where δ is the initial distribution, that is δi = Pr (S1 = i).
If the Markov chain {St } has m states, then {Xt : t ∈ N} is called an m-state
HMM. With X (t) and S (t) representing the sequence of values from time 1 to
time t, the simplest model of this kind can be summarized by
( )
Pr St | S (t−1) = Pr ( St | St−1 ) , t = 2, 3, . . . , (2a)
( )
Pr Xt | X (t−1) , S (t) = Pr ( Xt | St ) , t ∈ N. (2b)

When the current state St is known, the distribution of Xt depends only on


St . This causes the autocorrelation of Xt to be strongly dependent on the
persistence of St .
A specific observation can usually arise from more than one state as the support
of the conditional distributions overlaps. Therefore, the unobserved state pro-
cess {St } is not directly observable through the observation process {Xt }, but
can only be estimated.
As an example, consider the two-state model with Gaussian conditional distri-
butions:
( )
Xt = µSt + εSt , εSt ∼ N 0, σS2 t ,
42 Stylized facts and hidden Markov models in continuous time

where
{ { [ ]
µ1 , if St = 1, 2 σ12 , if St = 1, 1 − γ12 γ12
µSt = σSt = and Γ = .
µ2 , if St = 2, σ22 , if St = 2, γ21 1 − γ21

For this model, the value of the autocorrelation function at lag k is


2
π1 (1 − π1 ) (µ1 − µ2 ) k
ρXt ( k| θ) = λ ,
σ2
and the autocorrelation function for the squared process is
( )2
π1 (1 − π1 ) µ21 − µ22 + σ12 − σ22
ρXt2 ( k| θ) = 2 λk ,
E [ Xt | θ] − E [ Xt | θ]
4 2

when θ denotes the model parameters, σ 2 = Var [ Xt | θ] is the unconditional


variance, and λ = γ11 + γ22 − 1 is the second largest eigenvalue of Γ (Frühwirth-
Schnatter 2006). It is evident from these expressions, as noted by Rydén et al.
(1998), that HMMs can only reproduce an exponentially-decaying autocorrela-
tion structure.
The ACF of the first-order process becomes zero if the means are equal, whereas
persistence in the squared process can be induced either by a difference in the
mean values, as for a mixed effects model, or by a difference in the variances
across the states. In both cases, the persistence increases with the combined
persistence of the states as measured by λ (Ang and Timmermann 2012).
The parameters of an HMM are usually estimated using the maximum-likelihood
method. Under the assumption that successive observations are independent,
the likelihood is given by
( )

LT (θ) = Pr X (T ) = x(T ) θ = δP (x1 ) ΓP (x2 ) · · · ΓP (xT ) 1, (3)

where P (x) is a diagonal matrix with the state-dependent conditional densities


pi (x) = Pr ( Xt = x| St = i), i ∈ {1, 2, . . . , m}, as entries. The conditional dis-
tribution of Xt may be either discrete or continuous, univariate or multivariate.
In mixtures of continuous distributions, the likelihood can be unbounded in the
vicinity of certain parameter combinations.2
The likelihood function of an HMM is in general a complicated function of the
parameters with several local maxima. The two most popular approaches to
maximizing the likelihood are direct numerical maximization and the Baum–
Welch algorithm, a special case of the expectation–maximization (EM) algo-
rithm (Cappé et al. 2005, Zucchini and MacDonald 2009). All discrete-time
2 If the conditional distribution is normal, then the likelihood can be made arbitrarily large by

setting the mean equal to one of the observations and letting the conditional variance tend to zero
(Frühwirth-Schnatter 2006).
3 Hidden Markov models in continuous time 43

models are estimated using the R-package hsmm due to Bulla et al. (2010) that
implements the EM algorithm in the version presented by Guédon (2003).
In an HMM, the sojourn times are implicitly assumed to be geometrically dis-
tributed:
t−1
Pr (’staying t time steps in state i’) = γii (1 − γii ) . (4)

The geometric distribution is memoryless, implying that the time until the next
transition out of the current state is independent of the time spent in the state.

Hidden semi-Markov models


If the assumption of geometrically distributed sojourn times is unsuitable, then
hidden semi-Markov models can be applied. HMMs and HSMMs differ only
in the way that the state process is defined. In an HSMM, the sojourn-time
distribution is modeled explicitly for each state i:

di (u) = Pr ( St+u+1 ̸= i, St+u−v = i, v = 0, . . . , u − 2| St+1 = i, St ̸= i) (5)

and the transition probabilities are defined as

γij = Pr ( St+1 = j| St+1 ̸= i, St = i) (6)



for each i ̸= j with γii = 0 and j γij = 1.
The conditional independence assumption for the observation process is similar
to a simple HMM, but the semi-Markov chain associated with an HSMM does
not have the Markov property at each time t. This property is transferred to
the imbedded, first-order Markov chain—that is, the sequence of visited states
(Bulla and Bulla 2006). In other words, the future states are only conditionally
independent of the past states when the process changes state.

3 Hidden Markov models in continuous time


In a continuous-time Markov chain, transitions can occur at any time rather
than at discrete and equidistant time points. There is no smallest time step and
the quantities of interest are the transition probabilities

pij (∆t) = Pr ( S (t + ∆t) = j| S (t) = i) (7)

as ∆t → 0. Clearly, pij (0) = 0 for different states i and j, and it can be shown
that under certain regularity conditions

lim P (t) = I. (8)


t→0
44 Stylized facts and hidden Markov models in continuous time

Assuming that pij (∆t) is differentiable at 0, the transition rates are defined as

pij (∆t) − pij (0)


p′ij (0) = lim
∆t→0 ∆t
Pr ( S (t + ∆t) = j| S (t) = i) (9)
= lim
∆t→0 ∆t
= qij

with the additional definition qii = qi = − j̸=i qij . The transition intensity
matrix Q = {qij } has nonnegative off-diagonal elements qij , nonpositive diago-
nal elements qi , and all rows sum to zero.
The stationary distribution π, if it exists, is found by solving the system of
equations {
πQ = 0
(10)
π1 = 1.
If it has a strictly positive solution (all elements in π are strictly positive), then
the stationary distribution exists and is independent of the initial distribution.
The matrix of transition probabilities P (t) = {pij (t)} can be found as the
solution to Kolmogorov’s differential equation
dP (t)
= P (t) Q (11)
dt
with the initial condition P (0) = I. The solution being

P (t) = eQt P (0) = eQt . (12)

When the process enters state i, it remains there according to an exponential


distribution with parameter −qi > 0 before it instantly jumps to another state
j ̸= i with probability −qij /qi . A continuous-time Markov chain is fully charac-
terized by its initial distribution δ and the transition intensity matrix Q.
It follows that the transition intensity matrix Q can in principle be found by
taking the logarithm of the one-step transition probability matrix

P (t) = eQt ⇒ Q = log P (1) . (13)

Computing the logarithm of a matrix with many elements that are close to zero
or zero is not a trivial operation. Instead, an intuitive estimate of the transition
rate qi can be based on the discrete transition probability γii as

q̃i = − log γ̂ii . (14)

This estimate does not take into account that the process might change from a
given state and back within the sampling interval. Thus, this simple estimate
3 Hidden Markov models in continuous time 45

will underestimate qi , but the error will be small for qi ≪ 1 (Madsen et al.
1985).

The exponential distribution is memoryless just like its discrete analogue, the
geometric distribution. By introducing dummy states that are indistinguishable
from one or more of the original states, it is possible to allow for nonexponentially-
distributed sojourn times (see, e.g., Madsen et al. 1985, Iversen et al. 2013). The
sojourn-time distribution will then be a mixture of exponential distributions,
which is a phase-type distribution, and the Markov property will be transferred
to the imbedded Markov chain, as for the HSMM. Phase-type distributions can
be used to approximate any positive-valued distribution with arbitrary precision
(see Nielsen 2013, for details). Similarly, Langrock and Zucchini (2011) showed
how a discrete-time HMM can be structured to fit any sojourn-time distribution
with arbitrary precision.

It is often convenient to assume that in a short time interval ∆t, the only possible
transitions are to the neighboring states:

pij = o (∆t) , |i − j| ≥ 2, 



pii (∆t) = 1 − qi ∆t + o (∆t) , 


pi,i−1 (∆t) = wi qi ∆t + o (∆t) , (15)


pi,i+1 (∆t) = (1 − wi ) qi ∆t + o (∆t) ,




i ∈ {1, 2, . . . , m} ,

where lim∆t→0 o(∆t)


∆t = 0. The notation includes transitions from state 1 to
m and reverse with the definition that state 0 = state m and state (m + 1) =
state 1.

It should be noted that, even though, the process cannot go straight from state
i to state i + 2 without going through state i + 1, there is no limit to how fast
a transition from state i to state i + 2 can occur.

Under this assumption, the matrix of transition intensities has the structure
 
−q1 (1 − w1 ) q1 0 ··· 0 w1 q1
 w2 q2 −q2 (1 − w2 ) q2 ··· 0 0 
 
Q= .. .. .. .. .. .. . (16)
 . . . . . . 
(1 − wm ) qm 0 0 ··· wm qm qm

The number of parameters increases linearly with the number of states. Thus,
a continuous-time Markov chain yields a parameter reduction over its discrete-
time analogue if the number of states exceeds three. The higher the number
of states, the larger the reduction. In addition, it is possible to incorporate
46 Stylized facts and hidden Markov models in continuous time

inhomogeneity without a dramatic increase in the number of parameters using


splines, harmonic functions, or similar.3
Another advantage of a continuous-time formulation is the flexibility to use
data with any sampling interval as the data is not assumed to be equidistantly
sampled. In a discrete-time model, weekends and bank holidays are ignored so
that the trading days are aggregated, meaning that Friday is followed by Monday
in a normal week. Using a continuous-time model, it is possible to model the
sampling times and thereby recognize that there is a longer time span between
Friday and Monday. The so-called weekend effect in returns have been studied
empirically for decades (see, e.g., French 1980, Rogalski 1984, Asai and McAleer
2007). There are two main effects, first that returns are higher on Fridays and
lower on Mondays than on other days and secondly that the variance is larger
on Fridays and lower on Mondays. There are several plausible explanations for
this, but all of them are more complicated to model than just treating Saturdays
and Sundays as missing observations. Below, the observations are assumed to
be equidistantly sampled in order to facilitate a comparison to the discrete-time
models using model selection criteria.
The continuous-time hidden Markov models (CTHMMs) are estimated using
the R-package msm due to Jackson (2011) that is based on direct numerical
maximization of the likelihood function.

4 Data
The data analyzed is daily log-returns of the S&P 500 and the FTSE 100 total
return index covering the period from 23 July 1993 to 22 July 2013. The log-
returns are calculated using rt = log (Pt ) − log (Pt−1 ), where Pt is the closing
price of the index on day t and log is the natural logarithm. The focus will be
on the log-returns of the S&P 500 index as the analysis of the FTSE 100 index
showed similar results. The results of the analysis of the FTSE returns can be
found in appendix B.
The 5,040 log-returns of the S&P 500 index are shown in figure 1. The volatility
is seen to form clusters as large price movements tend to be followed by large
price movements and vice versa. Volatility clustering is a consequence of the
persistence of the ACF of the squared returns (Cont 2001).
The first four moments of the daily log-returns are shown in table 2 together with
approximate 95% confidence intervals based on bootstrapping 500,000 series of
length 5,040 from the log-returns with replacement. The confidence intervals
for the mean and skewness are very wide, whereas the estimates of the stan-
dard deviation and kurtosis are more certain. The distribution is left skew and

3 See Iversen et al. (2013) for an example of the use of splines to reduce the number of parameters

in an inhomogeneous Markov model.


5 Empirical results 47

0.10

0.6
S&P 500
Gaussian

Density
0.2 0.4
0.00
rt

-7 -5 -3
-0.10

0.0
1994 1998 2002 2006 2010 -4 -2 0 2 4
Year Standardized rt

Figure 1: Daily log-returns of the S&P 500 total return index and a kernel estimate of
the density of the standardized daily log-returns together with the density function for
the standard normal distribution.

Mean Std. deviation Skewness Kurtosis JB


0.00034 0.0121 -0.24 11.3 14372
[0.00001; 0.00068] [0.0116; 0.0127] [−0.75; 0.30] [8.5; 14.1] [6356; 25803]

Table 2: First four moments of the S&P 500 log-returns and the Jarque–Bera test
statistic together with bootstrapped 95% confidence intervals.

leptokurtic with an excess kurtosis of 8.3 compared to the normal distribution.


The Jarque–Bera test statistic4 rejects the normal distribution at a 0.1% level
of significance.
The excess kurtosis is evident from the plot of the density function in figure 1.
There is too much mass centered right around the mean and in the tails com-
pared to the normal distribution. There are 81 observations that deviate more
than three standard deviations from the mean compared to an expectation of
14 if the returns were normally distributed.
Figure 3 shows the ACF of the absolute returns raised to different positive
powers. It is a stylized fact that autocorrelations of positive powers of absolute
returns are highest at power one. This is called the Taylor effect. The results
generally agree with the Taylor effect although the effect is not clear-cut at the
lowest lags.

5 Empirical results
The empirical autocorrelation function of the squared log-returns is shown in
figure 4 together with ACFs of simulated squared returns from the fitted models.
( )
4 The Skewness2 (Kurtosis−3)2
Jarque–Bera test statistic is defined as JB = T 6
+ 24
, where T is
the number of observations.
48 Stylized facts and hidden Markov models in continuous time

θ =1 θ =1
0.00 0.10 0.20 0.30

0.00 0.10 0.20 0.30


θ = 0.75 θ = 1.25
( θ)

( θ)
θ = 0.5 θ = 1.5
ACF |rt |

ACF |rt |
θ = 0.25 θ = 1.75

0 20 40 60 80 100 0 20 40 60 80 100
Lag Lag

Figure 3: The empirical autocorrelation function of the absolute log-returns of the S&P
500 total return index raised to different positive powers.

HMMN (2) HMMN (3)


0.00 0.10 0.20 0.30

0.00 0.10 0.20 0.30

HSMMN (2) HSMMN (3)


HSMMt (2) HSMMt (3)
CTHMMN (4)
ACF

ACF

0 20 40 60 80 100 0 20 40 60 80 100
Lag Lag

Figure 4: Empirical autocorrelation function of the squared log-returns at lag 1–100


together with simulated autocorrelation functions for the fitted models.

Of the two-state models, the HMM with normal conditional distributions is


seen to be the best fit at the lowest lags, whereas the HSMM with normal
conditional distributions is the best fit from lag 40 and upwards. The HSMM
with t components is seen to be very persistent, but at too low of a level and it
provides a poor fit overall.
As the HSMM with normal conditional distributions provides the best fit at the
highest lags, this is also the model that best reproduces the stylized fact relating
to the persistence of the ACF. This is true when looking at two-state models,
as concluded by Bulla and Bulla (2006), but a much better fit can be obtained
by increasing the number of states to three.
The HSMM with normal conditional distributions is also a better fit than the
HMM when looking at the three-state models in figure 4, as the ACF for the
HMM decays too fast. The ACF for the HSMM with conditional t distributions
is again at too low of a level.
The fit of a CTHMM with three states with normal conditional distributions is
similar to that of the three-state HMM. This appears from the mean squared
5 Empirical results 49

Original data Outlier-corrected data


MSE×103 WMSE×103 MSE×103 WMSE×103
HMMN (2) 7.9 4.6 13.1 9.7
HSMMN (2) 9.1 3.6 14.0 8.0
HSMMt (2) 15.7 4.6 15.1 6.5
HMMN (3) 4.4 3.2 6.1 6.7
HSMMN (3) 3.3 2.0 4.1 4.2
HSMMt (3) 9.2 3.4 5.1 4.0
CTHMMN (3) 4.2 3.2 6.1 6.5
HMMN (4) 4.0 2.2 3.4 3.3
HSMMN (4) 2.4 1.0 1.8 2.0
HSMMt (4) 3.4 1.1 1.4 1.2
CTHMMN (4) 1.9 1.3 1.3 1.6

Table 5: Mean squared error and weighted mean squared error of the autocorrelation
function of the squared returns and the outlier-corrected squared returns for the fitted
models.

error and the weighted mean squared error of the ACF of the squared returns
for the fitted models in table 5. The weighted mean squared error reweights
the error at lag k by 0.95(100−k) to increase the influence of higher-order lags
following the approach by Bulla and Bulla (2006).

A CTHMM with four states with normal conditional distributions is seen to


provide a better fit to the ACF of the squared returns than the three-state
HSMM with normal conditional distributions. This observation is supported by
the computed mean squared errors and weighted mean squared errors.

The first four moments of the log-returns are shown in table 6 together with
the estimated moments for the fitted models based on 500,000 Monte Carlo
simulations. Two states with normal conditional distributions are not enough
to adequately capture the excess kurtosis of the log-returns. The two-state
model with conditional t distributions is able to reproduce the excess kurtosis,
but this model was not a good fit to the ACF.

The three-state models all provide a reasonable fit to the empirical moments.


The kurtosis is still a little too low, with the exception of the HSMM with
t components. The four-state models all provide a good fit to the empirical
moments.

5.1 Correcting for outliers


Figure 7 shows the empirical ACF of the squared outlier-corrected log-returns
together with the ACFs of the squared outlier-corrected simulated log-returns
50 Stylized facts and hidden Markov models in continuous time

Model Mean Std. dev. Skewness Kurtosis


rt 0.00034 0.0121 -0.24 11.3
[0.00001; 0.00068] [0.0116; 0.0127] [−0.75; 0.30] [8.5; 14.1]
HMMN (2) 0.00034 0.0122 -0.17 5.6
HSMMN (2) 0.00035 0.0121 -0.24 6.5
HSMMt (2) 0.00042 0.0122 -0.14 12.0
HMMN (3) 0.00036 0.0120 -0.18 8.2
HSMMN (3) 0.00032 0.0121 -0.26 8.4
HSMMt (3) 0.00037 0.0123 -0.19 14.0
CTHMMN (3) 0.00025 0.0120 -0.20 8.7
HMMN (4) 0.00033 0.0122 -0.34 10.3
HSMMN (4) 0.00033 0.0122 -0.30 10.5
HSMMt (4) 0.00035 0.0125 -0.33 11.5
CTHMMN (4) 0.00037 0.0124 -0.30 10.7

Table 6: First four moments of the log-returns together with bootstrapped 95% confi-
dence intervals and simulated moments for the fitted models.

HMMN (2) HMMN (3)


0.00 0.10 0.20 0.30

0.00 0.10 0.20 0.30

HSMMN (2) HSMMN (3)


HSMMt (2) HSMMt (3)
CTHMMN (4)
ACF

ACF

0 20 40 60 80 100 0 20 40 60 80 100
Lag Lag

Figure 7: Empirical autocorrelation function of the squared outlier-corrected log-returns


at lag 1–100 together with autocorrelation functions of the squared outlier-corrected
simulated log-returns for the fitted models.

for the fitted models. Following the approach by Granger and Ding (1995a),
values outside the interval r¯t ± 4b
σ are set equal to the nearest boundary.
Restraining the impact of outliers reduces the amount of noise in the empirical
ACF significantly. The noise reduction reveals a weekly variation that could
suggest the need for an inhomogeneous, yet continuous, Markov model. The
flexibility of a continuous-time model would be necessary to incorporate inho-
mogeneity without a dramatic increase in the number of parameters.
The conclusions regarding the fit of the different models to the empirical ACF are
still valid when looking at the outlier-corrected returns. The outperformance
of the HSMMs relative to the HMMs at the high lags is even more apparent
5 Empirical results 51

Model No. of parameters Log-lik AIC BIC


HMMN (2) 7 15984 -31954 -31908
HSMMN (2) 9 16062 -32107 -32048
HSMMt (2) 11 16137 -32251 -32180
HMMN (3) 14 16214 -32400 -32309
HSMMN (3) 17 16227 -32419 -32308
HSMMt (3) 20 16245 -32449 -32319
CTHMMN (3) 13 16209 -32391 -32306
HMMN (4) 23 16262 -32478 -32328
HSMMN (4) 27 16273 -32492 -32316
HSMMt (4) 31 16284 -32505 -32303
CTHMMN (4) 19 16256 -32474 -32350

Table 8: Model selection based on the Akaike information criterion and the Bayesian
information criterion.

when looking at the outlier-corrected data. What is also more apparent is the
outperformance of the four-state CTHMM with normal conditional distributions
relative to the three-state HSMMs. The ACF for the four-state CTHMM still
decays too fast from lag 40 and onwards, but it clearly provides a better fit than
the HSMMs with a similar number of parameters.

5.2 Model selection


Model selection involves both the choice of an appropriate number of states
and the choice between competing state-dependent distributions. Likelihood-
ratio tests cannot be applied to models with different numbers of states as
these are not hierarchically nested. Instead, penalized likelihood criteria can be
used to select the model that is estimated to be closest to the ’true’ model, as
suggested by Zucchini and MacDonald (2009). A disadvantage is that model
selection criteria provide no information about the confidence in the selected
model relative to others.
A four-state HSMM fits the data as well as the four-state CTHMM with normal
conditional distributions, but it has 8 or 12 more parameters (see table 8). There-
fore, the four-state CTHMM is preferred to both the three and the four-state
HSMMs according to the Bayesian information criterion5 . Akaike’s information
criterion6 selects the four-state HSMM with t components as it puts less em-
phasis on the number of parameters. However, various simulation studies have
shown that AIC tends to select models with too many states (Bacci et al. 2014).

5 The Bayesian information criterion is defined as BIC = −2 log L + p log T , where T is the

number of observations and p is the number of parameters.


6 The Akaike information criterion is defined as AIC = −2 log L + 2p.
52 Stylized facts and hidden Markov models in continuous time

θ =1 θ =1
0.00 0.10 0.20 0.30

0.00 0.10 0.20 0.30


θ = 0.75 θ = 1.25
( θ)

( θ)
θ = 0.5 θ = 1.5
ACF |rt |

ACF |rt |
θ = 0.25 θ = 1.75

0 20 40 60 80 100 0 20 40 60 80 100
Lag Lag

Figure 9: Autocorrelation function of the absolute value of 500,000 returns simulated


from the estimated four-state CTHMM raised to different positive powers.

The parameter estimates are more uncertain the higher the number of states
because of the quadratic increase in the number of parameters for the discrete-
time models. Rydén et al. (1998) investigated HMMs with two and three states
and found that the three-state models were “less similar to each other” and
that “the estimation results seemed heavily dependent on outlying observations”.
That was also the reason why Bulla and Bulla (2006) only considered HSMMs
with two-states. There is a strong preference for models with fewer parameters
as a four-state HSMM with over 30 parameters is likely to be overfitting the
data.
It is problematic to fit a five-state CTHMM to the log-returns. The likelihood
function appears to be highly multimodal and it is easy to find several local
maxima by using different starting values. This indicates that the model is
overfitting the data. It was not possible to find a five-state CTHMM with a
lower BIC-value than the four-state CTHMM.
The ability of the four-state CTHMM to capture the Taylor effect is illustrated
in figure 9. The ACF of the absolute value of 500,000 simulated returns raised
to different positive powers is seen to be highest at power one.

6 Conclusion
HSMMs were found to be better at reproducing the slowly decaying ACF of
squared daily returns of the S&P 500 and the FTSE 100 total return index
than HMMs when looking at two and three-state models in agreement with the
finding by Bulla and Bulla (2006). A much better fit to the slowly decaying ACF
and the empirical moments was obtained by increasing the number of hidden
states from two to three.
An extension to continuous time was presented and it was shown that a CTHMM
with four states provides a better fit than the discrete-time models with three
6 Conclusion 53

states with a similar number of parameters and, even more so, after restraining
the impact of outliers. There was no indication that the memoryless property
of the sojourn-time distribution is inconsistent with the long-memory property
of the squared returns.
Different models were preferred by the different selection criteria, but the four-
state CTHMM with normal conditional distributions was selected by the Bayesian
information criterion that was believed to be the most reliable. Finally, it was
argued that the four-state CTHMM is preferred to the four-state HSMMs due to
the significantly lower number of parameters resulting from the continuous-time
formulation that makes the model less likely to be overfitting the data.

References
Ang, A. and A. Timmermann. “Regime changes and financial markets.” Annual
Review of Financial Economics, vol. 4, no. 1 (2012), pp. 313–337.

Asai, M. and M. McAleer. “Non-trading day effects in asymmetric conditional


and stochastic volatility models.” Econometrics Journal, vol. 10, no. 1 (2007),
pp. 113–123.

Bacci, S., S. Pandolfi, and F. Pennoni. “A comparison of some criteria for states
selection in the latent Markov model for longitudinal data.” Advances in Data
Analysis and Classification, vol. 8, no. 2 (2014), pp. 125–145.

Bulla, J. “Hidden Markov models with t components. Increased persistence and


other aspects.” Quantitative Finance, vol. 11, no. 3 (2011), pp. 459–475.

Bulla, J. and I. Bulla. “Stylized facts of financial time series and hidden semi-
Markov models.” Computational Statistics & Data Analysis, vol. 51, no. 4
(2006), pp. 2192–2209.

Bulla, J., I. Bulla, and O. Nenadić. “hsmm—an R package for analyzing hidden
semi-Markov models.” Computational Statistics & Data Analysis, vol. 54,
no. 3 (2010), pp. 611–619.

Cappé, O., E. Moulines, and T. Rydén. Inference in Hidden Markov Models.


Springer: New York (2005).

Cont, R. “Empirical properties of asset returns: stylized facts and statistical


issues.” Quantitative Finance, vol. 1, no. 2 (2001), pp. 223–236.

French, K. R. “Stock returns and the weekend effect.” Journal of Financial


Economics, vol. 8, no. 1 (1980), pp. 55–69.

Frühwirth-Schnatter, S. Finite Mixture and Markov Switching Models. Springer:


New York (2006).
54 Stylized facts and hidden Markov models in continuous time

Granger, C. W. J. and Z. Ding. “Some properties of absolute return: An alter-


native measure of risk.” Annales D’Economie Et Statistique, vol. 40 (1995a),
pp. 67–92.

Granger, C. W. J. and Z. Ding. “Stylized facts on the temporal and distribu-


tional properties of daily data from speculative markets.” Unpublished paper,
Department of Economics, University of California, San Diego (1995b).

Granger, C. W. J., S. Spear, and Z. Ding. “Stylized facts on the temporal and
distributional properties of absolute returns: An update.” In Proceedings of
the Hong Kong International Workshop on Statistics in Finance. Imperial
College Press: London (2000), pp. 97–120.

Guédon, Y. “Estimating hidden semi-Markov chains from discrete sequences.”


Journal of Computational and Graphical Statistics, vol. 12, no. 3 (2003), pp.
604–639.

Iversen, E. B., J. K. Møller, J. M. Morales, and H. Madsen. “Inhomogeneous


Markov models for describing driving patterns.” IMM-Technical Report-
2013 02, Technical University of Denmark (2013).

Jackson, C. H. “Multi-state models for panel data: The msm package for R.”
Journal of Statistical Software, vol. 38, no. 8 (2011), pp. 1–29.

Langrock, R. and W. Zucchini. “Hidden Markov models with arbitrary state


dwell-time distributions.” Computational Statistics & Data Analysis, vol. 55,
no. 1 (2011), pp. 715–724.

Madsen, H., H. Spliid, and P. Thyregod. “Markov models in discrete and con-
tinuous time for hourly observations of cloud cover.” Journal of Climate and
Applied Meteorology, vol. 24, no. 7 (1985), pp. 629–639.

Malmsten, H. and T. Teräsvirta. “Stylized facts of financial time series and


three popular models of volatility.” European Journal of Pure and Applied
Mathematics, vol. 3, no. 3 (2010), pp. 443–477.

Nielsen, B. F. Matrix Analytic Methods in Applied Probability with a View


towards Engineering Applications. Doctoral thesis, Technical University of
Denmark (2013).

Rogalski, R. J. “New findings regarding day-of-the-week returns over trading


and non-trading periods: A note.” Journal of Finance, vol. 39, no. 5 (1984),
pp. 1603–1614.

Rydén, T., T. Teräsvirta, and S. Åsbrink. “Stylized facts of daily return series
and the hidden Markov model.” Journal of Applied Econometrics, vol. 13,
no. 3 (1998), pp. 217–244.
A Parameter estimates 55

Silvestrov, D. and F. Stenberg. “A pricing process with stochastic volatility


controlled by a semi-Markov process.” Communications in Statistics-Theory
and Methods, vol. 33, no. 3 (2004), pp. 591–608.
Zucchini, W. and I. L. MacDonald. Hidden Markov Models for Time Series: An
Introduction Using R. Chapman & Hall: London, 2nd ed. (2009).

A Parameter estimates

m Γ µ × 104 σ 2 × 104 δ
0.990 0.010 8.3 0.52 1.0
(0.002) (1.1) (0.01) (0.2)
2
0.021 0.979 −6.9 3.47 0.0
(0.004) (4.9) (0.14)

0.982 0.018 0.000 10.1 0.32 1.0


(0.004) (0.001) (1.3) (0.01) (0.1)
0.015 0.978 0.006 0.5 1.28 0.0
3
(0.003) (0.002) (2.5) (0.04) (0.1)
0.000 0.030 0.970 −12.4 7.08 0.0
(0.003) (0.010) (12.8) (0.50)

0.979 0.021 0.000 0.000 10.8 0.29 1.0


(0.005) (0.006) (0.00) (1.5) (0.01) (0.2)
0.020 0.970 0.009 0.001 3.3 0.96 0.0
(0.005) (0.003) (0.001) (2.7) (0.05) (0.2)
4
0.000 0.017 0.976 0.007 −2.9 2.39 0.0
(0.000) (0.006) (0.004) (5.7) (0.14) (0.1)
0.000 0.000 0.049 0.951 −29.9 12.16 0.0
(0.000) (0.002) (0.026) (31.2) (1.73)

Table 10: Parameter estimates for the fitted m-state HMMs with normal conditional
distributions together with bootstrapped standard errors based on 250 simulations of the
model.
56 Stylized facts and hidden Markov models in continuous time

m Γ 1−p r × 10 µ × 104 σ 2 × 104 δ


0 1 0.994 0.4 9.0 0.48 1.0
(0.002) (0.1) (1.3) (0.01) (0.4)
2
1 0 0.975 0.5 −10.8 4.00 0.0
(0.010) (0.2) (5.7) (0.18)

0 1.000 0.000 0.993 0.3 10.5 0.31 1.0


(0.000) (0.003) (0.1) (1.3) (0.01) (0.3)
0.972 0 0.028 0.945 1.7 −1.0 1.51 0.0
3
(0.013) (0.037) (1.7) (3.0) (0.07) (0.3)
0.000 1.000 0 0.983 5.5 −14.3 7.26 0.0
(0.000) (0.052) (33.0) (12.7) (0.54)

0 1.000 0.000 0.000 0.985 0.5 11.6 0.25 1.0


(0.000) (0.000) (0.006) (0.5) (1.4) (0.01) (0.3)
0.970 0 0.022 0.008 0.965 1.0 2.1 1.05 0.0
(0.022) (0.020) (0.024) (0.7) (2.8) (0.06) (0.2)
4
0.000 0.675 0 0.325 0.937 31.2 −2.4 2.38 0.0
(0.000) (0.161) (0.124) (124.8) (5.2) (0.17) (0.2)
0.000 0.000 1.000 0 0.977 4.2 −30.6 11.81 0.0
(0.000) (0.000) (0.098) (68.9) (31.8) (2.26)

Table 11: Parameter estimates for the fitted m-state HSMMs with normal conditional
distributions together with bootstrapped standard errors based on 250 simulations of the
model. p and r are the parameters of the negative binomial sojourn-time distribution.

m Γ 1−p r × 10 µ × 104 σ 2 × 104 t δ


0 1 0.997 0.2 9.5 0.34 7.2 1.0
(0.002) (0.1) (1.2) (0.02) (1.2) (0.4)
2
1 0 0.984 0.5 −6.0 2.10 5.6 0.0
(0.008) (0.2) (4.8) (0.17) (1.0)

0 1.000 0.000 0.990 7.4 10.5 0.25 6.7 1.0


(0.042) (0.009) (7.8) (1.1) (0.01) (1.4) (0.3)
0.630 0 0.370 0.979 10.5 1.3 1.16 22.4 0.0
3
(0.144) (0.047) (22.2) (2.7) (0.07) (467.9) (0.3)
0.000 1.000 0 0.983 5.2 −12.2 4.96 7.2 0.0
(0.045) (0.055) (28.5) (13.1) (0.75) (7.4)

0 1.000 0.000 0.000 0.987 8.9 10.6 0.23 6.8 1.0


(0.000) (0.000) (0.013) (10.2) (1.4) (0.01) (1.5) (0.2)
0.610 0 0.296 0.094 0.957 18.6 4.1 0.86 24.8 0.0
(0.103) (0.106) (0.052) (34.2) (2.2) (0.05) (16.4) (0.2)
4
0.000 0.724 0 0.276 0.931 34.3 −1.9 2.22 49.0 0.0
(0.000) (0.148) (0.115) (115.9) (5.2) (0.16) (98.2) (0.0)
0.000 0.000 1.000 0 0.981 3.9 −29.3 9.56 13.8 0.0
(0.000) (0.000) (0.110) (61.9) (31.6) (2.44) (11968)

Table 12: Parameter estimates for the fitted m-state HSMMs with Student t conditional
distributions together with bootstrapped standard errors based on 250 simulations of the
model. p and r are the parameters of the negative binomial sojourn-time distribution
and t is the degrees of freedom for the conditional t distributions.
B FTSE results 57

m Q µ × 104 σ 2 × 104 δ
−0.014 0.014 0 0 10.6 0.32 1.0
(0.003) (1.4) (0.01)
0 −0.020 0.020 0 0.8 1.29 0.0
(0.003) (2.5) (0.03)
3
0.048 0 −0.068 0.020 0.8 1.29 0.0
(0.019)
0.005 0.019 0 −0.024 −14.6 7.12 0.0
(0.003) (0.003) (12.1) (0.26)

−0.018 0.017 0 0.001 11.1 0.29 1.0


(0.005) (0.001) (1.6) (0.01)
0.015 −0.020 0.005 0 3.6 0.95 0.0
(0.004) (0.002) (2.6) (0.03)
4
0 0.010 −0.015 0.005 −3.2 2.39 0.0
(0.003) (0.002) (5.2) (0.10)
0.005 0 0.029 −0.034 −29.2 12.29 0.0
(0.005) (0.013) (25.9) (0.82)

Table 13: Parameter estimates for the fitted m-state CTHMMs with normal conditional
distributions together with approximate standard errors based on the Hessian. The three-
state model has a dummy state as the second and the third state are indistinguishable.
No standard errors are given for the initial distributions as the Hessian is unreliable for
this purpose.

B FTSE results

HMMN (2) HMMN (3)


0.00 0.10 0.20 0.30

0.00 0.10 0.20 0.30

HSMMN (2) HSMMN (3)


HSMMt (2) HSMMt (3)
CTHMMN (4)
ACF

ACF

0 20 40 60 80 100 0 20 40 60 80 100
Lag Lag

Figure 14: Empirical autocorrelation function of the squared FTSE 100 log-returns at
lag 1–100 together with simulated autocorrelation functions for the fitted models.
58 Stylized facts and hidden Markov models in continuous time

Model Mean Std. dev. Skewness Kurtosis


rt 0.00031 0.0118 -0.16 8.9
[−0.00002; 0.00063] [0.0113; 0.0123] [−0.56; 0.24] [7.0; 10.9]
HMMN (2) 0.00026 0.0119 -0.16 5.4
HSMMN (2) 0.00032 0.0115 -0.18 6.0
HSMMt (2) 0.00038 0.0116 -0.19 8.5
HMMN (3) 0.00028 0.0116 -0.21 7.2
HSMMN (3) 0.00026 0.0118 -0.23 7.1
HSMMt (3) 0.00037 0.0118 -0.15 9.1
CTHMMN (4) 0.00026 0.0123 -0.33 8.0

Table 15: First four moments of the FTSE 100 log-returns together with bootstrapped
95% confidence intervals and simulated moments for the fitted models.

HMMN (2) HMMN (3)


0.00 0.10 0.20 0.30

0.00 0.10 0.20 0.30

HSMMN (2) HSMMN (3)


HSMMt (2) HSMMt (3)
CTHMMN (4)
ACF

ACF

0 20 40 60 80 100 0 20 40 60 80 100
Lag Lag

Figure 16: Empirical autocorrelation function of the squared outlier-corrected FTSE


100 log-returns at lag 1–100 together with autocorrelation functions of the squared outlier-
corrected simulated log-returns for the fitted models.

Model No. of parameters Log-lik AIC BIC


HMMN (2) 7 16054 -32093 -32047
HSMMN (2) 9 16093 -32167 -32108
HSMMt (2) 11 16125 -32227 -32156
HMMN (3) 14 16220 -32413 -32321
HSMMN (3) 17 16224 -32414 -32303
HSMMt (3) 20 16235 -32429 -32299
CTHMMN (4) 19 16252 -32466 -32342

Table 17: Model selection based on the Akaike information criterion and the Bayesian
information criterion.
PAPER B
60
Originally published in the Journal of Forecasting

Long memory of financial time series and hidden


Markov models with time-varying parameters

Peter Nystrup, Henrik Madsen, and Erik Lindström

Abstract

Hidden Markov models are often used to model daily returns and to infer
the hidden state of financial markets. Previous studies have found that the
estimated models change over time, but the implications of the time-varying
behavior have not been thoroughly examined. This paper presents an adap-
tive estimation approach that allows for the parameters of the estimated
models to be time varying. It is shown that a two-state Gaussian hidden
Markov model with time-varying parameters is able to reproduce the long
memory of squared daily returns that was previously believed to be the most
difficult fact to reproduce with a hidden Markov model. Capturing the time-
varying behavior of the parameters also leads to improved one-step density
forecasts. Finally, it is shown that the forecasting performance of the es-
timated models can be further improved using local smoothing to forecast
the parameter variations.

Keywords: Hidden Markov models; Daily returns; Long memory; Adaptive


estimation; Time-varying parameters.

1 Introduction
Many different stylized facts have been established for financial returns (see,
e.g., Granger and Ding 1995a,b, Granger et al. 2000, Cont 2001, Malmsten and
Teräsvirta 2010). Rydén et al. (1998) showed the ability of a hidden Markov
model (HMM) to reproduce most of the stylized facts of daily return series in-
troduced by Granger and Ding (1995a,b). In an HMM, the distribution that
generates an observation depends on the state of an unobserved Markov chain.
Rydén et al. (1998) found that the one stylized fact that could not be repro-
duced by an HMM was the slow decay of the autocorrelation function (ACF) of
squared and absolute daily returns, which is of great importance in financial risk
management. The daily returns do not have the long-memory property them-
selves, only their squared and absolute values do. Rydén et al. (1998) considered
this stylized fact to be the most difficult to reproduce with an HMM.
62 Long memory and time-varying parameters

According to Bulla and Bulla (2006), the lack of flexibility of an HMM to model
this temporal higher-order dependence can be explained by the implicit assump-
tion of geometrically distributed sojourn times in the hidden states. This led
them to consider hidden semi-Markov models (HSMMs) in which the sojourn-
time distribution is modeled explicitly for each hidden state so that the Markov
property is transferred to the imbedded first-order Markov chain. They found
that an HSMM with negative-binomially distributed sojourn times was better
than the HMM at reproducing the long-memory property of squared daily re-
turns.
Bulla (2011) later showed that HMMs with t-distributed components reproduce
most of the stylized facts as well or better than the Gaussian HMM at the
same time as increasing the persistence of the visited states and the robustness
to outliers. Bulla (2011) also found that models with three states provide a
better fit than models with two states. In Nystrup et al. (2015b), an extension
to continuous time was presented and it was shown that a continuous-time
Gaussian HMM with four states provides a better fit than discrete-time models
with three states with a similar number of parameters.
The data analyzed in this paper is daily returns of the S&P 500 stock index from
1928 to 2014. It is the same time series that was studied in the majority of the
above-mentioned studies, just extended through the end of 2014. Granger and
Ding (1995a) divided the full sample into ten subsamples of 1,700 observations,
corresponding to a little less than seven years, as they believed it was likely that
with such a long time span there could have been structural shifts in the data-
generating process. Using the same approach, Rydén et al. (1998) and Bulla
(2011) found that the estimated HMMs, including the number of states and the
type of conditional distributions, changed considerably between the subsamples.
HMMs are popular for inferring the hidden state of financial markets and sev-
eral studies have shown the profitability of dynamic asset allocation strategies
based on this class of models (see, e.g., Bulla et al. 2011, Nystrup et al. 2015a).
The profitability of those strategies is directly related to the persistence of the
volatility. Therefore, it is relevant to explore in depth the importance of the
time-varying behavior for the models’ ability to reproduce the long memory and
forecast future returns. Failure to account for the time-varying behavior of the
estimated models is likely part of the reason why regime-switching models often
get outperformed by a simple random walk model when used for out-of-sample
forecasting, as discussed by Dacco and Satchell (1999).
In this paper, an adaptive estimation approach that allows for the parameters
of the estimated models to be changing over time is presented as an alternative
to fixed-length forgetting. After all, it is unlikely that the parameters change
after exactly 1,700 observations. The time variation is observation driven based
on the score function of the predictive likelihood function, which is related to
the generalized autoregressive score (GAS) model of Creal et al. (2013).
2 The hidden Markov model 63

In agreement with the findings by Rydén et al. (1998) and Bulla (2011), the
parameters of the estimated models are found to vary significantly throughout
the data period. As a consequence of the time-varying transition probabilities,
the sojourn-time distribution is not the memoryless geometric distribution. A
two-state Gaussian HMM with time-varying parameters is shown to reproduce
the long memory of the squared daily returns. Faster adaption to the parameter
changes improves both the fit to the ACF of the squared returns and the one-step
density forecasts. Using local smoothing to forecast the parameter variations, it
is possible to further improve the density forecasts. Finally, the need for a third
state or a conditional t-distribution in the high-variance state to capture the
full extent of excess kurtosis is discussed in light of the nonstationary behavior
of the estimated models.
Section 2 gives an introduction to the HMM. Section 3 discusses the relation
between long memory and regime switching. In section 4, a method for adaptive
parameter estimation is outlined. Section 5 contains a description of the data.
The results are presented in section 6 and section 7 concludes.

2 The hidden Markov model


In a hidden Markov model, the probability distribution that generates an ob-
servation depends on the state of an underlying and unobserved Markov pro-
cess. An HMM is a particular kind of dependent mixture and is therefore also
referred to as a Markov-switching mixture model. General references to the
subject include Cappé et al. (2005), Frühwirth-Schnatter (2006), and Zucchini
and MacDonald (2009).
A sequence of discrete random variables {St : t ∈ N} is said to be a first-order
Markov chain if, for all t ∈ N, it satisfies the Markov property:

Pr ( St+1 | St , . . . , S1 ) = Pr ( St+1 | St ) . (1)

The conditional probabilities Pr ( Su+t = j| Su = i) = γij (t) are called transition


probabilities. The Markov chain is said to be homogeneous if the transition
probabilities are independent of u, and inhomogeneous otherwise.
If the Markov chain {St } has m states, then the bivariate stochastic process
{(St , Xt )} is called an m-state HMM. With S (t) and X (t) representing the
values from time 1 to time t, the simplest model of this kind can be summarized
by
( )
Pr St | S (t−1) = Pr ( St | St−1 ) , t = 2, 3, . . . , (2a)
( )
Pr Xt | X (t−1) , S (t) = Pr ( Xt | St ) , t ∈ N. (2b)
64 Long memory and time-varying parameters

Hence, when the current state St is known, the distribution of Xt depends only
on St . This causes the autocorrelation of {Xt } to be strongly dependent on the
persistence of {St }.
An HMM is a state-space model with finite state space, where (2a) is the state
equation and (2b) is the observation equation. A specific observation can usually
arise from more than one state as the support of the conditional distributions
overlaps. Therefore, the unobserved state process {St } is not directly observable
through the observation process {Xt }, but can only be estimated.
As an example, consider the two-state model with Gaussian conditional densi-
ties: ( )
Xt = µSt + εSt , εSt ∼ N 0, σS2 t ,
where {
µ1 , if St = 1,
µSt =
µ2 , if St = 2,
{
σ12 , if St = 1,
σS2 t =
σ22 , if St = 2,
and [ ]
1 − γ12 γ12
Γ= .
γ21 1 − γ21

For this model, the value of the autocorrelation function at lag k is


2
π1 (1 − π1 ) (µ1 − µ2 ) k
ρXt ( k| θ) = λ (3)
σ2
and the autocorrelation function for the squared process is
( )2
π1 (1 − π1 ) µ21 − µ22 + σ12 − σ22
ρXt2 ( k| θ) = 2 λk , (4)
E [ Xt4 | θ] − E [ Xt2 | θ]

where π1 is the stationary probability of state one and λ = γ11 + γ22 − 1 is the
second largest eigenvalue of Γ (Frühwirth-Schnatter 2006).1 It is evident from
these expressions, as noted by Rydén et al. (1998), that an HMM with constant
parameters can only reproduce an exponentially-decaying autocorrelation struc-
ture. The ACF of the first-order process becomes zero if the mean values are
equal whereas persistence in the squared process can be induced either by a
difference in the means or by a difference in the variances. In both cases the
persistence increases with the combined persistence of the states as measured
by λ.

1 The other eigenvalue of Γ is λ = 1.


3 Long memory and regime switching 65

The sojourn times are implicitly assumed to be geometrically distributed:


t−1
Pr (’staying t time steps in state i’) = γii (1 − γii ) . (5)

The geometric distribution is memoryless, implying that the time until the next
transition out of the current state is independent of the time spent in the state.
Langrock and Zucchini (2011) showed how an HMM can be structured to fit any
sojourn-time distribution with arbitrary precision by mapping multiple latent
states to the same output state. The distribution of sojourn times is then
a mixture of geometric distributions, which is a phase-type distribution, and
the Markov property is transferred to the imbedded Markov chain as in an
HSMM.2 Phase-type distributions can be used to approximate any positive-
valued distribution with arbitrary precision (Nielsen 2013). Similarly, time-
varying transition probabilities will lead to nongeometrically distributed sojourn
times. Thus, an estimation approach that allows the transition probabilities to
be changing over time has the flexibility to fit sojourn-time distributions other
than the geometric.

3 Long memory and regime switching


Granger and Teräsvirta (1999) and Gourieroux and Jasiak (2001) showed how
simple nonlinear time series models with infrequent regime switching can gener-
ate a long-memory effect in the autocorrelation function. Around the same time,
Diebold and Inoue (2001) showed analytically how stochastic regime switching
is easily confused with long memory. They specifically showed that under the
assumption that the persistence of the states converges to one as a function of
the sample size, the variances of partial sums of a Markov-switching process
will match those of a fractionally integrated process. This led Baek et al. (2014)
to question the relevance of the HMM for long memory as they found common
estimators of the long-memory parameter to be extremely biased when applied
to data generated by the Markov-switching model of Diebold and Inoue (2001).
Baek et al. (2014) argued that the HMM should be viewed as a short-memory
model with some long-memory features rather than a long-memory model.
Gourieroux and Jasiak (2001) emphasized that the distinction between a short-
memory model with long-memory features and a long-memory model has im-
portant practical implications, for example, when the model is used for making
predictions. If a fractional model is retained, the predictions should be based
on a long history of the observed series. If, on the other hand, a short-memory

2 In an HSMM, the sojourn-time distribution is modeled explicitly for each state. The conditional

independence assumption for the observation process is similar to a simple HMM, but the Markov
property is transferred to the imbedded first-order Markov chain—that is, the sequence of visited
states (Bulla and Bulla 2006).
66 Long memory and time-varying parameters

(regime-switching) model with long-memory features is selected, then the pre-


dictions should be based on only the most recent observations.

Several studies have documented how structural changes in the unconditional


variance can cause long-range dependence in the volatility and integrated gener-
alized autoregressive conditional heteroskedasticity (GARCH) effects (see, e.g.,
Mikosch and Stărică 2004, and references therein). The GARCH model of En-
gle (1982) and Bollerslev (1986) has been extended in various ways since its
introduction in an effort to capture long-range dependencies in economic time
series (see, e.g., Baillie et al. 1996, Bauwens et al. 2014). Stărică and Granger
(2005) identified intervals of homogeneity where the nonstationary behavior of
the S&P 500 series can be approximated by a stationary model. They found
that the most appropriate is a simple model with no linear dependence but with
significant changes in the mean and variance of the time series. On the inter-
vals of homogeneity, the data is approximately a white noise process. Their
results indicate the time-varying unconditional variance as the main source of
nonstationarity in the S&P 500 series.

Calvet and Fisher (2004) showed that a multifrequency regime-switching model


is able to generate substantial outliers and capture both the low-frequency
regime shifts that cause abrupt volatility changes and the smooth autoregressive
volatility transitions at mid-range frequencies without including GARCH com-
ponents or heavy-tailed conditional distributions. The multifrequency regime-
switching model reproduces the long memory of the volatility by having a com-
ponent with a duration of the same order as the sample size. With two possible
values for the volatility the number of states increases at the rate 2f , where f
is the number of switching frequencies. Thus there are over 1,000 states when
f = 10.

It was already known that a Markov chain with a countably-infinite state space
can have the long-memory property (see Granger and Teräsvirta 1999, and
references therein). The model proposed in this paper is much simpler and,
consequently, less likely to be overfitted out of sample or in an on-line application
like adaptive forecasting. The model is a simple Gaussian HMM with parameters
that are time varying in a nonparametric way. This approach, in principle,
allows for an infinite number of states, but the number of parameters that
has to be estimated remains unchanged compared to an HMM with constant
parameters. The time variation is observation driven based on the score function
of the predictive model density. This is similar to the GAS model of Creal et al.
(2013), but the time variation is not limited to the transition probabilities as in
the study by Bazzi et al. (2017).
4 Adaptive parameter estimation 67

4 Adaptive parameter estimation


The parameters of an HMM are often estimated using the maximum-likelihood
(ML) method. The likelihood function of an HMM is, in general, a complicated
function of the parameters with several local maxima and in mixtures of con-
tinuous distributions the likelihood can be unbounded in the vicinity of certain
parameter combinations.3 The two most popular approaches to maximizing the
likelihood are direct numerical maximization and the Baum–Welch algorithm,
a special case of the expectation–maximization (EM) algorithm (Baum et al.
1970, Dempster et al. 1977).
When maximizing the likelihood, every observation is usually assumed to be
of equal importance no matter how long the sample period is. This approach
works well when the sample period is short and the underlying process does not
change over time. The time-varying behavior of the parameters documented in
previous studies (Rydén et al. 1998, Bulla 2011), however, calls for an adaptive
approach that assigns more weight to the most recent observations while keeping
in mind past patterns at a reduced confidence.
As pointed out by Cappé et al. (2005), it is possible to evaluate derivatives of
the likelihood function with respect to the parameters for virtually any model
that the EM algorithm can be applied to. This is, for example, true when the
standard (Cramer-type) regularity conditions for the ML estimator hold because
the maximizing quantities in the M-step are derived based on the derivatives
of the likelihood function. As a consequence, instead of resorting to a specific
algorithm such as the EM algorithm, the likelihood can be maximized using
gradient-based optimization methods. Lystig and Hughes (2002) described an
algorithm for exact computation of the score vector and the observed informa-
tion matrix in HMMs that can be performed in a single pass through the data.
The algorithm was derived from the forward–backward algorithm.
The reason for exploring gradient-based methods is the flexibility to make the
estimator recursive and adaptive.4 The estimation of the parameters through a
maximization of the conditional log-likelihood function can be done recursively
using the estimator


t ( )

θ̂t = arg max wn log Pr Xn X (n−1) , θ = arg max ℓ̃t (θ) (6)
θ θ
n=1

with wn = 1. The recursive estimator can be made adaptive by introducing a


different weighting. A popular choice is to use exponential weights wn = λt−n ,
3 If, for example, the conditional distribution is Gaussian then the likelihood can be made

arbitrarily large by setting the mean equal to one of the observations and letting the conditional
variance tend to zero (Frühwirth-Schnatter 2006).
4 See Khreich et al. (2012) for a survey of techniques for incremental learning of HMM parame-

ters.
68 Long memory and time-varying parameters

where 0 < λ < 1 is the forgetting factor (Madsen 2008). The speed of adaption
is then determined by the effective memory length
1
Neff = . (7)
1−λ

Maximizing the second-order Taylor expansion of ℓ̃t (θ) around θ̂t−1 with respect
to θ and defining the solution as the estimator θ̂t leads to
[ ( )]−1 ( )
θ̂t = θ̂t−1 − ∇θθ ℓ̃t θ̂t−1 ∇θ ℓ̃t θ̂t−1 . (8)

This is equivalent to a specific case of the GAS model of Creal et al. (2013).
Using the estimator (8) it is possible to reach quadratic convergence whereas
the GAS model in general converges only linearly (see Cappé et al. 2005).
For HMMs, the score function must consider the previous observations and can-
not reasonably be approximated by the score function of the latest observation,
as it is often done for other models (Khreich et al. 2012). In order to compute
the weighted score function the algorithm of Lystig and Hughes (2002) has to
be run for each iteration and the contribution of each observation has to be
weighted. Experimentation suggests that with an effective memory length of
250 observations, it is necessary to compute the contribution of the last 2,500
observations to get a satisfactory approximation of the weighted score function.
This leads to a significant increase in computational complexity.
Approximating the Hessian by
( ) ∑
t ( )

∇θθ ℓ̃t θ̂t−1 = ∇θθ λt−n log Pr Xn X (n−1) , θ̂t−1
n=1

t ( )

= λt−n ∇θθ log Pr Xn X (n−1) , θ̂t−1
n=1 (9)
∑t ( ( ))
≈ λt−n −It θ̂t−1
n=1
1 − λt ( ( ))
= −It θ̂t−1 ,
1−λ
leads to the recursive, adaptive estimator
A [ ( )]−1 ( )
θ̂t ≈ θ̂t−1 + It θ̂t−1 ∇θ ℓ̃t θ̂t−1 , (10)
min (t, Neff )

where the tuning constant A can be adjusted to increase or decrease the speed of
convergence without changing the effective memory length. In order to improve
5 Data 69

1−λ 1
the clarity, the fraction 1−λt was replaced by min(t,N ) , where Neff is the effective
eff
memory length (7). The two fractions share the property that they decrease
toward a constant when t increases. It is necessary to apply a transformation to
all constrained parameters for the estimator (10) to converge. Further, in order
to avoid very large initial steps, it is often advisable to start the estimation at
a t0 > 0.
The Fisher information can be updated recursively using the identity E [∇θθ ℓt ] =
−E [∇θ ℓt ∇θ ℓ′t ]:

( ) 1∑
t ( ) ( )′
It θ̂ = − ∇θ ℓn θ̂ ∇θ ℓn θ̂
t n=1
{ t−1
t−1 1 ∑ ( ) ( )′
=− ∇θ ℓn θ̂ ∇θ ℓn θ̂
t t − 1 n=1
}

( ) ( )′
(11)
+ ∇θ log Pr Xt X (t−1) , θ̂ ∇θ log Pr Xt X (t−1) , θ̂
( ) 1[ ( )

= It−1 θ̂ + ∇θ log Pr Xt X (t−1) , θ̂
t
( )′ ( )]

· ∇θ log Pr Xt X (t−1) , θ̂ − It−1 θ̂ .

Further, the matrix inversion lemma is applicable, since the estimator (10) only
makes use of the inverse of the Fisher information. The diagonal elements of
the inverse of the Fisher information provide uncertainties of the parameter
estimates as a by-product of the algorithm.5

5 Data
The data analyzed is 21,851 daily log-returns of the S&P 500 index covering the
period from 1928 to 2014.6 The log-returns are calculated using rt = log (Pt ) −
log (Pt−1 ), where Pt is the closing price of the index on day t and log is the
natural logarithm. It is evident from the plot of the log-returns shown in figure 1
that the data includes a larger number of exceptional observations. The Global
Financial Crisis of 2007–2008 does not stand out when compared to the Great

5 If some of the parameters are on or near the boundary of their parameter space, which is often

the case in HMMs, the use of the Hessian to compute standard errors is unreliable. Moreover, the
conditions for asymptotic normality of the ML estimator are often violated, thus making confidence
intervals based on the computed standard errors unreliable. In those cases confidence intervals
based on the profile likelihood function or bootstrapping provide a better approximation.
6 The definition of the index was changed twice during the period. In 1957, the S&P 90 was

expanded to 500 stocks and became the S&P 500 index. The 500 stocks contained exactly 425
industrials, 25 railroads, and 50 utility firms. This requirement was relaxed in 1988.
70 Long memory and time-varying parameters

0.0 0.2 0.4 0.6


S&P 500
Gaussian

Density
0.0

t2.53
rt

-7 -5 -3
-0.2

1930 1950 1970 1990 2010 -4 -2 0 2 4


Year Standardized rt

Figure 1: Daily log-returns of the S&P 500 index and the density of the standardized
daily log-returns together with the density function for the standard normal distribution
and a t-distribution.

Depression of 1929–1933 and the period around Black Monday in October 1987.
The tendency for the volatility to form clusters as large price movements are
followed by large price movements and vice versa is a stylized fact (see, e.g.,
Cont 2001, Lindström et al. 2015).
The excess kurtosis is evident from the plot of the density function shown in
figure 1. There is too much mass centered right around the mean and in the tails
compared to the normal distribution. The t-distribution with 2.53 degrees of
freedom is a much better fit to the unconditional distribution of the log-returns.
There are 169 observations that deviate more than four standard deviations
from the mean compared to an expectation of 1.4 observations if the returns
were normally distributed. The most extreme being the −22.9% log-return on
Black Monday which deviates more than 19 standard deviations from the sample
mean.
In figure 2, the sample autocorrelation function of the squared log-returns is
shown together with the ACF of the squared, outlier-corrected log-returns (the
two top panels). The dashed lines are the upper boundary of an approximate
95% confidence interval under the null hypothesis of independence (Madsen
2008). To analyze the impact of outliers, values outside the interval r¯t ± 4b
σ,
b are the estimated sample mean and standard deviation, are set
where r¯t and σ
equal to the nearest boundary following the approach by Granger and Ding
(1995a). According to Granger et al. (2000), the choice of four standard devi-
ations was arbitrary, but experimentation suggested that the results were not
substantially altered if the value was changed somewhat. The long memory
of the squared returns is evident from both plots although the large number
of exceptional observations greatly reduces the magnitude of the ACF of the
unadjusted squared returns (see Chan 1995).
The persistence of the ACF of the squared returns is, to some extent, a conse-
quence of the volatility clustering seen in figure 1, but the significantly positive
5 Data 71

0.4
Full sample
ACF(rt2 )
0.2
0.0

0 100 200 300 400 500


Lag
0.4

Outliers limited to r¯t ± 4b


σ
ACF(rt2 )
0.2
0.0

0 100 200 300 400 500


Lag
0.4

Subsamples of 1,700 observations


ACF(rt2 )
0.2
0.0

0 100 200 300 400 500


Lag

Figure 2: The top panel shows the autocorrelation function of the squared log-returns
at lag 1–500. The middle panel shows the autocorrelation function of the squared, outlier-
corrected log-returns. The bottom panel shows the average autocorrelation of the squared
log-returns for subsamples of 1,700 observations.
72 Long memory and time-varying parameters

level at lags well over 100 is more likely caused by the data-generating process
being nonstationary. This is supported by the third panel of figure 2, which
shows the average autocorrelation of the squared returns for 12 subsamples of
1,700 observations. The decay of the autocorrelations in the subseries is, on
average, substantially faster than in the full series and roughly exponential,
as concluded by Malmsten and Teräsvirta (2010). Stărică and Granger (2005)
reached the same conclusion based on scaling the absolute log-returns with the
time-varying unconditional variance.

6 Results
Rather than dividing the full sample into subsamples of 1,700 observations, as
done by Granger and Ding (1995a), Rydén et al. (1998), and Bulla (2011),
the parameters of a two-state HMM with conditional normal distributions are
estimated using a rolling window of 1,700 trading days. By using a rolling
window it is possible to get an idea of the evolution of the model parameters
over time. The result is shown in figure 3, where the dashed lines are the
maximum-likelihood estimate (MLE) for the full sample and the gray areas
are bootstrapped 95% confidence intervals based on re-estimating the model to
1,000 simulated series of 1,700 observations.
It is evident that the parameters are far from constant as noted by Rydén et al.
(1998) and Bulla (2011). The variation of the variances and the transition
probabilities far exceeds the likely range of variation as indicated by the 95%
confidence intervals. This implies that the sojourn-time distribution is not mem-
oryless. The first ten years after World War II, in particular, stand out as the
probability of staying in the same state was extraordinarily low. The impact of
the extreme returns in 1987 on the parameters of the high-variance state also
stands out. One of the drawbacks of using fixed-length forgetting appears from
the estimated variance in the second state; the variance spikes on Black Mon-
day in 1987 and does not return to its pre-1987 level until 1,700 days later, thus
making the length of the rolling window evident from the figure.
The substantial variation of the parameters could indicate that the regime-
switching model is unsuitable for the S&P 500 series, but its ability to reproduce
the stylized facts suggests otherwise (see Rydén et al. 1998, Bulla 2011, Nystrup
et al. 2015b). It could also be an indication that the model is misspecified, i.e.,
that it has too few states or a wrong kind of conditional distributions. Rydén
et al. (1998) found that in some periods there was a need for a third so-called
outlier state with a low unconditional probability. Adding a third state, how-
ever, does not lead to smaller variations which suggests that the need for a
third state was a consequence of the lack of adaption of the parameters. The
addition of a conditional t-distribution in the high-variance state, on the other
hand, dampens the variations as shown in figure 4. The parameters are still
not constant, but the variations are smaller especially around 1987. It is only
6 Results 73

0.0010

-0.002
µ1

µ2
0.0000

-0.008
1940 1960 1980 2000 1940 1960 1980 2000
Year Year
0.0002

0.003
0.0000 0.0015
σ12

σ22
0.00005

1940 1960 1980 2000 1940 1960 1980 2000


Year Year
0.9
0.97
γ11

γ22
0.7
0.94

0.5

1940 1960 1980 2000 1940 1960 1980 2000


Year Year

Figure 3: Parameters of a two-state Gaussian HMM estimated using a rolling window


of 1,700 trading days. The dashed lines are the MLE for the full series and the gray
areas are bootstrapped 95% confidence intervals.
74 Long memory and time-varying parameters

0.000
0.0015
µ1

µ2
-0.003
0.0000

1940 1960 1980 2000 1940 1960 1980 2000


Year Year

1e-04 4e-04 7e-04


0.00005 0.00015
σ12

σ22

1940 1960 1980 2000 1940 1960 1980 2000


Year Year
1.00

0.95
0.94
γ11

γ22
0.80
0.88

0.65

1940 1960 1980 2000 1940 1960 1980 2000


Year Year
25
15
ν
5

1940 1960 1980 2000


Year

Figure 4: Parameters of a two-state HMM with a conditional t-distribution in the


high-variance state estimated using a rolling window of 1,700 trading days. The dashed
lines are the MLE for the full series and the gray areas are bootstrapped 95% confidence
intervals.
6 Results 75

the degrees of freedom of the t-distribution that change dramatically from a


maximum of 29 in 1978 to a minimum of 2.5 in 1990.
The choice of window length affects the parameter estimates and can be viewed
as a tradeoff between bias and variance. A shorter window yields a faster adap-
tion to changes but a more noisy estimate as fewer observations are used for
the estimation. The large fluctuations in the parameter values could be a result
of the window length being too short, but the parameter values do not stabi-
lize even if the window length is increased to 2,500 days. Instead, there is a
strong incentive to reduce the window length to secure a faster adaption to the
nonstationary behavior of the data-generating process. If the window length is
reduced to 1,000 days, then the degrees of freedom of the t-distribution exceed
100 throughout a large part of the period, meaning that the distribution in the
high-variance state is effectively normal. This suggests that the t-distribution,
to some extent, is a compensation for too slow an adaption of the parameters.
It cannot be ruled out that adding a couple hundred more states or switching
frequencies, as proposed by Calvet and Fisher (2004), would stabilize the pa-
rameters, but it would also make the model more likely to be overfitting out
of sample and impossible to use for state inference. Given the nonstationary
behavior of the data-generating process, it is reasonable to assume that a model
with nonconstant parameters is needed.
Figure 5 shows the parameters of the two-state HMM with conditional normal
distributions estimated using the adaptive estimation approach outlined in sec-
tion 4 with an effective memory length Neff = 250.7 The dashed lines show
the MLE for the full series and the gray areas are approximate 95% confidence
intervals based on the profile likelihood functions. The width of the confidence
intervals is seen to spike whenever there are large movements in the parameter
estimates.
Compared to figure 3, the variations are larger as a result of the shorter mem-
ory length. Using exponential forgetting, the effective memory length can be
reduced compared to when using fixed-length forgetting without increasing the
sensitivity to noise to an undesirable level. Exponential forgetting is also more
meaningful as it assigns most weight to the newest observations and, at the same
time, observations are not excluded from the estimation from one day to the
next. This leads to smoother parameter estimates. The optimal memory length
depends on the intended application and is not necessarily constant over time;
however, 250 days is close to being the minimum length that offers a reasonable
tradeoff between speed of adaption and sensitivity to noise.

7 The estimation was started at t = 250 in order to avoid very large initial steps with the
0
tuning constant A = 1.25. Numerical optimization of the weighted likelihood function was used to
find initial values for the parameters and the Hessian.
76 Long memory and time-varying parameters
0.004

0.00
-0.004 0.000
µ1

µ2

1940 1960 1980 2000 -0.02 1940 1960 1980 2000


Year Year
1e-04 3e-04 5e-04

0.003
σ12

σ22
0.001

1940 1960 1980 2000 1940 1960 1980 2000


Year Year
0.2 0.4 0.6 0.8 1.0

0.2 0.4 0.6 0.8 1.0


γ11

γ22

1940 1960 1980 2000 1940 1960 1980 2000


Year Year

Figure 5: Parameters of a two-state Gaussian HMM estimated adaptively using an


effective memory length Neff = 250. The dashed lines are the MLE for the full series and
the gray areas are approximate 95% confidence intervals based on the profile likelihood
functions.
6 Results 77

0.4
Test sample HMMRW=1700
N

ACF(rt2 )
HMMRW=1700
Nt

0.2
HMMN
N
eff =250

0.0

0 100 200 300 400 500


Lag
0.4

Outliers limited to r¯t ± 4b


σ
ACF(rt2 )
0.2
0.0

0 100 200 300 400 500


Lag

Figure 6: The top panel shows the autocorrelation function of the squared log-returns
at lag 1–500 together with the average autocorrelation of squared log-returns simulated
using the different estimated models. The test sample does not include the first 1,700
observations. In the bottom panel the impact of outliers is limited to four standard
deviations.

6.1 Reproducing the long memory


The top panel of figure 6 shows the ACF of the squared log-returns together with
the average autocorrelation of squared log-returns based on 100 datasets simu-
lated using the estimated parameters shown in figure 3–5. The test sample does
not include the first 1,700 observations in order to facilitate a comparison with
the models estimated using a rolling window. The later starting point causes
the ACF to be significantly lower than in figure 2, supporting the hypothesis
that the long memory is caused by nonstationarity.
The autocorrelation of the simulated data is, on average, higher than the sample
autocorrelation
( ) for all three models. The adaptively-estimated Gaussian HMM
Neff =250
HMMN appears to have the right shape, but it is the HMM with a
( )
conditional t-distribution HMMRW=1700
Nt that provides the best fit to the ACF.
The fact that the ACFs of the squared log-returns of the Gaussian HMMs exceed
that of the data is an indication that the tails of the Gaussian models are too
short compared to the empirical distribution.
When constraining the impact of outliers to r¯t ± 4b
σ , as done in the second panel
78 Long memory and time-varying parameters

Model Test sample Outlier-corrected


MSE1:250 × 103 MSE251:500 × 103 MSE1:250 × 103 MSE251:500 × 103
HMMRW=1700
N 3.85 2.99 0.36 1.84
RW=1700
HMMN t 0.39 0.47 0.32 1.48
N =250
HMMNeff 2.28 0.62 0.17 0.27

Table 7: Mean squared error for the ACF of the squared log-returns and the outlier-
corrected, squared log-returns at lag 1–250 and 251–500 for the estimated models.

of figure 6, the level of autocorrelation in the simulated data is similar to the


empirical data. The adaptively-estimated Gaussian HMM provides a good fit
to the sample ACF of the squared, outlier-corrected log-returns, while the ACFs
of the two models that were estimated using a rolling window are too persistent.
The difference in the fit to the ACF of the squared, outlier-corrected log-returns
is largest at the highest lags. This observation is supported by the computed
mean squared errors for lag 1–250 and 251–500 summarized in table 7. The
result when using a threshold of eight instead of four standard deviations is
very similar and therefore not shown. With this (equally arbitrary) threshold
the model that includes a t-distribution becomes a worse fit at the lowest lags
while the adaptively-estimated Gaussian HMM still provides the best fit.

6.2 Comparing one-step density forecasts


Table 8 compares the predictive log-likelihood of the different two-state models
for the full test sample and when leaving out the 20 most negative contributions
to the log-likelihood. This is to separate the impact of the most exceptional
observations, similar to the approach used for the ACF. The 20 observations
are not the ones furthest away from the sample mean and they are also not the
same for all the models, but there is a large overlap. As HMMs are often used
for day-to-day state identification in an on-line setting, the focus is on one-step
forecasts.
For the full test sample, which does not include the first 1,700 observations, the
HMM with a t-distribution in the high-variance state has the highest predic-
tive log-likelihood. When leaving out the 20 observations that make the most
negative contributions to the log-likelihood, the adaptively-estimated Gaussian
HMM outperforms all the other models. This is less than 0.1% of the total num-
ber of observations. In fact, the adaptively-estimated Gaussian HMM outper-
forms the other models when removing only the observation from Black Monday.
Thus, while the t-distribution is a better fit to the most exceptional observations,
the adaptively-estimated Gaussian HMM provides the best one-step-ahead den-
sity forecasts for the remainder of the sample.
The predictive log-likelihood of both the Gaussian and the combined Gaussian-t
model increases when using a rolling rather than expanding window for the esti-
6 Results 79

Model Predictive log-likelihood


Full test sample Leaving out 20 observations
HMMexpanding
N 66,962 67,093
HMMexpanding
Nt 67,319 67,386
HMMRW=1700
N 67,391 67,739
HMMRW=1700
Nt 67,757 67,860
N =250
HMMNeff 67,678 68,034

Table 8: Predictive log-likelihood of the different two-state models for the full test
sample and when leaving out the 20 observations with the most negative contributions.

mation. This is not surprising given the nonstationarity of the data-generating


process and it clearly shows the need to consider nonstationary models. The
difference when leaving out the 20 most negative contributions to the predictive
log-likelihood is small when using an expanding window because the tails be-
come very heavy in order to compensate for the lack of adaption of the model
parameters. This leads to a poor average predictive performance.
Based on the results shown in table 8, it is natural to wonder whether the
performance of the combined Gaussian-t model could be improved by reducing
the effective memory length using the adaptive estimation method. This has
been attempted, but the advantage of a conditional t-distribution disappears
when the memory length is reduced and the degrees of freedom increase. The
added uncertainty of the degrees of freedom parameter, which is very sensitive
to outliers, leads to more noisy estimates of the mean and scale parameters and
a lower predictive likelihood when the memory length is reduced.
To summarize, a shorter memory length of the parameters leads to a good fit
to the sample ACF of the squared log-returns when constraining the impact of
outliers. A shorter memory also improves the one-step density forecasts with
the adaptively-estimated Gaussian HMM being the overall best when leaving
out the 20 observations that were most difficult to forecast. Even when the
memory length of the parameters is reduced considerably, however, conditional
normal distributions provide a poor fit to those very exceptional observations.

6.3 Improving density forecasts with local smoothing


Given that a faster adaption of the parameters leads to improved density fore-
casts, it should be possible to further improve the density forecasts by improving
the parameter forecasts. Recall that the adaptively-estimated parameters are
found as solutions to (6) for distinct values of t. The observations up to and
including time t are used when estimating θt , which is then used for making infer-
ence about Xt+1 . In other words, the parameters are assumed to stay constant
from time t to time t + 1.
80 Long memory and time-varying parameters

Model Predictive log-likelihood


Full test sample Leaving out 20 observations
HMMRW=1700
N 67,408 67,742
HMMRW=1700
Nt 67,771 67,872
N =250
HMMNeff 67,723 68,058

Table 9: Predictive log-likelihood of the estimated models when using cubic smoothing
splines to forecast the parameters.

If the effective memory length is sufficiently short, the approximation of θt as a


constant vector near t is good. However, this would imply that a relatively low
number of observations is used to estimate θt , resulting in a noisy estimate. On
the contrary, a large bias may occur if the effective memory is long. Joensen
et al. (2000) showed that if the parameter variations are smooth, then locally
to t the elements of θt are better approximated by local polynomials than local
constants.
The parameter variations displayed in figure 5 appear to be smooth, especially
the variations in the conditional variances. The idea of Joensen et al. (2000) is
implemented with more flexible cubic smoothing splines rather than polynomials.
Table 9 summarizes the predictive log-likelihood of the two-state models when
using cubic smoothing splines to forecast the parameters.
By fitting a cubic smoothing spline to the last nine parameter estimates and
then using the fitted spline to forecast the parameters at the next time step, it
is possible to increase the predictive log-likelihood of the adaptively-estimated
Gaussian HMM from 67,678 to 67,723 for the full test sample. When leaving
out the 20 most negative contributions, the predictive log-likelihood is improved
from 68,034 to 68,058. Thus, the largest improvement is obtained for the 20
observations that are most difficult to forecast. For the two models that were
estimated using a rolling window the improvement in forecasting performance
is smaller. This is not surprising given that the parameter variations appeared
less smooth when using a rolling window for the estimation.
Modeling the driving forces behind the variations is beyond the scope of this
paper, but the improvement in forecasting performance that can be obtained
by using simple smoothing splines to forecast the parameters suggests there is
great potential for a hierarchical model.

7 Conclusion
By applying an adaptive estimation method that allowed for observation-driven
time variation of the parameters it was possible to reproduce the long memory
that is characteristic for long series of squared daily returns with a two-state
Gaussian hidden Markov model. The transition probabilities were found to be
7 Conclusion 81

time varying, implying that the sojourn-time distribution was not the memory-
less geometric distribution.

The adaptive estimation approach meant that the effective memory length could
be reduced compared to when using fixed-length forgetting, thereby allowing a
faster adaption to changes and a better reproduction of the current parameter
values. This led to an improved fit to the autocorrelation function of the squared
log-returns and better one-step density forecasts. A third state or a conditional
t-distribution in the high-variance state may be necessary to capture the full
extent of excess kurtosis in a few periods, but the long memory that is needed
to justify a third state or a conditional t-distribution with long tails is not
consistent with the fast adaption of the parameters that led to the best fit
to the long memory of the squared log-returns and the best one-step density
forecasts with the exception of the most extreme observations.

The presented results emphasize the importance of taking into account the non-
stationary behavior of the data-generating process. The longer the data period,
the more important this is. The outlined adaptive estimation method can be
applied to other models than regime-switching models and for other purposes
than density forecasting. Within financial risk management, for example, pos-
sible applications include value-at-risk estimation and dynamic asset allocation
based on out-of-sample state identification.

A better description of the time-varying behavior of the parameters is an open


route for future work that can be pursued in various ways. One way would be to
allow different forgetting factors for each parameter or consider more advanced
state-, time-, or data-dependent forgetting. Another way would be to formulate
a model for the parameter changes in the form of a hierarchical model possibly
including relevant exogenous variables. The proposed method for estimating the
time variation of the parameters is an important step toward the identification
of a hierarchical model structure.

References
Baek, C., N. Fortuna, and V. Pipiras. “Can Markov switching model generate
long memory?” Economics Letters, vol. 124, no. 1 (2014), pp. 117–121.

Baillie, R. T., T. Bollerslev, and H. O. Mikkelsen. “Fractionally integrated gener-


alized autoregressive conditional heteroskedasticity.” Journal of Econometrics,
vol. 74, no. 1 (1996), pp. 3–30.

Baum, L. E., T. Petrie, G. Soules, and N. Weiss. “A maximization technique oc-


curring in the statistical analysis of probabilistic functions of Markov chains.”
Annals of Mathematical Statistics, vol. 41, no. 1 (1970), pp. 164–171.
82 Long memory and time-varying parameters

Bauwens, L., A. Dufays, and J. V. Rombouts. “Marginal likelihood for Markov-


switching and change-point GARCH models.” Journal of Econometrics, vol.
178, no. 3 (2014), pp. 508–522.

Bazzi, M., F. Blasques, S. J. Koopman, and A. Lucas. “Time varying transition


probabilities for Markov regime switching models.” Journal of Time Series
Analysis, vol. 38, no. 3 (2017), pp. 458–478.

Bollerslev, T. “Generalized autoregressive conditional heteroskedasticity.” Jour-


nal of Econometrics, vol. 31, no. 3 (1986), pp. 307–327.

Bulla, J. “Hidden Markov models with t components. Increased persistence and


other aspects.” Quantitative Finance, vol. 11, no. 3 (2011), pp. 459–475.

Bulla, J. and I. Bulla. “Stylized facts of financial time series and hidden semi-
Markov models.” Computational Statistics & Data Analysis, vol. 51, no. 4
(2006), pp. 2192–2209.

Bulla, J., S. Mergner, I. Bulla, A. Sesboüé, and C. Chesneau. “Markov-switching


asset allocation: Do profitable strategies exist?” Journal of Asset Manage-
ment, vol. 12, no. 5 (2011), pp. 310–321.

Calvet, L. E. and A. J. Fisher. “How to forecast long-run volatility: Regime


switching and the estimation of multifractal processes.” Journal of Financial
Econometrics, vol. 2, no. 1 (2004), pp. 49–83.

Cappé, O., E. Moulines, and T. Rydén. Inference in Hidden Markov Models.


Springer: New York (2005).

Chan, W. S. “Understanding the effect of time series outliers on sample autocor-


relations.” Test, vol. 4, no. 1 (1995), pp. 179–186.

Cont, R. “Empirical properties of asset returns: stylized facts and statistical


issues.” Quantitative Finance, vol. 1, no. 2 (2001), pp. 223–236.

Creal, D., S. J. Koopman, and A. Lucas. “Generalized autoregressive score mod-


els with applications.” Journal of Applied Econometrics, vol. 28, no. 5 (2013),
pp. 777–795.

Dacco, R. and S. Satchell. “Why do regime-switching models forecast so badly?”


Journal of Forecasting, vol. 18, no. 1 (1999), pp. 1–16.

Dempster, A. P., N. M. Laird, and D. B. Rubin. “Maximum likelihood from


incomplete data via the EM algorithm.” Journal of the Royal Statistical
Society. Series B (Methodological), vol. 39, no. 1 (1977), pp. 1–38.

Diebold, F. X. and A. Inoue. “Long memory and regime switching.” Journal of


Econometrics, vol. 105, no. 1 (2001), pp. 131–159.
7 Conclusion 83

Engle, R. F. “Autoregressive conditional heteroscedasticity with estimates of the


variance of United Kingdom inflation.” Econometrica, vol. 50, no. 4 (1982),
pp. 987–1007.

Frühwirth-Schnatter, S. Finite Mixture and Markov Switching Models. Springer:


New York (2006).

Gourieroux, C. and J. Jasiak. “Memory and infrequent breaks.” Economics


Letters, vol. 70, no. 1 (2001), pp. 29–41.

Granger, C. W. J. and Z. Ding. “Some properties of absolute return: An alter-


native measure of risk.” Annales D’Economie Et Statistique, vol. 40 (1995a),
pp. 67–92.

Granger, C. W. J. and Z. Ding. “Stylized facts on the temporal and distribu-


tional properties of daily data from speculative markets.” Unpublished paper,
Department of Economics, University of California, San Diego (1995b).

Granger, C. W. J., S. Spear, and Z. Ding. “Stylized facts on the temporal and
distributional properties of absolute returns: An update.” In Proceedings of
the Hong Kong International Workshop on Statistics in Finance. Imperial
College Press: London (2000), pp. 97–120.

Granger, C. W. J. and T. Teräsvirta. “A simple nonlinear time series model


with misleading linear properties.” Economics Letters, vol. 62, no. 2 (1999),
pp. 161–165.

Joensen, A., H. Madsen, H. A. Nielsen, and T. S. Nielsen. “Tracking time-varying


parameters with local regression.” Automatica, vol. 36, no. 8 (2000), pp. 1199–
1204.

Khreich, W., E. Granger, A. Miri, and R. Sabourin. “A survey of techniques


for incremental learning of HMM parameters.” Information Sciences, vol. 197
(2012), pp. 105–130.

Langrock, R. and W. Zucchini. “Hidden Markov models with arbitrary state


dwell-time distributions.” Computational Statistics & Data Analysis, vol. 55,
no. 1 (2011), pp. 715–724.

Lindström, E., H. Madsen, and J. N. Nielsen. Statistics for Finance. Chapman


& Hall: London (2015).

Lystig, T. C. and J. P. Hughes. “Exact computation of the observed information


matrix for hidden Markov models.” Journal of Computational and Graphical
Statistics, vol. 11, no. 3 (2002), pp. 678–689.

Madsen, H. Time Series Analysis. Chapman & Hall: London (2008).


84 Long memory and time-varying parameters

Malmsten, H. and T. Teräsvirta. “Stylized facts of financial time series and


three popular models of volatility.” European Journal of Pure and Applied
Mathematics, vol. 3, no. 3 (2010), pp. 443–477.
Mikosch, T. and C. Stărică. “Nonstationarities in financial time series, the
long-range dependence, and the IGARCH effects.” Review of Economics and
Statistics, vol. 86, no. 1 (2004), pp. 378–390.
Nielsen, B. F. Matrix Analytic Methods in Applied Probability with a View
towards Engineering Applications. Doctoral thesis, Technical University of
Denmark (2013).
Nystrup, P., B. W. Hansen, H. Madsen, and E. Lindström. “Regime-based ver-
sus static asset allocation: Letting the data speak.” Journal of Portfolio
Management, vol. 42, no. 1 (2015a), pp. 103–109.
Nystrup, P., H. Madsen, and E. Lindström. “Stylised facts of financial time
series and hidden Markov models in continuous time.” Quantitative Finance,
vol. 15, no. 9 (2015b), pp. 1531–1541.
Rydén, T., T. Teräsvirta, and S. Åsbrink. “Stylized facts of daily return series
and the hidden Markov model.” Journal of Applied Econometrics, vol. 13,
no. 3 (1998), pp. 217–244.
Stărică, C. and C. W. J. Granger. “Nonstationarities in stock returns.” Review
of Economics and Statistics, vol. 87, no. 3 (2005), pp. 503–522.
Zucchini, W. and I. L. MacDonald. Hidden Markov Models for Time Series: An
Introduction Using R. Chapman & Hall: London, 2nd ed. (2009).
PAPER C
86
Originally published in The Journal of Portfolio Management

Regime-based versus static asset allocation:


Letting the data speak

Peter Nystrup, Bo William Hansen, Henrik Madsen,


and Erik Lindström

Abstract

Regime shifts present a big challenge to traditional strategic asset allocation.


This paper investigates whether regime-based asset allocation can effectively
respond to changes in financial regimes at the portfolio level in an effort to
provide better long-term results when compared to more static approaches.
The regime-based approach is centered around a regime-switching model
with time-varying parameters that can match the tendency of financial mar-
kets to change their behavior abruptly and the phenomenon that the new
behavior often persists for several periods after a change. In an asset uni-
verse consisting of a global stock index and a global government bond index,
it is shown that, even without any level of forecasting skill, it may not be
optimal to hold a static portfolio.

Keywords: Dynamic asset allocation; Regime shifts; Hidden Markov models;


Time-varying parameters; Daily returns; Volatility clustering.

1 Introduction
The behavior of financial markets changes abruptly. Although some changes
may be transitory, the new behavior often persists for several periods after
a change. The mean, volatility, and correlation patterns in stock returns, for
example, changed dramatically at the start of, and continued through the global
financial crisis of 2007–2008. Similar regime changes, some of which can be
recurring (recessions versus expansions) and some of which can be permanent
(structural breaks), are prevalent across a wide range of financial markets and
in the behavior of many macro variables (Ang and Timmermann 2012).
Observed regimes in financial markets are related to the phases of the business
cycle (see, e.g., Campbell 1999, Cochrane 2005). The link is complex and diffi-
cult to exploit for investment purposes due to the large lag in the availability of
business cycle related data. Our intention is to let the data speak by focusing
88 Regime-based versus static asset allocation

on readily available market data instead of attempting to establish the link to


the business cycle.

Regime changes present a big challenge to traditional strategic asset alloca-


tion (SAA). In the presence of time-varying investment opportunities, portfolio
weights should be adjusted as new information arrives. Traditional SAA ap-
proaches seek to develop static “all-weather” portfolios that optimize efficiency
across a range of economic scenarios. However, if economic conditions are per-
sistent and strongly linked to asset class performance, then a dynamic strategy
should add value over static weights (Sheikh and Sun 2012). The purpose of
a regime-based strategy is to take advantage of favorable economic regimes as
well as withstand adverse economic regimes and reduce potential drawdowns.

Regime-based investing is distinct from tactical asset allocation (TAA). While


the latter is shorter term, higher frequency (i.e., weekly or monthly), and driven
primarily by valuation considerations, regime-based investing targets a longer
time horizon (i.e., a year or more) and is driven by changing economic fun-
damentals. A regime-based approach has the flexibility to adapt to changing
economic conditions within a benchmark-based investment policy which can in-
volve more than one rebalancing within a year. It straddles a middle ground
between strategic and tactical (Sheikh and Sun 2012).

2 Letting the data speak

This paper examines whether regime-based asset allocation (RBAA) can effec-
tively respond to financial regimes in an effort to provide better long-term re-
sults when compared to static approaches. Dopfel (2010) showed the potential
outperformance of a RBAA strategy assuming complete information about the
prevailing regime and future regime shifts is available. Dopfel (2010), however,
concluded that an investor who does not possess exceptional forecasting skill
is better off holding a static portfolio that is hedged against the uncertainty
associated with regime shifts.

This conclusion contrasts the large amount of studies that have documented
the profitability of dynamic asset allocation (DAA) strategies based on regime-
switching models (see, e.g., Ang and Bekaert 2002, 2004, Guidolin and Timmer-
mann 2007, Bulla et al. 2011, Kritzman et al. 2012). The profitability should be
accepted with caution as not all the studies account for transaction costs when
comparing the performance of dynamic and static strategies. This is impor-
tant as frequent rebalancing can offset the potential excess return of a dynamic
strategy. Furthermore, the in-sample performance generally exceeds the out-of-
sample performance—if the strategies are at all tested out-of-sample.
2 Letting the data speak 89

Inspired by the apparent profitabil-

400
ity of regime-switching strategies this MSCI ACWI
paper challenges the conclusion made JPM GBI

300
Index
by Dopfel (2010) by letting the data
speak. In an investment universe

200
consisting of a global stock index
(MSCI ACWI)1 and a global govern-

100
ment bond index (JPM GBI)2 , the 1995 2000 2005 2010
performance of a RBAA strategy is
compared to a strategy based on re- Year
balancing to static weights. The de-
Figure 1: Investment universe.
velopment of the two indices over the
20-year data period is depicted in fig-
ure 1.
The intention is to identify regimes in the stock returns using a regime-switching
model and let the asset allocation depend on the identified regime. The focus
on modeling the stock returns is natural as portfolio risk is typically dominated
by stock market risk. In addition, the stock markets generally lead the economy
(Siegel 1991). The goal is not to predict regime shifts or future market move-
ments but to identify when a regime shift has occurred and then benefit from the
persistence of equilibrium returns and volatilities. The regime-switching process
can be interpreted as a momentum process when it is more likely to continue in
the same state rather than transition to another state (Ang and Bekaert 2002).

1 The MSCI All Country World Index, denominated in USD, captures large and mid cap repre-

sentation across 23 Developed Market and 21 Emerging Market countries. The difference compared
to the more well-known MSCI World Index is the weight on EM countries. The data prior to 1999,
where the total return index began, has been reconstructed based on the price index by adding the
average daily net dividend return over the period from 1999 to 2013 of 0.007% to the price returns.
2 The global J.P. Morgan Government Bond Index measures the performance across 13 developed

fixed income bond markets hedged to USD. The constituents are selected from all government bonds,
excluding floating rate notes, perpetuals, bonds targeted at the domestic market for tax purposes,
and bonds with less than one year to maturity. The index had a modified duration of 6.8 at the
end of 2013.
90 Regime-based versus static asset allocation

Figure 2 shows the daily log-returns


of the stock index.3 The volatility
0.05

forms clusters as large price move-


Log-return

ments tend to be followed by large


price movements and vice versa, as
noted by Mandelbrot (1963).4 The
-0.05

RBAA strategy aims to exploit this


1995 2000 2005 2010
persistence of the volatility as risk-
adjusted returns, on average, are sub-
Year stantially lower during turbulent peri-
ods, irrespective of the source of tur-
Figure 2: Volatility clustering in the daily
log-returns.
bulence, as shown by Kritzman and
Li (2010). The purpose is not to out-
line the optimal strategy but rather
to discuss the profitability of a RBAA approach.

3 The hidden Markov model


Imagine knowing a person’s heart rate. While the person is sleeping, a low
average heart rate with low volatility is observed. When the person wakes up,
there is a sudden rise in the average level of the heart rate and its volatility.
Without actually seeing the person, it can reasonably be concluded whether he
or she is awake or sleeping, that is which state he or she is in.
The heart rate of a financial market is its returns. The use of hidden Markov
models (HMMs) to infer the state of financial markets has gained popularity
over the last decade. The HMM is a black-box model, but the inferred states
can often be linked to phases of the business cycle (see, e.g., Guidolin and
Timmermann 2007). The possibility of interpreting the states combined with
the model’s ability to reproduce stylized facts of financial returns is part of the
reason why the HMM has become increasingly popular.
In a hidden Markov model, the probability distribution that generates an ob-
servation depends on the state of an unobserved Markov chain. A sequence of
discrete random variables {Xt : t ∈ N} is said to be a first-order Markov chain
if, for all t ∈ N, it satisfies the Markov property:
Pr ( Xt+1 | Xt , . . . , X1 ) = Pr ( Xt+1 | Xt ) . (1)
The conditional probabilities Pr ( Xt+1 = j| Xt = i) = γij are called transition
probabilities.
3 The log-returns are calculated using r = log (P ) − log (P
t t t−1 ), where Pt is the closing price
of the index on day t and log is the natural logarithm.
4 A quantitative manifestation of this fact is that while returns themselves are uncorrelated,

absolute and squared returns display a positive, significant and slowly decaying autocorrelation
function.
3 The hidden Markov model 91

As an example, consider the two-state model with Gaussian conditional distri-


butions: ( )
Yt | Xt ∼ N µXt , σX
2
t
,
where
{ { [ ]
µ1 , if Xt = 1, 2 σ12 , if Xt = 1, 1 − γ12 γ12
µXt = σXt = and Γ = .
µ2 , if Xt = 2, 2
σ2 , if Xt = 2, γ21 1 − γ21

When the current state Xt is known, the distribution of Yt depends only on Xt .


The sojourn times are implicitly assumed to be geometrically distributed:
t−1
Pr (’staying t time steps in state i’) = γii (1 − γii ) . (2)

The geometric distribution is memoryless, implying that the time until the next
transition out of the current state is independent of the time spent in the state.
HMMs can match financial markets’ tendency to change their behavior abruptly
and the phenomenon that the new behavior often persists for several periods
after a change. They are well suited to capture the stylized behavior of many
financial series including volatility clustering and leptokurtosis, as shown by
Rydén et al. (1998). Subsequent papers have extended the classical Gaussian
HMM by considering other sojourn-time distributions than the memoryless geo-
metric distribution (Bulla and Bulla 2006), other conditional distributions than
the Gaussian distribution (Bulla 2011), and a continuous-time formulation as
an alternative to the dominating discrete-time models (Nystrup et al. 2015b).
In ? it was found that the need to consider other sojourn-time distributions
and other conditional distributions can be eliminated by adapting to the time-
varying behavior of the data process.
The parameters of an HMM are usually estimated using the maximum-likelihood
method. Every observation is assumed to be of equal importance no matter how
long the sample period is. This approach works well when the sample period is
short and the underlying process does not change over time. The time-varying
behavior of the parameters documented in previous studies (Rydén et al. 1998,
Bulla 2011, Nystrup et al. 2017b) calls for an adaptive approach that assigns
more weight to the most recent observations while keeping in mind the past
patterns at a reduced confidence.
In Nystrup et al. (2017b), an adaptive estimation approach based on weighting
the observations with exponentially-decreasing weights, in other words using
exponential forgetting, was outlined. The same approach will be pursued in
this article. The regime-switching model is still a two-state HMM with Gaussian
conditional distributions, but one that adapts to the time-varying behavior of
the underlying process in an effort to produce more robust state estimates.
92 Regime-based versus static asset allocation

Index/Strategy AR SD SR MDD
Bond index (JPM GBI) 0.060 0.03 1.90 0.05
Stock index (MSCI ACWI) 0.069 0.18 0.38 0.58
Static Portfolio 0.064 0.09 0.72 0.32
Stocks–Bonds 0.114 0.09 1.23 0.13
Long–Short 0.096 0.18 0.52 0.44

Table 3: Performance of the indices and the strategies from 1996–2014.

4 Empirical results
The testing is done one day at a time in a live-sample setting to make it as
realistic as possible. The model is fitted to the first t observations assigning most
weight to the most recent observations. Based on the estimated parameters the
probability that on day t the market was in the high and low-volatility state,
respectively, are calculated along with the predicted state probabilities for day
t + 1. As the states are highly persistent (γii ≫ 0.5), the state that is predicted
to be most likely on day t + 1 will be the same that was identified to be most
likely on day t. If the state that is predicted to be most likely on day t + 1 is
different from the state that the asset allocation on day t is based on and the
confidence in the prediction is above 95%5 , then the allocation is changed based
on the closing price at day t + 1, i.e., there is assumed to be a one-day delay
in the implementation. Otherwise the asset allocation remains unchanged. The
log-return on day t + 1 is then included in the sample, the model is re-estimated,
and the state probabilities are calculated based on the new parameters. This
procedure is repeated sequentially by including the observations, one at a time,
from 1 January 1996 all the way through the sample.6
The performance of two regime-based strategies (Stocks–Bonds and Long–Short)
is compared to the performance of the two indices and a static portfolio with a
fixed allocation of 49% to stocks in table 3. The first strategy is fully invested
in the stock index in the low-volatility state and the bond index in the high-
volatility state. The average allocation to the stock index over the period was
49%. The second strategy is long the stock index in the low-volatility state and
short the stock index in the high-volatility state. Figure 4 shows the development
of the strategies and the indices. In the shaded periods the allocation was based
on being in the high-volatility state.
The identified regimes seem intuitive when looking at the log-returns at the
bottom of figure 4. There have been a total of 16 regime changes over the
5 If the confidence threshold is changed to 85% or 90% there will be more regime changes, but

the results will only change somewhat. For a given level of transaction costs there exists an optimal
threshold that balances the cost of rebalancing with the cost of not reacting to regime shifts or
delaying the reaction.
6 The log-returns from the two years prior to 1996 are used for initialization.
4 Empirical results 93

Stocks–Bonds
600
Long–Short
MSCI ACWI
Static Portfolio
500

JPM GBI
400
Index
300
200
100

2000 2005 2010


Year
0.05
Log-return
-0.05

2000 2005 2010


Year

Figure 4: Development of the strategies and indices across the inferred regimes.

18-year period. The length of the identified regimes varies considerably from
a few weeks up to six years. This is different from what would be expected if
the regimes were based on a business cycle indicator. There appears to be large
differences in the level of volatility within the six-year high-volatility regime
beginning in 1998 that includes both the build-up and the burst of the dot-com
bubble. This suggests that the market states were less persistent around this
time.

The bond index has realized the highest Sharpe ratio (SR) with an annualized
return (AR) of 6.0% and an adjusted annualized standard deviation (SD) of only
3%. The reported SDs have been adjusted for autocorrelation using the proce-
94 Regime-based versus static asset allocation

Index/Strategy AR SD SR MDD
Bond index (JPM GBI) 0.058 0.03 1.89 0.05
Stock index (MSCI ACWI) 0.111 0.17 0.66 0.50
Static Portfolio 0.084 0.08 1.04 0.21
Stocks–Bonds 0.115 0.09 1.22 0.13
Long–Short 0.063 0.17 0.38 0.44

Table 5: Performance of the indices and the strategies when excluding year 2008.

dure outlined by Kinlaw et al. (2015).7 The data period has been characterized
by falling interest rates and low inflation leading to a strong performance for
bonds. It is unlikely that the environment for bonds will be as favorable going
forward.
The Stocks–Bonds strategy has realized the second highest SR of 1.23. The
annualized SD is the same as for the static portfolio (that has the same aver-
age exposure to the stock index), but the realized return is higher as long as
transaction costs do not exceed 239 basis points per one-way transaction. This
is when ignoring the costs associated with rebalancing to static weights, thus
the break-even transaction cost is higher than 239 basis points. In addition, the
maximum drawdown8 (MDD) for the Stocks–Bonds strategy is much smaller
when compared to the static strategy.
The Long–Short strategy has been less profitable, but it still outperforms the
stock index when transaction costs are less than 130 basis points per one-way
transaction. The outperformance essentially happened during the financial crisis
in 2008. The Long–Short strategy has a lower tail risk compared to the stock
index, but the risk of the strategy underperforming the index going forward
seems to be real.
This observation is confirmed in table 5, where the performance of the indices
and the regime-based strategies are summarized, when excluding year 2008 from
the sample. Although it still has a lower tail risk compared to the stock index,
the Long–Short strategy then underperforms the stock index. The performance
of the bond index and the Stocks–Bonds strategy barely change, whereas the
AR and SR of the stock index and the static portfolio increase. The average
allocation to the stock index when excluding year 2008 was 52%, which equals
the fixed allocation to stocks in the static portfolio in table 5. The realized
return of the Stocks–Bonds strategy still exceeds that of the static portfolio as
long as transaction costs do not exceed 140 basis points per one-way transaction.
7 The adjustment for autocorrelation leads to slightly higher standard deviations and correspond-

ingly lower Sharpe ratios. The adjustment is not important to the conclusions drawn as the indices
and strategies all displayed fairly similar, low amounts of autocorrelation.
8 The maximum drawdown is the largest relative decline from a historical peak in the index

value.
5 Summary and discussion 95

5 Summary and discussion


Our results indicate that it does not require any level of forecasting skill for
a RBAA strategy to be more profitable than a static strategy. The testing
was done one day at a time in a live-sample setting to make it as realistic as
possible. Our approach was based on identifying regimes in market returns using
a regime-switching model with time-varying parameters. As the parameters are
updated every day the same approach should work in other time periods as
well. The results are robust as they are based on available market data with no
assumptions about equilibrium returns, volatilities, or correlations. Additionally,
it might be possible to improve the performance by including economic variables,
interest rates, investor sentiment surveys, or other possible indicators.

The outperformance of the strategy that switched between stocks and bonds
appeared to be most reliable as it did not just happen during one year, as was
the case with the Long–Short strategy. Although the break-even transaction
cost was more than 239 basis points (140 basis points when excluding year
2008), the strategies will only remain profitable if the persistence of the market
states remains high. Volatility clustering is not a new phenomenon, but it can
occur at different levels of market persistence.

The tested strategies may be based on larger changes in allocation than most
investors are willing to and/or allowed to implement. Suppose your benchmark
is a 50–50 allocation to stocks and bonds and that you are allowed to vary it
between a 60–40 and a 40–60 allocation. For the Stocks–Bonds strategy, this
is equivalent to allocating 80% of the portfolio to a static 50–50 portfolio and
the remaining 20% to the regime-based strategy. The excess return that can be
obtained is then 20% of the excess return that could be obtained by allocating
the entire portfolio to the regime-based strategy.

Our results have important implications for portfolio managers with a medium
to long-term investment horizon. Even without any level of forecasting skill it
may not be optimal to hold a static portfolio. With some level of forecasting
skill the potential outperformance of a RBAA strategy could be substantial. It
is definitely worth considering a more dynamic approach to asset allocation if
not only to reduce the tail risk.

References
Ang, A. and G. Bekaert. “International asset allocation with regime shifts.”
Review of Financial Studies, vol. 15, no. 4 (2002), pp. 1137–1187.

Ang, A. and G. Bekaert. “How regimes affect asset allocation.” Financial Ana-
lysts Journal, vol. 60, no. 2 (2004), pp. 86–99.
96 Regime-based versus static asset allocation

Ang, A. and A. Timmermann. “Regime changes and financial markets.” Annual


Review of Financial Economics, vol. 4, no. 1 (2012), pp. 313–337.
Bulla, J. “Hidden Markov models with t components. Increased persistence and
other aspects.” Quantitative Finance, vol. 11, no. 3 (2011), pp. 459–475.
Bulla, J. and I. Bulla. “Stylized facts of financial time series and hidden semi-
Markov models.” Computational Statistics & Data Analysis, vol. 51, no. 4
(2006), pp. 2192–2209.
Bulla, J., S. Mergner, I. Bulla, A. Sesboüé, and C. Chesneau. “Markov-switching
asset allocation: Do profitable strategies exist?” Journal of Asset Manage-
ment, vol. 12, no. 5 (2011), pp. 310–321.
Campbell, J. Y. “Asset prices, consumption, and the business cycle.” In Hand-
book of Macroeconomics, edited by J. B. Taylor and M. Woodford, vol. 1C,
chap. 19. Elsevier: Amsterdam (1999), pp. 1231–1303.
Cochrane, J. H. “Financial markets and the real economy.” Foundations and
Trends in Finance, vol. 1, no. 1 (2005), pp. 1–101.
Dopfel, F. E. “Designing the new policy portfolio: A smart, but humble ap-
proach.” Journal of Portfolio Management, vol. 37, no. 1 (2010), p. 43.
Guidolin, M. and A. Timmermann. “Asset allocation under multivariate regime
switching.” Journal of Economic Dynamics and Control, vol. 31, no. 11 (2007),
pp. 3503–3544.
Kinlaw, W., M. Kritzman, and D. Turkington. “The divergence of high- and low-
frequency estimation: Implications for performance measurement.” Journal
of Portfolio Management, vol. 41, no. 3 (2015), pp. 14–21.
Kritzman, M. and Y. Li. “Skulls, financial turbulence, and risk management.”
Financial Analysts Journal, vol. 66, no. 5 (2010), pp. 30–41.
Kritzman, M., S. Page, and D. Turkington. “Regime shifts: Implications for
dynamic strategies.” Financial Analysts Journal, vol. 68, no. 3 (2012), pp.
22–39.
Mandelbrot, B. “The variation of certain speculative prices.” Journal of Business,
vol. 36, no. 4 (1963), pp. 394–419.
Nystrup, P., H. Madsen, and E. Lindström. “Stylised facts of financial time
series and hidden Markov models in continuous time.” Quantitative Finance,
vol. 15, no. 9 (2015b), pp. 1531–1541.
Nystrup, P., H. Madsen, and E. Lindström. “Long memory of financial time
series and hidden Markov models with time-varying parameters.” Journal of
Forecasting, vol. 36, no. 8 (2017b), pp. 989–1002.
5 Summary and discussion 97

Rydén, T., T. Teräsvirta, and S. Åsbrink. “Stylized facts of daily return series
and the hidden Markov model.” Journal of Applied Econometrics, vol. 13,
no. 3 (1998), pp. 217–244.
Sheikh, A. Z. and J. Sun. “Regime change: Implications of macroeconomic shifts
on asset class and portfolio performance.” Journal of Investing, vol. 21, no. 3
(2012), pp. 36–54.
Siegel, J. J. “Does it pay stock investors to forecast the business cycle?” Journal
of Portfolio Management, vol. 18, no. 1 (1991), pp. 27–34.
98
PAPER D
100
Originally published in The Journal of Portfolio Management

Dynamic allocation or diversification:


A regime-based approach to multiple assets

Peter Nystrup, Bo William Hansen, Henrik Olejasz Larsen,


Henrik Madsen, and Erik Lindström

Abstract

This article investigates whether regime-based asset allocation can effec-


tively respond to changes in financial regimes at the portfolio level in an
effort to provide better long-term results when compared to a static 60/40
benchmark. The potential benefit from taking large positions in a few assets
at a time comes at the cost of reduced diversification. This tradeoff is ana-
lyzed in a multi-asset universe with great potential for static diversification.
The regime-based approach is centered around a regime-switching model
with time-varying parameters that can match financial markets’ behavior,
and a new, more intuitive way of inferring the hidden market regimes. The
empirical results show that regime-based asset allocation is profitable, even
when compared to a diversified benchmark portfolio. The results are robust,
as they are based on available market data with no assumptions about fore-
casting skills.

Keywords: Dynamic asset allocation; Regime shifts; Hidden Markov model;


Time-varying parameters; Daily returns; Volatility clustering.

1 Introduction
Regime changes present a big challenge to traditional strategic asset allocation
(SAA) approaches seeking to develop static “all-weather” portfolios that opti-
mize efficiency across a range of economic scenarios. If economic conditions are
persistent and strongly linked to asset class performance, then a dynamic strat-
egy should add value over rebalancing to static weights, as argued by Sheikh
and Sun (2012).
Within the last 15 years, a large number of studies have examined the profitabil-
ity of regime-based asset allocation (RBAA). RBAA is distinct from tactical
asset allocation, which relies on forecasting, in that it is based on reacting to
observed changes in market conditions. The purpose of RBAA is not to pre-
dict regime changes or future market movements, but to identify when a regime
102 Dynamic allocation or diversification

change has occurred and then benefit from the persistence of equilibrium re-
turns, volatilities, and correlations to take advantage of favorable regimes and
reduce potential drawdowns.
The hidden Markov model (HMM) is a popular choice for inferring the state of
financial markets. Ang and Bekaert (2002, 2004), Guidolin and Timmermann
(2007), Bulla et al. (2011), Kritzman et al. (2012), and Nystrup et al. (2015a)
have all found RBAA approaches based on HMMs to be profitable. Ang and
Bekaert (2002, 2004) and Guidolin and Timmermann (2007), however, did not
account for transaction costs, which is important because frequent rebalancing
can offset a dynamic strategy’s potential excess return.
All of the aforementioned studies considered dynamic allocation to stocks in
combination with bonds and/or a risk-free asset, often involving larger changes
in allocation than most investors are willing to or allowed to implement. The
potential benefit from taking large positions in a few assets at a time comes at
the cost of reduced diversification. The benefits of diversification include lower
downside risk and higher risk-adjusted returns. In order to analyze this tradeoff,
the performance of RBAA has to be compared to a static benchmark using a
more comprehensive asset universe, because the potential for diversification is
limited by the size of the asset universe.
Dynamic asset allocation is, by definition, more restricted than SAA in terms of
the size of the investment opportunity set, since it is difficult to invest dynami-
cally in illiquid assets such as private real estate, private equity, infrastructure,
timber, etc. This is worth mentioning, given that illiquid alternatives have be-
come a larger part of institutional investors’ portfolios in recent years. Although
restricted to the universe of liquid assets, there are more opportunities than just
stocks and government bonds.
Regime-based approaches are very popular, inter alia, because of the link to the
phases of the business cycle. As argued in Nystrup et al. (2015a), the link is
complex and difficult to exploit for investment purposes due to the large lag in
the availability of data related to the business cycle. Therefore, in this article,
the focus will be on readily available market data, instead of attempting to
establish the link to the business cycle.
The underlying two-state HMM with time-varying parameters is the same as
in Nystrup et al. (2015a). However, this study includes more asset classes, and
a new, more intuitive way of inferring the hidden market states based on an
online version of the Viterbi algorithm is introduced.1 It is examined whether
RBAA can effectively respond to financial regimes in an effort to provide better
long-term results when compared to a diversified, fixed-weight benchmark, with
emphasis on the tradeoff between dynamic allocation and diversification.
1 An online algorithm processes its input observation-by-observation in a sequential fashion,

without having the entire input sequence available from the start.
2 Asset universe 103

600 EM HY bonds
Real estate
HY bonds
500

DM stocks
CORP bonds
IFL bonds
400

Gold
GVT bonds
Index

EM stocks
Oil
300
200
100

1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
Year

Figure 1: Development of the ten indices over the 19-year data period. The legends
are sorted according to the index values at the end of 2015.

2 Asset universe
The asset universe considered in this paper consists of developed (DM) and
emerging market (EM) stocks, listed real estate, DM and EM high-yield bonds,
gold, oil, corporate bonds, inflation-linked bonds, and government bonds.2 All
indices measure the total net return in USD with a total of 4,944 daily closing
prices per index covering the period from 1997 through 2015.
Figure 1 shows the development of the ten indices over the 19-year data period.
It is evident that there are large differences in the asset classes’ behavior, which
is why diversification works over the long run. The Global Financial Crisis of
2007–2008 stands out, in that respect, as the majority of the indices suffered
large losses in this period.
Table 2 summarizes the annualized return, standard deviation, Sharpe ratio, and
maximum drawdown3 of the indices. To ensure that the performance comparison
2 The ten indices are MSCI World, MSCI Emerging Markets, FTSE EPRA/NAREIT Developed

Real Estate, BofA Merrill Lynch U.S. High Yield, Barclays Emerging Markets High Yield, S&P
GSCI Crude Oil (funded futures roll), S&P GSCI Gold (funded futures roll), Barclays U.S. Aggregate
Corporate Bonds, Barclays World Inflation-Linked Bonds (hedged to USD), and J.P. Morgan Global
Government Bonds (hedged to USD).
3 The maximum drawdown is the largest relative decline from a historical peak in the index
104 Dynamic allocation or diversification

Annualized Standard Sharpe Maximum


Index
return deviation ratio drawdown
1. MSCI World (stocks) 0.061 0.18 0.34 0.57
2. MSCI EM (stocks) 0.052 0.29 0.18 0.65
3. FTSE/EPRA REIT (real estate) 0.075 0.22 0.34 0.72
4. High-Yield Bonds (credit) 0.064 0.12 0.56 0.35
5. EM High-Yield Bonds (credit) 0.093 0.12 0.75 0.36
6. S&P GSCI Crude Oil WTI (commodity) -0.032 0.43 -0.07 0.91
7. S&P GSCI Gold (commodity) 0.054 0.16 0.34 0.46
8. Corporate Bonds Inv Grade (fixed income) 0.060 0.06 1.07 0.16
9. Inflation-Linked Bonds (fixed income) 0.059 0.04 1.40 0.10
10. JPM Global GBI (fixed income) 0.054 0.03 1.75 0.05

Table 2: Performance of the ten indices over the 19-year data period.

is not distorted by autocorrelation in the daily returns, the reported standard


deviations have been adjusted for autocorrelation using the procedure outlined
by Kinlaw et al. (2015).4 The differences in performance are substantial. Out
of the ten indices, the oil price index has been both at the lowest and highest
value during the 19-year period. It is the only index that has had a negative
return. The EM high-yield bond index finished at the highest value, while the
fixed income indices realized the highest Sharpe ratios and, at the same time,
suffered the smallest drawdowns. Fixed income benefited from falling interest
rates over the considered period.
The differences in Sharpe ratios are too large for a diversified portfolio to be
able to outperform a portfolio with an overweight of fixed income. There should
not be a strong preference, ex ante, for portfolios that overweight fixed income,
since the environment for bonds is unlikely to remain as favorable in the coming
years. It is, therefore, important to ensure that portfolios have the same average
allocation to fixed income before comparing their performance. Alternatively,
the mean values of the indices could be adjusted so that they all have the same
Sharpe ratio.
Figure 3 shows the correlations between the ten indices estimated from the
daily returns over the 19-year data period. The indices are divided into two
groups based on whether they are positively or negatively correlated with DM
stocks. Gold stands out as having a very low correlation with the other assets.
Investment-grade corporate bonds could have been labeled as credit rather than
fixed income, but the index appears to be strongly correlated with inflation-
linked bonds and government bonds. High-yield bonds, on the other hand, are
more strongly correlated with stocks than government bonds.

value.
4 The adjustment leads to the reported standard deviations being higher than had they been

annualized under the assumption of independence, as most of the indices display positive autocorre-
lation. The adjustment had a large impact on the standard deviation of the high-yield bond index
that went from 0.04 to 0.12.
2 Asset universe 105

EM HY bonds

CORP bonds

GVT bonds
Real estate
DM stocks
EM stocks

IFL bonds
HY bonds

Gold
Oil
1
DM stocks
0.8
EM stocks
0.6
Real estate
0.4
HY bonds
0.2
EM HY bonds
0
Oil
-0.2
Gold
-0.4
CORP bonds
-0.6
IFL bonds
-0.8
GVT bonds
-1

Figure 3: Correlations between the ten indices over the 19-year data period. The size
of each circle illustrates the absolute value and the shading indicates the numerical value
of the correlation. The indices are divided into two groups based on their correlation
with DM stocks.

2.1 Volatility regimes


The regime detection will focus on
the log-returns of the MSCI World
0.05

index, because portfolio risk is typi-


Log-return

cally dominated by stock market risk


(see, e.g., Goyal et al. 2015). In ad-
dition, the stock markets generally
-0.05

lead the economy (Siegel 1991). Fig-


ure 4 shows the log-returns of the
2000 2005 2010 2015
MSCI World index.5 The volatility
forms clusters, as large price move- Year
ments tend to be followed by large
price movements and vice versa, as Figure 4: Volatility clustering in the log-
returns of the MSCI World index.
noted by Mandelbrot (1963).6
5 The log-returns are calculated using rt = log (Pt ) − log (Pt−1 ), where Pt is the closing price
of the index on day t and log is the natural logarithm.
6 A quantitative manifestation of this fact is that while returns themselves are uncorrelated,

absolute and squared returns display a positive, significant, and slowly decaying autocorrelation
106 Dynamic allocation or diversification

RBAA aims to exploit this persistence of the volatility, since risk-adjusted re-
turns, on average, are substantially lower during turbulent periods, irrespective
of the source of turbulence, as shown by Kritzman and Li (2010). The negative
correlation between volatility and returns is sometimes explained by changes
in attitudes toward risk; because high-volatility regimes are associated with in-
creased risk aversion and reduced risk capacity, a high-volatility environment is
likely to be accompanied by falling asset prices.7
The intention is to identify high- and low-volatility regimes in the stock returns
using a regime-switching model and let the asset allocation depend on the iden-
tified regime. The purpose is not to outline the optimal strategy, but rather to
discuss the potential profitability of RBAA in a comprehensive asset universe.
It appeared from figure 1 that the turning points are not exactly the same for
all asset classes, however, it is left for future research to show whether the re-
sults presented in this article can be improved by including information from
the other asset classes in the regime-detection process.

2.2 A 60/40 benchmark


The first column in table 5 outlines the weight of the indices in a 60/40 portfolio.
The weight of stocks, real estate, credit, and commodities sum to 60%, and the
weight of corporate, inflation-linked, and government bonds sum to 40%. This
portfolio will serve as a benchmark, though the results are not sensitive to the
specific choice of benchmark allocation.8 The weights have not been optimized,
but are chosen to mimic a 60/40 long-only SAA portfolio of an institutional
investor, in order to make the study as realistic as possible.9 The allocation
to high-yield bonds is considered part of the 60% allocation to stocks, because
they are more strongly correlated with stocks than government bonds, confer
figure 3.
In table 5, the ten indices’ weights in the risk–on and risk–off RBAA portfolios
are compared to their weights in the 60/40 benchmark portfolio when a fraction
p = 0.5 of the portfolio is allocated to the RBAA strategy. Risk–on means
that in low-volatility regimes the weight of the risky assets (everything that is

function.
7 Increased risk aversion (a behavioral explanation) and reduced risk capacity (an institutional

explanation) are difficult to distinguish in data. Both effects have support (e.g., Cohn et al. 2015,
Brunnermeier and Pedersen 2009).
8 Another possible benchmark portfolio is the 1/N portfolio that assigns equal weight to all

assets. This agnostic portfolio has in many cases proved to be a difficult benchmark to beat (see
DeMiguel et al. 2009b), but it has realized a slightly lower Sharpe ratio than the proposed SAA
portfolio over the considered period.
9 The proposed benchmark allocation is within the range of variation of the average asset al-

location of pension plans across the major OECD countries according to OECD Global Pension
Statistics.
3 The hidden Markov model 107

Index SAA
 Risk–on
 Risk–off

1. MSCI World (stocks) 25.0% 
 33.3% 
 12.5% 

  
2. MSCI EM (stocks) 5.0% 

 6.7% 

 2.5% 


  
3. FTSE/EPRA REIT (real estate) 10.0% 
 13.3% 
 5.0% 

4. High-Yield Bonds (credit) 5.0% 60% 6.7% 80% 2.5% 30%
  
5. EM High-Yield Bonds (credit) 5.0% 

 6.7% 

 2.5% 


  
6. S&P GSCI Crude Oil WTI (commodity) 5.0% 

 6.7% 

 2.5% 


  
7. S&P GSCI Gold (commodity) 5.0%  6.7%  2.5% 
8. Corporate Bonds Inv Grade (fixed income) 10.0%  5.0%  17.5% 
9. Inflation-Linked Bonds (fixed income) 10.0% 40% 5.0% 20% 17.5% 70%
  
10. JPM Global GBI (fixed income) 20.0% 10.0% 35.0%

Table 5: The ten indices’ weights in the risk–on and risk–off RBAA portfolios when
p = 0.5, compared to their weights in the benchmark SAA portfolio.

positively correlated with DM stocks) is increased above 60% and the weight of
fixed income is decreased below 40%. Risk–off is the opposite.10
The indices’ weights are increased and decreased in proportion to their relative
weights in the benchmark portfolio. The larger the percentage of the portfolio
allocated to the RBAA strategy, the more the weights are increased and de-
creased. When p = 0.5, the weight of government bonds, for example, is 10% in
the risk–on portfolio and 35% in the risk–off portfolio, compared to 20% in the
SAA portfolio. Adjusting the weight of the risky assets relative to fixed income,
rather than adjusting only the weights of stocks and government bonds, ensures
a minimum level of diversification even when p = 1.

3 The hidden Markov model


Imagine a market that is either in a bullish or a bearish regime. When the
market is in the bullish regime, the average return is positive and the volatility
is low. When the market is in the bearish regime, the average return is negative
and the volatility is high. Although the market regime can never be observed,
it can reasonably be concluded based on the returns whether it is a bull or a
bear market—that is, which state the market is in.
The use of HMMs to infer the state of financial markets has gained popularity
over the last decade. The HMM is a black-box model, but the inferred states
can often be linked to phases of the business cycle (see, e.g., Guidolin and
Timmermann 2007). The possibility of interpreting the states, combined with
the model’s ability to reproduce stylized facts of financial returns, is part of the
reason that HMMs have become increasingly popular.
10 The weights in the RBAA portfolio depend on the percentage p of the portfolio that is allo-

cated to the RBAA strategy and the regime X: wRBAA (p, X) = p · wrisk–on/off (X) + (1 − p) · wSAA .
The regime-dependent portfolio weights are wrisk–on/off (X) = (25, 5, 10, 5, 5, 5, 5, 0, 0, 0)T /60 ·
1risk–on (X) + (0, 0, 0, 0, 0, 0, 0, 10, 10, 20)T /40 · 1risk–off (X), where the indicator function
1risk–on (X) ≡ 1 if X = risk–on and 1risk–on (X) ≡ 0 if X = risk–off.
108 Dynamic allocation or diversification

In a hidden Markov model, the probability distribution that generates an ob-


servation depends on the state of an unobserved Markov chain. A sequence of
discrete random variables {Xt : t ∈ N} is said to be a first-order Markov chain
if, for all t ∈ N, it satisfies the Markov property:

Pr ( Xt+1 | Xt , . . . , X1 ) = Pr ( Xt+1 | Xt ) . (1)

The conditional probabilities Pr ( Xt+1 = j| Xt = i) = γij are called transition


probabilities.
As an example, consider the two-state model with Gaussian conditional distri-
butions: ( )
Yt | Xt ∼ N µXt , σX
2
t
,
where
{ { [ ]
µ1 , if Xt = 1, 2 σ12 , if Xt = 1, 1 − γ12 γ12
µXt = σXt = and Γ = .
µ2 , if Xt = 2, σ22 , if Xt = 2, γ21 1 − γ21

When the current state Xt is known, the distribution of Yt is given—that is, the
distribution of Yt depends only on Xt .
The sojourn times are implicitly assumed to be geometrically distributed:
t−1
Pr (’staying t time steps in state i’) = γii (1 − γii ) . (2)

The geometric distribution is memoryless, implying that the time until the next
transition out of the current state is independent of the time spent in the state.
HMMs can match the tendency of financial markets to change their behavior
abruptly and the phenomenon that the new behavior often persists for several
periods after a change (Ang and Timmermann 2012). They are well suited
to capture the stylized behavior of many financial series including volatility
clustering and leptokurtosis, as shown by Rydén et al. (1998).
Subsequent articles have extended the classical Gaussian HMM by considering
other sojourn-time distributions than the memoryless geometric distribution
(Bulla and Bulla 2006), other conditional distributions than the Gaussian dis-
tribution (Bulla 2011), and a continuous-time formulation as an alternative to
the dominating discrete-time models (Nystrup et al. 2015b). In Nystrup et al.
(2017b) it was found that the need to consider other sojourn-time distributions
and other conditional distributions can be eliminated by adapting to the time-
varying behavior of the underlying data process.

Parameter estimation
The parameters of an HMM are usually estimated using the maximum-likelihood
method. Every observation is assumed to be of equal importance, no matter how
4 State inference 109

long the sample period is. This approach works well when the sample period is
short and the underlying process does not change over time. The time-varying
behavior of the parameters documented in previous studies (Rydén et al. 1998,
Bulla 2011, Nystrup et al. 2017b) calls for an adaptive approach that assigns
more weight to the most recent observations while keeping in mind past patterns
at a reduced confidence.
In Nystrup et al. (2017b), an adaptive estimation approach based on weighting
the observations with exponentially-decreasing weights—in other words, using
exponential forgetting—was outlined. The same estimation approach was used
in Nystrup et al. (2015a) and will be used in this article. The regime-switching
model is still a two-state HMM with Gaussian conditional distributions, but one
that adapts to the time-varying behavior of the underlying process in an effort
to produce more robust state estimates.

4 State inference
Once the parameters of the hidden Markov model have been estimated, the hid-
den states can be inferred. The most likely sequence of states can be computed
efficiently using the algorithm of Viterbi (1967). The entire output sequence
{Yt : t ∈ 1, 2, . . . , T } must be observed before the state for any time step can be
generated. A widely used approach is to break the input sequence into fixed-size
windows and apply the Viterbi algorithm to each window. Larger windows lead
to higher accuracy but result in higher latency.
Narasimhan et al. (2006) proposed an online step algorithm that makes it pos-
sible to dynamically trade off latency for expected accuracy, without having to
choose a fixed window size up front. The essence of their algorithm is that the
initial state becomes increasingly certain as more observations are included in
the sequence and the latency increases. Once the certainty estimate reaches a
dynamically-computed threshold, the identified initial state is outputted and
the algorithm proceeds to estimate the next state in the sequence. The algo-
rithm is shown in pseudo code in algorithm 1. Despite being very intuitive, the
algorithm has never been applied in studies of RBAA.

Choice of threshold
Using the online step algorithm, the tradeoff between accuracy and latency can
be made explicit by letting the threshold be a decreasing function of the latency.
It can be argued, though, that this is not desirable in the present application;
if the delay in classifying an observation is large, then part of the reallocation
premium has been missed, and incurring unnecessary transaction costs would
only make it worse.
Instead, a constant confidence threshold of 1 − 1/ T ≈ 0.9998, where T is the
number of observations, is chosen. This threshold is not comparable to the
110 Dynamic allocation or diversification

Algorithm 1: Online step algorithm.


t=T =1
while T ≤ the number of observations
calculate the probabilities Pr ( Xt = i| Y1 , Y2 , . . . , YT ) for all states i at time t ≤ T
based on knowledge of the first T observations
if maxi Pr ( Xt = i| Y1 , Y2 , . . . , YT ) > threshold
classify state Xt = arg maxi Pr ( Xt = i| Y1 , Y2 , . . . , YT )
t=t+1
T = max (t, T )
else
T =T +1
endif
endwhile

95% threshold applied in Nystrup et al. (2015a), where an observation was


classified as belonging to the current regime unless it immediately and with
95% confidence could be classified as belonging to the other regime. In the
algorithm proposed by Narasimhan et al. (2006), an observation is not classified
until enough observations have been gathered that the confidence requirement
is met.
With a constant threshold of 0.9998, the median delay in classifying an observa-
tion is 7 days (not trading days). The median delay in detecting regime changes
is 25 days. By lowering the confidence threshold, the delay can be reduced, but
this also leads to detection of more (spurious) regime changes and increased
transaction costs.
For a given level of transaction costs, there is an optimal threshold that balances
the cost of rebalancing with the cost of not reacting to regime changes or delaying
the reaction. A confidence threshold that corresponds to an expectation of one
misclassification for the entire sample may be too conservative. Attempting to
find the optimal threshold, however, would introduce a backtesting bias.

5 Empirical results
The testing is done one day at a time in a live-sample setting to make it as
realistic as possible. The model is fitted to the first t observations of the MSCI
World index, assigning most weight to the most recent observations.11 Based on
the estimated parameters, the probability that on day t the market was in the
high- and low-volatility states, respectively, is estimated.12 The asset allocation
remains unchanged if the certainty of the estimate does not exceed the threshold
11 An asymptotic memory length of 520 days is used when estimating the parameters (see Nystrup
et al. 2017b).
12 As the states are highly persistent (γ ≫ 0.5), the most likely state on day t + 1 will be the
ii
same state that was estimated to be most likely on day t.
5 Empirical results 111

of 0.9998. The closing price on day t + 1 is then included in the sample, the
model is re-estimated, and the state probabilities for day t are estimated based
on knowledge of the closing price on day t + 1. This procedure is repeated
sequentially by including the observations, one at a time, until the certainty of
the estimate exceeds the threshold and the state on day t can be classified.
Once the certainty exceeds the threshold and the state on day t has been clas-
sified based on knowledge of the first T observations, the asset allocation can
be updated. If the estimated state on day t is different from the state that the
current asset allocation is based on, then the allocation is changed based on
the closing price at day T + 1—that is, there is assumed to be a one-day delay
in the implementation. Otherwise the asset allocation remains unchanged. If,
based on knowledge of the first T observations, the states on day t and t + 1 can
be classified simultaneously, then it is the state on day t + 1 that determines
whether the asset allocation is changed.
When as many observations as possible have been classified based on knowledge
of the first T observations, then the closing price on day T + 1 is included in the
sample and the model is re-estimated. It is then the next day in the sequence
that has to be classified—for instance, t + 2, if, based on knowledge of the first
T observations, the states on day t and t + 1 were classified. This procedure is
repeated sequentially by including the observations, one at a time, from January
1, 1998 all the way through the sample.13 The portfolio is rebalanced only when
the allocation changes from risk–on to risk–off or vice versa.

13 The log-returns from 1997 are used for initialization.


112 Dynamic allocation or diversification

5.1 Dynamic allocation or diversification

Figure 6 shows the annualized return,


standard deviation, and Sharpe ratio
Annualized return

as a function of the percentage p of


0.080

the portfolio that is allocated to the


RBAA strategy; p = 0 corresponds
to the benchmark 60/40 SAA port-
0.070

folio rebalanced at the change points.


In terms of the rebalancing frequency
0.0 0.2 0.4 0.6 0.8 1.0
of the benchmark portfolio, it turns
out that annual rebalancing would
p have been better than quarterly or
monthly. This is a lower frequency
Standard deviation

than institutional investors typically


use and indicates the presence of mo-
0.092

mentum in the asset returns. Rebal-


ancing at the change points, however,
performed almost as well as annual re-
0.088

balancing. By rebalancing the bench-


0.0 0.2 0.4 0.6 0.8 1.0 mark portfolio at the same points in
p time as the allocation of the RBAA
portfolios changes, the timing has no
0.90

impact on the relative performance.


Sharpe ratio
0.80

The annualized return of the RBAA


portfolio increases linearly as a func-
0.70

tion of p. The annualized standard de-


0.0 0.2 0.4 0.6 0.8 1.0 viation (adjusted for autocorrelation)
p is minimized when p is around 0.5,
and the Sharpe ratio is maximized
Figure 6: Annualized return, standard de- when p is approximately 0.8. Al-
viation, and Sharpe ratio as a function of the though this is when there are no trans-
percentage of the portfolio that is allocated action costs, adding 10 basis point
to the regime-based strategy. transaction costs does not change the
shape of the graphs. The value of
p that maximizes the Sharpe ratio
would be much smaller if it was only
the weight of DM stocks and govern-
ment bonds that was adjusted dynamically, because the decline in diversifica-
tion would be steeper. In summary, RBAA increases portfolio return, decreases
portfolio risk, and, thus, leads to increased risk-adjusted returns.
5 Empirical results 113

RBAA (p = 0.5)
400
60/40 SAA
MSCI World
350
300
Index
250
200
150
100

1998 2000 2002 2004 2006 2008 2010 2012 2014


Year
0.05
Log-return
-0.05

1998 2000 2002 2004 2006 2008 2010 2012 2014


Year

Figure 7: Development of RBAA strategy with p = 0.5 compared to 60/40 SAA


portfolio and the MSCI World index across the inferred regimes. In the shaded, high-
volatility periods, the allocation was risk–off.

5.2 Performance across the inferred regimes


In figure 7, the development of the RBAA strategy with p = 0.5 is compared to
the 60/40 SAA portfolio and the MSCI World index over the period from 1998
to 2015. In the shaded periods, the allocation was risk–off. The 60/40 portfolio
is in fact a suitable benchmark for this period, as it turns out that the RBAA
strategy was risk–on 61% of the time—that is, the two portfolios had almost
the same average allocation to fixed income.
The inferred regimes seem intuitive when looking at the log-returns of the MSCI
World index in the lower panel of figure 7. A total of 34 regime changes are
114 Dynamic allocation or diversification

RBAA(p = 0.5) RBAA(p = 0.8) SAA


Annualized return 0.076 0.081 0.067
Standard deviation 0.088 0.089 0.096
Sharpe ratio 0.87 0.90 0.70
Maximum drawdown 0.17 0.18 0.30
Calmar ratio 0.46 0.46 0.22
Annual turnover 0.92 1.47 0.05

Table 8: Performance of RBAA portfolios compared to 60/40 SAA portfolio with rebal-
ancing at the change points.

detected over the 18-year period. The length of the identified regimes varies
considerably from a few weeks up to four years, which is different from what
would be expected if the regimes were based on a business-cycle indicator.

The RBAA strategy with p = 0.5 outperforms both the 60/40 benchmark and
the MSCI World index, when there are no transaction costs. The outperfor-
mance relative to the benchmark portfolio begins in 2003 and then slowly accu-
mulates all the way through the crisis in 2008. Part of the outperformance is
lost in the first half of 2009 when the market rebounds and the RBAA portfolio
is still risk–off. Once the allocation is changed to risk–on in the second half of
2009, the RBAA strategy again starts to outperform the benchmark. In 2015,
however, the RBAA strategy underperformed the SAA portfolio.

In table 8, the performance of the RBAA portfolio with p = 0.5 and p =


0.8, respectively, is compared to the 60/40 SAA portfolio rebalanced at the
change points over the period 1998–2015. Recall from figure 6 that p = 0.5
was the percentage that minimized the standard deviation and the Sharpe ratio
was maximized around p = 0.8. Although the annual turnover of the RBAA
strategies is much higher than that of the benchmark portfolio, the Sharpe ratios
exceed that of the SAA portfolio as long as transaction costs do not exceed 79
and 60 basis points per one-way transaction, respectively.

In addition, the maximum drawdown of the RBAA portfolios is much smaller


than that of the SAA portfolio. This means that the Calmar ratio, which is the
annualized return divided by the maximum drawdown, is more than twice as
high for the RBAA portfolio than for the SAA portfolio.

5.3 Comparison with other approaches


In table 9, the performance of the RBAA portfolio with p = 0.5 and the SAA
portfolio is compared with two other RBAA approaches. The first is based on
the median filter that Bulla et al. (2011) applied to filter the state probabilities
instead of the online step algorithm. The inferred state on day t is the median
5 Empirical results 115

RBAA(p = 0.5) SAA Median filter EWMA


Annualized return 0.076 0.067 0.079 0.055
Standard deviation 0.088 0.096 0.103 0.094
Sharpe ratio 0.87 0.70 0.76 0.59
Maximum drawdown 0.17 0.30 0.21 0.23
Calmar ratio 0.46 0.22 0.38 0.25
Annual turnover 0.92 0.05 1.06 3.02

Table 9: Performance of RBAA portfolio and 60/40 SAA portfolio with rebalancing at
the change points compared to a median filter and an EWMA approach. A percentage
p = 0.5 of the portfolio was allocated to the RBAA strategy in all three cases.

of the most likely states over the last 21 days.14 The Sharpe ratio is almost
the same regardless of whether the number of days is 5 or 21 (one week or one
month), but the annual turnover is much higher when fewer days are considered.
The average allocation of the median-filter approach is 71% risk–on and 29%
risk–off, meaning that the 60/40 SAA portfolio is not a suitable benchmark for
this approach. Despite the lower average exposure to fixed income, the median-
filter approach realizes a higher Sharpe ratio than that of the SAA portfolio. It
relies on the same HMM as the RBAA portfolio to distinguish between market
regimes, but realizes a lower Sharpe ratio with a higher turnover.
The second strategy is based on an exponentially-weighted moving average
(EWMA) of the standard deviation of the MSCI World index. When the EWMA
rises above its 60% quantile, the allocation is changed to risk–off, and when it
falls below its 60% quantile, the allocation is changed to risk–on. By design, the
average allocation of the RBAA strategy based on the EWMA is 60/40. The an-
nual turnover is highly dependent on the memory length of the EWMA. Table 9
shows the result when using an effective memory length of 21 days. This value
yields a reasonable tradeoff between Sharpe ratio (before transaction costs) and
turnover.
The basic idea of the EWMA approach—to reduce risk exposure whenever
higher volatility has been observed for a while—is the same as for the other
RBAA approaches; however, it realizes a lower Sharpe ratio than does the SAA
portfolio. The comparison emphasizes the value of the HMM for distinguishing
between market regimes and the online step algorithm for filtering the regime
probabilities.

[ ( )]
14 The b f = median X
inferred state on day t is X bt−20 , X
bt−19 , . . . , X bt , where [·] maps every
t
number to its integer part and Xbt = arg maxi Pr ( Xt = i| Y1 , Y2 , . . . , Yt ).
116 Dynamic allocation or diversification

6 Conclusion
The empirical results showed that regime-based asset allocation is profitable,
even when compared to a diversified benchmark portfolio in a multi-asset uni-
verse. The proposed strategy was based on adjusting the weight of risky assets
relative to safe assets (fixed income) to maintain a minimum level of diversifica-
tion in all regimes.

The results are robust, because they are based on available market data with no
assumptions about equilibrium returns, volatilities, correlations, or the ability to
forecast their future values. As the parameters of the hidden Markov model used
to identify the regimes were updated every day, the same approach should work
in other time periods as well. It will remain a possibility for future research
to try to improve the performance by including information from other asset
classes, economic variables, interest rates, investor sentiment surveys, or other
possible indicators.

The benchmark portfolio was chosen to mimic a 60/40 long-only SAA portfolio
of an institutional investor to make the comparison as realistic as possible. The
performance of RBAA was analyzed as a function of the percentage of the
portfolio that was allocated dynamically. In order to minimize the portfolio
standard deviation, 50% had to be allocated to the regime-based strategy. This
corresponded to an 80/20 allocation in the low-volatility regimes and a 30/70
allocation in the high-volatility regimes. The lower standard deviation combined
with a higher return led to an improved Sharpe ratio compared to the static,
fixed-weight benchmark. Even more remarkable was the improvement in the
ratio of average return to maximum drawdown, as this ratio more than doubled
when allocating half of the portfolio dynamically.

The results have important implications for portfolio managers with a medium-
to long-term investment horizon. The percentage of a multi-asset portfolio that,
with advantage, can be allocated dynamically is strongly dependent on the effec-
tiveness of the regime-detection process. Rebalancing to a static benchmark is
not optimal, however, when market regimes are persistent. It is definitely worth
considering a more dynamic approach to asset allocation, if only to reduce the
tail risk.

References
Ang, A. and G. Bekaert. “International asset allocation with regime shifts.”
Review of Financial Studies, vol. 15, no. 4 (2002), pp. 1137–1187.

Ang, A. and G. Bekaert. “How regimes affect asset allocation.” Financial Ana-
lysts Journal, vol. 60, no. 2 (2004), pp. 86–99.
6 Conclusion 117

Ang, A. and A. Timmermann. “Regime changes and financial markets.” Annual


Review of Financial Economics, vol. 4, no. 1 (2012), pp. 313–337.
Brunnermeier, M. K. and L. H. Pedersen. “Market liquidity and funding liquid-
ity.” Review of Financial studies, vol. 22, no. 6 (2009), pp. 2201–2238.
Bulla, J. “Hidden Markov models with t components. Increased persistence and
other aspects.” Quantitative Finance, vol. 11, no. 3 (2011), pp. 459–475.
Bulla, J. and I. Bulla. “Stylized facts of financial time series and hidden semi-
Markov models.” Computational Statistics & Data Analysis, vol. 51, no. 4
(2006), pp. 2192–2209.
Bulla, J., S. Mergner, I. Bulla, A. Sesboüé, and C. Chesneau. “Markov-switching
asset allocation: Do profitable strategies exist?” Journal of Asset Manage-
ment, vol. 12, no. 5 (2011), pp. 310–321.
Cohn, A., J. Engelmann, E. Fehr, and M. A. Maréchal. “Evidence for counter-
cyclical risk aversion: an experiment with financial professionals.” American
Economic Review, vol. 105, no. 2 (2015), pp. 860–885.
DeMiguel, V., L. Garlappi, and R. Uppal. “Optimal versus naive diversification:
How inefficient is the 1/N portfolio strategy?” Review of Financial Studies,
vol. 22, no. 5 (2009b), pp. 1915–1953.
Goyal, A., A. Ilmanen, and D. Kabiller. “Bad habits and good practices.” Journal
of Portfolio Management, vol. 41, no. 4 (2015), pp. 97–107.
Guidolin, M. and A. Timmermann. “Asset allocation under multivariate regime
switching.” Journal of Economic Dynamics and Control, vol. 31, no. 11 (2007),
pp. 3503–3544.
Kinlaw, W., M. Kritzman, and D. Turkington. “The divergence of high- and low-
frequency estimation: Implications for performance measurement.” Journal
of Portfolio Management, vol. 41, no. 3 (2015), pp. 14–21.
Kritzman, M. and Y. Li. “Skulls, financial turbulence, and risk management.”
Financial Analysts Journal, vol. 66, no. 5 (2010), pp. 30–41.
Kritzman, M., S. Page, and D. Turkington. “Regime shifts: Implications for
dynamic strategies.” Financial Analysts Journal, vol. 68, no. 3 (2012), pp.
22–39.
Mandelbrot, B. “The variation of certain speculative prices.” Journal of Business,
vol. 36, no. 4 (1963), pp. 394–419.
Narasimhan, M., P. Viola, and M. Shilman. “Online decoding of Markov models
under latency constraints.” In Proceedings of the 23rd International Confer-
ence on Machine Learning (2006), pp. 657–664.
118 Dynamic allocation or diversification

Nystrup, P., B. W. Hansen, H. Madsen, and E. Lindström. “Regime-based ver-


sus static asset allocation: Letting the data speak.” Journal of Portfolio
Management, vol. 42, no. 1 (2015a), pp. 103–109.
Nystrup, P., H. Madsen, and E. Lindström. “Stylised facts of financial time
series and hidden Markov models in continuous time.” Quantitative Finance,
vol. 15, no. 9 (2015b), pp. 1531–1541.
Nystrup, P., H. Madsen, and E. Lindström. “Long memory of financial time
series and hidden Markov models with time-varying parameters.” Journal of
Forecasting, vol. 36, no. 8 (2017b), pp. 989–1002.
Rydén, T., T. Teräsvirta, and S. Åsbrink. “Stylized facts of daily return series
and the hidden Markov model.” Journal of Applied Econometrics, vol. 13,
no. 3 (1998), pp. 217–244.
Sheikh, A. Z. and J. Sun. “Regime change: Implications of macroeconomic shifts
on asset class and portfolio performance.” Journal of Investing, vol. 21, no. 3
(2012), pp. 36–54.
Siegel, J. J. “Does it pay stock investors to forecast the business cycle?” Journal
of Portfolio Management, vol. 18, no. 1 (1991), pp. 27–34.
Viterbi, A. J. “Error bounds for convolutional codes and an asymptotically
optimum decoding algorithm.” IEEE Transactions on Information Theory,
vol. 13, no. 2 (1967), pp. 260–269.
PAPER E
120
Originally published in the Journal of Asset Management

Detecting change points in VIX and S&P 500:


A new approach to dynamic asset allocation

Peter Nystrup, Bo William Hansen, Henrik Madsen,


and Erik Lindström

Abstract

The purpose of dynamic asset allocation is to overcome the challenge that


changing market conditions present to traditional strategic asset allocation
by adjusting portfolio weights to take advantage of favorable conditions and
reduce potential drawdowns. This article proposes a new approach to dy-
namic asset allocation that is based on detection of change points without fit-
ting a model with a fixed number of regimes to the data, without estimating
any parameters, and without assuming a specific distribution of the data. It
is examined whether dynamic asset allocation is most profitable when based
on changes in the CBOE Volatility Index (VIX) or change points detected
in daily returns of the S&P 500 index. In an asset universe consisting of
the S&P 500 index and cash, it is shown that a dynamic strategy based on
detected change points significantly improves the Sharpe ratio and reduces
the drawdown risk when compared to a static, fixed-weight benchmark.

Keywords: Regime changes; Change-point detection; Dynamic asset alloca-


tion; Volatility regimes; Daily returns; Nonparametric statistics.

1 Introduction
The financial crisis of 2007–2008 resulted in large losses for most static port-
folios, leading to increased interest in dynamic approaches to asset allocation.
The market ructions during the beginning of 2016 are a recent example of how
abruptly the behavior of financial markets can change. Although some changes
may be transitory, the new behavior often persists for several months after a
change (Ang and Timmermann 2012). The mean, volatility, and correlation
patterns in stock returns, for example, changed dramatically at the start of,
and persisted through the crisis of 2007–2008.
Changes in market dynamics present a big challenge to traditional strategic
asset allocation (SAA) that seeks to develop static “all-weather” portfolios that
122 Detecting change points in VIX and S&P 500

optimize efficiency across a range of scenarios. The purpose of dynamic asset


allocation (DAA) is to take advantage of favorable market conditions and reduce
potential drawdowns by adjusting portfolio weights as new information arrives
(Sheikh and Sun 2012). DAA is distinct from tactical asset allocation (TAA).
While the latter relies on forecasting, DAA is based on reacting to changes in
market conditions. The goal of DAA is not to predict change points or future
market movements, but to identify when a regime shift has occurred and then
benefit from persistence of equilibrium returns and volatilities.
Some approaches to DAA exploit the relationship between observed regimes in
financial markets and the phases of the business cycle while other approaches
are based solely on market data. Regime-switching models, such as the hidden
Markov model (HMM), are a popular choice for modeling the hidden state of
financial markets because they are able to reproduce stylized facts of financial
returns, including volatility clustering and leptokurtosis (see, e.g., Rydén et al.
1998, Nystrup et al. 2017b). Furthermore, the inferred states can often be linked
to phases of the business cycle (see Guidolin and Timmermann 2007). Several
studies have shown the profitability of DAA strategies based on regime-switching
models (see, e.g., Guidolin and Timmermann 2007, Bulla et al. 2011, Kritzman
et al. 2012, Nystrup et al. 2015a).
A DAA strategy has two components: a method for detecting change points and
a strategy for changing the portfolio when a change point has been detected. In
case of a strategy based on a regime-switching model the two components fuse
as the asset allocation is typically fully determined by the inferred regime. The
pivotal question then becomes, how many regimes are needed and whether the
regimes can be assumed to be stationary. Assuming a fixed number of regimes
(typically between two and four) based on economic motivations, as it has often
been done in the literature (see Guidolin 2011b), is unlikely to be optimal.
The approach taken in this article is very different. It is based on detection
of change points without fitting a model with a fixed number of regimes to
the data, without estimating any parameters, and without assuming a specific
distribution of the data. The change points will not necessarily correspond to
turning points in the business cycle. When a change point has been detected, the
only knowledge of the new regime is the observations encountered between the
change point and the time of detection. The portfolio adjustment is determined
independently of the change-point detection.
This article presents a new approach to DAA. It examines whether DAA based
on nonparametric change-point detection can provide better long-term results
when compared to static, fixed-weight portfolios. The focus will be on stock
returns as portfolio risk is typically dominated by stock market risk. It will
be examined whether to test for location, scale, or more general distributional
changes; and whether the CBOE Volatility Index (VIX) or past returns of the
S&P 500 index yield the best proxy for the prevailing regime. Finally, it will be
2 VIX and S&P 500 123

examined whether DAA is most profitable when based on changes in the VIX
or change points detected in daily returns of the S&P 500 index. The VIX is
considered due to its forward-looking nature which will be discussed in the next
section before introducing change-point detection and the empirical results in
the following sections.

2 VIX and S&P 500


The CBOE Volatility Index (VIX), introduced by the Chicago Board Options
Exchange (CBOE) in 1993, is an important benchmark for stock market volatil-
ity. The VIX estimates expected market volatility by averaging weighted prices
of 30-calendar-day S&P 500 options over a range of strike prices. It consid-
ers a model-free estimator of implied volatility and, thus, does not depend on
any particular option pricing framework.1 The VIX essentially offers a market-
determined, forward-looking estimate of one-month stock market volatility and
is regarded as an indicator of market stress; the higher the VIX, the greater the
fear (Whaley 2000).
There is a significantly negative re-
lationship between changes in the
0.10

VIX and contemporaneous returns of


the S&P 500 index as depicted in
rtS&P 500

figure 1. The negative correlation


0.00

between stock returns and implied


volatility captures the leverage effect
-0.10

first discussed by Black (1976). Raci-


cot and Théoret (2016) found that the -0.2 0.0 0.2 0.4
VIX imbeds many properties of other rtVIX
macroeconomic and financial uncer-
tainty measures such as the growth Figure 1: The leverage effect illustrated by
of industrial production, the growth contemporaneous log-returns of the VIX and
of consumer credit, long-term interest the S&P 500 index.
rates, term spreads, etc.
In figure 2, the VIX is compared to the realized volatility of the S&P 500 index
one-month ahead.2 It is evident that the VIX has had a persistent bias over
realized volatility. The VIX is not a pure forecast, it is the price of volatility
and, as such, includes a risk premium that varies over time. A negative risk
premium for volatility partly explains the bias.

1 Thedetails of the computations are available at http://www.cboe.com/micro/vix/vixwhite.pdf.


2 The realized volatility is calculated,
√ ∑ in accordance with the methodology used by S&P Dow
Jones Indices, as RVt = 100 × 12 ti=t−20 log (Pi /Pi−1 )2 , where Pt is the closing price of the
index on day t and log is the natural logarithm.
124 Detecting change points in VIX and S&P 500

From figure 2, it appears that spikes


Annualized Volatility
20 40 60 80

VIXt in realized volatility are followed by


RVt+21 spikes in the VIX. Andersen et al.
(2007) showed that monthly volatil-
ity forecasts based on past realized
volatility are more efficient than the
VIX model-free implied volatility. It
1990 2000 2010
may, however, still be the case that
the VIX provides earlier signals about
Year change points than past returns of the
S&P 500 index.
Figure 2: The VIX and the realized volatil-
ity of S&P 500 one-month ahead. Only one The data analyzed is 6,485 daily log-
observation per month is shown. returns of the VIX and the S&P 500
index covering the period from Jan-
uary 1990 through September 2015.
Figure 3 shows the two indices3 together with their log-returns.4 The volatility
forms clusters as large movements tend to be followed by large movements and
vice versa, as noted by Mandelbrot (1963).5 This is most pronounced in the
log-returns of the S&P 500 index, although it can be seen in the VIX series as
well.
The volatility of the VIX appears to be higher when the level of the VIX is high.
Persistence in the volatility of the VIX corresponds to persistence in the kurtosis
of the S&P 500 returns, as the second moment of the VIX corresponds to the
fourth moment of the S&P 500 returns. DAA aims to exploit this persistence
of the volatility, as risk-adjusted returns, on average, are substantially lower
during turbulent periods, irrespective of the source of turbulence, as shown by
Kritzman and Li (2010).
The log-return series are characterized by a large number of exceptional observa-
tions. Table 4 shows the first four moments of the log-returns of the two indices.
The moments of the two return series are quite different, but the kurtosis is well
above three for both series implying that the tails of the unconditional distribu-
tions are heavier than those of the Gaussian distribution. Assuming the data is
Gaussian would cause occasional large values to be interpreted as change points,
even though they should more correctly be classified as outliers, as discussed by
Ross (2013).

3 The
S&P 500 index has been rescaled to start at 100 on January 2, 1990.
4 The
log-returns are calculated using rt = log (Pt /Pt−1 ).
5 A quantitative manifestation of this fact is that while returns themselves are uncorrelated,

absolute and squared returns display a positive, significant, and slowly decaying autocorrelation
function.
3 Change-point detection 125

500
70

S&P 500
50
VIX

300
30

100
10

1990 2000 2010 1990 2000 2010


Year Year

0.10
0.2

rtS&P 500
rtVIX

0.00
-0.2

-0.10
1990 2000 2010 1990 2000 2010
Year Year

Figure 3: The VIX, the S&P 500 index, and their daily log-returns.

VIX S&P 500


Mean 0.000054 0.00026
Standard deviation 0.063 0.011
Skewness 0.69 −0.24
Kurtosis 7.2 11.7

Table 4: First four moments of the log-returns.

3 Change-point detection
A frequently used analogy for financial returns in the case of regime-switching
models is that of a person’s heart rate. While the person sleeps, a low average
heart rate with low volatility is observed. When the person wakes up, there is a
sudden rise in the heart rate’s average level and its volatility. Without actually
seeing the person, it can reasonably be concluded based on observations of the
heart rate whether he or she is awake or sleeping. Rather than distinguishing
between a fixed number of states, such as awake or sleeping, the aim of change-
point detection is to detect when the distribution of the heart rate changes.
Change-detection problems, where the goal is to monitor for distributional shifts
in a sequence of time-ordered observations, arise in many diverse areas. They
have been studied extensively within the field of statistical process control where
126 Detecting change points in VIX and S&P 500

the goal is to monitor the quality characteristics of an industrial process in order


to detect and diagnose faults.
The task is to detect whether a data sequence contains a change point. If no
change point exists, the observations are assumed to be identically distributed.
If a change point exists at time τ , then the observations are distributed as:
{
F 0 if i < τ,
Xi ∼ (1)
F 1 if i ≥ τ.

In other words, the variables are assumed iid with some distribution F 0 before
the change point at t = τ and iid with a different distribution F 1 after. The
location of the change point is unknown, and the problem is to detect it as soon
as possible. The change-point methodology can also be applied to sequences
that are not iid between change points, by first modeling the data sequence
in a way that yields iid one-step-ahead forecast residuals and then performing
change detection on these (Ross et al. 2011).
Most traditional approaches to change detection assume that the distributional
form of the data is known before and after the change with only the parame-
ters being unknown. Classical methods for this problem include the CUSUM
method (Page 1954), exponentially-weighted moving-average charts (Roberts
1959), and generalized likelihood ratio tests (Siegmund and Venkatraman 1995).
This assumption, however, rarely holds in sequential applications. Typically,
there is no prior knowledge of the true distribution or assumptions made about
the distribution may be incorrect.

A nonparametric approach
The nonparametric (distribution-free) change detection approach applied in this
article is based on Ross et al. (2011) and the implementation by Ross (2015).
It does not assume that anything is known about the distribution of the data
before monitoring begins; it is thus an example of a self-starting technique. The
main advantage of the approach being self-starting is that it can be deployed
out-of-the-box without the need to estimate parameters of the data distribution
from a reference sample prior to monitoring.
In a typical setting, a sequence of observations x1 , x2 , . . . are received from the
random variables X1 , X2 , . . .. The distribution of Xi is given by (1), conditional
on the change point τ which the task is to detect. Suppose t points from the
sequence have been observed. For any fixed k < t the hypothesis that a change
point occurred at the kth observation can be written as
{
F0 if i < k,
H0 : ∀i Xi ∼ F0 , H1 : Xi ∼
F1 if i ≥ k.
3 Change-point detection 127

A two-sample hypothesis test can be used to test for a change point at k. Let
Dk,t be an appropriately chosen test statistic. For example, if the change is
assumed to take the form of a shift in location and the data is assumed to
be Gaussian, then Dk,t will be the statistic associated with the usual t-test.
If Dk,t > hk,t for some appropriately chosen threshold hk,t , then a change is
detected at location k. As no information is available concerning the location
of the change point, Dk,t has to be evaluated at all values of 1 < k < t. If

Dmax,t = max Dk,t , (2)


k

then the hypothesis that no change has occurred before the tth observation is
rejected if Dmax,t > ht for some threshold ht . The estimate τ̂ of the change
point is then the value of k which maximizes Dk,t . This formulation provides a
general method for nonsequential change-point detection on a fixed size dataset.
It can also be applied in the case where points are sequentially arriving over
time, by repeatedly recomputing the test statistic as each new observation is
received.
When the task is to detect a change in a data sequence where no informa-
tion is available regarding the pre- or post-change distribution, the approach
of Ross et al. (2011) is to replace Dk,t with a nonparametric two-sample test
statistic that can detect arbitrary changes in a distribution. The algorithm
would proceed as above, with this statistic evaluated at every time point, and
the maximum value being compared to a threshold ht . In many situations a
more powerful test can be found by restricting attention to the case when the
pre-change distribution F0 undergoes a change in either location or scale:

• Location Shift: F1 (x) = F0 (x + δ).

• Scale Shift: F1 (x) = F0 (δx).

This corresponds to a change in either the mean or volatility of financial returns.


Although it is slightly more restricted than testing for arbitrary changes, in
practice any change in F0 is likely to cause a shift in location or scale, and thus
can be detected.
Nonparametric change detection is typically perceived to be less powerful than
parametric change detection, but Ross et al. (2011) showed that, although para-
metric Gaussian change detection using the t- and F -test outperforms the non-
parametric methods when detecting larger sized changes in the parameters of
a Gaussian distribution, the difference in performance is not excessive, and the
nonparametric tests actually outperform for smaller sized changes. For heavy-
tailed data the nonparametric methods significantly outperform the parametric
Gaussian methods.
128 Detecting change points in VIX and S&P 500

Of the nonparametric test statistics that were considered for this article, the
tests that—in agreement with Ross et al. (2011)—were found to be most pow-
erful are the Mann–Whitney test for changes in location (Mann and Whitney
1947), the Mood test for changes in scale (Mood 1954), and the Lepage test for
joint monitoring of changes in location and scale (Lepage 1971). The Cramér–
von Mises and Kolmogorov–Smirnov tests (see Ross and Adams 2012) were also
considered, but they were found to be slower at detecting distributional changes
when compared to the Lepage test.
The mentioned tests use only the rank of the observations.6 The Mood test
(Mood 1954), for example, is based on the observation that if n = nA + nB
points are spread over two samples A and B, then assuming no tied ranks, the
expected rank of each point under the null hypothesis that both samples are
identically distributed is (n + 1) /2. The Mood test uses a test statistic which
measures the extent to which the rank of each observation deviates from the
expected value: ∑
M′ =
2
(r (xi ) − (n + 1) /2) , (3)
xi ∈A

where r (xi ) denotes the rank of xi in the pooled sample.


The distribution of the Mood statistic is independent of the distribution of the
underlying random variables with mean and variance
( ) ( 2 )
µM ′ = nA n2 − 1 /12, σM2
′ = nA nB (n + 1) n − 4 /180.

Taking the absolute value of the standardized test statistic

M = | (M ′ − µM ′ )/ σM ′ | (4)

ensures that both increases and decreases in scale can be detected.

4 Empirical results
The empirical testing shows that the best result is obtained by focusing on shifts
in the scale parameter. It is not surprising that there is little information in the
mean values (see, e.g., Merton 1980). Testing for more general distributional
changes leads to approximately the same change points being identified with a
larger detection delay compared to when focusing on shifts in scale. This finding
applies to both the VIX and the S&P 500 return series.

6 The
∑t
rank of the ith observation at time t is defined as r (xi ) = i̸=j I (xi ≥ xj ) , where I is
the indicator function.
4 Empirical results 129

In order to avoid a large number of


false alarms the expected time be-

70
tween false positive detections, also
referred to as the average run length,

50
VIX
is set to 10,000. The choice of aver-

30
age run length is a tradeoff between
false positive detections and the de-

10
lay in detecting changes. A total of 1990 2000 2010
27 change points are detected in each
of the two series when using the Mood Year
test for changes in scale (Mood 1954)

0.10
as implemented by Ross (2015).

rtS&P 500
The detected change points are com-

0.00
pared in figure 5. Every second
regime is shaded to make it easier to
identify the change points. There is -0.10
no further information in the shad- 1990 2000 2010
ing. The detected change points seem
very intuitive when shown together Year
with the S&P 500 returns due to
Figure 5: Comparison of the change points
the distinct volatility clusters, while detected in the VIX and the S&P 500 log-
the change points detected in the log- returns.
returns of the VIX seem more intu-
itive when shown together with the
index. The distance between the change points varies considerably from a few
days up to six years. This is different from what would be expected if the change
points were based on a business cycle indicator.
Although it is interesting in itself to analyze the change points, it is the times
of detection that really matter. It is not possible to conclude whether changes
in the VIX or past returns on the S&P 500 index yield the earliest signal about
change points as not all the detected change points coincide. The next step is,
therefore, to implement and test a dynamic asset allocation strategy based on
the detected change points in order to determine which of the indices provides
the most profitable signals.

4.1 Test procedure


The testing is done one day at a time in a live-sample setting to make it as
realistic as possible. The first 21 observations, corresponding to one month, are
used to determine the initial allocation. From that point onwards it is tested
each day whether a change point has occurred. If a change point is detected
to have happened on day τ < t after the log-return of the index on day t has
been added to the sample, then the observations from the time of the change
130 Detecting change points in VIX and S&P 500

point τ until the time of the detection t are used to estimate the volatility in
the new regime. If, based on the new estimate of the volatility, the allocation to
the stock index changes, then the new allocation is implemented at the closing
of day t + 1, i.e., there is assumed to be a one-day delay in the implementation.
The asset allocation remains unchanged otherwise. The closing level on day
t + 1 is then included in the sample and it is tested whether a new change point
has occurred. The portfolio is not rebalanced until the next change point is
detected.
Upon the detection of a change point, the volatility is estimated as the square
root of an exponentially-weighted moving average of the past variance

EWMAt = λEWMAt−1 + (1 − λ) rt2 . (5)

The forgetting parameter λ is set to 0.95, corresponding to an effective memory


length of 20 trading days or roughly one month. This EWMA is found to be a
better forecast of future, one-month volatility than the VIX.
The tested strategies are fairly simple
as the purpose is not to outline the op-
Allocation to S&P 500
1.5

Long-only timal strategy but rather to discuss


Long–Short the profitability of a DAA approach
based on change-point detection. As-
0.5

suming the average long-term stock


volatility is 20%, a simple strategy is
to allocate 50% to stocks and 50% to
-0.5

0.0 0.1 0.2 0.3 0.4


cash when the volatility is 20%. If, in-
stead, the volatility is 10%, then the
Annualized Volatility entire portfolio is allocated to stocks.
If the volatility is 30%, then the en-
Figure 6: Two simple strategies where the
allocation to S&P 500 is a function of the
tire portfolio is allocated to cash. In
annualized volatility. figure 6, this simple (linear) long-only
strategy with no leverage is shown as
the dashed line. The dotted line cor-
responds to the same allocation function with leverage and short-selling allowed.
This strategy is referred to as Long–Short to distinguish it from the Long-only
strategy. These strategies are likely not optimal, but attempting to optimize
them may introduce a significant back-testing bias.

4.2 Performance of dynamic asset allocation strategies


S&P 500 change points. In table 7, the performance of the two dynamic
strategies based on the S&P 500 change points is compared to the performance
of the S&P 500 index and a static, fixed-weight portfolio. The static portfolio
I is rebalanced daily to have a fixed allocation of 61% to the S&P 500 index
which equals the average allocation of the Long-only strategy to the stock index
4 Empirical results 131

Index/Strategy AR SD SR MDD
S&P 500 0.071 0.18 0.39 0.57
Long-only strategy 0.056 0.09 0.62 0.31
Static portfolio I 0.047 0.11 0.43 0.39
Long–Short strategy 0.057 0.16 0.36 0.47

Table 7: Performance of two dynamic strategies based on the S&P 500 change points.
The table summarizes the annualized return, standard deviation, Sharpe ratio, and max-
imum drawdown for the S&P 500 index, the two dynamic strategies based on the S&P
500 change points, and a static portfolio that has the same average exposure (61%) to
the S&P 500 index as the Long-only strategy.

over the period from February 1990 through September 2015. The remaining
39% are allocated to cash which is assumed to yield zero interest.
The Long-only strategy has the highest Sharpe ratio (SR) with an annualized re-
turn (AR) of 5.6% and an annualized standard deviation (SD) of 9%.7 Although
the realized return is lower than for the S&P 500 index, the SR is significantly
higher. The Long–Short strategy has the same AR as the Long-only strategy
but a higher SD and MDD, and, consequently, a lower SR.
The Long-only strategy has a lower SD and maximum drawdown8 (MDD) than
the static portfolio I (that has the same average exposure to the S&P 500 index),
but the realized return is higher as long as transaction costs do not exceed 188
basis points per one-way transaction. This is when ignoring the costs associated
with rebalancing to static weights, so the break-even transaction cost is higher
than 188 basis points.
Figure 8 shows the development of the Long-only strategy, the static portfolio
I, and the S&P 500 index. The shaded areas show the Long-only strategy’s
allocation to the stock index. The times of detection of the change points that
result in allocation changes are visible from the plot. Based on the allocation
changes the volatility forecasts can be inferred. There appears to be many
different levels of volatility. The choice of not rebalancing the dynamic portfolio
between change points tilts it towards a momentum strategy, but the effect
appears to be moderate. In case of a short position, thought of as implemented
using a futures contract, the effect of not rebalancing is the opposite, as the size
of the short position increases when the index goes up and decreases when the
index goes down.
Most of the Long-only strategy’s outperformance relative to the static, fixed-
weight portfolio occurred during the financial crisis in 2008. It is clear why the
7 Neitherof the portfolio return series display enough autocorrelation that it is necessary to
adjust the annualized standard deviations.
8 The maximum drawdown is the largest relative decline from a historical peak in the index

value.
132 Detecting change points in VIX and S&P 500
1

S&P 500 Change Points

600
0.8
Allocation to S&P 500

500
0.6

S&P 500

400
Index
Long-only
Static I
0.4

300
200
0.2

100
0

1991 1994 1997 2000 2003 2006 2009 2012 2015


Year

Figure 8: Development of the Long-only strategy based on the S&P 500 change points
compared to a static, fixed-weight portfolio and the S&P 500 index (right axis). The
shaded areas show the Long-only strategy’s allocation to the S&P 500 index (left axis).
The legends are sorted according to the final index value.

Long–Short strategy is not profitable as the Long-only strategy is primarily fully


allocated to cash around the peak in 2000 and on the way out of the crisis in
2008, when the market rebounded.

VIX change points. Table 9 summarizes the performance of the same strate-
gies when based on change points detected by analyzing the daily changes in
the VIX. The static portfolio II has a fixed allocation of 64% to the S&P 500
index which equals the average allocation of the Long-only strategy to the stock
index over the period.
The Long-only strategy has a higher AR and SR and a lower MDD than before.
The AR is still lower than that of the index, but the difference is smaller. The
AR of the Long-only strategy exceeds that of the static portfolio II as long as
transaction costs do not exceed 372 basis points per one-way transaction.
The performance of the Long–Short strategy is similar to the index in terms
of SR and it has a lower MDD. This is a significant improvement compared to
when based on the S&P 500 change points, but it is still less profitable than the
Long-only strategy.
4 Empirical results 133

Index/Strategy AR SD SR MDD
S&P 500 0.071 0.18 0.39 0.57
Long-only strategy 0.062 0.10 0.64 0.24
Static portfolio II 0.049 0.12 0.42 0.40
Long–Short strategy 0.059 0.15 0.40 0.37

Table 9: Performance of two dynamic strategies based on the VIX change points. The
table summarizes the annualized return, standard deviation, Sharpe ratio, and maximum
drawdown for the S&P 500 index, the two dynamic strategies based on the VIX change
points, and a static portfolio that has the same average exposure (64%) to the S&P 500
index as the Long-only strategy.
1

VIX Change Points

600
0.8
Allocation to S&P 500

500
0.6

S&P 500

400
Index
Long-only
Static II
0.4

300
200
0.2

100
0

1991 1994 1997 2000 2003 2006 2009 2012 2015


Year

Figure 10: Development of the Long-only strategy based on the VIX change points
compared to a static, fixed-weight portfolio and the S&P 500 index (right axis). The
shaded areas show the Long-only strategy’s allocation to the S&P 500 index (left axis).
The legends are sorted according to the final index value.

Figure 10 shows the development of the Long-only strategy, the static portfolio
II, and the S&P 500 index. It was evident from figure 5 that not all the change
points are coinciding between the two series and by comparing figure 8 and
figure 10 it appears that this it true for the detection times as well.

Compared to figure 8, the Long-only strategy based on the VIX change points
performs better in the years leading up to the burst of the dot–com bubble in
134 Detecting change points in VIX and S&P 500

year 2000. The allocation to S&P 500 is significantly reduced in 1998, around
the same time as the allocation was reduced to zero when based on based on the
S&P 500 change points, but the allocation is then increased to about 60% in
1999. This allocation is retained all the way to the peak in year 2000. Towards
the peak in 2008, the Long-only strategy performs worse when based on the
VIX change points. The allocation to S&P 500 is reduced for the first time
already at the beginning of 2007, but it is not reduced all the way to zero until
November 2008. In figure 8, the allocation was reduced to zero in September
2008, only a few days after Lehman Brothers filed for bankruptcy.
The superior performance of the Long-only strategy compared to the static port-
folio II increases steadily towards the peak in year 2000, through the subsequent
downturn, and towards the peak in year 2008. The difference in performance is
then reduced over the following three years during the market rebound before it
again increases. The outperformance is built up gradually through the 25-year
period and does not come from the financial crisis in 2008, on the contrary, the
gap in performance compared to the static portfolio shrunk a lot in the wake of
the crisis.

4.3 Performance of switching strategies


Although the tested strategies are fairly simple, they can be made even simpler
by only allowing one asset in the portfolio at a time. Rather than having the
allocation to the S&P 500 index be a linear function of the forecasted volatility,
the allocation is either 100% or 0% in the long-only case and 100% or −100% in
the long–short case depending on whether the forecasted volatility is above or
below 20%. This threshold is chosen based on the assumption that the average
long-term stock volatility is 20%, which is an arbitrary choice. These simple
switching strategies are similar to a regime-switching approach.
The testing of the switching strategies is carried out one day at a time in a
live-sample setting exactly as before. The focus will be on the change points
detected by analyzing the daily changes in the VIX, since the DAA strategies
already tested were most profitable when based on these. Past returns of the
S&P 500 index are still used to estimate the volatility upon the detection of a
change point.
Table 11 summarizes the performance of the switching strategies based on the
VIX change points. The static portfolio III is rebalanced daily to have a fixed
allocation of 66% to the S&P 500 index, which equals the average allocation of
the Long-only switching strategy to the stock index over the period.
The performance of the Long-only switching strategy is notable. Not only does
it have a much higher SR and a significantly lower MDD than the S&P 500
index, it also has a higher realized return as long as transaction costs do not
exceed 93 basis points per one-way transaction. This is under the assumption
4 Empirical results 135

Index/Strategy AR SD SR MDD
S&P 500 0.071 0.18 0.39 0.57
Long-only switching strategy 0.075 0.11 0.68 0.20
Static portfolio III 0.051 0.12 0.42 0.41
Long–Short switching strategy 0.074 0.15 0.50 0.44

Table 11: Performance of two switching strategies based on the VIX change points. The
table summarizes the annualized return, standard deviation, Sharpe ratio, and maximum
drawdown for the S&P 500 index, the two switching strategies based on the VIX change
points, and a static portfolio that has the same average exposure (66%) to the S&P 500
index as the Long-only switching strategy.

that there is no interest on cash. The realized return of the Long-only switching
strategy exceeds that of the static portfolio III as long as transaction costs do
not exceed 627 basis points per one-way transaction and, at the same time, the
MDD is less than half.
The Long–Short switching strategy is only rebalanced when the position changes
from long to short or from short to long. This strategy also has a higher AR
and a lower SD and MDD compared to the S&P 500 index. The SR of 0.50 is
higher than that of the index, but it does not compare to that of the Long-only
switching strategy.
Figure 12 shows the development of the Long-only switching strategy, the static
portfolio III, and the S&P 500 index. The allocation is 100% whenever it was
above 50% in figure 10 and 0% whenever it was below 50%.
Compared to the index, the switching strategy falls behind on the way towards
the peak in year 2000, then it gets ahead during the subsequent downturn and
retains its lead until the peak in year 2008 when the index catches up. The
switching strategy gets far ahead during the crash in 2008 and the difference
compared to the index is at its largest at the market trough in 2009. The lead
almost diminishes during the turbulent second half of 2011, but the switching
strategy retains a small lead all the way to the end of the sample. It is clear
that if the big loss in 2008 was removed from the sample, the dynamic strategies
would not outperform the index in terms of absolute return.
Compared to the static, fixed-weight portfolio, the Long-only switching strategy
gradually extends its lead, with the exception of a few short periods, from the
beginning of the sample to the market trough in 2009. The following three
years the static portfolio regains some of the loss before the difference is again
extended from 2012 to the end of the sample. The Long-only switching strategy
is far in front of the static portfolio throughout the sample as evidenced by the
break-even transaction cost of 627 basis points.
It is easy to get the impression from figure 12 that the applied change-point
method is faster at detecting increases in the volatility than decreases, as the
136 Detecting change points in VIX and S&P 500
1

VIX Change Points

600
0.8
Allocation to S&P 500

500
0.6

Long-only

400
Index
S&P 500
Static III
0.4

300
200
0.2

100
0

1991 1994 1997 2000 2003 2006 2009 2012 2015


Year

Figure 12: Development of the Long-only switching strategy based on the VIX change
points compared to a static, fixed-weight portfolio and the S&P 500 index (right axis).
The shaded areas show the Long-only strategy’s allocation to the S&P 500 index (left
axis). The legends are sorted according to the final index value.

switching strategy is much better at timing the downturns than the rebounds.
The same conclusion was reached in Nystrup et al. (2015a) based on regimes
inferred with an HMM with time-varying parameters. From a comparison with
figure 10 it appears that the volatility remains high during the beginning of the
rebound as the allocation is increased (to a level below 50%) sooner than it can
be seen from figure 12. It is expected that gradual drift is harder to detect than
abrupt changes.

4.4 Trading the VIX


Given the strong performance of the Long-only switching strategy based on the
VIX change points, it is natural to wonder whether the performance can be
replicated and possibly improved by trading the VIX itself. Instead of buying
the S&P 500 when a change point is detected and the volatility is estimated to
be below 20%, the replicating strategy would be to sell short-term VIX futures
(see Ilmanen 2012, Whaley 2013, Simon and Campasano 2014).
As long as the volatility remains low, a short position will accrue a roll yield as
the futures price converges to the spot price, reflecting the gap between realized
5 Summary and discussion 137

Index/Strategy AR SD SR MDD
S&P 500 0.044 0.21 0.21 0.57
Long-only switching strategy 0.045 0.09 0.47 0.18
VIX short-term futures inverse ER 0.117 0.64 0.18 0.92
Switching futures strategy 0.179 0.31 0.59 0.40

Table 13: Performance of two switching strategies based on the VIX change points. The
table summarizes the annualized return, standard deviation, Sharpe ratio, and maximum
drawdown for the S&P 500 index, the Long-only switching strategy, the S&P 500 VIX
short-term futures inverse daily excess return index, and the switching futures strategy
over the period from December 20, 2005 through September 30, 2015.

and implied volatility.9 When a new change point is detected and the volatility
is estimated to be above 20%, the short position is terminated.
In table 13, the performance of the switching futures strategy is compared to
the S&P 500 VIX short-term futures inverse daily excess return index, the Long-
only switching strategy, and the S&P 500 index over the period from December
20, 2005 through September 30, 2015.10
The return from selling short-term VIX futures is more than three times as high
as the return of the S&P 500 index and the Long-only strategy that switches
in and out of S&P 500, but the strategy is also more volatile. The SR of
the futures strategy is higher than that of the Long-only switching strategy
when transaction costs are ignored. After accounting for transaction costs, the
difference is most likely small, since the futures strategy involves more trading
from rolling the short position (Whaley 2013).

5 Summary and discussion


The new approach to DAA presented in this article was based on sequential
hypothesis testing to detect change points without fitting a model with a fixed
number of regimes to the data, without estimating any parameters, and without
making any assumptions about the distribution of the data. This approach is
very robust given that it is not based on a model or any assumptions about the
data that can change going forward. It is a useful tool for dividing financial
time series into regimes without making any assumptions about the number of
regimes or the distribution of the data within the regimes.
Daily returns of both the VIX and the S&P 500 index were considered as input to
the change-point detection. The VIX was considered due to its forward-looking
9 The VIX futures curve has historically been in contango most of the time, especially when the
VIX was below 20, as documented by Simon and Campasano (2014).
10 This is the longest history available for the S&P 500 VIX short-term futures inverse daily

excess return index (SPVXSPI) provided by S&P Dow Jones Indices.


138 Detecting change points in VIX and S&P 500

nature. It was not possible to conclude which of the two series that provided
the earliest warning about change points as not all the detected change points
coincided. However, the testing did show that DAA based on the VIX change
points was most profitable.
Simple switching strategies performed better than strategies where the alloca-
tion to the stock index was a linear function of the estimated volatility despite
the fact that the volatility assumed many different levels in the detected regimes.
The best performing strategy was a switching strategy that was fully invested in
the S&P 500 index in the low-volatility state and cash in the high-volatility state.
This strategy outperformed both the S&P 500 index and a static portfolio with
the same average allocation to the stock index both in terms of Sharpe ratio and
realized return and it had a significantly lower tail risk. Due to the assumption
of zero interest on cash positions, there was no other source of performance than
the index. A similar Sharpe ratio could be obtained by selling short-term VIX
futures instead of buying the S&P 500 index in the low-volatility state.
The analysis focused on the S&P 500 price index because of its link to the VIX.
If the price index was replaced by the total return version of the S&P 500 index
and the return on cash was assumed to be the daily risk-free rate rather than
zero, then the realized returns would be higher, but the relative performance of
the strategies and the break-even transaction costs would be almost the same.
The tested strategies may be based on larger changes in allocation than most
investors are willing to and/or allowed to implement. The excess return that
can be obtained will simply be proportional to the fraction of the portfolio that
is allocated to the dynamic strategy.
The presented results have important implications for portfolio managers with a
medium to long-term investment horizon. Even without any level of forecasting
skill it is not optimal to hold a static, fixed-weight portfolio. This new robust ap-
proach to DAA has the potential to improve (risk-adjusted) returns and reduce
tail risk compared to traditional static SAA.

References
Andersen, T. G., P. H. Frederiksen, and A. D. Staal. “The information content of
realized volatility forecasts.” Working paper, Northwestern University (2007).

Ang, A. and A. Timmermann. “Regime changes and financial markets.” Annual


Review of Financial Economics, vol. 4, no. 1 (2012), pp. 313–337.

Black, F. “Studies of stock price volatility changes.” In Proceedings of the 1976


Meetings of the American Statistical Association, Business and Economics
Statistics Section (1976), pp. 177–181.
5 Summary and discussion 139

Bulla, J., S. Mergner, I. Bulla, A. Sesboüé, and C. Chesneau. “Markov-switching


asset allocation: Do profitable strategies exist?” Journal of Asset Manage-
ment, vol. 12, no. 5 (2011), pp. 310–321.

Guidolin, M. “Markov switching models in empirical finance.” In Missing Data


Methods: Time-Series Methods and Applications, edited by D. M. Drukker,
vol. 27b of Advances in Econometrics. Emerald Group Publishing: Bingley
(2011b), pp. 1–86.

Guidolin, M. and A. Timmermann. “Asset allocation under multivariate regime


switching.” Journal of Economic Dynamics and Control, vol. 31, no. 11 (2007),
pp. 3503–3544.

Ilmanen, A. “Do financial markets reward buying or selling insurance and lottery
tickets?” Financial Analysts Journal, vol. 68, no. 5 (2012), pp. 26–36.

Kritzman, M. and Y. Li. “Skulls, financial turbulence, and risk management.”


Financial Analysts Journal, vol. 66, no. 5 (2010), pp. 30–41.

Kritzman, M., S. Page, and D. Turkington. “Regime shifts: Implications for


dynamic strategies.” Financial Analysts Journal, vol. 68, no. 3 (2012), pp.
22–39.

Lepage, Y. “A combination of Wilcoxon’s and Ansari–Bradley’s statistics.”


Biometrika, vol. 58, no. 1 (1971), pp. 213–217.

Mandelbrot, B. “The variation of certain speculative prices.” Journal of Business,


vol. 36, no. 4 (1963), pp. 394–419.

Mann, H. B. and D. R. Whitney. “On a test of whether one of two random


variables is stochastically larger than the other.” Annals of Mathematical
Statistics, vol. 18, no. 1 (1947), pp. 50–60.

Merton, R. C. “On estimating the expected return on the market: An ex-


ploratory investigation.” Journal of Financial Economics, vol. 8, no. 4 (1980),
pp. 323–361.

Mood, A. M. “On the asymptotic efficiency of certain nonparametric two-sample


tests.” Annals of Mathematical Statistics, vol. 25, no. 3 (1954), pp. 514–522.

Nystrup, P., B. W. Hansen, H. Madsen, and E. Lindström. “Regime-based ver-


sus static asset allocation: Letting the data speak.” Journal of Portfolio
Management, vol. 42, no. 1 (2015a), pp. 103–109.

Nystrup, P., H. Madsen, and E. Lindström. “Long memory of financial time


series and hidden Markov models with time-varying parameters.” Journal of
Forecasting, vol. 36, no. 8 (2017b), pp. 989–1002.
140 Detecting change points in VIX and S&P 500

Page, E. S. “Continuous inspection schemes.” Biometrika, vol. 41, no. 1–2 (1954),
pp. 100–115.
Racicot, F. É. and R. Théoret. “Macroeconomic shocks, forward-looking dynam-
ics, and the behavior of hedge funds.” Journal of Banking & Finance, vol. 62
(2016), pp. 41–61.
Roberts, S. W. “Control chart tests based on geometric moving averages.” Tech-
nometrics, vol. 1, no. 3 (1959), pp. 239–250.
Ross, G. J. “Modelling financial volatility in the presence of abrupt changes.”
Physica A: Statistical Mechanics and its Applications, vol. 392, no. 2 (2013),
pp. 350–360.
Ross, G. J. “Parametric and nonparametric sequential change detection in R:
The cpm package.” Journal of Statistical Software, vol. 66, no. 3 (2015), pp.
1–20.
Ross, G. J. and N. M. Adams. “Two nonparametric control charts for detecting
arbitrary distribution changes.” Journal of Quality Technology, vol. 44, no. 2
(2012), p. 102.
Ross, G. J., D. K. Tasoulis, and N. M. Adams. “Nonparametric monitoring of
data streams for changes in location and scale.” Technometrics, vol. 53, no. 4
(2011), pp. 379–389.
Rydén, T., T. Teräsvirta, and S. Åsbrink. “Stylized facts of daily return series
and the hidden Markov model.” Journal of Applied Econometrics, vol. 13,
no. 3 (1998), pp. 217–244.
Sheikh, A. Z. and J. Sun. “Regime change: Implications of macroeconomic shifts
on asset class and portfolio performance.” Journal of Investing, vol. 21, no. 3
(2012), pp. 36–54.
Siegmund, D. and E. S. Venkatraman. “Using the generalized likelihood ra-
tio statistic for sequential detection of a change-point.” Annals of Statistics,
vol. 23, no. 1 (1995), pp. 255–271.
Simon, D. P. and J. Campasano. “The VIX futures basis: Evidence and trading
strategies.” Journal of Derivatives, vol. 21, no. 3 (2014), pp. 54–69.
Whaley, R. E. “The investor fear gauge.” Journal of Portfolio Management,
vol. 26, no. 3 (2000), pp. 12–17.
Whaley, R. E. “Trading volatility: At what cost?” Journal of Portfolio Manage-
ment, vol. 40, no. 1 (2013), pp. 95–108.
PAPER F
142
To appear in Advances in Data Analysis and Classification

Greedy Gaussian segmentation


of multivariate time series

David Hallac, Peter Nystrup, and Stephen Boyd

Abstract

We consider the problem of breaking a multivariate (vector) time series


into segments over which the data is well explained as independent samples
from a Gaussian distribution. We formulate this as a covariance-regularized
maximum likelihood problem, which can be reduced to a combinatorial op-
timization problem of searching over the possible breakpoints, or segment
boundaries. This problem can be solved using dynamic programming, with
complexity that grows with the square of the time series length. We pro-
pose a heuristic method that approximately solves the problem in linear
time with respect to this length, and always yields a locally optimal choice,
in the sense that no change of any one breakpoint improves the objective.
Our method, which we call greedy Gaussian segmentation (GGS), easily
scales to problems with vectors of dimension over 1,000 and time series of
arbitrary length. We discuss methods that can be used to validate such a
model using data, and also to automatically choose appropriate values of
the two hyperparameters in the method. Finally, we illustrate our GGS
approach on financial time series and Wikipedia text data.

Keywords: Time series analysis; Change-point detection; Financial regimes;


Text segmentation; Covariance regularization; Greedy algorithms.

1 Introduction
Many applications, including weather measurements (Xu 2002), car sensors (Hal-
lac et al. 2016), and financial returns (Nystrup et al. 2017a), contain long se-
quences of multivariate time series data. With datasets such as these, there are
many benefits to partitioning the time series into segments, where each segment
can be explained by as simple a model as possible. Partitioning can be used
for denoising (Abonyi et al. 2005), anomaly detection (Rajagopalan and Ray
2006), regime-change identification (Nystrup et al. 2016), and more. Breaking
a large dataset down into smaller, simpler components is also a key aspect of
many unsupervised learning algorithms (Hastie et al. 2009, chapter 14).
In this paper, we analyze the time series partitioning problem by formulating
it as a covariance-regularized likelihood maximization problem, where the data
144 Greedy Gaussian segmentation

in each segment can be explained as independent samples from a multivariate


Gaussian distribution. We propose an efficient heuristic, which we call the greedy
Gaussian segmentation (GGS) algorithm, that approximately finds the optimal
breakpoints using a greedy homotopy approach based on the number of segments
(Zangwill and Garcia 1981). The memory usage of the algorithm is a modest
multiple of the memory used to represent the original data, and the time com-
plexity is linear in the number of observations, with significant opportunities for
exploiting parallelism. Our method is able to scale to arbitrarily long time series
and multivariate vectors of dimension over 1,000. We also discuss several exten-
sions of this approach, including a streaming algorithm for real-time partitioning,
as well as a method of validating the model and selecting optimal values of the
hyperparameters. Last, we implement the GGS algorithm in a Python soft-
ware package GGS, available online at https://github.com/cvxgrp/GGS, and
apply it to various financial time series and Wikipedia text data to illustrate
our method’s accuracy, scalability, and interpretability.

1.1 Related work


This work relates to recent advancements in both optimization and time series
segmentation. Many variants of our problem have been studied in several con-
texts, including Bayesian change-point detection (Booth and Smith 1982, Lee
1998, Son and Kim 2005, Cheon and Kim 2010, Bauwens and Rombouts 2012),
change-point detection based on hypothesis testing (Crosier 1988, Venter and
Steel 1996, De Gooijer 2006, Galeano and Wied 2014, Li 2015a), mixture mod-
els (Verbeek et al. 2003, Abonyi et al. 2005, Picard et al. 2011, Samé et al.
2011), hidden Markov models and the Viterbi algorithm (Rydén et al. 1998, Ge
and Smyth 2001, Bulla 2011, Hu et al. 2015, Nystrup et al. 2017b), and convex
segmentation (Katz and Crammer 2015), all trying to find breakpoints in time
series data.

The different methods make different assumptions about the data (see Esling
and Agon 2012, for a comprehensive survey). GGS assumes that, in each seg-
ment, the mean and covariance are constant and unrelated to the means and
covariances in all other segments. This differs from ergodic hidden Markov mod-
els, which implicitly assume that the underlying segments will repeat themselves,
with some structure to when the transitions are likely to occur. In a left-to-right
hidden Markov model (Bakis 1976, Cappé et al. 2005), though, additional con-
straints are imposed to ensure non-repeatability of segments, similar to GGS.
Alternatively, trend filtering problems (Kim et al. 2009) assume that neighbor-
ing segments have similar statistical parameters; when a transition occurs, the
new parameters are not too far from the previous ones. Other models have tried
to solve the problem of change-point detection when the number of breakpoints
is unknown (Basseville and Nikiforov 1993, Chouakria-Douzal 2003), including
in streaming settings (Guralnik and Srivastava 1999, Gustafsson 2000).
1 Introduction 145

GGS uses a straightforward approach based on the maximum likelihood of the


data (we address how to incorporate many of these alternative assumptions in
section 5). In real world contexts, deciding on which approach to use depends
entirely on the underlying structure of the data; a reasonable choice of method
can be determined via cross-validation of the various models. Our work is novel
in that it allows for an extremely scalable greedy algorithm to detect breakpoints
in multivariate time series. That is, GGS is able to solve much larger problems
than many of these other methods, both in terms of vector dimension and the
length of the time series. Additionally, its robustness allows GGS to be used as
a black-box method which can automatically determine an appropriate number
of breakpoints, as well as the model parameters within each segment, using
cross-validation.

Our greedy algorithm is based on a top-down approach to segmentation (Dou-


glas and Peucker 1973), though there has also been related work using bottom-
up methods (Keogh et al. 2004). While our algorithm does achieve a locally
optimal solution, we note that it is possible to solve for the global optimum us-
ing dynamic programming (Bellman 1961, Fragkou et al. 2004, Kehagias et al.
2006). However, these globally optimal approaches have complexities that grow
with the square of the time series length, whereas our heuristic method scales
linearly with the time series length. Our model approximates ℓ1 /ℓ2 trend filter-
ing problems (Kim et al. 2009, Wahlberg et al. 2011, 2012), which typically use
a penalty based on the fused group lasso (Tibshirani et al. 2005, Bleakley and
Vert 2011) to couple together the model parameters at adjacent times. How-
ever, these models are unable to scale up to the sizes we are aiming for, so we
develop a fast heuristic, similar to an ℓ0 penalty (Candès et al. 2008), where
each breakpoint splits the time series into two independent problems. To ensure
robustness, we rely on covariance-regularized regression to avoid errors when
there are more dimensions than samples in a segment (Witten and Tibshirani
2009).

1.2 Outline

The rest of this paper is structured as follows. In section 2, we formally define


our optimization problem. In section 3, we explain the GGS algorithm for
approximately solving the problem in a scalable way. In section 4, we describe
a validation process for choosing the two hyperparameters in our model. We
then examine in section 5 several extensions of this approach which allow us to
apply our algorithm to new types of problems. Finally, we apply GGS to several
real-world financial and Wikipedia datasets, as well as a synthetic example, in
section 6.
146 Greedy Gaussian segmentation

2 Problem setup
2.1 Segmented Gaussian model
We consider a given time series x1 , . . . , xT ∈ Rn . (The times t = 1, . . . , T need
not be uniformly spaced in real time; all that matters in our model and method
is that they are ordered.) We will assume that the xt ’s are independent samples
with xt ∼ N (µt , Σt ), where the mean µt and covariance Σt only change at
K ≪ T breakpoints b1 , . . . , bK . These breakpoints divide the given T samples
into K + 1 segments; in each segment, the xt ’s are generated from the same
multivariate normal distribution. Our goal is to determine K, the breakpoints
b1 , . . . , bK , and the means and covariances

µ(1) , . . . , µ(K+1) , Σ(1) , . . . , Σ(K+1)

in the K + 1 segments between the breakpoints.

Introducing breakpoints b0 and bK+1 , the breakpoints must satisfy

1 = b0 < b1 < · · · < bK < bK+1 = T + 1,

and the means and covariances are given by

(µt , Σt ) = (µ(i) , Σ(i) ), bi−1 ≤ t < bi , i = 1, . . . , K.

(The subscript t denotes time t; the superscript (i) and subscript on b denotes
segment i.)

We refer to this parametrized distribution of x1 , . . . , xT as the segmented Gaus-


sian model (SGM). The log-likelihood of the data x1 , . . . , xT under this model
is given by

∑ T ( )
1 T −1 1 n
ℓ(b, µ, Σ) = − (xt − µt ) Σt (xt − µt ) − log det Σt − log(2π)
t=1
2 2 2
∑ ∑
K+1 bi −1 (
1
= − (xt − µ(i) )T (Σ(i) )−1 (xt − µ(i) )
i=1
2
t=bi−1
)
1 n
− log det Σ − log(2π)
(i)
2 2


K+1
= ℓ(i) (bi−1 , bi , µ(i) , Σ(i) ),
i=1
2 Problem setup 147

where
i −1
b∑ (
1
(i) (i) (i)
ℓ (bi−1 , bi , µ , Σ ) = − (xt − µ(i) )T (Σ(i) )−1 (xt − µ(i) )
2
t=bi−1
)
1 n
− log det Σ(i) − log(2π)
2 2
bi −1
1 ∑
=− (xt − µ(i) )T (Σ(i) )−1 (xt − µ(i) )
2
t=bi−1
bi − bi−1 ( )
− log det Σ(i) + n log(2π)
2

is the contribution from the i’th segment. Here we use the notation b =
(b1 , . . . , bK ), µ = (µ(1) , . . . , µ(K+1) ), and Σ = (Σ(1) , . . . , Σ(K+1) ), for the pa-
rameters in the SGM. In all the expressions above we define log det Σ as −∞ if
Σ is singular, i.e., not positive definite. Note that bi − bi−1 is the length of the
i’th segment.

2.2 Regularized maximum-likelihood estimation


We will choose the model parameters by maximizing the covariance-regularized
log-likelihood for a given value of K, the number of breakpoints. We regularize
the covariance to avoid errors when there are more dimensions than samples
in a segment, a well-known problem in high-dimensional settings (Huang et al.
2006, Bickel and Levina 2008, Witten and Tibshirani 2009). Thus we choose b,
µ, and Σ to maximize the regularized log-likelihood


K+1
ϕ(b, µ, Σ) = ℓ(b, µ, Σ) − λ Tr(Σ(i) )−1
i=1
(1)
∑(
K+1 )
(i) −1
= ℓ (bi−1 , bi , µ , Σ ) − λTr(Σ )
(i) (i) (i)
,
i=1

where λ ≥ 0 is a regularization parameter, with K fixed. (We discuss the choice


of the hyperparameters λ and K in section 4.) This is a mixed combinatorial ( −1)
and continuous optimization problem since it involves a search over the TK
possible choices of the breakpoints b1 , . . . , bK , as well as the parameters µ and Σ.
For λ = 0, this reduces to maximum-likelihood estimation, but we will assume
henceforth that λ > 0. This implies that we will only consider positive definite
(invertible) estimated covariance matrices.
If the breakpoints b are fixed, the regularized maximum-likelihood problem has
a simple analytical solution. The optimal value of the i’th segment mean is the
148 Greedy Gaussian segmentation

empirical mean over the segment,


i −1
b∑
1
µ(i) = xt , (2)
bi − bi−1
t=bi−1

and the optimal value of the i’th segment covariance is


λ
Σ(i) = S (i) + I, (3)
bi − bi−1

where S (i) is the empirical covariance over the segment:


i −1
b∑
1
S (i) = (xt − µ(i) )(xt − µ(i) )T .
bi − bi−1
t=bi−1

Note that the empirical covariance S (i) can be singular, for example when bi −
bi−1 < n, but for λ > 0 (which we assume), Σ(i) is always positive definite. Thus,
for any fixed choice of breakpoints b, the mean and covariance parameters that
maximize the regularized log-likelihood (1) are given by (2) and (3), respectively.
The optimal value of the covariance (3) is similar to a Stein-type shrinkage
estimator (Ledoit and Wolf 2004).
Using these optimal values of the mean and covariance parameters, the regular-
ized log-likelihood (1) can be expressed in terms of b alone, as
K+1(
1 ∑ λ
ϕ(b) = C − (bi − bi−1 ) log det(S (i) + I)
2 i=1 bi − bi−1
)
λ
− λTr(S (i) + I)−1
bi − bi−1

K+1
=C+ ψ(bi−1 , bi ),
i=1

where C = −(T n/2)(log(2π) + 1) is a constant that does not depend on b, and


(
1 λ
ψ(bi−1 , bi ) = − (bi − bi−1 ) log det(S (i) + I)
2 bi − bi−1
)
λ
− λTr(S (i) + I)−1 .
bi − bi−1

(Note that S (i) depends on bi−1 and bi .) Without regularization, i.e., with λ = 0,
we have
1
ψ(bi−1 , bi ) = − (bi − bi−1 ) log det S (i) .
2
2 Problem setup 149

More generally, we have reduced the regularized maximum-likelihood-estimation


problem, for fixed values of K and λ, to the purely combinatorial problem
K+1(
1 ∑ λ
maximize − (bi − bi−1 ) log det(S (i) + I)
2 i=1 bi − bi−1
) (4)
λ −1
− λTr(S +(i)
I) ,
bi − bi−1

where the variable


( −1to) be chosen is the collection of breakpoints b = (b1 , . . . , bK ).
These can take TK possible values. Note that the breakpoints bi appear in the
objective of (4) both explicitly and implicitly, through the empirical covariance
matrices S (i) , which depend on the breakpoints.

Efficiently computing the objective. For future reference, we mention how


the objective in (4) can be computed given b. We first compute the empirical
covariance matrices S (i) , which costs order T n2 flops. This step can be carried
out in parallel, on up to K + 1 processors. The storage required to store these
matrices is order Kn2 doubles. For comparison, the storage required for the
original problem data is T n. Since we typically have Kn ≤ T , i.e., the average
segment length is at least n, the storage of S (i) is no more than the storage of
the original data.
For each segment i = 1, . . . , K + 1, we carry out the following steps (again,
possibly in parallel) to evaluate ψ(bi−1 , bi ). We first carry out the Cholesky
factorization
λ
LLT = S (i) + I,
bi − bi−1
where L is lower triangular with positive diagonal entries, which costs order
n3∑flops. The log-determinant term can be computed in order n flops, as
2 i=1 log(Lii ), and the trace term in order n3 flops, as ∥L−1 ∥2F . The over-
n

all complexity of evaluating the objective is order T n2 + Kn3 flops, and this
can be easily parallelized into K + 1 independent tasks. While we make no
assumptions about T , n, and K (other than K < T ), the two terms are equal
in order when T = Kn, which means that the average segment length is on
the order of n, the vector dimension. This is the threshold at which the empiri-
cal covariance matrices (can) become nonsingular, though in most applications,
useful values of K are much smaller, which means the first term dominates (in
order). With the assumption that the average segment length is at least n, the
overall complexity of evaluating the objective is T n2 .
As an example, we might expect a serial implementation for a data set with
T = 1000 and n = 100 to require on the order of 0.01 seconds to evaluate
the objective, using the very conservative estimate of 1Gflop/sec for computer
speed.
150 Greedy Gaussian segmentation

Globally optimal solution. The problem (4) can be solved globally by dy-
namic programming (Bellman 1961, Fragkou et al. 2004, Kehagias et al. 2006).
We take as states the set of pairs (bi−1 , bi ), with bi−1 < bi , so the state space
has cardinality T (T − 1)/2. We consider the selection of a sequence of K states,
with the state transition constraint that (p, q) must be succeeded by a state of
the form(q, r). The complexity of this dynamic programming method is n3 KT 2 .
Our interest, however, is in a method for large T , so we instead seek a heuristic
method for solving (4) approximately, but with linear complexity in the time
series length T .

Our method. In section 3, we describe a heuristic method for approximately


solving problem (4). The method is not guaranteed to find the globally optimal
choice of breakpoints, but it does find breakpoints with high (if not always high-
est) objective value, and the ones it finds are 1-OPT, meaning that no change
of any one breakpoint can increase the objective. The storage requirements of
the method are on the order of the storage required to evaluate the objective,
and the computational cost is typically smaller than a few hundred evaluations
of the objective function.

3 Greedy Gaussian segmentation


In this section we describe a greedy algorithm for fitting an SGM to data, which
we call greedy Gaussian segmentation. GGS computes an approximate solution
of (4) in a scalable way, in each iteration adding one breakpoint and then ad-
justing all the breakpoints to (approximately) maximize the objective. In the
literature on time series segmentation, this is similar to the standard “top-down”
approach (Keogh et al. 2004).

3.1 Split subroutine


The main building block of our algorithm is the Split subroutine. The function
Split(bi−1 , bi ) takes segment i and finds the t that maximizes ψ(bi−1 , t) + ψ(t, bi )
over all values of t between bi−1 and bi . (We assume that bi −bi−1 > 1; otherwise
we cannot split the i’th segment into two segments.) The time t = Split(bi−1 , bi )
is the optimal place to add a breakpoint between bi−1 and bi . The value of
ψ(bi−1 , t) + ψ(t, bi ) − ψ(bi−1 , bi ) is the increase in the objective if we add a new
breakpoint at t. This is highest when we choose t = Split(bi−1 , bi ). Due to
the regularization term, it is possible for this maximum increase to be negative,
which means that adding any breakpoint between bi−1 and bi actually decreases
the objective. The Split subroutine is summarized in algorithm 1.
In Split, line 3, updating the empirical mean and covariance of the left and right
segments resulting from adding a breakpoint at t is done in a recursive setting
3 Greedy Gaussian segmentation 151

Algorithm 1: Splitting a single interval into two separate segments.


Input: xbi−1 , . . . , xbi , along with empirical mean µ and covariance Σ.
1: initialize µleft = 0, µright = µ, Σleft = λI, Σright = Σ + λI.
2: for t = bi−1 + 1, . . . , bi − 1 do
3: Update µleft , µright , Σleft , Σright .
4: Calculate ψt = ψ(bi−1 , t) + ψ(t, bi ).
5: end for
6: return The t which maximizes ψt and the value of ψt − ψ(bi−1 , bi ) for that t.

Algorithm 2: Greedy Gaussian segmentation.


Input: x1 , . . . , xT , K max .
1: initialize b0 = 1, b1 = T + 1.
2: for K = 0, . . . , K max -1 do
AddNewBreakpoint:
3: for i = 1, . . . , K + 1 do
4: (ti , ψincrease ) = Split(bi−1 , bi ).
5: end for
6: if All ψincrease ’s are negative and K > 0 then
7: return (b1 , . . . , bK ).
8: else if All ψincrease ’s are negative then
9: return ().
10: end if
11: Add a new breakpoint at the ti with the largest corresponding value of ψincrease .
12: Relabel the breakpoints so that 1 = b0 < b1 < · · · < bK+1 < bK+2 = T + 1.
AdjustBreakpoints:
13: repeat
14: for i = 1, . . . , K do
15: (ti , ℓincrease ) = Split(bi−1 , bi+1 ).
16: If ti ̸= bi , set bi = ti .
17: end for
18: until Stationary.
19: end for
20: return (b1 , . . . , bK ).

in order n2 flops (Welford 1962). Line 4, evaluating ψt requires order n3 flops,


which dominates. The total cost of running Split is order (bi − bi−1 )n3 .

3.2 GGS algorithm


We can use the Split subroutine to develop a simple greedy method for finding
good choices of K breakpoints, for K = 1, . . . , K max , by alternating between
adding a new breakpoint to the current set of breakpoints, and then adjusting
the positions of all breakpoints until the result is 1-OPT, i.e., no change of
any one breakpoint improves the objective. This GGS approach is outlined in
algorithm 2.
152 Greedy Gaussian segmentation

In line 2, we loop over the addition of new breakpoints, adding exactly one new
breakpoint each iteration. Thus, the algorithm finds good sets of breakpoints, for
K = 1, . . . , K max , unless it quits early in line 6. This occurs when the addition of
any new breakpoint will decrease the objective. In AdjustBreakpoints, we loop
over the current segmentation and adjust each breakpoint alone to maximize the
objective. In this step the objective can either increase or stay the same, and we
repeat until the current choice of breakpoints is 1-OPT. In AdjustBreakpoints,
there is no need to call Split(bi−1 , bi+1 ) more than once if the arguments have
not changed.
The outer loop over K must be run serially, since in each iteration we start with
the breakpoints from the previous iteration. Lines 3 and 4 (in AddNewBreak-
point) can be run in parallel over the K + 1 segments. We can also parallelize
AdjustBreakpoints, by alternately adjusting the even and odd breakpoints (each
of which can be parallelized) until stationarity. GGS requires storage on the
order of Kn2 numbers. As already mentioned, this is typically the same order
as, or less than, the storage required for the original data.
Ignoring opportunities for parallelization, running iteration K of GGS requires
order KLn3 T flops, where L is the average number of iterations required in Ad-
justBreakpoints. When parallelized, the complexity drops to Ln3 T flops. While
we do not know an upper bound on L, we have observed empirically that it is
modest when K is not too large; that is, AdjustBreakpoints runs just a few outer
loops over the breakpoints. Summing from K = 1 to K = K max , and assuming
L is a constant, gives a complexity of order (K max )2 n3 T without paralleliza-
tion, or K max n3 T with parallelization. In contrast, the dynamic programming
method (Bellman 1961, Fragkou et al. 2004, Kehagias et al. 2006) requires order
K max n3 T 2 flops.

4 Validation and parameter selection


Our GGS method has just two hyperparameters: λ, which controls the amount
of covariance regularization, and K max , the maximum number of breakpoints.
In applications where the reason for segmentation is to identify interesting times
where the statistics of the data change, K (and λ) might be chosen by hand, or
by aesthetic or other considerations, such as whether the segmentation identifies
known or suspected times when something changed. The hyperparameter values
can also be chosen by a more principled method, such as Bayesian or Akaike
information criterion (Hastie et al. 2009, chapter 7). In this section, we describe
a simple method of selecting the hyperparameters through out-of-sample or cross
validation. We first describe the basic idea with 10:1 out-of-sample validation.
We remove 10% of the data at random, leaving us with 0.9T remaining samples.
The 10% of samples are our test set, and the remaining samples are the training
set, which we use to fit our model. We choose some reasonable value for K max ,
5 Variations and extensions 153

such as K max = (T /n)/3 (which corresponds to the average segment length 3n)
or a much smaller number when T /n is large. For multiple values of λ, typically
logarithmically spaced over a wide range, we run the GGS algorithm. This gives
us one SGM for each value of λ and each value of K. For each of these SGMs,
we note the log-likelihood on the training data, and also on the test data. (It is
convenient to divide each of these by the number of data points, so they become
the average log-likelihood per sample. In this way the numbers for the training
and test sets can be compared.) To calculate the log-likelihood on the test set,
we simply evaluate
1 1 n
ℓ(xt ) = − (xt − µ(i) )T (Σ(i) )−1 (xt − µ(i) ) − log det Σ(i) − log(2π),
2 2 2
if t falls in the i’th segment of the model.
∑ The overall test set log-likelihood is
then defined, on a test set X , as |X1 | xt ∈X ℓ(xt ). Note that when t is the time
index of a sample in the test set, it cannot be a breakpoint of the model, since
the model was developed using the data in the training set.
We then apply standard principles of validation. If for a particular SGM (found
by GGS with a particular value of λ and K) the average log-likelihood on the
training and test sets is similar, we conclude the model is not overfit, and there-
fore a reasonable candidate. Among candidate models, we then choose one that
has a high value of average log-likelihood. If many models have reasonably high
average log-likelihood, we choose one with a small value of K and a large value
of λ. (In the former case to get the simplest model that explains the data, and
in the latter case to get the least sensitive model that explains the data.)
Standard cross-validation is an extension of out-of-sample validation that can
give us even more confidence in a proposed SGM. In cross-validation we divide
the original data into ten equal size ‘folds’ of randomly chosen samples, and
carry out out-of-sample validation ten times, with each fold as the test set. If
the results are reasonably consistent across the folds, both in terms of training
and test average log-likelihood and the breakpoints themselves, we can have
confidence that the SGM fits the data.

5 Variations and extensions


The basic model and method can be extended in many ways, several of which
we describe here.

Warm-start. GGS builds SGMs by increasing K, starting from K = 0. It can


also be used in warm-start mode, meaning we start the algorithm from a given
choice of initial breakpoints. As an extreme version, we can start with a random
set of K breakpoints, and then run AdjustBreakpoints until we have a 1-OPT
solution. The main benefit of a warm start is that it allows for a significant
154 Greedy Gaussian segmentation

computational speedup. Whereas a (parallelized) GGS algorithm has a runtime


of O(K max T n3 ), this warm-start method takes only O(T n3 ), since it can skip
the first K max − 1 steps of algorithm 2. However, as we will show in section 6.2,
this speedup comes with a tradeoff, as the solution accuracy tends to drop when
running GGS in warm-start mode as compared to the original algorithm.

Backtracking. In GGS, we add one breakpoint per iteration. While we adjust


the previous breakpoints found, we never remove a breakpoint. One variation
is to occasionally remove a breakpoint. This can be done using a subroutine
called Combine. This function evaluates, for each breakpoint, the decrease in
objective value if that breakpoint is removed. In a backtracking step, we remove
the breakpoint that decreases the objective the least; we can then adjust the
remaining breakpoints and continue with the GGS algorithm by adding a new
breakpoint. (If we end up adding the breakpoint we removed back in, nothing
has been achieved.) We also note that backtracking allows for GGS to be solved
by a bottom-up method (Keogh et al. 2004, Borenstein and Ullman 2008). We
do so by starting with T − 1 breakpoints and continually backtracking until only
K breakpoints remain.

Streaming. We can deploy GGS when the data is streaming. We maintain a


memory of the last M samples and run GGS on this data set. We could do this
from scratch as each new data point or group of data points arrives, complete
with selection of the hyperparameters and validation. Another option is to fix λ
and K, and then run GGS in warm-start mode, which means that we keep the
previous breakpoints (shifted appropriately), and then run AdjustBreakpoints
from this starting point (as well as AddBreakpoint if a breakpoint has fallen off
our memory).
In streaming mode, the GGS algorithm provides an estimate of the statistics of
future time samples, namely, the mean and covariance in the SGM in the most
recent segment.

Multiple samples at the same time. Our approach can easily incorporate
the case where we have more than one data vector for any given time t. We
simply change the sums over each segment for the empirical mean and covariance
to include any data samples in the given time range.

Cyclic data. In cyclic data, the times t are interpreted modulo T , so xT


and x1 are adjacent. A good example is a vector time series that represents
daily measurements over multiple years; we simply map all measurements to
t = 1, . . . , 365 (ignoring leap years), and modify the model and method to be
cyclic. The only subtlety here arises in choosing the first breakpoint, since one
breakpoint does not split a cyclic set of times into two segments. Evidently
we need two breakpoints to split a cyclic set of times into two segments. We
6 Experiments 155

modify GGS by arbitrarily choosing a first breakpoint, and then running as


usual, including the ‘wrap-around’ segment as a segment. Thus, the first step
chooses the second breakpoint, which splits the cyclic data into two segments.
The AdjustBreakpoints method now adjusts both the chosen breakpoint and
the arbitrarily chosen original one.

Regularization across time. In our current model, the estimates on either


side of a breakpoint are independent of each other. We can, however, carry
out a post-processing step to shrink models on either side of each breakpoint
towards each other. We can do this by fixing the breakpoints and then adjusting
the continuous model parameters to minimize our original objective minus a
regularization term that penalizes deviations of (Σ(i) , µ(i) ) from (Σ(i−1) , µ(i−1) ).

Non-Gaussian data. Our segmented Gaussian model and associated regu-


larized maximum-likelihood problem (4) can be generalized to other statistical
models. The problem is tractable, at least in theory, when the associated regu-
larized maximum-likelihood problem is convex. In this case we can compute the
optimal parameters over a segment by solving a convex optimization problem,
whereas in the SGM we have an analytical solution in terms of the empirical
mean and covariance. Thus we can segment Poisson or Bernoulli data, or even
heterogeneous exponential family distributions (Lee and Hastie 2015, Tansey
et al. 2015).

6 Experiments
In this section, we describe our implementation of GGS, and the results of some
numerical experiments to illustrate the model and the method.

6.1 Implementation
We have implemented GGS as a Python package GGS available at

https://github.com/cvxgrp/GGS.

GGS is capable of carrying out full ten-fold cross-validation to help users choose
values of the hyperparameters. GGS uses NumPy for the numerical computations
and the multiprocessing package to carry out the algorithm in parallel for
different cross-validation folds for a single λ. (The current implementation does
not support parallelism over the segments of a single fold, and the advantages
of parallelism will only be seen when GGS is run on a computer with multiple
cores.)
156 Greedy Gaussian segmentation

Figure 1: Cumulative returns over the 19-year period for a stock, oil, and bond index.

6.2 Financial indices


In financial markets, regime changes have been shown to have important impli-
cations for asset class and portfolio performance (Ang and Timmermann 2012,
Sheikh and Sun 2012, Nystrup et al. 2015a, 2017a). We start with a small ex-
ample with n = 3, where we can visualize and plot all entries of the segment
parameters µ(i) and Σ(i) .

Dataset description. Our dataset consists of 19 years of daily returns, from


January 1997 to December 2015, for n = 3 indices for stocks, oil, and gov-
ernment bonds: MSCI World, S&P GSCI Crude Oil, and J.P. Morgan Global
Government Bonds. We use log-return data, i.e., the logarithm of the end-of-day
price increase from the previous day. The time series length is T = 4943. Cu-
mulative returns for the three indices are shown in figure 1. We can clearly see
multiple ‘regimes’ in the return time series, although the individual behaviors
of the three indices are quite different.

Running GGS We run GGS on the data with K max = 30 and λ = 10−4 .
Figure 2 shows the value of the objective as a function of K, i.e., the objective
value in each iteration of GGS. We see a sharp increase in the objective value
up to around K = 8 or K = 10—our first hint that a choice in this range would
be reasonable. For this example n is very small, so the computation time is
dominated by Python overhead. Still, our single-threaded GGS solver took less
6 Experiments 157

Figure 2: Objective value ϕ(b) as a function of the number of breakpoints for λ = 10−4 .

than 30 seconds to compute these 30 models on a standard laptop with a 1.7


GHz Intel i7 processor. The average number of passes through the data for the
breakpoint adjustments was under two.

Cross-validation. We next use ten-fold cross-validation to determine reason-


able values for K and λ. We plot the average log-likelihood over the ten folds as
a function of K in figure 3 for various values of λ. When λ is large, the curves
stop before K = K max , because GGS terminates early. These plots clearly
show that increasing K above ten does not increase the average log-likelihood
in the test set; and moreover past this point the log-likelihood on the test and
training sets begin to diverge, meaning the model is overfit. Though figure 3
only goes up to K = 30, we find that for values of K above around 60, the
log-likelihood begins to drop significantly. Furthermore, we see that values of
λ up to λ = 10−4 yield roughly the same high log-likelihood. This suggests
that choices of K = 10 and λ = 10−4 are reasonable, aligning with our general
preference for models which are simple (small K) and not too sensitive to noise
(large λ). Cross-validation also reveals that the choice of breakpoint locations
is very stable for these values of K and λ, across the ten folds.

Results. Figure 4 shows the model obtained by GGS with λ = 10−4 and
K = 10. We plot the covariance matrix by showing the square root of the
diagonal entries (i.e., the volatilities) and the three correlations as a function
of t. During the financial crisis in 2008, the mean returns of stocks and oil
158 Greedy Gaussian segmentation

(a) λ = 10−6 . (b) λ = 10−5 .

(c) λ = 10−4 . (d) λ = 10−3 .

Figure 3: Average training- and test-set log-likelihood during ten-fold cross validation
for various λ’s and across all values of K ≤ 30.

were very negative and volatility was high. The stock market and the oil price
were almost uncorrelated before 2008, but have been positively correlated since
then. It is interesting to see how the correlation between stocks and bonds has
varied over time: it was strongly positive in 1997 and very negative in 1998,
in 2002, and in the five years from mid-2007 to mid-2012. The sudden shift in
this correlation between 1997 and 1998 is why GGS yields two relatively short
segments in the [1997, 1999] window, rather than breaking up a longer segment
(such as [1999, 2002], where the correlation structure is more homogenous). The
extent of these variations would be difficult to capture using a sliding window;
the window would have to be very short, which would lead to noisy estimates.
The segmentation approach yields a more interpretable partitioning with no
dependence on a (prespecified, fixed) window length.

Approaches to risk modeling (Alexander 2000) and portfolio optimization (Par-


6 Experiments 159

Figure 4: Segmented Gaussian model obtained with λ = 10−4 and K = 10.

tovi and Caputo 2004, Meucci 2009) based on principal component analysis are
questionable, when volatilities and correlations are changing as significantly as
is the case in figure 4 (see also Fenn et al. 2011). We plot the cumulative index
returns along with the chosen breakpoints in figure 5. We can clearly see natural
segments and boundaries, for example the Russian default in 1998 and the 2008
financial crisis.

Comparison with random warm-start. We fix the hyperparameters K =


10 and λ = 10−4 , and attempt to find a better SGM using warm-start with
random breakpoints. This step is not needed; we carry it out to demonstrate
that while GGS does not find the model that globally maximizes the objective,
it is effective. We run 10,000 warm-start random initial breakpoint computa-
tions, running AdjustBreakpoints until the model is 1-OPT and computing the
objective found in each case. (In this case the number of passes over the data
set far exceeds two, the typical number in GGS.) The complementary CDF of
the objective for these 10,000 computations is shown in figure 6, as well as the
objective values found by GGS for K = 8 through K = 11. We see that the
random initializations can sometimes lead to very poor results: over 50% of the
simulations, even though they are locally optimal, have smaller objectives than
the K = 9 step of GGS. On the other hand, the random initializations do find
some SGMs with objective slightly exceeding the one found by GGS, demon-
strating that GGS did not find the globally optimal set of breakpoints. These
160 Greedy Gaussian segmentation

Figure 5: Cumulative returns with vertical bars at the model breakpoints.

SGMs have similar breakpoints, and similar cross-validated log-likelihood, as


the one found by GGS. As a practical matter, these SGMs are no better than
the one found by GGS. There are two advantages of GGS over the random
search: first, it is much faster; and second, it finds models for a range of values
of K, which is useful before we select the value of K to use.

6.3 Large-scale financial example


Dataset description. We next look at a larger example to emphasize the
scalability of GGS. We look at all companies currently in the S&P 500 index
that have been publicly listed for the entire 19-year period from before (from
1997 to 2015), which leaves 309 companies. Note that there are slightly fewer
trading days for the S&P 500 each year than the global indices, since the S&P
500 does not trade during US holidays, while the global indices still move. The
19-year dataset yields a 309 × 4782 data matrix. We take daily log-returns for
these stocks and run the GGS algorithm to detect relevant breakpoints.

GGS scalability. We run GGS on this much larger dataset up to K max = 10.
Our serial implementation of the GGS algorithm, on the same 1.7 GHz Intel
i7 processor, took 36 minutes, where AdjustBreakpoint took an average of 3.5
passes through the data at each K. Note that this aligns very closely with our
predicted runtime from section 3.2, which was estimated as (K max )2 LT n.
6 Experiments 161

Figure 6: Empirical complementary CDF of ϕ(b) for 10,000 randomly initialized results
for K = 10 and λ = 10−4 .

Cross-validation. We run ten-fold cross-validation to find good values of the


hyperparameters K and λ. The average log-likelihood of the test and training
sets are displayed in figure 7. From the results, we can see that the log-likelihood
is maximized at a much smaller value of K, indicating fewer breakpoints. This
is in part because, with n = 309, we need more samples in each segment to
get an accurate estimate of the 309 × 309 covariance matrix, as opposed to the
3 × 3 covariance in the smaller example. Our cross-validation results suggest
choosing K = 3 and λ = 5 × 10−2 , and as in the small example, the results are
very stable near these values.

Results. We plot the mean, standard deviation, and cumulative return of a


uniform, buy-and-hold portfolio (i.e., investing $1 into each of the 309 stocks
in 1997). The results are shown in figure 8. Note that there is a selection bias
in the dataset, since these companies all remained in the S&P 500 in 2016, and
the total return is 8× over the 19-year period. Like before, the 2008 financial
crisis stands out. It is the only segment with a negative mean value. The
partitioning seems intuitively right. The first, highly volatile, segment includes
both the build-up and burst of the dot–com bubble. The second segment is
the bull market that led to the financial crisis in 2008. The third segment is
the financial crisis and the fourth segment is the market rally that followed the
crisis. These breakpoints were also found in the multi-asset example in figure 4
and figure 5.
162 Greedy Gaussian segmentation

(a) λ = 5 × 10−3 . (b) λ = 10−2 .

(c) λ = 5 × 10−2 . (d) λ = 10−1 .

Figure 7: Average training- and test-set log-likelihood during ten-fold cross-validation


for various λ’s for the 309-stock example. Note that not all λ’s go all the way up to
K = 10 because our algorithm stops when it determines that it will no longer benefit
from adding an additional split.

6.4 Wikipedia text data


We examine an example from the field of natural language processing (NLP) to
illustrate how GGS can be applied to a very different type of dataset, beyond
tradtional time series examples.

Dataset description. We look at text data from English-language Wikipedia.


We obtain our data by concatenating the introductions, i.e., the text that pre-
cedes the Table of Contents section on the Wikipedia webpage, of five separate
articles, with titles George Clooney, Botany, Julius Caesar, Jazz, and Denmark.
Here, the “time series” consists of the sequence of words from these five articles
in order. After basic preprocessing (removing words that do not appear at least
6 Experiments 163

Figure 8: Mean, standard deviation, and cumulative return for a uniform portfolio with
K = 3 and λ = 5 × 10−2 .

five times and in multiple articles), our dataset consists of 1282 words, with each
article contributing between 224 and 286 words. We then convert each word
into a 300-dimensional vector using a pretrained Word2Vec embedding of three
million unique words (or short phrases), trained on the Google News dataset of
approximately 100 billion words, available at

https://code.google.com/archive/p/word2vec/.

This leaves us with a 300 × 1282 data matrix. Our hope is that GGS can
detect the breakpoints between the five concatenated articles, based solely on
the change in mean and covariance of the vectors associated with the words in
our vector series.

GGS results. We run GGS to split the data into five segments—i.e., K = 4—
and use cross-validation to select λ = 10−3 . (We note, however, that this
example is quite robust to the selection of λ, and any value from 10−6 to 10−3
yields the exact same breakpoint locations.) We plot the results in figure 9,
164 Greedy Gaussian segmentation

Figure 9: Actual and GGS-predicted breakpoints for the concatenation of the five
Wikipedia articles, along with the predicted most similar word to the mean of each GGS
segment.

which show GGS achieving a near-perfect split of the five articles. Figure 9 also
shows a representative word (or short phrase) in the Google News dataset that
is among the top five “most similar” words, out of the entire three million word
corpus, to the average (mean) of each GGS segment, as measured by cosine
similarity. We see that GGS correctly identifies both the breakpoint locations
and the general topic of each segment.

6.5 Comparison with left-to-right HMM on synthetic data


We next analyze a synthetic example where observations are generated from a
given sequence of segments. This provides a known ground truth, allowing us
to compare GGS with a common baseline, a left-to-right hidden Markov model
(HMM) (Bakis 1976, Cappé et al. 2005). Left-to-right HMMs, like GGS, split the
data into non-repeatable segments, where each segment is defined by a Gaussian
distribution. The HMMs in this experiment are implemented using the rarhsmm
library (Xu and Liu 2017), which includes the same shrinkage estimator for the
covariance matrices used in GGS (see section 2.2). The shrinkage estimator
results in more reliable estimates not only of the covariance matrices but also
of the transition matrix and the hidden states (Fiecas et al. 2017).
6 Experiments 165

(a) Objective vs. breakpoints for λ = 10. (b) Average training and test log-likelihood.

Figure 10: GGS correctly identifies that there are ten underlying segments in the data
(from the kink in the plots at K = 9).

Data set description. We start by generating ten random covariance matri-


ces. We do so by setting Σi = A(i) A(i)T , i = 1, . . . , 10, where A(i) ∈ R25×25 is
(i)
a random matrix where each element Aj,k was generated independently from
the standard normal distribution. Our synthetic data set then has ten ground
truth segments (or K = 9 breakpoints), where segment i has zero mean and
covariance Σi . Each segment is of length 100 (so the total time series has 1,000
observations). Each of the 100 readings per segment is sampled independently
from the given distribution. Thus, our final data set consists of a 25 × 1000 data
matrix, consisting of ten independent segments, each of length 100.

Results. We run both GGS and the left-to-right HMM on this data set. For
GGS, we immediately notice a kink in the objective at K = 9, as shown in
figure 10(a), indicating that the data should be split into K + 1 = 10 segments.
We use cross-validation to choose an appropriate value of λ, which yields λ = 10.
We plot the training- and test-set log-likelihoods for λ = 10 in figure 10(b).
(Similar to the Wikipedia text example, though, the breakpoint locations are
relatively robust to our selection of λ. GGS returns identical breakpoints for
any λ between 10−3 and 103 ). For this value of λ (and thus for the whole range
of λ between 10−3 and 103 ), we split the data perfectly, identifying the nine
breakpoints at their exact locations.
Left-to-right HMMs have various methods for determining the number of seg-
ments, such as AIC or BIC. Here we instead simply use the correct number of
segments, and initialize the transition matrix as its true value. Note that this
is the best-case scenario for the left-to-right HMM. Even with this advantage,
the left-to-right HMM struggles to properly split the time series. Whereas GGS
166 Greedy Gaussian segmentation

correctly identifies the breakpoints as [100, 200, 300, 400, 500, 600, 700, 800, 900],
the left-to-right HMM gets at least one breakpoint completely wrong (and splits
the data at, for example, [100, 200, 300, 400, 500, 600, 663, 700, 800]).

These results are consistent. In fact, when this experiment was repeated 100
times (with different randomly generated data), GGS identified the correct
breakpoints every single time. We also note that GGS is robust to n (the
dimension of the data), K (the number of breakpoints), and T /(K + 1) (the av-
erage segment length), perfectly splitting the data at the exact breakpoints for
all tests across at least one order of magnitude in each of these three parameters.
On the other hand, the 100 left-to-right HMM experiments correctly labeled on
average just 7.46 of the nine true breakpoints (and never more than eight). In
these HMM experiments, instead of ten segments of length 100, the shortest
segment had an average length of 26, and the longest segment had an average
length of 200. Additionally, the left-to-right HMM struggles as the parameters
change, doing significantly worse when K increases and when T /K is small
compared to n (though formal analysis of the robustness of left-to-right HMMs
is outside the scope of this paper). This comes as no surprise, because finding
the global maximum among all local maxima of the likelihood function for an
HMM with many states is known to be difficult problem (Cappé et al. 2005,
chapter 1.4). Therefore, as shown by these these experiments, GGS appears to
outperform left-to-right HMMs in this setting.

7 Summary
We have analyzed the problem of breaking a multivariate time series into seg-
ments, where the data in each segment could be modeled as independent samples
from a multivariate Gaussian distribution. Our greedy Gaussian segmentation
(GGS) algorithm is able to approximately maximize the covariance-regularized
log-likelihood in an efficient manner, easily scaling to vectors with dimension
over 1,000 and time series of any length. Examples on both small and large
data sets yielded useful insights. Our implementation, available at https:
//github.com/cvxgrp/GGS, can be used to solve problems in a variety of ap-
plications. For example, the regularized parameter estimates obtained by GGS
could be used as inputs to portfolio optimization, where correlations between
different assets play an important role when determining optimal holdings.

References
Abonyi, J., B. Feil, S. Nemeth, and P. Arva. “Modified Gath–Geva clustering
for fuzzy segmentation of multivariate time-series.” Fuzzy Sets and Systems,
vol. 149, no. 1 (2005), pp. 39–56.
7 Summary 167

Alexander, C. “A primer on the orthogonal GARCH model.” (2000). Unpub-


lished manuscript, ISMA Center, University of Reading, U.K.

Ang, A. and A. Timmermann. “Regime changes and financial markets.” Annual


Review of Financial Economics, vol. 4, no. 1 (2012), pp. 313–337.

Bakis, R. “Continuous speech recognition via centisecond acoustic states.” Jour-


nal of the Acoustical Society of America, vol. 59, no. S1 (1976), p. S97.

Basseville, M. and I. V. Nikiforov. Detection of Abrupt Changes: Theory and


Application, vol. 104. Prentice Hall: Englewood Cliffs (1993).

Bauwens, L. and J. Rombouts. “On marginal likelihood computation in change-


point models.” Computational Statistics & Data Analysis, vol. 56, no. 11
(2012), pp. 3415–3429.

Bellman, R. E. “On the approximation of curves by line segments using dynamic


programming.” Communications of the ACM, vol. 4, no. 6 (1961), p. 284.

Bickel, P. J. and E. Levina. “Regularized estimation of large covariance matrices.”


Annals of Statistics, vol. 36, no. 1 (2008), pp. 199–227.

Bleakley, K. and J. P. Vert. “The group fused lasso for multiple change-point
detection.” arXiv preprint arXiv:1106.4199 (2011).

Booth, N. B. and A. F. M. Smith. “A Bayesian approach to retrospective identi-


fication of change-points.” Journal of Econometrics, vol. 19, no. 1 (1982), pp.
7–22.

Borenstein, E. and S. Ullman. “Combined top-down/bottom-up segmentation.”


IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30,
no. 12 (2008), pp. 2109–2125.

Bulla, J. “Hidden Markov models with t components. Increased persistence and


other aspects.” Quantitative Finance, vol. 11, no. 3 (2011), pp. 459–475.

Candès, E. J., M. B. Wakin, and S. Boyd. “Enhancing sparsity by reweighted ℓ1


minimization.” Journal of Fourier Analysis and Applications, vol. 14, no. 5–6
(2008), pp. 877–905.

Cappé, O., E. Moulines, and T. Rydén. Inference in Hidden Markov Models.


Springer: New York (2005).

Cheon, S. and J. Kim. “Multiple change-point detection of multivariate mean


vectors with the Bayesian approach.” Computational Statistics & Data Anal-
ysis, vol. 54, no. 2 (2010), pp. 406–415.
168 Greedy Gaussian segmentation

Chouakria-Douzal, A. “Compression technique preserving correlations of a mul-


tivariate temporal sequence.” In Advances in Intelligent Data Analysis V,
edited by M. R. Berthold, H.-J. Lenz, E. Bradley, R. Kruse, and C. Borgelt,
vol. 2810 of Lecture Notes in Computer Science. Springer: Berlin (2003), pp.
566–577.

Crosier, R. B. “Multivariate generalizations of cumulative sum quality-control


schemes.” Technometrics, vol. 30, no. 3 (1988), pp. 291–303.

De Gooijer, J. “Detecting change-points in multidimensional stochastic pro-


cesses.” Computational Statistics & Data Analysis, vol. 51, no. 3 (2006), pp.
1892–1903.

Douglas, D. H. and T. K. Peucker. “Algorithms for the reduction of the number


of points required to represent a digitized line or its caricature.” Cartographica:
The International Journal for Geographic Information and Geovisualization,
vol. 10, no. 2 (1973), pp. 112–122.

Esling, P. and C. Agon. “Time-series data mining.” ACM Computing Surveys,


vol. 45, no. 1 (2012), p. 12.

Fenn, D. J., M. A. Porter, S. Williams, M. McDonald, N. F. Johnson, and N. S.


Jones. “Temporal evolution of financial-market correlations.” Physical Review
E, vol. 84, no. 2 (2011), p. 026109.

Fiecas, M., J. Franke, R. von Sachs, and J. T. Kamgaing. “Shrinkage estimation


for multivariate hidden Markov models.” Journal of the American Statistical
Association, vol. 112, no. 517 (2017), pp. 424–435.

Fragkou, P., V. Petridis, and A. Kehagias. “A dynamic programming algorithm


for linear text segmentation.” Journal of Intelligent Information Systems,
vol. 23, no. 2 (2004), pp. 179–197.

Galeano, P. and D. Wied. “Multiple break detection in the correlation struc-


ture of random variables.” Computational Statistics & Data Analysis, vol. 76
(2014), pp. 262–282.

Ge, X. and P. Smyth. “Segmental semi-Markov models for endpoint detection


in plasma etching.” IEEE Transactions on Semiconductor Engineering, vol.
259 (2001), pp. 201–209.

Guralnik, V. and J. Srivastava. “Event detection from time series data.” In Pro-
ceedings of the Fifth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (1999), pp. 33–42.

Gustafsson, F. Adaptive Filtering and Change Detection. Wiley: West Sussex


(2000).
7 Summary 169

Hallac, D., A. Sharang, R. Stahmann, A. Lamprecht, M. Huber, M. Roehder,


R. Sosič, and J. Leskovec. “Driver identification using automobile sensor data
from a single turn.” In IEEE 19th International Conference on Intelligent
Transport Systems (2016), pp. 953–958.

Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.


Springer: New York, 2nd ed. (2009).

Hu, B., T. Rakthanmanon, Y. Hao, S. Evans, S. Lonardi, and E. Keogh. “Using


the minimum description length to discover the intrinsic cardinality and di-
mensionality of time series.” Data Mining and Knowledge Discovery, vol. 29,
no. 2 (2015), pp. 358–399.

Huang, J. Z., N. Liu, M. Pourahmadi, and L. Liu. “Covariance matrix selection


and estimation via penalised normal likelihood.” Biometrika, vol. 93, no. 1
(2006), pp. 85–98.

Katz, I. and K. Crammer. “Outlier-robust convex segmentation.” In Proceedings


of the 29th National Conference on Artificial Intelligence, vol. 4 (2015), pp.
2701–2707.

Kehagias, A., E. Nidelkou, and V. Petridis. “A dynamic programming segmen-


tation procedure for hydrological and environmental time series.” Stochastic
Environmental Research and Risk Assessment, vol. 20, no. 1–2 (2006), pp.
77–94.

Keogh, E., S. Chu, D. Hart, and M. Pazzani. “Segmenting time series: A survey
and novel approach.” In Data Mining in Time Series Databases, edited by
M. Last, A. Kandel, and H. Bunke, vol. 57 of Series in Machine Perception
and Artificial Intelligence, chap. 1. World Scientific: Singapore (2004).

Kim, S. J., K. Koh, S. Boyd, and D. Gorinevsky. “ℓ1 trend filtering.” SIAM
Review, vol. 51, no. 2 (2009), pp. 339–360.

Ledoit, O. and M. Wolf. “A well-conditioned estimator for large-dimensional


covariance matrices.” Journal of Multivariate Analysis, vol. 88, no. 2 (2004),
pp. 365–411.

Lee, C. B. “Bayesian analysis of a change-point in exponential families with


applications.” Computational Statistics & Data Analysis, vol. 27, no. 2 (1998),
pp. 195–208.

Lee, J. and T. Hastie. “Learning the structure of mixed graphical models.”


Journal of Computational and Graphical Statistics, vol. 24, no. 1 (2015), pp.
230–253.
170 Greedy Gaussian segmentation

Li, J. “Nonparametric multivariate statistical process control charts: a hypoth-


esis testing-based approach.” Journal of Nonparametric Statistics, vol. 27,
no. 3 (2015a), pp. 383–400.

Meucci, A. “Managing diversification.” Risk, vol. 22, no. 5 (2009), pp. 74–79.

Nystrup, P., B. W. Hansen, H. O. Larsen, H. Madsen, and E. Lindström. “Dy-


namic allocation or diversification: A regime-based approach to multiple as-
sets.” Journal of Portfolio Management, vol. 44, no. 2 (2017a), pp. 62–73.

Nystrup, P., B. W. Hansen, H. Madsen, and E. Lindström. “Regime-based ver-


sus static asset allocation: Letting the data speak.” Journal of Portfolio
Management, vol. 42, no. 1 (2015a), pp. 103–109.

Nystrup, P., B. W. Hansen, H. Madsen, and E. Lindström. “Detecting change


points in VIX and S&P 500: A new approach to dynamic asset allocation.”
Journal of Asset Management, vol. 17, no. 5 (2016), pp. 361–374.

Nystrup, P., H. Madsen, and E. Lindström. “Long memory of financial time


series and hidden Markov models with time-varying parameters.” Journal of
Forecasting, vol. 36, no. 8 (2017b), pp. 989–1002.

Partovi, M. H. and M. Caputo. “Principal portfolios: Recasting the efficient


frontier.” Economics Bulletin, vol. 7, no. 3 (2004), pp. 1–10.

Picard, F., É. Lebarbier, E. Budinská, and S. Robin. “Joint segmentation of


multivariate Gaussian processes using mixed linear models.” Computational
Statistics & Data Analysis, vol. 55, no. 2 (2011), pp. 1160–1170.

Rajagopalan, V. and A. Ray. “Symbolic time series analysis via wavelet-based


partitioning.” Signal Processing, vol. 86, no. 11 (2006), pp. 3309–3320.

Rydén, T., T. Teräsvirta, and S. Åsbrink. “Stylized facts of daily return series
and the hidden Markov model.” Journal of Applied Econometrics, vol. 13,
no. 3 (1998), pp. 217–244.

Samé, A., F. Chamroukhi, G. Govaert, and P. Aknin. “Model-based clustering


and segmentation of time series with changes in regime.” Advances in Data
Analysis and Classification, vol. 5, no. 4 (2011), pp. 301–321.

Sheikh, A. Z. and J. Sun. “Regime change: Implications of macroeconomic shifts


on asset class and portfolio performance.” Journal of Investing, vol. 21, no. 3
(2012), pp. 36–54.

Son, Y. S. and S. Kim. “Bayesian single change point detection in a sequence


of multivariate normal observations.” Statistics, vol. 39, no. 5 (2005), pp.
373–387.
7 Summary 171

Tansey, W., O. H. M. Padilla, A. S. Suggala, and P. Ravikumar. “Vector-space


Markov random fields via exponential families.” In Proceedings of the 32nd
International Conference on Machine Learning, vol. 1 (2015), pp. 684–692.
Tibshirani, R., M. Saunders, S. Rosset, J. Zhu, and K. Knight. “Sparsity and
smoothness via the fused lasso.” Journal of the Royal Statistical Society:
Series B (Statistical Methodology), vol. 67, no. 1 (2005), pp. 91–108.
Venter, J. H. and S. J. Steel. “Finding multiple abrupt change points.” Compu-
tational Statistics & Data Analysis, vol. 22, no. 5 (1996), pp. 481–504.
Verbeek, J., N. Vlassis, and B. Kröse. “Efficient greedy learning of Gaussian
mixture models.” Neural Computation, vol. 15, no. 2 (2003), pp. 469–485.
Wahlberg, B., S. Boyd, M. Annergren, and Y. Wang. “An ADMM algorithm for
a class of total variation regularized estimation problems.” IFAC Proceedings
Volumes, vol. 45, no. 16 (2012), pp. 83–88.
Wahlberg, B., C. Rojas, and M. Annergren. “On ℓ1 mean and variance filtering.”
In Proceedings of the Forty Fifth Asilomar Conference on Signals, Systems
and Computers (2011), pp. 1913–1916.
Welford, B. P. “Note on a method for calculating corrected sums of squares and
products.” Technometrics, vol. 4, no. 3 (1962), pp. 419–420.
Witten, D. and R. Tibshirani. “Covariance-regularized regression and classifica-
tion for high dimensional problems.” Journal of the Royal Statistical Society:
Series B (Statistical Methodology), vol. 71, no. 3 (2009), pp. 615–636.
Xu, N. “A survey of sensor network applications.” IEEE Communications Mag-
azine, vol. 40, no. 8 (2002), pp. 102–114.
Xu, Z. and Y. Liu. “Regularized autoregressive hidden semi Markov model.”
https://github.com/cran/rarhsmm (2017).
Zangwill, W. I. and C. B. Garcia. Pathways to solutions, fixed points, and equi-
libria. Prentice Hall: Englewood Cliffs (1981).
172
PAPER G
174
Originally published in Quantitative Finance

Dynamic portfolio optimization


across hidden market regimes

Peter Nystrup, Henrik Madsen, and Erik Lindström

Abstract

Regime-based asset allocation has been shown to add value over rebalancing
to static weights and, in particular, reduce potential drawdowns by reacting
to changes in market conditions. The predominant approach in previous
studies has been to specify in advance a static decision rule for changing
the allocation based on the state of financial markets or the economy. In
this article, model predictive control (MPC) is used to dynamically opti-
mize a portfolio based on forecasts of the mean and variance of financial
returns from a hidden Markov model with time-varying parameters. There
are computational advantages to using MPC when estimates of future re-
turns are updated every time a new observation becomes available, since the
optimal control actions are reconsidered anyway. MPC outperforms a static
decision rule for changing the allocation and realizes both a higher return
and a significantly lower risk than a buy-and-hold investment in various
major stock market indices. This is after accounting for transaction costs,
with a one-day delay in the implementation of allocation changes, and with
zero-interest cash as the only alternative to the stock indices. Imposing a
trading penalty that reduces the number of trades is found to increase the
robustness of the approach.

Keywords: Mean–variance optimization; Model predictive control; Hidden


Markov model; Adaptive estimation; Forecasting.

1 Introduction
The objective of portfolio optimization is to find an optimal tradeoff between
risk and return over a fixed planning horizon. Traditionally, investors decide
on a strategic asset allocation (SAA) based on a single-period optimization,
inspired by the mean–variance framework of Markowitz (1952). The purpose
is to develop a static, “all-weather” portfolio that optimizes efficiency across
a range of economic scenarios. Even if the SAA is reconsidered on an annual
basis, it is unlikely to change significantly, as long as the purpose is “all-weather”
efficiency.
176 Dynamic portfolio optimization across hidden market regimes

In the presence of time-varying investment opportunities, portfolio weights sho-


uld be adjusted as new information arrives to take advantage of favorable eco-
nomic regimes and withstand adverse regimes (Sheikh and Sun 2012). The
abrupt regime changes that financial markets tend to undergo present a big
challenge to traditional SAA. Although some changes may be transitory, the
new behavior often persists for several periods after a change (Ang and Tim-
mermann 2012).
Regime-based asset allocation (RBAA) has indeed been shown to add value over
rebalancing to static weights and, in particular, reduce potential drawdowns by
reacting to changes in market conditions (see Ang and Bekaert 2004, Guidolin
and Timmermann 2007, Bulla et al. 2011, Kritzman et al. 2012, Nystrup et al.
2015a, 2017a). The predominant approach is to specify in advance a static
decision rule for changing the allocation based on the state of financial markets
or the economy.
The parameters of the decision rule can be optimized in sample, but it does
not guarantee that the decision rule is optimal for the problem at hand. A
disadvantage is, therefore, that a large number of different specifications might
have to be tried, in order to find a decision rule with good performance. Testing
many different specifications increases the risk of inferior performance out of
sample. Further, it can be argued that a static decision rule is hardly optimal
when the underlying model used for regime inference is time varying, as in Bulla
et al. (2011) and Nystrup et al. (2015a, 2017a).
An alternative approach is to dynamically optimize the portfolio based on the
inferred regime probabilities and parameters taking into account transaction
costs, risk aversion, and possibly other constraints. Herzog et al. (2007) and
Boyd et al. (2014) proposed to use model predictive control (MPC) to solve
this constrained, stochastic control problem. In MPC, a statistical model of the
process is used to predict its future evolution and choose the best control action.
The great strength of MPC is the capability to solve control problems under
constraints in a computationally feasible manner. Even so, it is commonly
assumed that asset prices can be described by a linear factor model with constant
variance and that there are no transaction costs in order to derive analytical
expressions for when the allocation should be changed (see, e.g., Herzog et al.
2007, Costa and Araujo 2008, Calafiore 2008, 2009). This limits the practical
impact of the results. Transaction costs are important when comparing the
performance of static and dynamic strategies, because frequent rebalancing can
offset the potential excess return of a dynamic strategy. Moreover, transaction
costs stabilize the optimization problem (Brodie et al. 2009, Ho et al. 2015).
In this article, asset returns are modeled by a two-state hidden Markov model
(HMM) with time-varying parameters, similar to the model considered in Nys-
trup et al. (2015a, 2017b,a). From a statistical perspective, the HMM is a
more realistic description of asset price dynamics than a linear factor model
2 The hidden Markov model 177

with constant variance. It is well suited to capture the stylized behavior of


financial series, including volatility clustering, leptokurtosis, and time-varying
correlations (see, e.g., Rydén et al. 1998, Ang and Timmermann 2012). From an
economic perspective, the HMM can describe the abrupt changes in market con-
ditions and investment opportunities that arise due to changes in risk aversion
and structural changes in the state of the economy.
Instead of a static decision rule for changing the portfolio based on the inferred
regime, MPC is used to dynamically optimize the portfolio based on forecasted
means and variances. Using an HMM, the forecasts are mean-reverting and
only change when the regime probabilities change. Thus, the allocation is still
determined indirectly by the inferred regime. MPC, however, is applicable to
forecasts from any type of model. The impact of transaction costs and risk
aversion is analyzed in a live-sample setting using available market data. MPC
is compared with previous approaches to RBAA under realistic assumptions
about transaction costs and implementation.
The article is structured as follows: section 2 introduces the HMM, its esti-
mation, and use for forecasting. Section 3 is concerned with dynamic portfolio
optimization and MPC. The empirical results are presented in section 4. Finally,
section 5 concludes.

2 The hidden Markov model


The HMM is a popular choice for inferring the hidden state of financial markets.
It can match the tendency of financial markets to change their behavior abruptly
and the phenomenon that the new behavior often persists for several periods
after a change (Ang and Timmermann 2012). In addition, it is well suited
to capture the stylized behavior of many financial series including volatility
clustering and leptokurtosis, as shown by Rydén et al. (1998).
In an HMM, the probability distribution that generates an observation depends
on the state of an unobserved Markov chain. A sequence of discrete random
variables {St : t ∈ N} is said to be a first-order Markov chain if, for all t ∈ N, it
satisfies the Markov property:
Pr ( St+1 | St , . . . , S1 ) = Pr ( St+1 | St ) . (1)
The conditional probabilities Pr ( St+1 = j| St = i) = γij are called transition
probabilities. A Markov chain with transition probability matrix Γ = {γij } has
stationary distribution π, if π T Γ = π T and 1T π = 1, where 1 is a column
vector with all entries one. The Markov chain is said to be stationary if δ = π,
where δ is the initial distribution, i.e., δi = Pr (S1 = i).
As an example, consider the two-state model with Gaussian conditional distri-
butions: ( )
Yt | St ∼ N µSt , σS2 t ,
178 Dynamic portfolio optimization across hidden market regimes

where
{ { [ ]
µ1 , if St = 1, 2 σ12 , if St = 1, 1 − γ12 γ12
µSt = σSt = and Γ = .
µ2 , if St = 2, 2
σ2 , if St = 2, γ21 1 − γ21

When the current state St is known, the distribution of Yt depends only on St ,


and not on previous states or observations.
The sojourn times are implicitly assumed to be geometrically distributed:
t−1
Pr (’staying t time steps in state i’) = γii (1 − γii ) . (2)

The geometric distribution is memoryless, implying that the time until the next
transition out of the current state is independent of the time spent in the state.
In order to improve its fit to the distributional and temporal properties of daily
returns, the Gaussian HMM has been extended by considering other sojourn-
time distributions than the memoryless geometric distribution (Bulla and Bulla
2006), other conditional distributions than the Gaussian distribution (Bulla
2011), and a continuous-time formulation as an alternative to the dominating
discrete-time models (Nystrup et al. 2015b). As an alternative to increasing
the model complexity, Nystrup et al. (2017b) obtained good results using an
adaptive estimation approach that allowed for time variation in the parameters
of a two-state Gaussian HMM. This approach was adopted in Nystrup et al.
(2015a, 2017a) and will be adopted in this article as well.

2.1 Adaptive parameter estimation


The parameters of an HMM are usually estimated using the maximum-likelihood
method. The two most popular approaches to maximizing the likelihood are
direct numerical maximization and the Baum–Welch algorithm, a special case
of the expectation–maximization (EM) algorithm (Baum et al. 1970, Dempster
et al. 1977).
Every observation is assumed to be of equal importance, no matter how long the
sample period is. This approach works well when the sample period is short and
the underlying process does not change over time. The time-varying behavior of
the parameters documented in previous studies (Rydén et al. 1998, Bulla 2011,
Nystrup et al. 2017b), however, calls for an adaptive approach that assigns more
weight to the most recent observations, while keeping in mind past patterns at
a reduced confidence.
As pointed out by Cappé et al. (2005), it is possible to evaluate derivatives of the
likelihood function with respect to the parameters for virtually any model that
the EM algorithm can be applied to. As a consequence, instead of resorting to
a specific algorithm such as the EM algorithm, the likelihood can be maximized
using gradient-based methods. Lystig and Hughes (2002) described an algorithm
2 The hidden Markov model 179

for exact computation of the score vector and the observed information matrix in
HMMs that can be performed in a single pass through the data. Their algorithm
was derived from the forward–backward algorithm.
The reason for exploring gradient-based methods is the flexibility to make the
estimator recursive and adaptive.1 The estimation of the parameters through a
maximization of the conditional log-likelihood function can be done online using
the estimator

t
θ̂t = arg max wn log Pr (Yn |Yn−1 , . . . , Y1 , θ ) = arg max ˜lt (θ) (3)
θ θ
n=1

with wn = 1.2 The online estimator can be made adaptive by introducing a


different weighting. A popular choice is to use exponential weights wn = f t−n ,
where 0 < f < 1 is the forgetting factor (Parkum et al. 1992, Kulhavỳ and
Zarrop 1993). The speed of adaption is then determined by the effective memory
length
1
Neff = . (4)
1−f

Maximizing the second-order Taylor expansion of ˜lt (θ) around θ̂t−1 with respect
to θ and defining the solution as the estimator θ̂t leads to
[ ( )]−1 ( )
θ̂t = θ̂t−1 − ∇θθ ˜lt θ̂t−1 ∇θ ˜lt θ̂t−1 . (5)

This is equivalent to a specific case of the generalized autoregressive score (GAS)


model of Creal et al. (2013). Using the estimator (5) it is possible to reach
quadratic convergence, whereas the GAS model, in general, converges only lin-
early (see Cappé et al. 2005).
Scaling by the Hessian in (5) is equivalent to scaling by the variance of the score
function, because the expectation of the score is zero. The variance of the score
function is known as the Fisher information
[ ]
It (θ) = E [−∇θθ lt ] = E ∇θ lt ∇θ ltT . (6)

Approximating the Hessian by the Fisher information leads to the recursive,


adaptive estimator
[ ( )]−1 ( )
θ̂t ≈ θ̂t−1 + A It θ̂t−1 ∇θ ˜lt θ̂t−1 . (7)

1 See Khreich et al. (2012) for a survey of techniques for incremental learning of HMM parame-
ters.
2 An online estimator processes its input observation-by-observation in a sequential fashion,

without having the entire input sequence available from the start.
180 Dynamic portfolio optimization across hidden market regimes

The tuning constant A can be adjusted to increase or decrease the speed of con-
vergence without changing the effective memory length, although it is common
to choose A ≈ 1/Neff . The inverse of the Fisher information can be updated
recursively using (6) and the matrix inversion lemma. It is necessary to apply a
transformation to all constrained parameters for the estimator (7) to converge,
and it is advisable to start the estimation at t > 1 to avoid large initial steps.
The time variation of the parameters is observation driven based on the score
of the likelihood function. Although the parameters are stochastic, they are
perfectly predictable given the past observations. This is contrary to parameter-
driven models, in which the parameters are stochastic processes with their own
source of error. No prior knowledge is assumed about the parameters, and no
attempt is made to identify the drivers of the variations (see, e.g., Brennan et al.
1997).
The use of the score function for updating the parameters is intuitive, as it
defines the steepest ascent direction for improving the model’s local fit in terms
of the likelihood at time t given the current parameter values (Creal et al. 2013).
For HMMs, the score function must consider the previous observations and
cannot reasonably be approximated by the score function of the most recent
observation, as it is often done for other models (Khreich et al. 2012). This
leads to a significant increase in computational complexity. In order to compute
the weighted score function, the algorithm of Lystig and Hughes (2002) has to
be run for each iteration and the contribution of each observation has to be
weighted.

2.2 Forecasting
The first step toward calculating the forecast distributions is to estimate the
current state probabilities given the past observations and parameters. The
vector of state probabilities is
∏T
δ T P1 (y1 ) t=2 Γt Pt (yt )
T
α T |T = ( ∏T ) , (8)
δ T P1 (y1 ) Γ P
t=2 t t (y t ) 1
( )
where the i’th entry is α T |T i = Pr ( ST = i| YT , . . . , Y1 ) and Pt (yt ) is a di-
agonal matrix with the conditional densities pi (yt ) = Pr ( Yt = yt | St = i, θt ) as
entries.
Once the current state probabilities are estimated, the state probabilities k steps
ahead can be calculated by multiplying α T |T with the transition probability
matrix k times:
αTT +k|T = αTT |T ΓkT . (9)
The parameters are assumed to stay constant in the absence of a model describ-
ing their evolution.
2 The hidden Markov model 181

The density forecast is the average of the state-dependent conditional densities


weighted by the forecasted state probabilities. When the conditional distribu-
tions are distinct normal distributions, the forecast distribution will be a nonnor-
mal mixture (Frühwirth-Schnatter 2006).3 Using Monte Carlo simulation, Boyd
et al. (2014) showed that the results of dynamic portfolio optimization are not
particularly sensitive to higher-order moments. Consequently, for the present
application, only the first and second moment of the forecast distribution are
considered.
The first two moments of a mixture distribution are

m
µ= µi αi (10)
i=1
∑m
( )
σ2 = µ2i + σi2 αi − µ2 (11)
i=1

with αi denoting the weights—that is, the forecasted state probabilities.


Before calculating the moments of the mixture distribution, the conditional
means and variances of the returns are calculated based on the moments of
the log-returns. Within each state, the returns rt are assumed to be iid with
log-normal distribution
( )
log (1 + rt ) ∼ N µ, σ 2 ,

where µ and σ 2 are the mean and variance of the log-returns. Thus, the mean
and variance of the returns are given by
( / )
E [rt ] = exp µ + σ 2 2 − 1 (12)
( ( 2) ) ( )
Var [rt ] = exp σ − 1 exp 2µ + σ . 2
(13)

The forecasted mean and variance will be mean-reverting, as the forecast horizon
extends and the state probabilities converge to the stationary distribution of the
Markov chain. The rate of convergence is determined by the size of the second
largest eigenvalue of the transition probability matrix, which for a two-state
Markov chain is |λ2 | = |γ11 + γ22 − 1|. The more persistent the states are—or
equivalently, the larger the size of λ2 —the lower the rate of convergence.

3 The nonnormality can easily be captured by generating scenarios using the current state prob-

abilities as initial probabilities.


182 Dynamic portfolio optimization across hidden market regimes

3 Dynamic portfolio optimization


The approach outlined in the previous section can be applied online to forecast
the mean and variance of returns at discrete horizons. The forecasted means
and variances are inputs to a multi-period portfolio optimization. Every day a
decision has to be made whether or not to change the current portfolio allocation,
knowing that the decision will be reconsidered the next day with new input. In
the risk-neutral case, the possible gain or saving from changing allocation has
to exceed the costs involved.
The planning horizon should at least be long enough to reach the stationary
distribution of the underlying Markov chain, whereafter the forecast does not
change. In principle, the forecast horizon should be infinitely long, but in reality
no one has an infinite horizon.
There is a limit to how far ahead in time it is meaningful to make predictions.
For sufficiently long horizons, it is not possible to make better predictions than
the long-term mean and variance, which is also the reason that the forecasted
mean and variance converge to their stationary values, when the forecast horizon
extends. Thus, looking only a limited number of steps into the future is not just
an approximation necessary to make the optimization problem computationally
feasible, it also seems perfectly reasonable.
The formulation of the dynamic portfolio optimization problem as a stochas-
tic control problem is inspired by Boyd et al. (2014). However, the objective
function and the way transaction costs are handled are significantly different.
Boyd et al. (2014) assumed that infinite amounts of cash could be entered into
the portfolio on any given day. Although it would be possible to constrain the
amount of cash that can be entered into the portfolio per day, it is for most
purposes more realistic to assume that only a finite amount of cash is available
initially. At any later point in time, the amount of cash available depends on
the portfolio’s development and transaction costs incurred.

3.1 Stochastic control formulation


Let ht ∈ Rn denote the portfolio holdings at time t, where (ht )i is the dollar
value of asset i at the end of day t, with (ht )i < 0 meaning a short position in
asset i. Assets can be bought and sold at the end of each day. Let ut ∈ Rn
denote the dollar values of the trades, with (ut )i > 0 meaning that asset i is
bought at the end of day t.
The post-trade portfolio is defined as

h+
t = ht + ut , t = 0, . . . , T − 1, (14)

which is also the portfolio at the beginning of day t + 1. The total value of
the portfolio before trading is Vt = 1T ht and the total value of the post-trade
3 Dynamic portfolio optimization 183

portfolio is Vt+ = 1T h+ t ≤ Vt . Working with holdings ht , rather than weights


ht /Vt , simplifies the notation.
The portfolio is assumed to be self-financing with transaction costs proportional
to the total trade volume, that is,

1T ut + κT |ut | = 0, t = 0, . . . , T − 1, (15)

where κ is a vector of commission rates and the absolute value is elementwise.


The constraint states that −1T ut , which is the total gross proceeds from sales
minus the total gross proceeds from purchases, equals κT |ut |, the total transac-
tion cost for purchases and sales.
For optimization purposes, the constraint (15) is replaced by the convex relax-
ation
1T ut + κT |ut | ≤ 0, t = 0, . . . , T − 1, (16)
which allows the possibility of discarding money.
The post-trade portfolio is held until
the end of the next day. The (pre- u0 u1 u2
trade) portfolio at the end of the next
day is given by r1 r2 ...

t , t = 0, . . . , T −1,
ht+1 = (1 + rt+1 )◦h+ + + +
h0 h0 h1 h1 h2 h2

where rt+1 ∈ Rn is the vector of as- t=0 t=1 t=2


set returns from day t to day t + 1
and ◦ denotes Hadamard (element- Figure 1: Timeline of portfolio dynamics.
wise) multiplication of vectors. As il-
lustrated in figure 1, the dynamics are
linear, but unknown at time t.
The returns rt are random variables with mean and covariance
[ ]
T
E [rt ] = r̄t , E (rt − r̄t ) (rt − r̄t ) = Σt , t = 1, . . . , T.

The trades are determined in each period by a trading policy ϕt : Rn → Rn :

ut = ϕt (ht ) , t = 0, . . . , T − 1.

Let Ct ⊆ Rn denote the post-trade portfolio constraint set. Since Ct is nonempty,


it follows that for any value of ht , there exists a ut for which

t = ht + ut ∈ Ct .
h+
184 Dynamic portfolio optimization across hidden market regimes

Explicit constraints are imposed only on the post-trade portfolio h+ t , because


this can be controlled by buying and selling (i.e., through ut ), whereas the pre-
trade portfolio ht is determined by the random return rt in the previous period
and, therefore, not directly controllable.
The portfolio may be subject to constraints on the post-trade holdings, such as
minimum and maximum allowed holdings for each asset:

hmin
t ≤ h+ max
t ≤ ht , (17)

where the inequalities are elementwise and hmint and hmax


t are given vectors of
minimum and maximum asset holdings in dollars. For a long-only portfolio
with no short positions allowed hmin = 0. Position limits can also be expressed
relative to the total portfolio value, for example,

yt+ ≤ Vt+ Htmax , (18)

with Htmax ∈ Rn .
The overall objective is to maximize
[ −1
]

T
J = E VT − ψt (ht , ut ) , (19)
t=0

where the expectation is over the sequence of returns r1 , . . . , rT conditional on


all past observations, VT = 1T hT is the terminal value of the portfolio, and
ψt : Rn × Rn → R is a cost, with units of dollars, for period t. This is a
stochastic control problem with linear dynamics and convex objective function,
ensuring the existence of a unique solution (Boyd and Vandenberghe 2004). The
data for the problem is the distribution of rt , the stage costs ψt , and the initial
portfolio h0 .

Risk-averse control
The traditional risk-adjustment charge is proportional to the variance of the next
period portfolio value given the current post-trade portfolio, which corresponds
to [ ] ( + )T
Var Vt+1 | h+
t ht Σt+1 h+
t
ψt (ht , ut ) = γ = γ , (20)
Vt+ Vt+
where γ ≥ 0 is a unitless risk-aversion parameter. To ensure that the tradeoff
between terminal value and variance does not depend on the portfolio value,
the variance is scaled by the post-trade value Vt+ .
If the returns are independent, then the sum of the variances is the variance
of the terminal value. In that case, the objective function (19) with the risk-
adjustment charge (20) is equivalent to the mean–variance criterion of Markowitz
3 Dynamic portfolio optimization 185

(1952).4 It is a special case of expected utility maximization with a quadratic


utility function. While the utility approach was theoretically justified by von
Neumann and Morgenstern (1953), in practice few, if any, investors know their
utility functions; nor do the functions which financial engineers and economists
find analytically convenient necessarily represent a particular investor’s attitude
toward risk and return (Dai et al. 2010a). The mean–variance criterion remains
the most commonly used in portfolio optimization (Meucci 2005).
As an alternative to including a risk penalty in the objective function, Boyd et al.
(2014) proposed to constrain the portfolio standard deviation to a fraction of the
portfolio value. In a single-period setting, the two formulations are equivalent.
A risk limit might be preferable, because it is easier to quantify than a risk-
aversion parameter. In a multi-period setting, however, a constraint on the
portfolio standard deviation leads to excessive trading and inferior performance.
The resulting constant-risk portfolio does not consider the attractiveness of the
risk-return tradeoff. In order to maximize the return, the portfolio will be right
at the risk limit most of the time, which leads to forced trading, whenever the
volatility forecast increases unexpectedly.

Trading aversion
Boyd et al. (2014) gave examples of other convex constraints and cost terms
that arise in practical investment problems and can easily be included. One
option is to include a penalty for trading

ψt (ut ) = ρT |ut | , (21)

where ρ is a vector of trading-aversion parameters. This could reflect a con-


servatism toward trading, for example, due to the uncertainty related to the
parameter estimates and forecasts. Inflating the transaction cost κ in (15)–(16)
would have the same effect. In order to distinguish it from the actual transac-
tion cost, the trading penalty (21) is instead included in the objective function
(19), similarly to the variance penalty (20).
It is well known that estimation errors can cause mean–variance optimized port-
folios to perform poorly (Michaud 1989, DeMiguel et al. 2009b). Brodie et al.
(2009) reformulated the mean–variance optimization problem as a constrained,
least-squares regression problem. Imposing a trading penalty (21) that is pro-
portional to the trade volume is a convex relaxation of constraining the number
of trades. Similar to the least absolute shrinkage and selection operator (lasso)
in regression analysis (Tibshirani 1996), this ℓ1 penalty regularizes the optimiza-
tion problem and reduces the risk due to estimation errors (Ho et al. 2015).

4 Everything is scaled by Vt .
186 Dynamic portfolio optimization across hidden market regimes

3.2 Model predictive control


MPC, also referred to as receding horizon control or rolling horizon planning,
is widely used in some industries; primarily for systems with slow dynamics
such as energy systems, chemical plants, and supply chains (see, e.g., Bemporad
2006), but it is also used, for example, for steering autonomous vehicles (see,
e.g., Falcone et al. 2007). MPC typically works very well in practice, even for
short horizons.
MPC is based on the simple idea that in order to determine ut , all future (un-
known) returns are replaced by their forecasted mean values r̂τ , τ = t+1, . . . , T.
This turns the stochastic control problem into a deterministic optimization prob-
lem
∑T −1
maximize VT − τ =t ψτ (hτ , uτ )
(22)
subject to hτ +1 = (1 + r̂τ +1 ) ◦ (hτ + uτ ) , τ = t, . . . , T − 1
with variables ht+1 , . . . , hT and ut , . . . , uT −1 . Note that ht is not a variable, but
the (known) current portfolio holdings.
Solving this convex optimization problem yields an optimal sequence of trades
u⋆t , . . . , u⋆T −1 . This sequence is a plan for future trades over the remaining trad-
ing horizon under the highly unrealistic assumption that future returns will be
equal to their forecasted values. An alternative is to forecast the unconditional
distribution and generate a number of scenarios, but this is computationally
much more challenging.
The MPC policy takes ϕMPC (ht ) = u⋆t , that is, only the first trade in the planned
sequence of trades is executed. At the next step, the process is repeated, starting
from the new portfolio ht+1 . In the case of a mean–variance objective function,
Herzog et al. (2007) showed that future asset allocation decisions do not depend
on the trajectory of the portfolio, but solely on the current tradeoff between
satisfying the constraints and maximizing the objective. As emphasized by
Boyd et al. (2014), there are computational advantages to using MPC in cases
when estimates of future return statistics are updated online. In this case, the
expected returns r̄t are simply replaced with the most recent return estimates.
MPC for stochastic systems is a suboptimal control strategy. However, it uses
new information advantageously and is better than pure open-loop control (Her-
zog et al. 2007). The open-loop policy would be to execute the entire sequence
of trades u⋆t , . . . , u⋆T −1 based on the initial portfolio without recourse. Using
Monte Carlo simulation, Boyd et al. (2014) showed that, in any practical sense,
the MPC policy is optimal.

Truncated MPC
The MPC policy described in (22) plans a sequence of trades for the full time
interval t, . . . , T . A common variation is to look a limited number of steps, K,
4 Empirical results 187

Algorithm 1: MPC approach to dynamic portfolio optimization.

1. Update HMM parameters based on the most recent returns


2. Forecast the mean and variance K steps into the future
3. Compute an optimal sequence of trades u⋆t , . . . , u⋆t+K−1 based on the current port-
folio

4. Execute the first trade, u⋆t , in the sequence and return to step 1

into the future. At each time t the optimization problem is


∑t+K−1
maximize term
Vt+K (ht+K ) − τ =t ψτ (hτ , uτ )
(23)
subject to hτ +1 = (1 + r̂τ +1 ) ◦ (hτ + uτ ) , τ = t, . . . , t + K − 1

with variables ht+1 , . . . , ht+K and ut , . . . , ut+K−1 . Here K is the number of


term
steps of look-ahead and Vt+K is the terminal value.
If the terminal value is appropriately chosen, then the truncated MPC policy
is exactly the same as the full look-ahead policy. If K is large enough to reach
term
the stationary distribution of the underlying Markov chain, then Vt+K (ht+K )
T
can be replaced by Vt+K = 1 ht+K , since the risk–return tradeoff does not
change after this point. If transaction costs are very high, then the choice of
K can affect the result, even when K is large enough to reach the stationary
distribution of the underlying Markov chain. However, the choice should reflect
how far ahead in time it is meaningful to make predictions.
Algorithm 1 summarizes the four steps in the MPC approach to solving the
dynamic portfolio optimization problem. At time t, a new measurement is
obtained, which is used to update the parameters of the HMM and forecast
the mean and variance K steps into the future. The next step is to compute

an optimal sequence of trades u⋆t , . . . , ut+K−1 based on the current portfolio
and the forecasts. Only the first trade in the sequence is executed before a
new measurement is obtained and the procedure is repeated. Computing the
optimal sequence of trades for K = 100 by solving the optimization problem
(23) takes less than 18 milliseconds using CVXPY (Diamond and Boyd 2016)
with the open-source solver ECOS (Domahidi et al. 2013).

4 Empirical results
4.1 Data
The asset universe considered consists of various major stock market indices
and cash. Cash positions are assumed to be risk-free and yield zero interest;
188 Dynamic portfolio optimization across hidden market regimes
300

0.05
Log-return
Index
200

-0.05
100

2000 2005 2010 2015 2000 2005 2010 2015


Year Year

Figure 2: MSCI World Total Return Index and its daily log-returns.

hence, the only source of performance is the stock indices. The stock indices
are considered one at a time. By only considering one risky asset and one risk-
free asset, correlations can be disregarded. It is natural to focus on a stock
index, since portfolio risk is typically dominated by stock market risk (see, e.g.,
Goyal et al. 2015). Previous studies on RBAA have also focused on stocks and
cash, sometimes in combination with bonds (Ang and Bekaert 2004, Guidolin
and Timmermann 2007, Bulla et al. 2011, Kritzman et al. 2012, Nystrup et al.
2015a).
In the first subsections, the data analyzed is 4,943 daily log-returns of the MSCI
World Total Return Index covering the period from 1997 through 2015.5 Then
in section 4.6, the analysis is repeated for S&P 500, TOPIX, DAX, FTSE, and
MSCI EM. Figure 2 shows the MSCI World index and its daily log-returns over
the 19-year data period. The volatility forms clusters, as large price movements
tend to be followed by large price movements and vice versa, as noted by Man-
delbrot (1963).6 RBAA aims to exploit this persistence of the volatility, since
risk-adjusted returns, on average, are substantially lower during turbulent pe-
riods, irrespective of the source of turbulence, as shown by Kritzman and Li
(2010).
Similar to previous studies, the regime detection will focus on the log-returns
of the stock indices. Observed regimes in financial markets are related to the
phases of the business cycle (Campbell 1999, Cochrane 2005). As argued in
Nystrup et al. (2015a), the link is complex and difficult to exploit for investment
purposes due to the large lag in the availability of data related to the business
cycle. Besides, stock markets generally lead the economy (Siegel 1991). The

5 The log-returns are calculated using r = log (P ) − log (P


t t t−1 ), where Pt is the closing price
of the index on day t and log is the natural logarithm.
6 A quantitative manifestation of this fact is that while returns themselves are uncorrelated,

absolute and squared returns display a positive, significant, and slowly decaying autocorrelation
function.
4 Empirical results 189

-0.002
-0.0005 0.0010
µ1

µ2
-0.006
2000 2005 2010 2015 2000 2005 2010 2015
Year Year
8e-05

0.0002 0.0008
σ12

σ22
2e-05

2000 2005 2010 2015 2000 2005 2010 2015


Year Year
1.00
0.98

0.90
γ11

γ22
0.92

0.80
0.86

2000 2005 2010 2015 2000 2005 2010 2015


Year Year

Figure 3: Parameters of a two-state Gaussian HMM estimated adaptively using an


effective memory length of Neff = 260 days.

focus is, therefore, on readily available market data instead of attempting to


establish the link to the business cycle.

4.2 HMM parameters


Figure 3 shows the result of applying the adaptive estimator (10) on the daily
log-returns of the MSCI World index to estimate the parameters of a two-state
Gaussian HMM. An effective memory length of Neff = 260 days was used with
the tuning constant A = 1/Neff . The first 260 observations were used for initial-
ization.
The choice of memory length affects the parameter estimates and can be viewed
190 Dynamic portfolio optimization across hidden market regimes

as a tradeoff between bias and variance. A shorter window yields a faster adap-
tion to changes, but larger variance of the estimates, as fewer observations are
included in the estimation. In Nystrup et al. (2017b), an effective memory
length of about one year was found to give the best forecasts. In agreement
with this, forecasts based on a memory length of 260 days are found to give
good results when used as inputs for MPC of an investment portfolio.
Similar to the finding in previous studies (Rydén et al. 1998, Bulla 2011, Nystrup
et al. 2017b), the HMM parameters are seen to fluctuate a lot over the 19-year
data period. State one is the most persistent, has the lowest variance, and—
most of the time—has a positive mean. State two has a much higher variance
and a negative mean value most of the time. The period in the early 2000s,
when the mean return in both states was negative, exposes a weakness to having
predefined rules for changing the portfolio based on the regime; regardless of the
regime, an allocation to stocks would lose money in those years.
The probabilities of staying in the current state, γ11 and γ22 , appear to be
less than 0.99 the majority of the time. A probability of 0.99 would imply an
expected sojourn time of 1/ (1 − 0.99) = 100 days. The number of steps of
look-ahead, K, is, therefore, chosen to be 100.7 This is found to be a sufficiently
large number in that a further increase does not affect the results.

4.3 Optimal thresholds for changing allocation


The optimal thresholds for changing allocation depend on the specific parameter
values. Using MPC, the current parameter values are taken into account when
deciding whether to change the portfolio, instead of having a static decision rule
for changing the allocation based on the inferred regime. As an example, figure 4
shows the optimal thresholds for changing allocation for different levels of risk
aversion and transaction costs for a long-only portfolio based on the parameter
values on January 2, 2012. The thresholds are not necessarily optimal ex post.
Based on the parameter values on January 2, 2012, a risk-neutral investor with
γ = 0 that can trade at zero cost should be fully invested in stocks when the
probability of currently being in state one (the state with low variance and
positive mean) exceeds 0.55. Whenever the probability falls below 0.55, the
risk-neutral investor should sell all stocks. The risk-neutral investor is always
fully allocated to either stocks or cash.
A risk-averse investor with γ = 2 should be fully invested in stocks when the
probability of currently being in state one exceeds 0.86 and—similar to a risk-
neutral investor—fully allocated to cash when the probability is below 0.55. In
between, the portfolio is a mix of stocks and cash.
7 The number of steps of look-ahead, K, could also be chosen based on the maximum absolute

value of the second largest eigenvalue of the transition probability matrix, |λ2 | = |1 − γ11 − γ22 |,
which is just below 0.99.
4 Empirical results 191

MPC(γ = 0, κ = 0) MPC(γ = 2, κ = 0) MPC(γ = 0, κ = 0.02)

Stocks Stocks
0.8

0.8

0.8
Stocks
Pr(St = 1)

Pr(St = 1)

Pr(St = 1)
Stocks/cash mix
No trading
0.4

0.4

0.4
Cash Cash
Cash
0.0

0.0

0.0
Figure 4: Optimal thresholds for changing allocation for different levels of risk aversion,
γ, and transaction costs, κ, based on the parameter values on January 2, 2012.

A risk-neutral investor that can trade stocks at a cost of κ = 0.02 per transaction
should buy stocks when the probability of currently being in state one exceeds
0.75 and sell all stocks when the probability falls below 0.17. In between, the
risk-neutral investor should keep the initial portfolio. The thresholds are not
symmetric, as 0.75 ̸= 1 − 0.17.
Institutional investors can trade stocks at a much lower cost than 2%, but this
high value clearly illustrates that the presence of transaction costs leads to a
no-trade zone. Based on the parameter values at the end of 2015, for example, a
risk-neutral investor that can trade at a cost of 2% should never sell stocks, even
if the probability of currently being in the good state is zero, because of the low
persistence of the bad state and the not excessively negative mean value. The
higher the transaction costs, the larger the no-trade zone. Dai et al. (2010b)
derived similar results in a continuous-time framework.

4.4 Comparison of MPC results


Table 5 summarizes the performance of MPC with and without risk (20) and
trading (21) aversion for K = 100 steps of look-ahead. Short positions were
not allowed. Transaction costs of κ = 0.001 (10 basis points per transaction)
have been deducted in all three cases. This is a realistic cost for an institutional
investor.
The first approach, MPC(γ = 0, ρ = 0), is pure return maximization with no
risk or trading penalty (except for the 10 basis point transaction cost). This
approach yields the highest annualized return (AR) despite having an annual
turnover (AT) of 4.61. An AT of 4.61 means that the entire portfolio is shifted
from stocks to cash or vice versa, on average, 4.61 times per year.
Introducing a risk penalty by setting γ = 2 increases the AT from 4.61 to
5.30 and leads to a lower standard deviation (SD), lower maximum drawdown
192 Dynamic portfolio optimization across hidden market regimes

MPC(γ = 0, ρ = 0) MPC(γ = 2, ρ = 0) MPC(γ = 0, ρ = 0.02)


Annualized return 0.076 0.062 0.070
Standard deviation 0.12 0.10 0.11
Sharpe ratio 0.65 0.61 0.67
Maximum drawdown 0.26 0.21 0.23
Calmar ratio 0.29 0.29 0.31
Annual turnover 4.61 5.30 1.17

Table 5: Performance of MPC with and without risk and trading aversion.

(MDD)8 , and lower AR. The Calmar ratio (CR)9 is unaffected. To ensure that
the performance comparison is not distorted by autocorrelation in the daily
returns, the reported SDs have been adjusted for autocorrelation using the pro-
cedure outlined by Kinlaw et al. (2015). As γ increases and more emphasis is
put on the variance forecast, the Sharpe ratio (SR)10 deteriorates. This could
indicate that the mean changes faster than the variance; however, the variance
is crucial when distinguishing between market regimes (Nystrup et al. 2016).
Basing the allocation decision on both the mean and variance forecasts adds
another source of estimation error. This should be further explored in a future
study encompassing more assets.
The third approach, MPC(γ = 0, ρ = 0.02), is return maximization with a 2%
trading penalty on top of the 10 basis point transaction cost. The trading
penalty (21) with ρ = 0.02 is included in the objective function when deter-
mining the optimal sequence of trades in step 3 of algorithm 1, but the actual
transaction cost applied is still κ = 0.001. This subjective trading penalty leads
to a significantly lower AT and a slightly lower AR compared to the unpenalized
case, while the SR and CR are roughly unchanged.
Figure 6 shows the transactions in the MSCI World index relative to the portfolio
value at the time for the three approaches. Introducing a risk penalty by setting
γ = 2 leads to more frequent trading and a higher AT compared to pure return
maximization. A trading penalty, on the other hand, leads to significantly fewer
trades. The trading penalty appears to be effective at reducing the number of
trades that are reversed within a short timespan and may, therefore, be preferred
in some applications.

8 The maximum drawdown is the largest relative decline from a historical peak in the index

value.
9 The Calmar ratio is the annualized return divided by the maximum drawdown.
10 The Sharpe ratio is the annualized return divided by the standard deviation adjusted for

autocorrelation.
4 Empirical results 193

MPC(γ = 0, κ = 0.001, ρ = 0)
1.0
ut /Vt
0.0
-1.0

1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Year
MPC(γ = 2, κ = 0.001, ρ = 0)
1.0
ut /Vt
0.0
-1.0

1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Year
MPC(γ = 0, κ = 0.001, ρ = 0.02)
1.0
ut /Vt
0.0
-1.0

1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Year

Figure 6: Trades in the MSCI World index relative to the portfolio value at the time
based on MPC with and without risk and trading aversion.
194 Dynamic portfolio optimization across hidden market regimes

MPC(γ = 0, ρ = 0) MPC(γ = 0, ρ = 0.02) MSCI World RBAA SAA


Annualized return 0.065 0.069 0.056 0.046 0.041
Standard deviation 0.12 0.11 0.18 0.12 0.13
Sharpe ratio 0.56 0.63 0.30 0.38 0.32
Maximum drawdown 0.26 0.25 0.57 0.34 0.44
Calmar ratio 0.25 0.28 0.10 0.13 0.09
Annual turnover 4.16 1.17 0.00 1.94 0.09

Table 8: Performance of MPC with and without a trading penalty compared to the
MSCI World index, rule-based RBAA, and SAA, when allocation changes are subject to
a one-day delay.

4.5 Comparison with rule-based approach when allocation changes


are delayed
The results presented in the previous
u0 u1 u2 subsection are based on the assump-
tion that it is possible to trade at the
r1 r2 ... closing price after it is known and the
parameters and forecasts have been
+ + +
h0 h0 h1 h1 h2 h2 updated. It is often more realistic to
assume that allocation changes can-
t=0 t=1 t=2
not be implemented until the end of
the next day, as illustrated in fig-
Figure 7: Timeline of portfolio dynamics ure 7. To ensure that the long-only
when trades are delayed by one day. constraint is still satisfied at all times,
trading decisions have to be imple-
mented as fractions of the holding.11
In table 8, the performance of the risk-neutral MPC approach with and without
a trading penalty is reported when allocation changes are subject to a one-
day delay. Transaction costs of 10 basis points have been deducted from the
reported results. The AR of MPC with no risk or trading penalty is about one
percentage point lower when allocation changes are delayed, in spite of the AT
being lowered from 4.61 to 4.16.
Imposing a trading penalty actually increases the AR when allocation changes
are subject to a one-day delay. The AR, SR, and CR of the risk-neutral MPC
approach with trading aversion are almost unchanged compared to when there
is no delay. This suggest that a trading penalty increases the robustness of the
MPC approach, similarly to what it does in a single-period setting (Brodie et al.
2009, Ho et al. 2015). The delay has no impact on the AT of the penalized MPC
approach that is still substantially lower than for the unpenalized approach.

11 If a decision is made on day t to sell $80 worth of stocks out of a total holding of $100, then

80% of the stocks are sold on day t + 1, regardless of their value.


4 Empirical results 195

In table 8, the performance of the risk-neutral MPC approaches is compared


with the MSCI World index, a rule-based RBAA approach, and an SAA portfolio
that is rebalanced monthly to a fixed allocation of 69% stocks and 31% cash,
which equals the average allocation of the unpenalized MPC approach.
The rule-based RBAA approach is the same as in Nystrup et al. (2017a). Like
the risk-neutral MPC approaches, it is either fully allocated to stocks or cash.
The allocation is changed when the probability that a regime change has oc-
curred exceeds a threshold of 0.9998. The underlying HMM is estimated using
(7) with an effective memory length of two years, since this was found to give
better results. A similar memory length was used in Nystrup et al. (2015a,
2017a).
The risk-neutral MPC approaches have realized the highest AR and have out-
performed the MSCI World index that has a significantly higher SD and MDD.
This is under the assumption that cash positions yield zero interest. Further,
with approximately the same SD and a lower MDD than the RBAA and SAA
portfolios, the MPC approaches have realized a substantially higher SR and CR.
RBAA outperforms both SAA and the index in terms of SR and CR. Its AT of
1.94 is higher than that of the penalized MPC approach, but less than half of
that of the unpenalized MPC approach. The performance of RBAA relative to
the index is not as convincing as when there are other alternatives to invest in
than zero-interest cash (see Nystrup et al. 2015a, 2017a).
In figure 9, the performance of MPC with no trading or risk penalty is compared
to the rule-based RBAA strategy and the MSCI World index when allocation
changes are subject to a one-day delay. In the shaded periods, the MPC (top
half) and the RBAA portfolios (bottom half) were fully allocated to cash. The
allocations are different from what would be expected if the regimes were based
on a business cycle indicator.
The MPC portfolio performs better than the RBAA portfolio during the build-
up and burst of the dot-com bubble around the year 2000. It is fully allocated
to stocks most of the time leading up to the peak and fully allocated to cash
throughout the downturn. The MPC portfolio stays fully allocated to cash
throughout the downturn, because the mean value in both states is negative
in this period, cf. figure 3. The RBAA portfolio times the subsequent rebound
better and does well by staying fully allocated to cash throughout the crash
in 2008. The MPC portfolio times the rebound in 2009 better and gradually
extends its lead over the following years.
The MPC portfolio is slightly behind the MSCI World index at the peak in
year 2000, but then moves ahead of the index during the downturn. The lead
is maintained in the following years leading up to the crash in 2008, during
which the lead is significantly extended. Part of the lead is lost during the
market rebound in the first half of 2009, before the MPC allocation is shifted
196 Dynamic portfolio optimization across hidden market regimes

MPC(γ = 0, κ = 0.001, ρ = 0)
350

MSCI World TR
RBAA(κ = 0.001)
300
250
Index
200
150
100

1998 2000 2002 2004 2006 2008 2010 2012 2014


Year

Figure 9: Performance of the return-maximizing MPC approach compared with rule-


based RBAA and the MSCI World index. In the shaded periods, the MPC and the
RBAA portfolios (top and bottom half, respectively) were fully allocated to cash.

to stocks. At the end of the sample, the performance gap is substantial. The
outperformance relative to the index comes from the two major downturns, but
this is hardly surprising, since there is no other source of return than the index
itself. In risk-adjusted terms, the outperformance is conclusive.

4.6 Application to other indices


Table 10 summarizes the results from applying MPC with no risk or trading
penalty (i.e., MPC(γ = 0, ρ = 0)) to various major stock market indices. For
the MSCI World index, the numbers are the same that were reported in table 5.
Recall that the testing period for the MSCI World index spans 1998 through
2015. For S&P 500, TOPIX, DAX, and FTSE, the data period includes 1984
through 2015. The first two years are used for initialization, leaving 30 years
for testing. For the MSCI EM index, daily data is only available from 1988 and
onwards, thus the testing period is four years shorter. All indices are net total
return.
The MPC approach realizes a higher SR and CR than a buy-and-hold investment
(as summarized in parentheses) in five out of six indices, with FTSE being
the only exception. The best performance relative to the underlying index is
5 Conclusion 197

MSCI World S&P500 TOPIX DAX FTSE MSCI EM


Annualized return 0.076 (0.056) 0.110 (0.108) 0.044 (0.025) 0.066 (0.073) 0.051 (0.078) 0.119 (0.072)
Standard deviation 0.12 (0.18) 0.14 (0.16) 0.17 (0.24) 0.16 (0.22) 0.14 (0.15) 0.18 (0.27)
Sharpe ratio 0.65 (0.30) 0.79 (0.68) 0.26 (0.10) 0.41 (0.33) 0.36 (0.54) 0.66 (0.27)
Maximum drawdown 0.26 (0.57) 0.36 (0.55) 0.48 (0.72) 0.39 (0.73) 0.36 (0.48) 0.38 (0.65)
Calmar ratio 0.29 (0.10) 0.30 (0.20) 0.09 (0.03) 0.17 (0.10) 0.14 (0.16) 0.31 (0.11)
Annual turnover 4.61 (0.00) 3.29 (0.00) 4.69 (0.00) 5.89 (0.00) 4.59 (0.00) 8.18 (0.00)

Table 10: Performance of MPC with no risk or trading penalty when applied to various
major stock indices with no delay in allocation changes. The numbers in parentheses are
the summary statistics for a buy-and-hold investment in the respective indices.

obtained for MSCI World, TOPIX, and MSCI EM. For all indices except DAX
and FTSE, the AR of the MPC approach is higher than that of the underlying
index despite high ATs—which could easily be reduced by introducing a trading
penalty. In all six cases, the SD and MDD are lower than those of the underlying
index. The results in table 10 show that the MPC approach in combination with
the adaptively-estimated HMM has worked well for multiple major stock indices
across different time periods. By introducing a trading penalty and calibrating
its level to each index individually, it would be possible to further improve the
results.

5 Conclusion
This article has shown the strength of using MPC for dynamic portfolio opti-
mization in combination with an online method for forecasting the mean and
variance of financial returns. There were computational advantages to using
MPC in cases when estimates of future return statistics were updated every
time a new observation becomes available, since the optimal control actions
were reconsidered anyway.
Based on forecasts from an adaptively-estimated HMM, the MPC approach re-
alized a higher return and a significantly lower risk than a buy-and-hold invest-
ment in various major stock indices. This was after accounting for transaction
costs. Imposing an additional trading penalty increased the robustness, by re-
ducing the number of trades, and improved the performance, when allocation
changes were subject to a delay. MPC also outperformed RBAA based on a
static decision rule for changing the portfolio. The performance of rule-based
RBAA has been stronger in previous studies, where there were more investment
opportunities than stocks and zero-interest cash. Thus, there is potential for
using MPC for optimal control of multi-asset portfolios.
To keep things simple and illustrate the strength of the approach, the focus of
this article was on stocks and cash, but it naturally extends to a multi-asset
portfolio. Another possibility for future work would be to specify a model for
the parameter changes, possibly including relevant explanatory variables, in an
198 Dynamic portfolio optimization across hidden market regimes

attempt to improve the forecasts and take the stochasticity into account in the
portfolio optimization.

References
Ang, A. and G. Bekaert. “How regimes affect asset allocation.” Financial Ana-
lysts Journal, vol. 60, no. 2 (2004), pp. 86–99.

Ang, A. and A. Timmermann. “Regime changes and financial markets.” Annual


Review of Financial Economics, vol. 4, no. 1 (2012), pp. 313–337.

Baum, L. E., T. Petrie, G. Soules, and N. Weiss. “A maximization technique oc-


curring in the statistical analysis of probabilistic functions of Markov chains.”
Annals of Mathematical Statistics, vol. 41, no. 1 (1970), pp. 164–171.

Bemporad, A. “Model predictive control design: New trends and tools.” In


Proceedings of the 45th IEEE Conference on Decision and Control (2006), pp.
6678–6683.

Boyd, S., M. T. Mueller, B. O’Donoghue, and Y. Wang. “Performance bounds


and suboptimal policies for multi-period investment.” Foundations and Trends
in Optimization, vol. 1, no. 1 (2014), pp. 1–72.

Boyd, S. and L. Vandenberghe. Convex Optimization. Cambridge University


Press: New York (2004).

Brennan, M. J., E. S. Schwartz, and R. Lagnado. “Strategic asset allocation.”


Journal of Economic Dynamics and Control, vol. 21, no. 8–9 (1997), pp.
1377–1403.

Brodie, J., I. Daubechies, C. D. Mol, D. Giannone, and I. Loris. “Sparse and


stable Markowitz portfolios.” Proceedings of the National Academy of Sciences
of the United States of America, vol. 106, no. 30 (2009), pp. 12267–12272.

Bulla, J. “Hidden Markov models with t components. Increased persistence and


other aspects.” Quantitative Finance, vol. 11, no. 3 (2011), pp. 459–475.

Bulla, J. and I. Bulla. “Stylized facts of financial time series and hidden semi-
Markov models.” Computational Statistics & Data Analysis, vol. 51, no. 4
(2006), pp. 2192–2209.

Bulla, J., S. Mergner, I. Bulla, A. Sesboüé, and C. Chesneau. “Markov-switching


asset allocation: Do profitable strategies exist?” Journal of Asset Manage-
ment, vol. 12, no. 5 (2011), pp. 310–321.

Calafiore, G. C. “Multi-period portfolio optimization with linear control policies.”


Automatica, vol. 44, no. 10 (2008), pp. 2463–2473.
5 Conclusion 199

Calafiore, G. C. “An affine control method for optimal dynamic asset allocation
with transaction costs.” SIAM Journal on Control and Optimization, vol. 48,
no. 4 (2009), pp. 2254–2274.
Campbell, J. Y. “Asset prices, consumption, and the business cycle.” In Hand-
book of Macroeconomics, edited by J. B. Taylor and M. Woodford, vol. 1C,
chap. 19. Elsevier: Amsterdam (1999), pp. 1231–1303.
Cappé, O., E. Moulines, and T. Rydén. Inference in Hidden Markov Models.
Springer: New York (2005).
Cochrane, J. H. “Financial markets and the real economy.” Foundations and
Trends in Finance, vol. 1, no. 1 (2005), pp. 1–101.
Costa, O. L. and M. V. Araujo. “A generalized multi-period mean–variance port-
folio optimization with Markov switching parameters.” Automatica, vol. 44,
no. 10 (2008), pp. 2487–2497.
Creal, D., S. J. Koopman, and A. Lucas. “Generalized autoregressive score mod-
els with applications.” Journal of Applied Econometrics, vol. 28, no. 5 (2013),
pp. 777–795.
Dai, M., Z. Q. Xu, and X. Y. Zhou. “Continuous-time Markowitz’s model with
transaction costs.” SIAM Journal on Financial Mathematics, vol. 1, no. 1
(2010a), pp. 96–125.
Dai, M., Q. Zhang, and Q. J. Zhu. “Trend following trading under a regime
switching model.” SIAM Journal on Financial Mathematics, vol. 1, no. 1
(2010b), pp. 780–810.
DeMiguel, V., L. Garlappi, and R. Uppal. “Optimal versus naive diversification:
How inefficient is the 1/N portfolio strategy?” Review of Financial Studies,
vol. 22, no. 5 (2009b), pp. 1915–1953.
Dempster, A. P., N. M. Laird, and D. B. Rubin. “Maximum likelihood from
incomplete data via the EM algorithm.” Journal of the Royal Statistical
Society. Series B (Methodological), vol. 39, no. 1 (1977), pp. 1–38.
Diamond, S. and S. Boyd. “CVXPY: A Python-embedded modeling language for
convex optimization.” Journal of Machine Learning Research, vol. 17, no. 83
(2016), pp. 1–5.
Domahidi, A., E. Chu, and S. Boyd. “ECOS: An SOCP solver for embedded
systems.” In Proceedings of the 12th European Control Conference (2013), pp.
3071–3076.
Falcone, P., F. Borrelli, J. Asgari, H. E. Tseng, and D. Hrovat. “Predictive
active steering control for autonomous vehicle systems.” IEEE Transactions
on control systems technology, vol. 15, no. 3 (2007), pp. 566–580.
200 Dynamic portfolio optimization across hidden market regimes

Frühwirth-Schnatter, S. Finite Mixture and Markov Switching Models. Springer:


New York (2006).

Goyal, A., A. Ilmanen, and D. Kabiller. “Bad habits and good practices.” Journal
of Portfolio Management, vol. 41, no. 4 (2015), pp. 97–107.

Guidolin, M. and A. Timmermann. “Asset allocation under multivariate regime


switching.” Journal of Economic Dynamics and Control, vol. 31, no. 11 (2007),
pp. 3503–3544.

Herzog, F., G. Dondi, and H. P. Geering. “Stochastic model predictive control


and portfolio optimization.” International Journal of Theoretical and Applied
Finance, vol. 10, no. 2 (2007), pp. 203–233.

Ho, M., Z. Sun, and J. Xin. “Weighted elastic net penalized mean–variance
portfolio design and computation.” SIAM Journal on Financial Mathematics,
vol. 6, no. 1 (2015), pp. 1220–1244.

Khreich, W., E. Granger, A. Miri, and R. Sabourin. “A survey of techniques


for incremental learning of HMM parameters.” Information Sciences, vol. 197
(2012), pp. 105–130.

Kinlaw, W., M. Kritzman, and D. Turkington. “The divergence of high- and low-
frequency estimation: Implications for performance measurement.” Journal
of Portfolio Management, vol. 41, no. 3 (2015), pp. 14–21.

Kritzman, M. and Y. Li. “Skulls, financial turbulence, and risk management.”


Financial Analysts Journal, vol. 66, no. 5 (2010), pp. 30–41.

Kritzman, M., S. Page, and D. Turkington. “Regime shifts: Implications for


dynamic strategies.” Financial Analysts Journal, vol. 68, no. 3 (2012), pp.
22–39.

Kulhavỳ, R. and M. B. Zarrop. “On a general concept of forgetting.” Interna-


tional Journal of Control, vol. 58, no. 4 (1993), pp. 905–924.

Lystig, T. C. and J. P. Hughes. “Exact computation of the observed information


matrix for hidden Markov models.” Journal of Computational and Graphical
Statistics, vol. 11, no. 3 (2002), pp. 678–689.

Mandelbrot, B. “The variation of certain speculative prices.” Journal of Business,


vol. 36, no. 4 (1963), pp. 394–419.

Markowitz, H. “Portfolio selection.” Journal of Finance, vol. 7, no. 1 (1952), pp.


77–91.

Meucci, A. Risk and Asset Allocation. Springer: Berlin (2005).


5 Conclusion 201

Michaud, R. O. “The Markowitz optimization Enigma: Is ’optimized’ optimal?”


Financial Analysts Journal, vol. 45, no. 1 (1989), pp. 31–42.
Nystrup, P., B. W. Hansen, H. O. Larsen, H. Madsen, and E. Lindström. “Dy-
namic allocation or diversification: A regime-based approach to multiple as-
sets.” Journal of Portfolio Management, vol. 44, no. 2 (2017a), pp. 62–73.
Nystrup, P., B. W. Hansen, H. Madsen, and E. Lindström. “Regime-based ver-
sus static asset allocation: Letting the data speak.” Journal of Portfolio
Management, vol. 42, no. 1 (2015a), pp. 103–109.
Nystrup, P., B. W. Hansen, H. Madsen, and E. Lindström. “Detecting change
points in VIX and S&P 500: A new approach to dynamic asset allocation.”
Journal of Asset Management, vol. 17, no. 5 (2016), pp. 361–374.
Nystrup, P., H. Madsen, and E. Lindström. “Stylised facts of financial time
series and hidden Markov models in continuous time.” Quantitative Finance,
vol. 15, no. 9 (2015b), pp. 1531–1541.
Nystrup, P., H. Madsen, and E. Lindström. “Long memory of financial time
series and hidden Markov models with time-varying parameters.” Journal of
Forecasting, vol. 36, no. 8 (2017b), pp. 989–1002.
Parkum, J. E., N. K. Poulsen, and J. Holst. “Recursive forgetting algorithms.”
International Journal of Control, vol. 55, no. 1 (1992), pp. 109–128.
Rydén, T., T. Teräsvirta, and S. Åsbrink. “Stylized facts of daily return series
and the hidden Markov model.” Journal of Applied Econometrics, vol. 13,
no. 3 (1998), pp. 217–244.
Sheikh, A. Z. and J. Sun. “Regime change: Implications of macroeconomic shifts
on asset class and portfolio performance.” Journal of Investing, vol. 21, no. 3
(2012), pp. 36–54.
Siegel, J. J. “Does it pay stock investors to forecast the business cycle?” Journal
of Portfolio Management, vol. 18, no. 1 (1991), pp. 27–34.
Tibshirani, R. “Regression shrinkage and selection via the lasso.” Journal of the
Royal Statistical Society. Series B (Methodological), vol. 58, no. 1 (1996), pp.
267–288.
von Neumann, J. and O. Morgenstern. Theory of Games and Economic Behavior.
Princeton University Press: Princeton, 3rd ed. (1953).
202
PAPER H
204
Originally published in Foundations and Trends in Optimization

Multi-period trading via convex optimization

Stephen Boyd, Enzo Busseti, Steven Diamond, Ronald N. Kahn,


Kwangmoo Koh, Peter Nystrup, and Jan Speth

Abstract

We consider a basic model of multi-period trading, which can be used to


evaluate the performance of a trading strategy. We describe a framework
for single-period optimization, where the trades in each period are found
by solving a convex optimization problem that trades off expected return,
risk, transaction cost, and holding cost such as the borrowing cost for short-
ing assets. We then describe a multi-period version of the trading method,
where optimization is used to plan a sequence of trades, with only the first
one executed, using estimates of future quantities that are unknown when
the trades are chosen. The single-period method traces back to Markowitz;
the multi-period methods trace back to model predictive control. Our con-
tribution is to describe the single- and multi-period methods in one simple
framework, giving a clear description of the development and the approx-
imations made. In this paper we do not address a critical component in
a trading algorithm: the predictions or forecasts of future quantities. The
methods we describe in this paper can be thought of as good ways to ex-
ploit predictions, no matter how they are made. We have also developed a
companion open-source software library that implements many of the ideas
and methods described in the paper.

Keywords: Optimization; Model predictive control.

1 Introduction
Single- and multi-period portfolio selection. Markowitz (1952) was the
first to formulate the choice of an investment portfolio as an optimization prob-
lem trading off risk and return. Traditionally, this was done independently of
the cost associated with trading, which can be significant when trades are made
over multiple periods (Kolm et al. 2014). Goldsmith (1976) was among the
first to consider the effect of transaction cost on portfolio selection in a single-
period setting. It is possible to include many other costs and constraints in a
single-period optimization formulation for portfolio selection (Lobo et al. 2007,
Moallemi and Sağlam 2017).
206 Multi-period trading via convex optimization

In multi-period portfolio selection, the portfolio selection problem is to choose


a sequence of trades to carry out over a set of periods. There has been much re-
search on this topic since the work of Samuelson (1969) and Merton (1969, 1971).
Constantinides (1979) extended Samuelson’s discrete-time formulation to prob-
lems with proportional transaction costs. Davis and Norman (1990) and Dumas
and Luciano (1991) derived similar results for the continuous-time formulation.
Transaction costs, constraints, and time-varying forecasts are more naturally
dealt with in a multi-period setting. Following Samuelson (1969) and Merton
(1969, 1971), the literature on multi-period portfolio selection is predominantly
based on dynamic programming (Bellman 1956, Bertsekas 1995), which properly
takes into account the idea of recourse and updated information available as the
sequence of trades are chosen (see Gârleanu and Pedersen 2013, and references
therein). Unfortunately, actually carrying out dynamic programming for trade
selection is impractical, except for some very special or small cases, due to the
‘curse of dimensionality’ (Powell 2007, Boyd et al. 2014). As a consequence,
most studies include only a very limited number of assets and simple objectives
and constraints. A large literature studies multi-period portfolio selection in the
absence of transaction cost (see, e.g., Campbell and Viceira 2002, and references
therein); in this special case, dynamic programming is tractable.
For practical implementation, various approximations of the dynamic program-
ming approach are often used, such as approximate dynamic programming, or
even simpler formulations that generalize the single-period formulations to multi-
period optimization problems (Boyd et al. 2014). We will focus on these simple
multi-period methods in this paper. While these simplified approaches can be
criticized for only approximating the full dynamic programming trading pol-
icy, the performance loss is likely very small in practical problems, for several
reasons. Boyd et al. (2014) developed a numerical bounding method that quan-
tifies the loss of optimality when using a simplified approach, and found it to
be very small in numerical examples. But in fact, the dynamic programming
formulation is itself an approximation, based on assumptions (like independent
or identically distributed returns) that need not hold well in practice, so the
idea of an ‘optimal strategy’ itself should be regarded with some suspicion.

Why now? What is different now, compared to 10, 20, or 30 years ago, is
vastly more powerful computers, better algorithms, specification languages for
optimization, and access to much more data. These developments have changed
how we can use optimization in multi-period investing. In particular, we can
now quickly run full-blown optimization, run multi-period optimization, and
search over hyperparameters in backtests. We can run end-to-end analyses,
indeed many at a time in parallel. Earlier generations of investment researchers,
relying on computers much less powerful than today, relied more on separate
models and analyses to estimate parameter values, and tested signals using
simplified (usually unconstrained) optimization.
1 Introduction 207

Goal. In this tutorial paper we consider multi-period investment and trading.


Our goal is to describe a simple model that takes into account the main practical
issues that arise, and several simple and practical frameworks based on solving
convex optimization problems (Boyd and Vandenberghe 2004) that determine
the trades to make. We describe the approximations made, and briefly discuss
how the methods can be used in practice. Our methods do not give a complete
trading system, since we leave a critical part unspecified: forecasting future
returns, volumes, volatilities, and other important quantities (see, e.g., Grinold
and Kahn 2000). This paper describes good practical methods that can be used
to trade, given forecasts.
The optimization-based trading methods we describe are practical and reliable
when the problems to be solved are convex. Real-world single-period convex
problems with thousands of assets can be solved using generic algorithms in
well under a second, which is critical for evaluating a proposed algorithm with
historical or simulated data, for many values of the parameters in the method.

Outline. We start in section 2 by describing a simple model of multi-period


trading, taking into account returns, trading costs, holding costs, and (some)
corporate actions. This model allows us to carry out simulation, used for what-if
analyses, to see what would have happened under different conditions, or with
a different trading strategy. The data in simulation can be realized past data
(in a backtest) or simulated data that did not occur, but could have occurred
(in a what-if simulation), or data chosen to be particularly challenging (in a
stress-test). In section 3 we review several common metrics used to evaluate
(realized or simulated) trading performance, such as active return and risk with
respect to a benchmark.
We then turn to optimization-based trading strategies. In section 4 we describe
single-period optimization (SPO), a simple but effective framework for trading
based on optimizing the portfolio performance over a single period. In section 5
we consider multi-period optimization (MPO), where the trades are chosen by
solving an optimization problem that covers multiple periods in the future.

Contribution. Most of the material that appears in this paper has appeared
before, in other papers, books, or EE364A, the Stanford course on convex op-
timization. Our contribution is to collect in one place the basic definitions, a
careful description of the model, and discussion of how convex optimization can
be used in multi-period trading, all in a common notation and framework. Our
goal is not to survey all the work done in this and related areas, but rather to
give a unified, self-contained treatment. Our focus is not on theoretical issues,
but on practical ones that arise in multi-period trading. To further this goal,
we have developed an accompanying open-source software library implemented
in Python, and available at
208 Multi-period trading via convex optimization

https://github.com/cvxgrp/cvxportfolio.

Target audience. We assume that the reader has a background in the basic
ideas of quantitative portfolio selection, trading, and finance, as described, for
example, in the books by Grinold and Kahn (2000), Meucci (2005), or Narang
(2013). We also assume that the reader has seen some basic mathematical
optimization, specifically convex optimization (Boyd and Vandenberghe 2004).
The reader certainly does not need to know more than the very basic ideas of
convex optimization, for example the overview material covered in chapter 1 of
Boyd and Vandenberghe (2004). In a nutshell, our target reader is a quantitative
trader, or someone who works with or for, or employs, one.

2 The model
In this chapter, we set the notation and give some detail of our simplified model
of multi-period trading. We develop our basic dynamic model of trading, which
tells us how a portfolio and associated cash account change over time, due to
trading, investment gains, and various costs associated with trading and holding
portfolios. The model developed in this chapter is independent of any method
for choosing or evaluating the trades or portfolio strategy, and independent of
any method used to evaluate the performance of the trading.

2.1 Portfolio asset and cash holdings


Portfolio. We consider a portfolio of holdings in n assets, plus a cash account,
over a finite time horizon, which is divided into discrete time periods labeled
t = 1, . . . , T . These time periods need not be uniformly spaced in real time or be
of equal length; for example, when they represent trading days, the periods are
one (calendar) day during the week and three (calendar) days over a weekend.
We use the label t to refer to both a point in time, the beginning of time period
t, as well as the time interval from time t to t + 1. The time period in our model
is arbitrary, and could be daily, weekly, or one hour intervals, for example. We
will occasionally give examples where the time indexes trading days, but the
same notation and model apply to any other time period.
Our investments will be in a universe of n assets, along with an associated
cash account. We let ht ∈ Rn+1 denote the portfolio (or vector of positions or
holdings) at the beginning of time period t, where (ht )i is the dollar value of
asset i at the beginning of time period t, with (ht )i < 0 meaning a short position
in asset i, for i = 1, . . . , n. The portfolio is long-only when the asset holdings
are all nonnegative, i.e., (ht )i ≥ 0 for i = 1, . . . , n.
The value of (ht )n+1 is the cash balance, with (ht )n+1 < 0 meaning that money
is owed (or borrowed). The dollar value for the assets is determined using the
reference prices pt ∈ Rn+ , defined as the average of the bid and ask prices at the
2 The model 209

beginning of time period t. When (ht )n+1 = 0, the portfolio is fully invested,
meaning that we hold (or owe) zero cash, and all our holdings (long and short)
are in assets.

Total value, exposure, and leverage. The total value (or net asset value,
NAV) vt of the portfolio, in dollars, at time t is vt = 1T ht , where 1 is the vector
with all entries one. (This is not quite the amount of cash the portfolio would
yield on liquidation, due to transaction costs, discussed below.) Throughout
this paper we will assume that vt > 0, i.e., the total portfolio value is positive.
The vector
(ht )1:n = ((ht )1 , . . . , (ht )n )
gives the asset holdings. The gross exposure can be expressed as

∥(ht )1:n ∥1 = |(ht )1 | + · · · + |(ht )n |,

the sum of the absolute values of the asset positions. The leverage of the port-
folio is the gross exposure divided by the value, ∥(ht )1:n ∥1 /vt . (Several other
definitions of leverage are also used, such as half the quantity above.) The
leverage of a fully invested long-only portfolio is one.

Weights. We will also describe the portfolio using weights or fractions of total
value. The weights (or weight vector) wt ∈ Rn+1 associated with the portfolio
ht are defined as wt = ht /vt . (Recall our assumption that vt > 0.) By definition
the weights sum to one, 1T wt = 1, and are unitless. The weight (wt )n+1 is the
fraction of the total portfolio value held in cash. The weights are all nonnegative
when the asset positions are long and the cash balance is nonnegative. The dollar
value holdings vector can be expressed in terms of the weights as ht = vt wt . The
leverage of the portfolio can be expressed in terms of the weights as ∥w1:n ∥1 ,
the ℓ1 norm of the asset weights.

2.2 Trades
Trade vector. In our simplified model we assume that all trading—i.e., buy-
ing and selling of assets—occurs at the beginning of each time period. (In reality
the trades would likely be spread over at least some part of the period.) We let
ut ∈ Rn denote the dollar values of the trades, at the current price: (ut )i > 0
means we buy asset i and (ut )i < 0 means we sell asset i, at the beginning of
time period t, for i = 1, . . . , n. The number (ut )n+1 is the amount we put into
the cash account (or take out, if it is negative). The vector zt = ut /vt gives the
trades normalized by the total value. Like the weight vector wt , it is unitless.

Post-trade portfolio. The post-trade portfolio is denoted

h+
t = ht + ut , t = 1, . . . , T.
210 Multi-period trading via convex optimization

This is the portfolio in time period t immediately after trading. The post-trade
portfolio value is vt+ = 1T h+ t . The change in total portfolio value from the
trades is given by
vt+ − vt = 1T h+
t − 1 ht = 1 ut .
T T

The vector (ut )1:n ∈ Rn is the set of (non-cash) asset trades. Half its ℓ1 norm
∥(ut )1:n ∥1 /2 is the turnover (in dollars) in period t. This is often expressed as
a percentage of total value, as ∥(ut )1:n ∥1 /(2vt ) = ∥z1:n ∥1 /2.
We can express the post-trade portfolio, normalized by the portfolio value, in
terms of the weights wt = ht /vt and normalized trades as

h+
t /vt = wt + zt .

Note that this normalized quantity does not necessarily add up to one.

2.3 Transaction cost


The trading incurs a trading or transaction cost (in dollars), which we denote as
ϕtrade
t (ut ), where ϕtrade
t : Rn+1 → R is the (dollar) transaction-cost function. We
trade
will assume that ϕt does not depend on (ut )n+1 , i.e., there is no transaction
cost associated with the cash account. To emphasize this we will sometimes
write the transaction cost as ϕtrade
t ((ut )1:n ). We assume that ϕtrade
t (0) = 0, i.e.,
there is no transaction cost when we do not trade. While ϕtrade t (ut ) is typically
nonnegative, it can be negative in some cases, discussed below. We assume that
the transaction-cost function ϕtrade
t is separable, which means it has the form


n
ϕtrade
t (x) = (ϕtrade
t )i (xi ),
i=1

i.e., the transaction cost breaks into a sum of transaction costs associated with
the individual assets. We refer to (ϕtrade
t )i , which is a function from R into R, as
the transaction-cost function for asset i, period t. We note that some authors
have used models of transaction cost which are not separable, for example the
quadratic dynamic model of Grinold (2006).

A generic transaction-cost model. A reasonable, generic model for the


scalar transaction-cost functions (ϕtrade
t )i is

|x|3/2
x 7→ a|x| + bσ + cx, (1)
V 1/2
where a, b, σ, V , and c are real numbers described below, and x is a dollar
trade amount (Grinold and Kahn 2000). The number a is one half the bid-ask
spread for the asset at the beginning of the time period, expressed as a fraction
of the asset price (and so is unitless). We can also include in this term broker
2 The model 211

commissions or fees which are a linear function of the number of shares (or dollar
value) bought or sold. The number b is a positive constant with unit inverse
dollars. The number V is the total market volume traded for the asset in the
time period, expressed in dollar value, so |x|3/2 /V 1/2 has units of dollars. The
number σ the corresponding price volatility (standard deviation) over recent
time periods, in dollars. According to a standard rule of thumb, trading one
day’s volume moves the price by about one day’s volatility, which suggests that
the value of the number b is around one. (In practice, however, the value of b
is determined by fitting the model above to data on realized transaction costs.)
The number c is used to create asymmetry in the transaction-cost function.
When c = 0, the transaction cost is the same for buying and selling; it is a
function of |x|. When c > 0, it is cheaper to sell than to buy the asset, which
generally occurs in a market where the buyers are providing more liquidity
than the sellers (e.g., if the book is not balanced in a limit order exchange).
The asymmetry in transaction cost can also be used to model price movement
during trade execution. Negative transaction cost can occur when |c| > |a|. The
constants in the transaction-cost model (1) vary with asset, and with trading
period, i.e., they are indexed by i and t. This 3/2-power transaction-cost model
is widely known and employed by practitioners.

We are not aware of empirical tests of the specific transaction-cost model (1),
but several references describe and validate similar models (Lillo et al. 2003,
Moro et al. 2009, Bershova and Rakhlin 2013, Gomes and Waelbroeck 2015). In
particular, these empirical works suggest that transaction cost grows (approxi-
mately) with the 3/2 power of transaction size.

Normalized transaction cost. The transaction-cost model(1) is in dollars.


We can normalize it by vt , the total portfolio value, and express it in terms of
zi , the normalized trade of asset i, resulting in the function (with t suppressed
for simplicity)
|zi |3/2
ai |zi | + bi σi + ci zi . (2)
(Vi /v)1/2

The only difference with (1) is that we use the normalized asset volume Vi /v
instead of the dollar volume Vi . This shows that the same transaction-cost
formula can be used to express the dollar transaction cost as a function of the
dollar trade, with the volume denoted in dollars, or the normalized transaction
cost as a function of the normalized trade, with the volume normalized by the
portfolio value.

With some abuse of notation, we will write the normalized transaction cost in
period t as ϕtrade
t (zt ). When the argument to the transaction-cost function is
normalized, we use the version where asset volume is also normalized. The
normalized transaction cost ϕtrade
t (zt ) depends on the portfolio value vt , as well
212 Multi-period trading via convex optimization

as the current values of the other parameters, but we suppress this dependence
to lighten the notation.

Other transaction-cost models. Other transaction-cost models can be used.


Common variants include a piecewise linear model, or adding a term that is
quadratic in the trade value zi (Almgren and Chriss 2001, Grinold 2006, Gâr-
leanu and Pedersen 2013). Almost all of these are convex functions. An example
of a transaction-cost term that is not convex is a fixed fee for any nonzero trad-
ing in an asset. For simulation, however, the transaction-cost function can be
arbitrary.

2.4 Holding cost


We will hold the post-trade portfolio h+ t over the t’th period. This will incur a
holding-based cost (in dollars) ϕhold
t (h+
t ), where ϕt
hold
: Rn+1 → R is the holding-
cost function. Like transaction cost, it is typically nonnegative, but it can also be
negative in certain cases, discussed below. The holding cost can include a factor
related to the length of the period; for example, if our periods are trading days,
but holding costs are assessed on all days (including weekend and holidays), the
Friday holding cost might be multiplied by three. For simplicity, we will assume
that the holding-cost function does not depend on the post-trade cash balance
(h+
t )n+1 .

A basic holding-cost model includes a charge for borrowing assets when going
short, which has the form

ϕhold
t (h+ T +
t ) = st (ht )− , (3)

where (st )i ≥ 0 is the borrowing fee, in period t, for shorting asset i, and
(z)− = max{−z, 0} denotes the negative part of a number z. This is the fee
for shorting the post-trade assets, over the investment period, and here we are
paying this fee in advance, at the beginning of the period. Our assumption that
the holding cost does not depend on the cash account requires (st )n+1 = 0. But
we can include a cash borrow cost if needed, in which case (st )n+1 > 0. This
is the premium for borrowing, and not the interest rate, which is included in
another part of our model, discussed below.
The holding cost (3), normalized by portfolio value, can be expressed in terms
of weights and normalized trades as

ϕhold
t (h+ T
t )/vt = st (wt + zt )− . (4)

As with the transaction cost, with some abuse of notation we use the same
function symbol to denote the normalized holding cost, writing the quantity
above as ϕhold
t (wt + zt ). (For the particular form of holding cost described
2 The model 213

above, there is no abuse of notation since ϕhold


t is the same when expressed in
dollars or normalized form.)
More complex holding-cost functions arise, for example when the assets include
ETFs (exchange-traded funds). A long position incurs a fee proportional to hi ;
when we hold a short position, we earn the same fee. This is readily modeled
as a linear term in the holding cost. (We can in addition have a standard fee
for shorting.) This leads to a holding cost of the form

ϕhold
t (wt + zt ) = sTt (wt + zt )− + ftT (wt + zt ),

where ft is a vector with (ft )i representing the per-period management fee for
asset i, when asset i is an ETF.
Even more complex holding-cost models can be used. One example is a piece-
wise linear model for the borrowing cost, which increases the marginal borrow
charge rate when the short position exceeds some threshold. These more general
holding-cost functions are almost always convex. For simulation, however, the
holding-cost function can be arbitrary.

2.5 Self-financing condition


We assume that no external cash is put into or taken out of the portfolio, and
that the trading and holding costs are paid from the cash account at the begin-
ning of the period. This self-financing condition can be expressed as

1T ut + ϕtrade
t (ut ) + ϕhold
t (h+
t ) = 0. (5)

Here −1T ut is the total cash out of the portfolio from the trades; (5) says that
this cash out must balance the cash cost incurred, i.e., the transaction cost plus
the holding cost. The self-financing condition implies vt+ = vt − ϕtrade t (ut ) −
ϕhold
t (h+
t ), i.e., the post-trade value is the pre-trade value minus the transaction
and holding costs.
The self-financing condition (5) connects the cash trade amount (ut )n+1 to the
asset trades, (ut )1:n , by
( )
(ut )n+1 = − 1T (ut )1:n + ϕtrade
t ((ht + ut )1:n ) + ϕhold
t ((ut )1:n ) . (6)

Here we use the assumption that the transaction and holding costs do not depend
on the n + 1 (cash) component by explicitly writing the argument as the first
n components, i.e., those associated with the (non-cash) assets. The formula
(6) shows that if we are given the trade values for the non-cash assets, i.e.,
(ut )1:n , we can find the cash trade value (ut )n+1 that satisfies the self-financing
condition (5).
We mention here a subtlety that will come up later. A trading algorithm chooses
the asset trades (ut )1:n before the transaction-cost function ϕtrade
t and (possibly)
214 Multi-period trading via convex optimization

the holding-cost function ϕhold


t are known. The trading algorithm must use
estimates of these functions to make its choice of trades. The formula (6) gives
the cash trade amount that is realized.

Normalized self-financing. By dividing the dollar self-financing condition


(5) by the portfolio value vt , we can express the self-financing condition in terms
of weights and normalized trades as

1T zt + ϕtrade
t (vt zt )/vt + ϕhold
t (vt (wt + zt ))/vt = 0,

where we use ut = vt zt and h+ t = vt (wt + zt ), and the cost functions above are
the dollar value versions. Expressing the costs in terms of normalized values we
get
1T zt + ϕtrade
t (zt ) + ϕhold
t (wt + zt ) = 0, (7)
where here the costs are the normalized versions.
As in the dollar version, and assuming that the costs do not depend on the cash
values, we can express the cash trade value (zt )n+1 in terms of the non-cash
asset trade values (zt )1:n as
( )
(zt )n+1 = − 1T (zt )1:n + ϕtrade
t ((wt + zt )1:n ) + ϕhold
t ((zt )1:n ) . (8)

2.6 Investment
The post-trade portfolio and cash are invested for one period, until the beginning
of the next time period. The portfolio at the next time period is given by

t + rt ◦ ht = (1 + rt ) ◦ ht ,
ht+1 = h+ t = 1, . . . , T − 1,
+ +

where rt ∈ Rn+1 is the vector of asset and cash returns from period t to period
t + 1 and ◦ denotes Hadamard (elementwise) multiplication of vectors. The
return of asset i over period t is defined as
(pt+1 )i − (pt )i
(rt )i = , i = 1, . . . , n,
(pt )i
the fractional increase in the asset price over the investment period. We assume
here that the prices and returns are adjusted to include the effects of stock splits
and dividends. We will assume that the prices are nonnegative, so 1 + rt ≥ 0
(where the inequality means elementwise). We mention an alternative to our
definition above, the log-return,
(pt+1 )i
log = log(1 + (rt )i ), i = 1, . . . , n.
(pt )i
For returns that are small compared to one, the log-return is very close to the
return defined above.
2 The model 215

The number (rt )n+1 is the return to cash, i.e., the risk-free interest rate. In
the simple model, the cash interest rate is the same for cash deposits and loans.
We can also include a premium for borrowing cash (say) in the holding-cost
function, by taking (st )n+1 > 0 in (3). When the asset trades (ut )1:n are chosen,
the asset returns (rt )1:n are not known. It is reasonable to assume that the cash
interest rate (rt )n+1 is known.

Next period portfolio value. For future reference we work out some useful
formulas for the next period portfolio value. We have

vt+1 = 1T ht+1
= (1 + rt )T h+
t

= vt + rtT ht + (1 + rt )T ut
= vt + rtT ht + rtT ut − ϕtrade
t (ut ) − ϕhold
t (h+
t ).

Portfolio return. The portfolio realized return in period t is defined as


vt+1 − vt
Rtp = ,
vt
the fractional increase in portfolio value over the period. It can be expressed as

Rtp = rtT wt + rtT zt − ϕtrade


t (zt ) − ϕhold
t (wt + zt ). (9)

This is easily interpreted. The portfolio return over period t consists of four
parts:

• rtT wt is the portfolio return without trades or holding cost,


• rtT zt is the return on the trades,
• −ϕtrade
t (zt ) is the transaction cost, and
• −ϕhold
t (wt + zt ) is the holding cost.

Next period weights. We can derive a formula for the next period weights
wt+1 in terms of the current weights wt and the normalized trades zt , and the
return rt , using the equations above. Simple algebra gives
1
wt+1 = (1 + rt ) ◦ (wt + zt ). (10)
1 + Rtp

By definition, we have 1T wt+1 = 1. This complicated formula reduces to wt+1 =


wt + zt when rt = 0. We note for future use that when the per-period returns
are small compared to one, we have wt+1 ≈ wt + zt .
216 Multi-period trading via convex optimization

2.7 Aspects not modeled


We list here some aspects of real trading that our model ignores, and discuss
some approaches to handle them if needed.

External cash. Our self-financing condition (5) assumes that no external


cash enters or leaves the portfolio. We can easily include external deposits and
withdrawals of cash by replacing the right-hand side of (5) with the external
cash put into the account (which is positive for cash deposited into the account
and negative for cash withdrawn).

Dividends. Dividends are usually included in the asset return, which implic-
itly means they are re-invested. Alternatively we can include cash dividends
from assets in the holding cost, by adding the term −dTt ht , where dt is the
vector of dividend rates (in dollars per dollar of the asset held) in period t. In
other words, we can treat cash dividends as negative holding costs.

Non-instant trading. Our model assumes all trades are carried out instantly
at the beginning of each investment period, but the trades are really executed
over some fraction of the period. This can be modeled using the linear term in
the transaction cost, which can account for the movement of the price during
the execution. We can also change the dynamics equation
ht+1 = (1 + rt ) ◦ (ht + ut )
to
ht+1 = (1 + rt ) ◦ ht + (1 − θt /2)(1 + rt ) ◦ ut ,
where θt is the fraction of the period over which the trades occur. In this
modification, we do not get the full period return on the trades when θt > 0,
since we are moving into the position as the price moves.
The simplest method to handle non-instant trading is to use a shorter period.
For example, if we are interested in daily trading, but the trades are carried out
over the whole trading day and we wish to model this effect, we can move to an
hourly model.

Imperfect execution. Here we distinguish between ureq t , the requested trade,


and ut , the actual realized trade (Perold 1988). In a backtest simulation we
might assume that some (very small) fraction of the requested trades are only
partially completed.

Multi-period price impact. This is the effect of a large order in one period
affecting the asset price in future periods (Almgren and Chriss 2001, Obizhaeva
and Wang 2013). In our model the transaction cost is only a function of the
current period trade vector, not previous ones.
2 The model 217

Trade settlement. In trade settlement we keep track of cash from trades one
day and two days ago (in daily simulation), as well as the usual (unencumbered)
cash account which includes all cash from trades that occurred three or more
days ago, which have already settled. Shorting expenses come from the unen-
cumbered cash, and trade-related cash moves immediately into the one day ago
category (for daily trading).

Merger/acquisition. In a certain period one company buys another, convert-


ing the shares of the acquired company into shares of the acquiring company at
some rate. This modifies the asset holdings update. In a cash buyout, positions
in the acquired company are converted to cash.

Bankruptcy or dissolution. The holdings in an asset are reduced to zero,


possibly with a cash payout.

Trading freeze. A similar action is a trading freeze, where in some time


periods an asset cannot be bought, or sold, or both.

2.8 Simulation
Our model can be used to simulate the evolution of a portfolio over the peri-
ods t = 1, . . . , T . This requires the following data, when the standard model
described above is used. (If more general transaction- or holding-cost functions
are used, any data required for them is also needed.)

• Starting portfolio and cash account values, h1 ∈ Rn+1 .

• Asset trade vectors (ut )1:n . The cash trade value (ut )n+1 is determined
from the self-financing condition by (6).

• Transaction-cost model parameters at ∈ Rn , bt ∈ Rn , ct ∈ Rn , σt ∈ Rn ,


and Vt ∈ Rn .

• Shorting rates st ∈ Rn .

• Returns rt ∈ Rn+1 .

• Cash dividend rates dt ∈ Rn , if they are not included in the returns.

Backtest. In a backtest the values would be past realized values, with (ut )1:n
the trades proposed by the trading algorithm being tested. Such a test estimates
what the evolution of the portfolio would have been with different trades or a
different trading algorithm. The simulation determines the portfolio and cash
account values over the simulation period, from which other metrics, described
218 Multi-period trading via convex optimization

in section 3 below, can be computed. As a simple example, we can compare the


performance of rebalancing to a given target portfolio daily, weekly, or quarterly.

A simple but informative backtest is to simulate the portfolio evolution using


the actual trades that were executed in a portfolio. We can then compare
the actual and simulated or predicted portfolio holdings and total value over
some time period. The true and simulated portfolio values will not be identical,
since our model relies on estimates of transaction and holding costs, assumes
instantaneous trade execution, and so on.

What-if simulations. In a what-if simulation, we change the data used to


carry out the simulation, i.e., returns, volumes, and so on. The values used are
ones that (presumably) could have occurred. This can be used to stress-test a
trading algorithm, by using data that did not occur, but would have been very
challenging.

Adding uncertainty in simulations. Any simulation of portfolio evolution


relies on models of transaction and holding costs, which in turn depend on pa-
rameters. These parameters are not known exactly, and in any case, the models
are not exactly correct. So the question arises, to what extent should we trust
our simulations? One simple way to check this is to carry out multiple simula-
tions, where we randomly perturb the model parameters by reasonable amounts.
For example, we might vary the daily volumes from their true (realized) values
by 10% each day. If simulation with parameters that are perturbed by reason-
able amounts yields divergent results, we know that (unfortunately) we cannot
trust the simulations.

3 Metrics
Several generic performance metrics can be used to evaluate the portfolio per-
formance.

3.1 Absolute metrics


We first consider metrics that measure the growth of portfolio value in absolute
terms, not in comparison to a benchmark portfolio or the risk-free rate.

Return and growth rate. The average realized return over periods t =
1, . . . , T is
1∑ p
T
Rp = R .
T t=1 t
3 Metrics 219

An alternative measure of return is the growth rate (or log-return) of the portfolio
in period t, defined as

Gpt = log(vt+1 /vt ) = log(1 + Rtp ).

The average growth rate of the portfolio is the average value of Gpt over the
periods t = 1, . . . , T . For per-period returns that are small compared to one
(which is almost always the case in practice) Gpt is very close to Rtp .
The return and growth rates given above are per-period. For interpretability
they are typically annualized (Bacon 2008): return and growth rates are multi-
plied by P , where P is the number of periods in one year. (For periods that are
trading days, we have P ≈ 250.)

Volatility and risk. The realized volatility (Black 1976) is the standard de-
viation of the portfolio return time series,
( )1/2
1∑ p
T
p
σ = (R − Rp )2 .
T t=1 t

(This is the maximum-likelihood estimate; for an unbiased estimate we replace


1/T with 1/(T − 1)). The square of the volatility is the quadratic risk. When
Rtp are small (in comparison to 1), a good approximation of the quadratic risk
is the second moment of the return,

1∑ p 2
T
(σ p )2 ≈ (R ) .
T t=1 t

The volatility and quadratic risk given above are per-period. For interpretability
they√are typically annualized. To get the annualized values we multiply volatility
by P , and quadratic risk by P . (This scaling is based on the idea that the
returns in different periods are independent random variables.)

3.2 Metrics relative to a benchmark


Benchmark weights. It is common to measure the portfolio performance
against a benchmark, given as a set of weights wtb ∈ Rn+1 , which are fractions
of the assets (including cash), and satisfy 1T wtb = 1. We will assume the
benchmark weights are nonnegative, i.e., the entries in wtb are nonnegative. The
benchmark weight wtb = en+1 (the unit vector with value 0 for all entries except
the last, which has value 1) represents the cash, or risk-free, benchmark. More
commonly the benchmark consists of a particular set of assets with weights
proportional to their capitalization. The benchmark return in period t is Rtb =
rtT wtb . (When the benchmark is cash, this is the risk-free interest rate (rt )n+1 .)
220 Multi-period trading via convex optimization

Active and excess return. The active return (Sharpe 1991, Grinold and
Kahn 2000) (of the portfolio, with respect to a benchmark) is given by

Rta = Rtp − Rtb .

In the special case when the benchmark consists of cash (so that the benchmark
return is the risk-free rate) this is known as excess return, denoted

Rte = Rtp − (rt )n+1 .

We define the average active return Ra , relative to the benchmark, as the average
of Rta . We have

Rta = Rtp − Rtb


( )
= rtT wt − wtb + rtT zt − ϕtrade
t (zt ) − ϕhold
t (wt + zt ).

Note that if zt = 0 and wt = wtb , i.e., we hold the benchmark weights and do
not trade, the active return is zero. (This relies on the assumption that the
benchmark weights are nonnegative, so ϕhold
t (wtb ) = 0.)

Active risk. The standard deviation of Rta , denoted σ a , is the risk relative to
the benchmark, or active risk. When the benchmark is cash, this is the excess
risk σ e . When the risk-free interest rate is constant, this is the same as the risk
σp .

Information and Sharpe ratio. The (realized) information ratio (IR) of the
portfolio relative to a benchmark is the average of the active returns Ra over
the standard deviation of the active returns σ a (Grinold and Kahn 2000),

IR = Ra /σ a .

In the special case of a cash benchmark this is known as Sharpe ratio (SR)
(Sharpe 1966, 1994)
SR = Re /σ e .
Both IR and SR are typically given using the annualized values of the return
and risk (Bacon 2008).

4 Single-period optimization
In this section we consider optimization-based trading strategies where at the
beginning of period t, using all the data available, we determine the asset por-
tion of the current trade vector (ut )1:n (or the normalized asset trades (zt )1:n ).
The cash component of the trade vector (zt )n+1 is then determined by the self-
financing equation (8), once we know the realized costs. We formulate this as
4 Single-period optimization 221

a convex optimization problem, which takes into account the portfolio perfor-
mance over one period, the constraints on the portfolio, and investment risk
(described below). The idea goes back to Markowitz (1952), who was the first
to formulate the choice of a portfolio as an optimization problem. (We will
consider multi-period optimization in the next section.)
When we choose (zt )1:n , we do not know rt and the other market parameters
(and therefore the transaction-cost function ϕtradet ), so instead we must rely
on estimates of these quantities and functions. We will denote an estimate
of the quantity or function Z, made at the beginning of period t (i.e., when
we choose (zt )1:n ), as Ẑ. For example, ϕ̂trade
t is our estimate of the current
period transaction-cost function (which depends on the market volume and other
parameters, which are predicted or estimated). The most important quantity
that we estimate is the return over the current period rt , which we denote as
r̂t . (Return forecasts are sometimes called signals.) If we adopt a stochastic
model of returns and other quantities, Ẑ could be the conditional expectation
of Z, given all data that is available at the beginning of period t, when the asset
trades are chosen.
Before proceeding we note that most of the effort in developing a good trading
algorithm goes into forming the estimates or forecasts, especially of the return
rt (Campbell et al. 1997, Grinold and Kahn 2000). In this paper, however, we
consider the estimates as given. Thus we focus on the question, given a set of
estimates, what is a good way to trade based on them? Even though we do
not focus on how the estimates should be constructed, the ideas in this paper
are useful in the development of estimates, since the value of a set of estimates
can depend considerably on how they are exploited, i.e., how the estimates are
turned into trades. To properly assess the value of a proposed set of estimates
or forecasts, we must evaluate them using a realistic simulation with a good
trading algorithm.
We write our estimated portfolio return as

R̂tp = r̂tT wt + r̂tT zt − ϕ̂trade


t (zt ) − ϕ̂hold
t (wt + zt ),

which is (9), with the unknown return rt replaced with the estimate r̂t . The
estimated active return is

R̂ta = r̂tT (wt − wtb ) + r̂tT zt − ϕ̂trade


t (zt ) − ϕ̂hold
t (wt + zt ).

Each of these consists of a term that does not depend on the trades, plus

r̂tT zt − ϕ̂trade
t (zt ) − ϕ̂hold
t (wt + zt ), (11)

the return on the trades minus the transaction and holding costs.
222 Multi-period trading via convex optimization

4.1 Risk–return optimization


In a basic optimization-based trading strategy, we determine the normalized
asset trades zt by solving the optimization problem

maximize R̂tp − γt ψt (wt + zt )


subject to zt ∈ Z t , w t + zt ∈ W t (12)
1T zt + ϕ̂trade
t (zt ) + ϕ̂hold
t (wt + zt ) = 0,

with variable zt . Here ψt : Rn+1 → R is a risk function, described below, and


γt > 0 is the risk-aversion parameter. The objective in (12) is called the risk-
adjusted estimated return. The sets Zt and Wt are the trading and holdings
constraint sets, respectively, also described in more detail below. The current
portfolio weight wt is known, i.e., a parameter, in the problem (12). The risk
function, constraint sets, and estimated transaction and holding costs can all
depend on the portfolio value vt , but we suppress this dependence to keep the
notation light.
To optimize performance against the risk-free interest rate or a benchmark port-
folio, we replace R̂tp in (12) with R̂te or R̂ta . By (11), these all have the form of
a constant that does not depend on zt , plus

r̂tT zt − ϕ̂trade
t (zt ) − ϕ̂hold
t (wt + zt ).

So in all three cases we get the same trades by solving the problem

maximize r̂tT zt − ϕ̂trade


t (zt ) − ϕ̂hold
t (wt + zt ) − γt ψt (wt + zt )
subject to zt ∈ Z t , w t + zt ∈ W t (13)
1T zt + ϕ̂trade
t (zt ) + ϕ̂hold
t (wt + zt ) = 0,

with variable zt . (We will see later that the risk functions are not the same for
absolute, excess, and active return.) The objective has four terms: the first is
the estimated return for the trades, the second is the estimated transaction cost,
the third term is the holding cost of the post-trade portfolio, and the last is the
risk of the post-trade portfolio. Note that the first two depend on the trades zt
and the last two depend on the post-trade portfolio wt + zt . (Similarly, the first
constraint depends on the trades, and the second on the post-trade portfolio.)

Estimated versus realized transaction and holding costs. The asset


trades we choose are given by (zt )1:n = (zt⋆ )1:n , where zt⋆ is optimal for (13).
In dollar terms, the asset trades are (ut )1:n = vt (zt⋆ )1:n . The true normalized
cash trade value (zt )n+1 is found by the self-financing condition (8) from the
non-cash asset trades (zt⋆ )1:n and the realized costs. This is not (in general)
the same as (zt⋆ )n+1 , the normalized cash trade value found by solving the
optimization problem (13). The quantity (zt )n+1 is the normalized cash trade
4 Single-period optimization 223

value with the realized costs, while (zt⋆ )n+1 is the normalized cash trade value
with the estimated costs.
The (small) discrepancy between the realized cash trade value (zt )n+1 and the
planned or estimated cash trade value (zt⋆ )n+1 has an implication for the post-
trade holding constraint wt + zt⋆ ∈ Wt . When we solve (13) we require that the
post-trade portfolio with the estimated cash balance satisfies the constraints,
which is not quite the same as requiring that the post-trade portfolio with the
realized cash balance satisfies the constraints. The discrepancy is typically very
small, since our estimation errors for the transaction cost are typically small
compared to the true transactions costs, which in turn are small compared
to the total portfolio value. But it should be remembered that the realized
post-trade portfolio wt + zt can (slightly) violate the constraints since we only
constrain the estimated post-trade portfolio wt + zt⋆ to satisfy the constraints.
(Assuming perfect trade execution, constraints relating to the asset portion of
the post-trade portfolio (wt + zt⋆ )1:n will hold exactly.)

Simplifying the self-financing constraint. We can simplify problem (13)


by replacing the self-financing constraint
1T zt + ϕ̂trade
t (zt ) + ϕ̂hold
t (wt + zt ) = 0
with the constraint 1T zt = 0. In all practical cases, the cost terms are small
compared to the total portfolio value, so the approximation is good. At first
glance it appears that by using the simplified constraint 1T z = 0 in the opti-
mization problem, we are essentially ignoring the transaction and holding costs,
which would not produce good results. But we still take the transaction and
holding costs into account in the objective.
With this approximation we obtain the simplified problem
maximize r̂tT zt − ϕ̂trade (zt ) − ϕ̂hold (wt + zt ) − γt ψt (wt + zt )
t t (14)
subject to 1 zt = 0, zt ∈ Zt , wt + zt ∈ Wt .
T

The solution zt⋆ to the simplified problem slightly over-estimates the realized
cash trade (zt )n+1 , and therefore the post-trade cash balance (wt + zt )n+1 . The
cost functions used in optimization are only estimates of what the realized values
will be; in most practical cases this estimation error is much larger than the ap-
proximation introduced with the simplification 1T zt = 0. One small advantage
(that will be useful in the multi-period trading case) is that in the optimization
problem (14), wt + zt is a bona fide set of weights, i.e., 1T (wt + zt ) = 1; whereas
in (13), 1T (wt + zt ) is (typically) slightly less than one.
We can re-write the problem (14) in terms of the variable wt+1 = wt + zt , which
we interpret as the post-trade portfolio weights:
maximize r̂tT wt+1 − ϕ̂trade (wt+1 − wt ) − ϕ̂hold (wt+1 ) − γt ψt (wt+1 )
t t (15)
subject to 1 wt+1 = 1, wt+1 − wt ∈ Zt , wt+1 ∈ Wt ,
T
224 Multi-period trading via convex optimization

with variable wt+1 .

4.2 Risk measures


The risk measure ψt in (13) or (14) is traditionally an estimate of the variance
of the return, using a stochastic model of the returns (Markowitz 1952, Kolm
et al. 2014), but it can be any function that measures our perceived risk of
holding a portfolio (Frittelli and Gianin 2002). We first describe the traditional
risk measures.

Absolute risk. Under the assumption that the returns rt are stochastic, with
covariance matrix Σt ∈ R(n+1)×(n+1) , the variance of Rtp is given by

var[Rtp ] = (wt + zt )T Σt (wt + zt ).

This gives the traditional quadratic risk measure for period t,

ψt (x) = xT Σt x.

It must be emphasized that Σt is an estimate of the return covariance under the


assumption that the returns are stochastic. It is usually assumed that the cash
return (risk-free interest rate) (rt )n+1 is known, in which case the last row and
column of Σt are zero.

Active risk. With the assumption that rt is stochastic with covariance Σt ,


the variance of the active return Rta is

var[Rta ] = (wt + zt − wtb )T Σt (wt + zt − wtb ).

This gives the traditional quadratic active risk measure

ψt (x) = (x − wtb )T Σt (x − wtb ).

When the benchmark is cash, this reduces to xT Σt x, the absolute risk, since the
last row and column of Σt are zero. In the sequel we will work with the active
risk, which reduces to the absolute or excess risk when the benchmark is cash.

Risk-aversion parameter. The risk-aversion parameter γt in (13) or (14) is


used to scale the relative importance of the estimated return and the estimated
risk. Here we describe how the particular value γt = 1/2 arises in an approx-
imation of maximizing expected growth rate, neglecting costs. Assuming that
the returns rt are independent samples from a distribution, and w is fixed, the
portfolio return Rtp = wT rt is a (scalar) random variable. The weight vector
that maximizes the expected portfolio growth rate E[log(1 + Rtp )] (subject to
1T w = 1, w ≥ 0) is called the Kelly optimal portfolio or log-optimal portfolio
4 Single-period optimization 225

(Kelly 1956, Busseti et al. 2016). Using the quadratic approximation of the
logarithm log(1 + a) ≈ a − (1/2)a2 we obtain
[ ]
E [log(1 + Rtp )] ≈ E Rtp − (1/2)(Rtp )2
= µT w − (1/2)wT (Σ + µµT )w,

where µ = E[rt ] and Σ = E[(rt −µ)(rt −µ)T ] are the mean and covariance of the
return rt . Assuming that the term µµT is small compared to Σ (which is the case
for realistic daily returns and covariance), the expected growth rate can be well
approximated as µT w − (1/2)wT Σw. So the choice of risk-aversion parameter
γt = 1/2 in the single-period optimization problems (13) or (14) corresponds to
approximately maximizing growth rate, i.e., Kelly optimal trading. In practice
it is found that Kelly optimal portfolios tend to have too much risk (Busseti
et al. 2016), so we expect that useful values of the risk-aversion parameter γt
are bigger than 1/2.

Factor model. When the number of assets n is large, the covariance estimate
Σt is typically specified as a low rank (‘factor’) component, plus a diagonal
matrix,
Σt = Ft Σft FtT + Dt ,
which is called a factor model (for quadratic risk). Here Ft ∈ R(n+1)×k is
the factor-loading matrix, Σft ∈ Rk×k is an estimate of the covariance of F T rt
(the vector of factor returns), and Dt ∈ R(n+1)×(n+1) is a nonnegative diagonal
matrix.
The number of factors k is much less than n (Chan et al. 1999) (typically, tens
versus thousands). Each entry (Ft )ij is the loading (or exposure) of asset i to
factor j. Factors can represent economic concepts such as industrial sectors,
exposure to specific countries, accounting measures, and so on. For example,
a technology factor would have loadings of 1 for technology assets and 0 for
assets in other industries. But the factor-loading matrices can be found using
many other methods, for example by a purely data-driven analysis. The matrix
Dt accounts for the additional variance in individual asset returns beyond that
predicted by the factor model, known as the idiosyncratic risk.
When a factor model is used in the problems (13) or (14), it can offer a very
substantial increase in the speed of solution (Perold 1984, Boyd and Vanden-
berghe 2004). Provided the problem is formulated in such a way that the solver
can exploit the factor model, the computational complexity drops from O(n3 )
to O(nk 2 ) flops, for a savings of O((n/k)2 ). The speedup can be substantial
when (as is typical) n is on the order of thousands and k on the order of tens.
(Computational issues are discussed in more detail in section 4.7.)
We now mention some less traditional risk functions that can be very useful in
practice.
226 Multi-period trading via convex optimization

Transformed risk. We can apply a nonlinear transformation to the usual


quadratic risk,
ψt (x) = φ((x − wtb )T Σt (x − wtb )),
where φ : R → R is a nondecreasing function. (It should also be convex, to
keep the optimization problem tractable, as we will discuss below.) This allows
us to shape our aversion to different levels of quadratic risk. For example, we
can take φ(x) = (x − a)+ . In this case the transformed risk assesses no cost for
quadratic risk levels up to a. This can be useful to hit a target risk level, or to
be maximally aggressive in seeking returns, up to the risk threshold a. Another
option is φ(x) = exp(x/η), where η > 0 is a parameter. This assesses a strong
cost to risks substantially larger than η, and is closely related to risk aversion
used in stochastic optimization.
The solution of the optimization problem (13) with transformed risk is the same
as the solution with the traditional risk function, but with a different value of
the risk-aversion parameter. So we can think of transformed risk aversion as a
method to automatically tune the risk-aversion parameter, increasing it as the
risk increases.

Worst-case quadratic risk. We now move beyond the traditional quadratic


risk to create a risk function that is more robust to unpredicted changes in
market conditions. We define the worst-case risk for portfolio x as
(i)
ψt (x) = max (x − wtb )T Σt (x − wtb ).
i=1,...,M

Here Σ(i) , i = 1, . . . , M , are M given covariance matrices; we refer to i as the


scenario. We can motivate the worst-case risk by imagining that the returns
are generated from one of M distributions, with covariances Σ(i) depending on
which scenario occurs. In each period, we do not know, and do not attempt to
predict, which scenario will occur. The worst-case risk is the largest risk under
the M scenarios.
If we estimate the probabilities of occurrence of the scenarios, and weight the
scenario covariance matrices by these probabilities, we end up back with a single
quadratic risk measure, the weighted sum of the scenario covariances. It is
critical that we combine them using the maximum, and not a weighted sum.
(Although other nonlinear combining functions would also work.) We should
think of the scenarios as describing situations that could arise, but that we
cannot or do not attempt to predict.
The scenario covariances Σ(i) can be found by many reasonable methods. They
can be empirical covariances estimated from realized (past) returns conditioned
on the scenario, for example, high or low market volatility, high or low interest
rates, high or low oil prices, and so on (Meucci 2010). They could be an analyst’s
best guess for what the asset covariance would be in a situation that could occur.
4 Single-period optimization 227

4.3 Forecast-error risk


The risk measures considered above attempt to model the period to period
variation in asset returns, and the associated period to period variation in the
portfolio return they induce. In this section we consider terms that take into
account errors in our prediction of return and covariance. (The same ideas can
be applied to other parameters that we estimate, like volume.) Estimation errors
can significantly impact the resulting portfolio weights, resulting in poor out-
of-sample performance (Jorion 1985, Michaud 1989, Chopra and Ziemba 1993,
Kan and Zhou 2007, DeMiguel et al. 2009b, Fabozzi et al. 2010, Kolm et al.
2014, Bailey et al. 2017).

Return-forecast-error risk. We assume our forecasts of the return vector


r̂ are uncertain: any forecast r̂ + δ with |δ| ≤ ρ and ρ ∈ Rn is possible and
consistent with what we know. In other words, ρ is a vector of uncertainties
on our return prediction r̂. If we are confident in our (nominal) forecast of
the return of asset i, we take ρi small; conversely large ρi means that we are
not very confident in our forecast. The uncertainty in return forecast is readily
interpreted when annualized; for example, our uncertain return forecast for an
asset might be described as 6% ± 2%, meaning any forecast return between 4%
and 8% is possible.
The post-trade estimated return is then (r̂t + δt )T (wt + zt ); we define the min-
imum of this over |δ| ≤ ρ as the worst-case return forecast. It is easy to see
what the worst-case value of δ is: if we hold a long position, the return (for
that asset) should take its minimum value r̂i + ρi ; if we hold a short position, it
should take its maximum allowed value r̂i − ρi . The worst-case return forecast
has the value

R̂twc = r̂tT (wt + zt − wtb ) − ρT |wt + zt − wtb |.

The first term here is our original estimate (including the constant terms we
neglect in (13) and (14)); the second term (which is always nonpositive) is the
worst possible value of our estimated active return over the allowed values of δ.
It is a risk associated with forecast uncertainty. This gives

ψt (x) = ρT |x − wtb |. (16)

(This would typically be added to a traditional quadratic risk measure.) This


term is a weighted ℓ1 norm of the deviation from the weights, and encourages
weights that deviate sparsely from the benchmark, i.e., weights with some or
many entries equal to those of the benchmark (Tibshirani 1996, Fastrich et al.
2015, Ho et al. 2015, Li 2015b).

Covariance-forecast-error risk. In a similar way we can add a term that


corresponds to risk of errors in forecasting the covariance matrix in a traditional
228 Multi-period trading via convex optimization

quadratic risk model. As an example, suppose that we are given a nominal


covariance matrix Σ, and consider the perturbed covariance matrix

Σpert = Σ + ∆,

where ∆ is a symmetric perturbation matrix with


1/2
|∆ij | ≤ κ (Σii Σjj ) , (17)

where κ ∈ [0, 1) is a parameter. This perturbation model means that the di-
agonal entries of covariance can change by the fraction κ; ignoring the change
in the diagonal entries, the asset correlations can change by up to (roughly) κ.
The value of κ depends on our confidence in the covariance matrix; reasonable
values are κ = 0.02, 0.05, or more.
With v = x − wtb , the maximum (worst-case) value of the quadratic risk over
this set of perturbations is given by

max v T (Σpert )v = max v T (Σ + ∆)v


|∆ij |≤κ(Σii Σjj )1/2 |∆ij |≤κ(Σii Σjj )1/2

= v T Σv + max vi vj ∆ij
|∆ij |≤κ(Σii Σjj )1/2
ij

= v T Σv + κ |vi vj |(Σii Σjj )1/2
ij
( )2
∑ 1/2
T
= v Σv + κ Σii |vi | .
i

This shows that the worst-case covariance, over all perturbed covariance matri-
ces consistent with our risk forecast error assumption (17), is given by
( )2
ψt (x) = (x − wtb )T Σ(x − wtb ) + κ σ T |x − wtb | , (18)
1/2 1/2
where σ = (Σ11 , . . . , Σnn ) is the vector of asset volatilities. The first term is
the usual quadratic risk with the nominal covariance matrix; the second term
can be interpreted as risk associated with covariance forecasting error (Ho et al.
2015, Li 2015b). It is the square of a weighted ℓ1 norm of the deviation of the
weights from the benchmark. (With cash benchmark, this directly penalizes
large leverage.)

4.4 Holding constraints


Holding constraints restrict our choice of normalized post-trade portfolio wt +zt .
They may be surrogates for constraints on wt+1 , which we cannot constrain
directly since it depends on the unknown returns. Usually returns are small
4 Single-period optimization 229

and wt+1 is close to wt + zt , so constraints on wt + zt are good approximations


for constraints on wt+1 . Some types of constraints always hold exactly for wt+1
when they hold for wt + zt .
Holding constraints may be mandatory, imposed by law or the investor, or dis-
cretionary, included to avoid certain undesirable portfolios. We discuss common
holding constraints below. Depending on the specific situation, each of these
constraints could be imposed on the active holdings wt + zt − wtb instead of the
absolute holdings wt + zt , which we use here for notational simplicity.

Long only. This constraint requires that only long asset positions are held,

wt + zt ≥ 0.

If only the assets must be long, this becomes (wt + zt )1:n ≥ 0. When a long-only
constraint is imposed on the post-trade weight wt + zt , it automatically holds
on the next period value (1 + rt ) ◦ (ht + zt ), since 1 + rt ≥ 0.

Leverage constraint. The leverage can be limited with the constraint

∥(wt + zt )1:n ∥1 ≤ Lmax ,

which requires the post-trade portfolio leverage to not exceed Lmax . (Note that
the leverage of the next period portfolio can be slightly larger than Lmax , due
to the returns over the period.)

Limits relative to asset capitalization. Holdings are commonly limited so


that the investor does not own too large a portion of the company total value.
Let Ct denote the vector of asset capitalization, in dollars. The constraint

(wt + zt )i ≤ δ ◦ Ct /vt ,

where δ ≥ 0 is a vector of fraction limits, and / is interpreted elementwise, limits


the long post-trade position in asset i to be no more than the fraction δi of the
capitalization. We can impose a similar limit on short positions, relative to asset
capitalization, total outstanding short value, or some combination.

Limits relative to portfolio. We can limit our holdings in each asset to lie
between a minimum and a maximum fraction of the portfolio value,

−wmin ≤ wt + zt ≤ wmax ,

where wmin and wmax are nonnegative vectors of the maximum short and long
allowed fractions, respectively. For example, with wmax = wmin = (0.05)1, we
are not allowed to hold more than 5% of the portfolio value in any one asset,
long or short.
230 Multi-period trading via convex optimization

Minimum cash balance. Often the cash balance must stay above a minimum
dollar threshold cmin (which can be negative). We express a minimum cash
balance as the constraint

(wt + zt )n+1 ≥ cmin /vt .

This constraint can be slightly violated by the realized values, due to our error
in estimation of the costs.

No-hold constraints. A no-hold constraint on asset i forbids holding a posi-


tion in asset i, i.e.,
(wt + zt )i = 0.

β-neutrality. A β-neutral portfolio is one whose return Rp is uncorrelated


with the benchmark return Rb , according to our estimate Σt of cov[rt ]. The
constraint that wt + zt be β neutral takes the form

(wtb )T Σt (wt + zt ) = 0.

Factor neutrality. In the factor covariance model, the estimated portfolio


risk σiF due to factor i is given by
( )2
σiF = (wt + zt )T (Ft )i (Σft )ii (Ft )Ti (wt + zt ).

The constraint that the portfolio be neutral to factor i means that σiF = 0, which
occurs when
(Ft )Ti (wt + zt ) = 0.

Stress constraints. Stress constraints protect the portfolio against unex-


pected changes in market conditions. Consider scenarios 1, . . . , K, each rep-
resenting a market shock event such as a sudden change in oil prices, a general
reversal in momentum, or a collapse in real estate prices. Each scenario i has
an associated (estimated) return ci . The ci could be based on past occurrences
of scenario i or predicted by analysts if scenario i has never occurred before.
Stress constraints take the form

cTi (wt + zt ) ≥ Rmin ,

i.e., the portfolio return in scenario i is above Rmin . (Typically Rmin is negative;
here we are limiting the decrease in portfolio value should scenario i actually
occur.) Stress constraints are related to chance constraints such as value at risk
in the sense that they restrict the probability of large losses due to shocks.
4 Single-period optimization 231

Liquidation-loss constraint. We can bound the loss of value incurred by


liquidating the portfolio over T liq periods. A constraint on liquidation loss will
deter the optimizer from investing in illiquid assets. We model liquidation as
the transaction cost to trade h+ over T liq periods. If we use the transaction-cost
estimate ϕ̂ for all periods, the optimal schedule is to trade (wt + zt )/T liq each
period. The constraint that the liquidation loss is no more than the fraction δ
of the portfolio value is given by
T liq ϕ̂trade
t ((wt + zt )/T liq ) ≤ δ.
(For optimization against a benchmark, we replace this with the cost to trade
the portfolio to the benchmark over T liq periods.)

Concentration limit. As an example of a non-traditional constraint, we con-


sider a concentration limit, which requires that no more than a given fraction
ω of the portfolio value can be held in some given fraction (or just a specific
number K) of assets. This can be written as

K
(wt + zt )[i] ≤ ω,
i=1

where the notation a[i] refers to the i’th largest element of the vector a. The
left-hand side is the sum of the K largest post-trade positions. For example,
with K = 20 and ω = 0.4, this constraint prohibits holding more than 40% of
the total value in any 20 assets. (It is not well known that this constraint is
convex, and indeed, easily handled; see Boyd and Vandenberghe (2004, section
3.2.3). It is easily extended to the case where K is not an integer.)

4.5 Trading constraints


Trading constraints restrict the choice of normalized trades zt . Constraints on
the non-cash trades (zt )1:n are exact (since we assume that our trades are exe-
cuted in full), while constraints on the cash trade (zt )n+1 are approximate, due
to our estimation of the costs. As with holding constraints, trading constraints
may be mandatory or discretionary.

Turnover limit. The turnover of a portfolio in period t is given by ∥(zt )1:n ∥1 /2.
It is common to limit the turnover to a fraction δ (of portfolio value), i.e.,
∥(zt )1:n ∥1 /2 ≤ δ.

Limits relative to trading volume. Trades in non-cash assets may be re-


stricted to a certain fraction δ of the current period market volume Vt (estimate),
|(zt )1:n | ≤ δ(Vt /vt ),
where the division on the right-hand side means elementwise.
232 Multi-period trading via convex optimization

No-buy, sell, or trade restriction. A no-buy restriction on asset i imposes


the constraint
(zt )i ≤ 0,
while a no-sell restriction imposes the constraint
(zt )i ≥ 0.
A no-trade restriction imposes both a no-buy and no-sell restriction.

4.6 Soft constraints


Any of the constraints on holdings or transactions can be made soft, which
means that are not strictly enforced. We explain this in a general setting. For a
vector equality constraint h(x) = 0 on the variable or expression x, we replace
it with a term subtracted the objective of the form γ∥h(x)∥1 , where γ > 0 is the
priority of the constraint. (We can generalize this to γ T |h(x)|, with γ a vector,
to give different priorities to the different components of h(x).)
In a similar way we can replace an inequality constraint h(x) ≤ 0 with a term,
subtracted from the objective, of the form γ T (h(x))+ , where γ > 0 is a vector
of priorities. Replacing the hard constraints with these penalty terms results in
soft constraints. For large enough values of the priorities, the constraints hold
exactly; for smaller values, the constraints are (roughly speaking) violated only
when they need to be.
As an example, we can convert a set of factor-neutrality constraints FtT (wt +
zt ) = 0 to soft constraints, by subtracting a term γ∥FtT (wt + zt )∥1 from the
objective, where γ > 0 is the priority. For larger values of γ factor neutrality
FtT (wt + zt ) = 0 will hold (exactly, when possible); for smaller values some
factor exposures can become nonzero, depending on other objective terms and
constraints.

4.7 Convexity
The portfolio optimization problem (13) can be solved quickly and reliably using
readily available software so long as the problem is convex. This requires that the
risk and estimated transaction- and holding-cost functions are convex, and the
trade and holding constraint sets are convex. All the functions and constraints
discussed above are convex, except for the self-financing constraint
1T zt + ϕ̂trade
t (zt ) + ϕ̂hold
t (wt + zt ) = 0,
which must be relaxed to the inequality
1T zt + ϕ̂trade
t (zt ) + ϕ̂hold
t (wt + zt ) ≤ 0.
The inequality will be tight at the optimum of (13). Alternatively, the self-
financing constraint can be replaced with the simplified version 1T zt = 0 as in
problem (14).
4 Single-period optimization 233

Solution times. The SPO problems described above, except for the multi-
covariance risk model, can be solved using standard interior-point methods
(Nesterov and Nemirovskii 1994) with a complexity O(nk 2 ) flops, where n is
the number of assets and k is the number of factors. (Without the factor model,
we replace k with n.) The coefficient in front is on the order of 100, which
includes the interior-point iteration count and other computation. This should
be the case even for complex leverage constraints, the 3/2-power transaction
cost, limits on trading and holding, and so on.
This means that a typical current single core (or thread) of a processor can solve
an SPO problem with 1,500 assets and 50 factors in under one half second (based
conservatively on a computation speed of 1G flop/sec). This is more than fast
enough to use the methods to carry out trading with periods on the order of
a second. But the speed is still very important even when the trading is daily,
in order to carry out backtesting. For daily trading, one year of backtesting,
around 250 trading days, can be carried out in a few minutes or less. A generic
32 core computer, running 64 threads, can carry out a backtest on five years of
data, with 64 different choices of parameters (see below), in under 10 minutes.
This involves solving 80,000 convex optimization problems. All of these times
scale linearly with the number of assets, and quadratically with the number of
factors. For a problem with, say, 4,500 assets and 100 factors, the computation
times would be around 12× longer. Our estimates are conservatively based
on a computation speed of 1G flop/sec; for these or larger problems multi-
threaded optimized linear algebra routines can achieve 100G flop/sec, making
the backtesting 100× faster.
We mention one special case that can be solved much faster. If the objective
is quadratic, which means that the risk and costs are quadratic functions, and
the only constraints are linear equality constraints (e.g., factor neutrality), the
problem can be solved with the same O(nk 2 ) complexity, but the coefficient in
front is closer to 2, around 50 times faster than using an interior-point method.
Custom solvers, or solvers targeted to specific platforms like GPUs, can solve
SPO problems much faster (O’Donoghue et al. 2016). For example, the first or-
der operator-splitting method implemented in POGS (Fougner and Boyd 2018)
running on a GPU can solve extremely large SPO problems. POGS can solve a
problem with 100,000 assets and 1,000 factors (which is much larger than any
practical problem) in a few seconds or less. At the other extreme, code genera-
tion systems like CVXGEN (Mattingley and Boyd 2012) can solve smaller SPO
problems with stunning speed; for example, a problem with 30 assets in well
under one millisecond.

Problem specification. New frameworks for convex optimization such as


CVX (Fougner and Boyd 2018), CVXPY (Diamond and Boyd 2016), and Con-
vex.jl (Udell et al. 2014), based on the idea of disciplined convex programming
234 Multi-period trading via convex optimization

(Grant et al. 2006), make it very easy to specify and modify the SPO problem
in just a handful of lines of easy to understand code. These frameworks make it
easy to experiment with non-standard trading and holding constraints, or risk
and cost functions.

Nonconvexity. The presence of nonconvex constraints or terms in the op-


timization problem greatly complicates its solution, making its solution time
much longer, and sometimes very much longer. This may not be a problem in
the production trading engine that determines one trade per day, or per hour.
But nonconvexity makes backtesting much slower at the least, and in many
cases simply impractical. This greatly reduces the effectiveness of the whole
optimization-based approach. For this reason, nonconvex constraints or terms
should be strenuously avoided.

Nonconvex constraints generally arise only when someone who does not under-
stand this adds a reasonable sounding constraint, unaware of the trouble he or
she is making. As an example, consider imposing a minimum trade condition,
which states that if (zt )i is nonzero, it must satisfy |(zt )i | ≥ ϵ, where ϵ > 0.
This constraint seems reasonable enough, but makes the problem nonconvex. If
the intention was to achieve sparse trading, or to avoid many very small trades,
this can be accomplished (in a far better way) using convex constraints or cost
terms.

Other examples of nonconvex constraints (that should be avoided) include limits


on the number of assets held, minimum values of nonzero holdings, or restricting
trades to be integer numbers of share lots, or restricting the total number of
assets we can trade. The requirement that we must trade integer numbers of
shares is also nonconvex, but irrelevant for any practical portfolio. The error
induced by rounding our trade lists (which contain real numbers) to an integer
number of shares is negligible for reasonably sized portfolios.

While nonconvex constraints and objective terms should be avoided, and are
generally not needed, it is possible to handle many of them using simple powerful
heuristics, such as solving a relaxation, fixing the nonconvex terms, and then
solving the convex problem again (Diamond et al. 2018). As a simple example
of this approach, consider the minimum nonzero trade requirement |(zt )i | ≥ ϵ
for (zt )i ̸= 0. We first solve the SPO problem without this constraint, finding
a solution z̃. We use this tentative trade vector to determine which entries of z
will be zero, negative, or positive (i.e., which assets we hold, sell, or buy). We
now impose these sign constraints on the trade vector: we require (zt )i = 0 if
(z̃t )i = 0, (zt )i ≥ 0 if (z̃t )i > 0, and (zt )i ≤ 0 if (z̃t )i < 0. We solve the SPO
again, with these sign constraints, and the minimum-trade constraints as well,
which are now linear, and therefore convex. This simple method will work very
well in practice.
4 Single-period optimization 235

As another example, suppose that we are limited to make at most K nonzero


trades in any given period. A very simple scheme, based on convex optimization,
will work extremely well. First we solve the problem ignoring the limit, and
possibly with an additional ℓ1 transaction cost added in, to discourage trading.
We take this trade list and find the K largest trades (buy or sell). We then
add the constraint to our problem that we will only trade these assets, and we
solve the portfolio optimization problem again, using only these trades. As in
the example described above, this approach will yield extremely good, if not
optimal, trades. This approximation will have no effect on the real metrics of
interest, i.e., the portfolio performance.
There is generally no need to solve the nonconvex problem globally, since this
greatly increases the solve time and delivers no practical benefit in terms of
trading performance. The best method for handling nonconvex problems in
portfolio optimization is to avoid them.

4.8 Using single-period optimization


The idea. In this section we briefly explain, at a high level, how the SPO
trading algorithm is used in practice. We do not discuss what is perhaps the
most critical part, the return (and other parameter) estimates and forecasts.
Instead, we assume the forecasts are given, and focus on how to use SPO to
exploit them.
In SPO, the parameters that appear in the transaction and holding costs can
be inspired or motivated by our estimates of what their true values will be, but
it is better to think of them as ‘knobs’ that we turn to achieve trading behavior
that we like (see, e.g., Jagannathan and Ma 2003, Cornuejols and Tütüncü 2006,
DeMiguel et al. 2009a, Li 2015b), as verified by backtesting, what-if simulation,
and stress-testing.
As a crude but important example, we can scale the entire transaction-cost func-
tion ϕtrade
t by a trading-aversion factor γ trade . (The name emphasizes the analogy
with the risk-aversion parameter, which scales the risk term in the objective.)
Increasing the trading-aversion parameter will deter trading or reduce turnover;
decreasing it will increase trading and turnover. We can even think of 1/γ trade
as the number of periods over which we will amortize the transaction cost we
incur (Grinold 2006). As a more sophisticated example, the transaction-cost
parameters at , meant to model bid-ask spread, can be scaled up or down. If we
increase them, the trades become more sparse, i.e., there are many periods in
which we do not trade each asset. If we scale the 3/2-power term, we encourage
or discourage large trades. Indeed, we could add a quadratic transaction term to
the SPO problem, not because we think it is a good model of transaction costs,
but to discourage large trades even more than the 3/2-power term does. Any
SPO variation, such as scaling certain terms, or adding new ones, is assessed by
backtesting and stress-testing.
236 Multi-period trading via convex optimization

The same ideas apply to the holding cost. We can scale the holding-cost rates by
a positive holdings-aversion parameter γ hold to encourage, or discourage, hold-
ing positions that incur holding costs, such as short positions. If the holding
cost reflects the cost of holding short positions, the parameter γ hold scales our
aversion to holding short positions. We can modify the holding cost by adding a
quadratic term of the short positions κT (wt + zt )2− , (with the square interpreted
elementwise and κ ≥ 0), not because our actual borrow-cost rates increase with
large short positions, but to send the message to the SPO algorithm that we
wish to avoid holding large short positions.
As another example, we can add a liquidation loss term to the holding cost, with
a scale factor to control its effect. We add this term not because we intend to
liquidate the portfolio, but to avoid building up large positions in illiquid assets.
By increasing the scale factor for the liquidation loss term, we discourage the
SPO algorithm from holding illiquid positions.

Trade-, hold-, and risk-aversion parameters. The discussion above sug-


gests that we modify the objective in (14) with scaling parameters for transaction
and holding costs, in addition to the traditional risk-aversion parameter, which
yields the SPO problem
(
maximize r̂tT zt − γttrade ϕ̂trade (zt )
t
)
− γ hold ϕ̂hold (wt + zt ) − γ risk ψt (wt + zt ) (19)
t t t
subject to 1T zt = 0, zt ∈ Zt , w t + zt ∈ W t .

where γttrade , γthold , and γtrisk are positive parameters used to scale the respective
costs. These parameters are sometimes called hyperparameters, which empha-
sizes the analogy to the hyperparameters used when fitting statistical models to
data. The hyperparameters are ‘knobs’ that we ‘turn’ (i.e., choose or change) to
obtain good performance, which we evaluate by backtesting. We can have even
more than three hyperparameters, which scale individual terms in the holding
and transaction costs. The choice of hyperparameters can greatly affect the per-
formance of the SPO method. They should be chosen using backtesting, what-if
testing, and stress-testing.
This style for using SPO is similar to how optimization is used in many other
applied areas, for example control systems or machine learning. In machine
learning, for example, the goal is to find a model that makes good predictions
on new data. Most methods for constructing a model use optimization to min-
imize a so-called loss function, which penalizes not fitting the observed data,
plus a regularizer, which penalizes model sensitivity or complexity. Each of
these functions is inspired by a (simplistic) theoretical model of how the data
were generated. But the final choice of these functions, and the (hyperparam-
eter) scale factor between them, is done by out-of-sample validation or cross
5 Multi-period optimization 237

validation, i.e., testing the model on data it has not seen (Hastie et al. 2009).
For general discussion of how convex optimization is used in this spirit, in ap-
plications such as control or estimation, see Boyd and Vandenberghe (2004).

Judging value of forecasts. In this paper we do not consider forecasts,


which of course are critically important in trading. The most basic test of a
new proposed return estimate or forecast is that it does well predicting returns.
This is typically judged using a simple model that evaluates Sharpe ratio or
information ratio, implicitly ignoring all portfolio constraints and costs. If a
forecast fails these simple SR or IR tests, it is unlikely to be useful in a trading
algorithm.
But the true value of a proposed estimate or forecast in the context of multi-
period trading can be very different from what is suggested by the simple SR or
IR prediction tests, due to costs, portfolio constraints, and other issues. A new
proposed forecast should be judged in the context of the portfolio constraints,
other forecasts (say, of volume), transaction costs, holding costs, trading con-
straints, and choice of parameters such as risk aversion. This can be done using
simulation, carrying out backtests, what-if simulations, and stress-tests, in each
case varying the parameters to achieve the best performance. The result of this
testing is that the forecast might be less valuable (the usual case) or more valu-
able (the less usual case) than it appeared from the simple SR and IR tests. One
consequence of this is that the true value of a forecast can depend considerably
on the type and size of the portfolio being traded; for example, a forecast could
be very valuable for a small long-short portfolio with modest leverage, and much
less valuable for a large long-only portfolio.

5 Multi-period optimization
5.1 Motivation
In this chapter we discuss optimization-based strategies that consider informa-
tion about multiple periods when choosing trades for the current period. Before
delving into the details, we should consider what we hope to gain over the
single-period approach. Predicting the returns for the current period is difficult
enough. Why attempt to forecast returns in future periods?
One reason is to better account for transaction costs. In the absence of trans-
action cost (and other limitations on trading), a greedy strategy that only con-
siders one period at a time is optimal, since performance for the current period
does not depend on previous holdings. However, in any realistic model current
holdings strongly affect whether a return prediction can be profitably acted on.
We should therefore consider whether the trades we make in the current period
put us in a good or bad position to trade in future periods. While this idea can
238 Multi-period trading via convex optimization

be incorporated into single-period optimization, it is more naturally handled in


multi-period optimization.
For example, suppose our single-period optimization-based strategy tells us to
go very long in a rarely traded asset. We may not want to make the trade
because we know that unwinding the position will incur large transaction costs.
The single-period problem models the cost of moving into the position, but not
the cost of moving out of it. To model the fact that we will over time revert
positions towards the benchmark, and thus must eventually sell the positions
we buy, we need to model time beyond the current period. (One standard trick
in single-period optimization is to double the transaction cost, which is then
called the round-trip cost.)
Another advantage of multi-period optimization is that it naturally handles
multiple, possibly conflicting return estimates on different time scales (see, e.g.,
Gârleanu and Pedersen 2013, Nystrup et al. 2018b). As an example, suppose
we predict that a return will be positive over a short period, but over a longer
period it will be negative. The first prediction might be relevant for only a
day, while the second for a month or longer. In a single-period optimization
framework, it is not clear how to account for the different time scales when
blending the return predictions. Combining the two predictions would likely
cancel them, or have us move according to whichever prediction is larger. But
the resulting behavior could be quite non-optimal. If the trading cost is high,
taking no action is likely the right choice, since we will have to reverse any trade
based on the fast prediction as we follow the slow prediction in future periods. If
the trading cost is low, however, the right choice is to follow the fast prediction,
since unwinding the position is cheap. This behavior falls naturally out of a
multi-period optimization, but is difficult to capture in a single-period problem.
There are many other situations where predictions over multiple periods, as
opposed to just the current period, can be taken advantage of in multi-period
optimization. We describe a few of them here.

• Signal decay and time-varying return predictions. Generalizing the dis-


cussion above on fast versus slow signals, we may assign an exponential
decay-rate to every return prediction signal. (This can be estimated his-
torically, for example, by fitting an auto-regressive model to the signal
values.) Then it is easy to compute return estimates at any time scale.
The decay in prediction accuracy is also called mean-reversion or alpha
decay (see, e.g., Campbell et al. 1997, Grinold 2006, Gârleanu and Peder-
sen 2013).
• Known future changes in volatility or risk. If we know that a future event
will increase the risk, we may want to exit some of the risky positions in
advance. In MPO, trading towards a lower risk position starts well before
the increase in risk, trading it off with the transaction costs. In SPO,
5 Multi-period optimization 239

(larger) trading to a lower risk position occurs only once the risk has
increased, leading to larger transaction costs. Conversely, known periods
of low risk can be exploited as well.
• Changing constraints over multiple periods. As an example, assume we
want to de-leverage the portfolio over multiple periods, i.e., reduce the
leverage constraint Lmax over some number of periods to a lower value. If
we use a multi-period optimization framework we will likely incur lower
trading cost than by some ad-hoc approach, while still exploiting our re-
turns predictions.
• Known future changes in liquidity or volume. Future volume or volatility
predictions can be exploited for transaction-cost optimization, for example
by delaying some trades until they will be cheaper. Market volumes Vt
have much better predictability than market returns.
• Setting up, shutting down, or transferring a portfolio. These transitions
can all be handled naturally by MPO, with a combination of constraints
and objective terms changing over time.

5.2 Multi-period optimization


In multi-period optimization, we choose the current trade vector zt by solving
an optimization problem over a planning horizon that extends H periods into
the future,
t, t + 1, . . . , t + H − 1.
(Single-period optimization corresponds to the case H = 1.)
Many quantities at times t, t + 1, . . . , t + H − 1 are unknown at time t, when
the optimization problem is solved and the asset trades are chosen, so as in the
single-period case, we will estimate them. For any quantity or function Z, we
let Ẑτ |t denote our estimate of Zτ given all information available to us at the
beginning of period t. (Presumably τ ≥ t; otherwise we can take Ẑτ |t = Zτ , the
realized value of Z at time τ .) For example, r̂t|t is the estimate made at time
t of the return at time t (which we denoted r̂t in the section on single-period
optimization); r̂t+2|t is the estimate made at time t of the return at time t + 2.
We can develop a multi-period optimization problem starting from (13). Let

zt , zt+1 , . . . , zt+H−1

denote our sequence of planned trades over the horizon. A natural objective is
the total risk-adjusted return over the horizon,

∑ (
t+H−1 )
r̂τT|t (wτ + zτ ) − γτ ψτ (wτ + zτ ) − ϕ̂hold trade
τ (wτ + zτ ) − ϕ̂τ (zτ ) .
τ =t
240 Multi-period trading via convex optimization

(This expression drops a constant that does not depend on the trades, and han-
dles absolute or active return.) In this expression, wt is known, but wt+1 , . . . , wt+H
are not, since they depend on the trades zt , . . . , zt+H−1 (which we will choose)
and the unknown returns, via the dynamics equation (10),

1
wt+1 = (1 + rt ) ◦ (wt + zt ),
1 + Rtp

which propagates the current weight vector to the next one, given the trading
and return vectors. (This true dynamics equation ensures that if 1T wt = 1, we
have 1T wt+1 = 1.)
In adding the risk terms γτ ψt (wτ +zτ ) in this objective, we are implicitly relying
on the idea that the returns are independent random variables, so the variance
of the sum is the sum of the variances. We can also interpret γτ ψτ (wτ + zτ ) as
cost terms that discourage us from holding certain portfolios.

Simplifying the dynamics. We now make a simplifying approximation: for


the purpose of propagating wt and zt to wt+1 in our planning exercise, we
will assume Rtp = 0 and rt = 0 (i.e., that the one period returns are small
compared to one). This results in the much simpler dynamics equation wt+1 =
wt + zt . With this approximation, we must add the constraints 1T zt = 0 to
ensure that the weights in our planning exercise add to one, i.e., 1T wτ = 1,
τ = t + 1, . . . , t + H. So we will impose the constraints

1T zτ = 0, τ = t + 1, . . . , t + H − 1.

The current portfolio weights wt are given, and satisfy 1T wt = 1; we get that
1T wτ = 1 for τ = t + 1, . . . , t + H due to the constraints. (Implications of the
dynamics simplification are discussed below.)

Multi-period optimization problem. With the dynamics simplification we


arrive at the MPO problem
∑t+H−1 ( T
maximize r̂τ |t (wτ + zτ ) − γτ ψτ (wτ + zτ )
τ =t
)
− ϕ̂hold
τ (wτ + zτ ) − ϕ̂τ
trade
(zτ ) (20)
subject to 1T zτ = 0, zτ ∈ Zτ , wτ + zτ ∈ Wτ ,
wτ +1 = wτ + zτ , τ = t, . . . , t + H − 1,

with variables zt , zt+1 , . . . , zt+H−1 and wt+1 , . . . , wt+H . Note that wt is not a
variable, but the (known) current portfolio weights. When H = 1, the multi-
period problem reduces to the simplified single-period problem (14). (We can
T
ignore the constant r̂t|t wt , which does not depend on the variables, that appears
in (20) but not (14).)
5 Multi-period optimization 241

Using wτ +1 = wτ + zτ we can eliminate the trading variables zτ to obtain the


equivalent problem
∑t+H ( T
maximize τ =t+1 r̂τ |t wτ − γτ ψτ (wτ )
)
− ϕ̂hold
τ (wτ ) − ϕ̂τ
trade
(wτ − wτ −1 ) (21)
subject to 1T wτ = 1, wτ − wτ −1 ∈ Zτ , wτ ∈ Wτ ,
τ = t + 1, . . . , t + H,

with variables wt+1 , . . . , wt+H , the planned weights over the next H periods.
This is the multi-period analog of (15).
Both MPO formulations (20) and (21) are convex optimization problems, pro-
vided the transaction cost, holding cost, risk functions, and trading and holding
constraints are all convex.

Interpretation of MPO. The MPO problems (20) or (21) can be interpreted


as follows. The variables constitute a trading plan, i.e., a set of trades to be
executed over the next H periods. Solving (20) or (21) is forming a trading
plan, based on forecasts of critical quantities over the planning horizon, and
some simplifying assumptions. We do not intend to execute this sequence of
trades, except for the first one zt . It is reasonable to ask then why we optimize
over the future trades zt+1 , . . . , zt+H−1 , since we do not intend to execute them.
The answer is simple: we optimize over them as part of a planning exercise,
just to be sure we don’t carry out any trades now (i.e., zt ) that will put us
in a bad position in the future. The idea of carrying out a planning exercise,
but only executing the current action, occurs and is used in many fields, such as
automatic control (where it is called model predictive control, MPC, or receding
horizon control) (Kwon and Han 2005, Bemporad 2006, Mattingley et al. 2011),
supply chain optimization (Cho et al. 2003), and others. Applications of MPC
in finance include Herzog et al. (2007), Boyd et al. (2014), Bemporad et al.
(2014), Busseti and Boyd (2015), Nystrup et al. (2018b).

About the dynamics simplification. Before proceeding let us discuss the


simplification of the dynamics equation, where we replace the exact weight up-
date
1
wt+1 = (1 + rt ) ◦ (wt + zt )
1 + Rtp
with the simplified version wt+1 = wt + zt , by assuming that rt = 0. At first
glance it appears to be a gross simplification, but this assumption is only made
for the purpose of propagating the portfolio forward in our planning process; we
do take the returns into account in the first term of our objective. We are thus
neglecting second-order terms, and we cannot be too far off if the per period
returns are small compared to one.
242 Multi-period trading via convex optimization

In a similar way, adding the constraints 1T zτ = 0 for τ = t + 1, . . . , t + H − 1


suggests that we are ignoring the transaction and holding costs, since if zτ were a
realized trade we would have 1T zτ = −ϕtrade
τ (zτ )−ϕhold
τ (wτ +zτ ). As above, this
assumption is only made for the purpose of propagating our portfolio forward
in our planning exercise; we do take the costs into account in the objective.

Terminal constraints. In MPO, with a reasonably long horizon, we can add


a terminal (equality) constraint, which requires the final planned weight to take
some specific value, wt+H = wterm . A reasonable choice for the terminal portfolio
weight is (our estimate of) the benchmark weight wb at period t + H.
For optimization of absolute or excess return, the terminal weight would be cash,
i.e., wterm = en+1 . This means that our planning exercise should finish with the
portfolio all cash. This does not mean we intend to liquidate the portfolio in H
periods; rather, it means we should carry out our planning as if this were the
case. This will keep us from making the mistake of moving into what appears, in
terms of our returns predictions, to be an attractive position that it is, however,
expensive to unwind. For optimization relative to a benchmark, the natural
terminal constraint is to be in the (predicted) benchmark.
Note that adding a terminal constraint reduces the number of variables. We
solve the problem (20), but with wt+H a given constant, not a variable. The ini-
tial weight wt is also a given constant; the intermediate weights wt+1 , . . . , wt+H−1
are variables.

5.3 Computation
The MPO problem (21) has Hn variables. In general the complexity of a convex
optimization increases as the cube of the number of variables, but in this case
the special structure of the problem can be exploited so that the computational
effort grows linearly in H, the horizon. Thus, solving the MPO problem (21)
should be a factor H slower than solving the SPO problem (15). For modest
H (say, a few tens), this is not a problem. But for H = 100 (say) solving the
MPO problem can be very challenging. Distributed methods based on ADMM
(Boyd et al. 2011, 2014) can be used to solve the MPO problem using multiple
processors.
In most cases we can solve the MPO problem in production. The issue is back-
testing, since we must solve the problem many times, and with many variations
of the parameters.

5.4 How MPO is used


All of the general ideas about how SPO is used apply to MPO as well; for
example, we consider the parameters in the MPO problem as knobs that we
adjust to achieve good performance under backtest and stress-test. In MPO, we
6 Implementation 243

must provide forecasts of each quantity for each period over the next H periods.
This can be done using sophisticated forecasts, with possibly different forecasts
for each period, or in a very simple way, with predictions that are constant.

5.5 Multi-scale optimization


MPO trading requires estimates of all relevant quantities, like returns, trans-
action costs, and risks, over H trading periods into the future. In this section
we describe a simplification of MPO that requires fewer predictions, as well as
less computation to carry out the optimization required in each period. We still
create a plan for trades and weights over the next H periods, but we assume
that trades take place only a few times over the horizon; in other time periods
the planned portfolio is maintained with no trading. This preserves the idea
that we have recourse; but it greatly simplifies the problem (20). We describe
the idea for three trades, taken in the short term, medium term, and long term,
and an additional trade at the end to satisfy a terminal constraint wt+H = wb .
Specifically we add the constraint that in (20), trading (i.e., zτ ̸= 0) only occurs
at specific periods in the future, for
τ = t, τ = t + T med , τ = t + T long , τ = t + H − 1,
where
1 < T med < T long < H − 1.
We interpret z short = zt as our short term trade, z med = zt+T med as our medium
term trade, and z long = zt+T long as our long term trade, in our trading plan. The
final nonzero trade zt+H−1 is determined by the terminal constraint.
For example, we might take T med = 5 and T long = 21, with H = 100. If the
periods represent days, we plan to trade now (short term), in a week (medium
term) and in month (longer term); in 99 days, we trade to the benchmark. The
only variables we have are the short, medium, and long term trades, and the
associated weights, given by
wshort = wt + z short , wmed = wshort + z med , wlong = wmed + z long .
To determine the trades to make, we solve (20) with all other zτ set to zero,
and using the weights given above. This results in an optimization problem
with the same form as (20), but with only three variables each for trading and
weights, and three terms in the objective, plus an additional term that represents
the transaction cost associated with the final trade to the benchmark at time
t + H − 1.

6 Implementation
We have developed an open-source Python package CVXPortfolio (Busseti et al.
2017) that implements the portfolio simulation and optimization concepts dis-
244 Multi-period trading via convex optimization

cussed in the paper. The package relies on Pandas (McKinney 2012) for man-
aging data. Pandas implements structured data types as in-memory databases
(similar to R dataframes) and provides a rich API for accessing and manipulating
them. Through Pandas, it is easy to couple our package with database backends.
The package uses the convex optimization modeling framework CVXPY (Dia-
mond and Boyd 2016) to construct and solve portfolio optimization problems.
The package provides an object-oriented framework with classes representing
return, risk measures, transaction costs, holding constraints, trading constraints,
etc. Single-period and multi-period optimization models are constructed from
instances of these classes. Each instance generates CVXPY expressions and
constraints for any given period t, making it easy to combine the instances into
a single convex model. In section 7 we give some simple numerical examples
that use CVXPortfolio.

6.1 Components
We briefly review the major classes in the software package. Implementing
additional classes, such as novel policies or risk measures, is straightforward.

Return estimates. Instances of the ReturnsForecast class generate a re-


turn estimate r̂t for period t using only information available at that period.
The simplest ReturnsForecast instance wraps a Pandas dataframe with return
estimates for each period:

r_hat = ReturnsForecast(return_estimates_dataframe)

Multiple ReturnsForecast instances can be blended into a linear combination.

Risk measures. Instances of a risk measure class, contained in the risks


submodule, generate a convex cost representing a risk measure at a given period
t. For example, the FullSigma class generates the cost (wt + zt )T Σt (wt + zt )
where Σt ∈ R(n+1)×(n+1) is an explicit matrix, whereas the FactorModel class
generates the cost with a factor model of Σt . Any risk measure can be switched
to absolute or active risk and weighted with a risk-aversion parameter. The
package provides all the risk measures discussed in section 4.2.

Costs. Instances of the TcostModel and HcostModel classes generate transaction-


and holding-cost estimates, respectively. The same classes work both for mod-
eling costs in a portfolio optimization problem and calculating realized costs in
a trading simulation. Cost objects can also be used to express other objective
terms like soft constraints.
7 Examples 245

Constraints. The package provides classes representing each of the constraints


discussed in section 4.4 and section 4.5. For example, instances of the class
LeverageLimit generate a leverage limit constraint that can vary by period.
Constraint objects can be converted into soft constraints, which are cost ob-
jects.

Policies. Instances of a policy class take holdings wt and value vt and output
trades zt using information available in period t. Single-period optimization
policies are constructed using the SinglePeriodOpt class. The constructor
takes a ReturnsForecast, a list of costs, including risk models (multiplied by
their coefficients), and constraints. For example, the following code snippet
constructs a SPO policy:

spo_policy = SinglePeriodOpt(r_hat,
[gamma_risk*factor_risk,
gamma_trade*tcost_model,
gamma_hold*hcost_model],
[leverage_limit])

Multi-period optimization policies are constructed similarly. The package also


provides classes for simple policies such as periodic rebalancing.

Simulator. The MarketSimulator class is used to run trading simulations, or


backtests. Instances are constructed with historical returns and other market
data, as well as transaction- and holding-cost models. Given a MarketSimulator
instance market_sim, a backtest is run by calling the run_backtest method
with an initial portfolio, policy, and start and end periods:

backtest_results = market_sim.run_backtest(init_portfolio,
policy,
start_t, end_t)

Multiple backtests can be run in parallel with different conditions. The backtest
results include all the metrics discussed in section 3.

7 Examples
In this section we present simple numerical examples illustrating the ideas
developed above, all carried out using CVXPortfolio and open-source mar-
ket data (and some approximations where no open source data is available).
The code for these is available at http://github.com/cvxgrp/cvxportfolio/
tree/master/examples. Given our approximations, and other short-comings
of our simulations that we will mention below, the particular numerical results
246 Multi-period trading via convex optimization

we show should not be taken too seriously. But the simulations are good enough
for us to illustrate real phenomena, such as the critical role transaction costs
can play, or how important hyperparameter search can be.

7.1 Data for simulation


We work with a period of five years, from January 2012 through December 2016,
on the components of the S&P 500 index as of December 2016. We select the
ones continuously traded in the period. (By doing this we introduce survivorship
bias (Elton et al. 1996).) We collect open-source market data from Quandl
(2016). The data consists of realized daily market returns rt (computed using
closing prices) and volumes Vt . We use the federal reserve overnight rate for
the cash return. Following Almgren (2009) we approximate the daily volatility
with a simple estimator, (σt )i = | log(popen
t )i − log(pclose
t )i |, where (popen
t )i and
close
(pt )i are the open and close prices for asset i in period t. We could not find
open-source data for the bid-ask spread, so we used the value at = 0.05% (five
basis points) for all assets and periods. As holding costs we use st = 0.01% (one
basis point) for all assets and periods. We chose standard values for the other
parameters of the transaction- and holding-cost models: bt = 1, ct = 0, dt = 0
for all assets and periods.

7.2 Portfolio simulation


To illustrate backtest portfolio simulation, we consider a portfolio that is meant
to track the uniform portfolio benchmark, which has weight wb = (1/n, 0),
i.e., equal fraction of value in all non-cash assets. This is not a particularly
interesting or good benchmark portfolio; we use it only as a simple example to
illustrate the effects of transaction costs. The portfolio starts with w1 = wb , and
due to asset returns drifts from this weight vector. We periodically rebalance,
which means using trade vector zt = wb − wt . For other values of t (i.e., the
periods in which we do not rebalance) we have zt = 0.

We carry out six backtest simulations for each of two initial portfolio values,
$100M and $10B. The six simulations vary in rebalancing frequency: daily,
weekly, monthly, quarterly, annually, or never (also called ‘hold’ or ‘buy-and-
hold’). For each simulation we give the portfolio active return Ra and ac-
tive risk σ a (defined in section 3.2), the annualized average transaction cost
∑T ∑T
t=1 ∥(zt )1:n ∥1 /2.
250 trade
T t=1 ϕt (zt ), and the annualized average turnover 250
T

Table 1shows the results. (The active return is also included for completeness.)
We observe that transaction cost depends on the total value of the portfolio, as
expected, and that the choice of rebalancing frequency trades off transaction
cost and active risk. (The active risk is not exactly zero when rebalancing
daily because of the variability of the transaction cost, which is included in the
7 Examples 247

Initial Rebalancing Active Active Trans. Turnover


value frequency return risk cost
$100M Daily −0.07% 0.00% 0.07% 220.53%
Weekly −0.07% 0.09% 0.04% 105.67%
Monthly −0.12% 0.21% 0.02% 52.71%
Quarterly −0.11% 0.35% 0.01% 29.98%
Annually −0.10% 0.63% 0.01% 12.54%
Hold −0.36% 1.53% 0.00% 0.00%
$10B Daily −0.25% 0.01% 0.25% 220.53%
Weekly −0.19% 0.09% 0.16% 105.67%
Monthly −0.20% 0.21% 0.10% 52.71%
Quarterly −0.17% 0.35% 0.07% 29.99%
Annually −0.13% 0.63% 0.04% 12.54%
Hold −0.36% 1.53% 0.00% 0.00%

Table 1: Portfolio simulation results with different initial value and different rebalancing
frequencies. All values are annualized.

portfolio return.) Figure 2 shows, separately for the two portfolio sizes, the
active risk versus the transaction cost.

7.3 Single-period optimization


In this section we show a simple example of the single-period optimization model
developed in section 4. The portfolio starts with total value v1 = $100M and
allocation equal to the uniform portfolio w1 = (1/n, 0). We impose a lever-
age constraint of Lmax = 3. This simulation uses the market data defined in
section 7.1. The forecasts and risk model used in the SPO are described below.

Risk model. Proprietary risk models, e.g., from MSCI (formerly Barra), are
widely used. Here we use a simple factor risk model estimated from past realized
returns, using a similar procedure to Almgren (2009). We estimate it on the first
day of each month, and use it for the rest of the month. Let t be an estimation
time period, and t−M risk the time period two years before.
∑t−1Consider the second
moment of the window of realized returns Σexp = M1risk τ =t−M risk rτ rτT , and its
∑n
eigenvalue decomposition Σexp = i=1 λi qi qiT , where the eigenvalues λi are in
descending order. Our factor risk model is

n
F = [q1 · · · qk ], Σf = diag(λ1 , . . . , λk ), D= λi diag(qi )diag(qi ),
i=k+1

with k = 15. (The diagonal matrix D is chosen so the factor model F Σf F T + D


and the empirical second moment Σexp have the same diagonal elements.)
248 Multi-period trading via convex optimization

Figure 2: Active risk versus transaction cost, for the two initial portfolio sizes. The
points on the lines correspond to rebalancing frequencies.

Return forecasts. The risk-free interest rates are known exactly, (r̂t )n+1 =
(rt )n+1 for all t. Return forecasts for the non-cash assets are always proprietary.
They are generated using many methods, ranging from analyst predictions to
sophisticated machine learning techniques, based on a variety of data feeds and
sources. For these examples we generate simulated return forecasts by adding
zero-mean noise to the realized returns and then rescaling, to obtain return
estimates that would (approximately) minimize mean squared error. Of course
this is not a real return forecast, since it uses the actual realized return; but our
purpose here is only to illustrate the ideas and methods.
For all t the return estimates for non-cash assets are

(r̂t )1:n = α ((rt )1:n + ϵt ) , (22)

where ϵt ∼ N (0, σϵ2 I) are independent. We use noise variance σϵ2 = 0.02, so the
noise components have standard deviation around 14%, around a factor of ten
larger than the standard deviation of the realized returns. The scale factor α
is chosen to minimize the mean squared error E[((r̂t )1:n − (rt )1:n )2 ], if we think
of rt as a random variable with variance σr , i.e., α = σr2 /(σr2 + σϵ2 ). We use the
typical value σr2 = 0.0005, i.e., a realized return standard deviation of around
2%, so α = 0.024. Our typical return√forecast is on the order of ±0.3%. This
corresponds to an information ratio α ≈ 0.15, which is on the high end of
what might be expected in practice (Grinold and Kahn 2000).
7 Examples 249

With this level of noise and scaling, our return forecasts have an accuracy on
the order of what we might expect from a proprietary forecast. For example,
across all the assets and all days, the sign of predicted return agrees with the
sign of the real return around 54% of the times.

Volume and volatility forecasts. We use simple estimates of total market


volumes and daily volatilities (used in the transaction-cost model), as moving
averages of the realized values with a window of length ten. ∑10 For example, the
1
volume forecast at time period t and asset i is (V̂t )i = 10 τ =1 (Vt−τ )i .

SPO backtests. We carry out multiple backtest simulations over the whole
period, varying the risk-aversion parameter γ risk , the trading-aversion param-
eter γ trade , and the holding-cost multiplier γ hold (all defined and discussed in
section 4.8). We first perform a coarse grid search in the hyperparameter space,
testing all combinations of

γ risk = 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000,


γ trade = 1, 2, 5, 10, 20,
γ hold = 1,

a total of 45 backtest simulations. (Logarithmic spacing is common in hyperpa-


rameter searches.)

Figure 3 shows mean excess portfolio return Re versus excess volatility σ e (de-
fined in section 3.2), for these combinations of parameters. For each value of
γ trade , we connect with a line the points corresponding to the different values
of γ risk , obtaining a risk–return tradeoff curve for that choice of γ trade and γ hold .
These show the expected tradeoff between mean return and risk. We see that
the choice of trading-aversion parameter is critical: for some values of γ trade the
results are so poor that the resulting curve does not even fit in the plotting area.
Values of γ trade around five seem to give the best results.

We then perform a fine hyperparameter search, focusing on trade-aversion pa-


rameters values around five,

γ trade = 4, 5, 6, 7, 8,

and the same values of γ risk and γ hold . Figure 4 shows the resulting curves of
excess return versus excess risk. A value around γ trade = 6 seems to be best.

For our last set of simulations we use a finer range of risk-aversion parameters,
focus on an even narrower range of the trading-aversion parameter, and also
250 Multi-period trading via convex optimization

Figure 3: SPO example, coarse hyperparameter grid search. (Some curves do not fit in
the plot.)

Figure 4: SPO example, fine hyperparameter grid search.


7 Examples 251

Figure 5: SPO example, grid search over 510 hyperparameter combinations. The line
connects the Pareto optimal points.

vary the hold-aversion parameter. We test all combinations of

γ risk = 0.1, 0.178, 0.316, 0.562, 1, 2, 3, 6, 10, 18, 32, 56,


100, 178, 316, 562, 1000,
trade
γ = 5.5, 6, 6.5, 7, 7.5, 8,
hold
γ = 0.1, 1, 10, 100, 1000,

a total of 510 backtest simulations. The results are plotted in figure 5 as points
in the risk–return plane. The Pareto optimal points, i.e., those with the lowest
risk for a given level of return, are connected by a line. Table 6 lists a selection
of the Pareto optimal points, giving the associated hyperparameter values.
From this curve and table we can make some interesting observations. The first
is that we do substantially better with large values of the holding-cost multiplier
parameter compared to γ hold = 1, even though the actual holding cost (used by
the simulator to update the portfolio each day) is very small, one basis point.
This is a good example of regularization in SPO; our large holding-cost multiplier
parameter tells the SPO algorithm to avoid short positions, and the result is
that the overall portfolio performance is better.
It is hardly surprising that the risk-aversion parameter varies over this selection
of Pareto optimal points; after all, this is the parameter most directly related
252 Multi-period trading via convex optimization

Excess Excess
γ risk γ trade γ hold return risk
1000.00 8.0 100 1.33% 0.39%
562.00 6.0 100 2.49% 0.74%
316.00 7.0 100 2.98% 1.02%
1000.00 7.5 10 4.64% 1.22%
562.00 8.0 10 5.31% 1.56%
316.00 7.5 10 6.53% 2.27%
316.00 6.5 10 6.88% 2.61%
178.00 6.5 10 8.04% 3.20%
100.00 8.0 10 8.26% 3.32%
32.00 7.0 10 12.35% 5.43%
18.00 6.5 0.1 14.96% 7.32%
6.00 7.5 10 18.51% 10.44%
2.00 6.5 10 23.40% 13.87%
0.32 6.5 10 26.79% 17.50%
0.18 7.0 10 28.16% 19.30%

Table 6: SPO example, selection of Pareto optimal points (ordered by increasing risk
and return).

to the risk–return tradeoff. One surprise is that the value of the hold-aversion
hyperparameter varies considerably as well.
In practice, we would backtest many more combinations of these three hyper-
parameters. Indeed we would also carry out backtests varying combinations
of other parameters in the SPO algorithm, for example the leverage, or the in-
dividual terms in transaction-cost functions. In addition, we would carry out
stress-tests and other what-if simulations, to get an idea of how our SPO al-
gorithm might perform in other, or more stressful, market conditions. (This
would be especially appropriate given our choice of backtest date range, which
was entirely a bull market.) Since these backtests can be carried out in parallel,
there is no reason to not carry out a large number of them.

7.4 Multi-period optimization


In this section we show the simplest possible example of the multi-period op-
timization model developed in section 5, using planning horizon H = 2. This
means that in each time period the MPO algorithm plans both current day and
next day trades, and then executes only the current day trades. As a practical
matter, we would not expect a great performance improvement over SPO using
a planning horizon of H = 2 days compared to SPO, which uses H = 1 day.
Our point here is to demonstrate that it is different.
The simulations are carried out using the market data described in section 7.1.
7 Examples 253

The portfolio starts with total value v1 = $100M and uniform allocation w1 =
(1/n, 0). We impose a leverage constraint of Lmax = 3. The risk model is the
same one used in the SPO example. The volume and volatility estimates (for
both the current and next period) are also the same as those used in the SPO
example.

Return forecasts. We use the same return forecast we generated for the
previous example, but at every time period we provide both the forecast for the
current time period and the one for the next:
r̂t|t = r̂t , r̂t+1|t = r̂t+1 ,
where r̂t and r̂t+1 are the same ones used in the SPO example, given in (22).
The MPO trading algorithm thus sees each return forecast twice, r̂t+1 = r̂t+1|t =
r̂t+1|t+1 , i.e., today’s forecast of tomorrow’s return is the same as tomorrow’s
forecast of tomorrow’s return.
As in the SPO case, this is clearly not a practical forecast, since it uses the
realized return. In addition, in a real setting the return forecast would be
updated at every time period, so that r̂t+1|t ̸= r̂t+1|t+1 . Our goal in choosing
these simulated return forecasts is to have ones that are similar to the ones used
in the SPO example, in order to compare the results of the two optimization
procedures.

Backtests. We carry out multiple backtest simulations varying the parame-


ters γ risk , γ trade , and γ hold . We first perform a coarse grid search in the hyperpa-
rameter space, with the same parameters as in the SPO example. We test all
combinations of
γ risk = 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000,
γ trade = 1, 2, 5, 10, 20,
γ hold = 1,
a total of 45 backtest simulations.
The results are shown in figure 7, where we plot mean excess portfolio return
Re versus excess risk σ e . For some trading-aversion parameter values the results
were so bad that they did not fit in the plotting area.
We then perform a more accurate hyperparameter search using a finer range
for γ risk , focusing on the values around γ trade = 10, and also varying the hold-
aversion parameter. We test all combinations of
γ risk = 1, 2, 3, 6, 10, 18, 32, 56, 100, 178, 316, 562, 1000,
γ trade = 7, 8, 9, 10, 11, 12,
γ hold = 0.1, 1, 10, 100, 1000,
254 Multi-period trading via convex optimization

Figure 7: MPO example, coarse hyperparameter grid search.

for a total of 390 backtest simulations. The results are plotted in figure 8 as
points in the risk–return plane. The Pareto optimal points are connected by a
line.
Finally we compare the results obtained with the SPO and MPO examples.
Figure 9 shows the Pareto optimal frontiers for both cases. We see that the MPO
method has a substantial advantage over the SPO method, mostly explained by
the advantage of a forecast for tomorrow’s, as well as today’s, return.

7.5 Simulation time


Here we give some rough idea of the computation time required to carry out the
simulation examples shown above, focusing on the SPO case. The backtest sim-
ulation is single-threaded, so multiple backtests can be carried out on separate
threads.
Figure 10 gives the breakdown of execution time for a backtest, showing the
time taken for each step of simulation, broken down into the simulator, the
numerical solver, and the rest of the policy (data management and CVXPY
manipulations). We can see that simulating one day takes around 0.25 seconds,
so a back test over five years takes around five minutes. The bulk of this (around
0.15 seconds) is the optimization carried out each day. The simulator time is,
as expected, negligible.
7 Examples 255

Figure 8: MPO example, grid search over 390 hyperparameter combinations. The line
connects the Pareto optimal points.

Figure 9: Pareto optimal frontiers for SPO and MPO.


256 Multi-period trading via convex optimization

Figure 10: Execution time for each day for one SPO backtest.

We carried out the multiple backtests using a 32 core machine that can execute
64 threads simultaneously. Carrying out 510 backtests, which entails solving
around a half million convex optimization problems, thus takes around thirty
minutes. (In fact, it takes a bit longer, due to system overhead.)
We close by making a few comments about these optimization times. First, they
can be considerably reduced by avoiding the 3/2-power transaction-cost terms,
which slow the optimizer. By replacing these terms with square transaction-
cost terms, we can obtain a speedup of more than a factor of two. Replacing
the default generic solver ECOS (Domahidi et al. 2013) used in CVXPY with
a custom solver, such as one based on operator-splitting methods (Boyd et al.
2011), would result in an even more substantial speedup.

References
Almgren, R. “High frequency volatility.” Available at http://cims.nyu.edu/
~almgren/timeseries/notes7.pdf (2009).

Almgren, R. and N. Chriss. “Optimal execution of portfolio transactions.” Jour-


nal of Risk, vol. 3, no. 2 (2001), pp. 5–39.

Bacon, C. R. Practical Portfolio Performance Measurement and Attribution.


Wiley: West Sussex, 2nd ed. (2008).
7 Examples 257

Bailey, D. H., J. M. Borwein, M. L. de Prado, and Q. J. Zhu. “The probability of


backtest overfitting.” Journal of Computational Finance, vol. 20, no. 4 (2017),
pp. 39–69.
Bellman, R. E. “Dynamic programming and Lagrange multipliers.” Proceedings
of the National Academy of Sciences, vol. 42, no. 10 (1956), pp. 767–769.
Bemporad, A. “Model predictive control design: New trends and tools.” In
Proceedings of the 45th IEEE Conference on Decision and Control (2006), pp.
6678–6683.
Bemporad, A., L. Bellucci, and T. Gabbriellini. “Dynamic option hedging via
stochastic model predictive control based on scenario simulation.” Quantita-
tive Finance, vol. 14, no. 10 (2014), pp. 1739–1751.
Bershova, N. and D. Rakhlin. “The non-linear market impact of large trades:
evidence from buy-side order flow.” Quantitative Finance, vol. 13, no. 11
(2013), pp. 1759–1778.
Bertsekas, D. P. Dynamic Programming and Optimal Control. Athena Scientific:
Belmont (1995).
Black, F. “Studies of stock price volatility changes.” In Proceedings of the 1976
Meetings of the American Statistical Association, Business and Economics
Statistics Section (1976), pp. 177–181.
Boyd, S., M. T. Mueller, B. O’Donoghue, and Y. Wang. “Performance bounds
and suboptimal policies for multi-period investment.” Foundations and Trends
in Optimization, vol. 1, no. 1 (2014), pp. 1–72.
Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein. “Distributed optimiza-
tion and statistical learning via the alternating direction method of multipli-
ers.” Foundations and Trends in Machine Learning, vol. 3, no. 1 (2011), pp.
1–122.
Boyd, S. and L. Vandenberghe. Convex Optimization. Cambridge University
Press: New York (2004).
Busseti, E. and S. Boyd. “Volume weighted average price optimal execution.”
(2015).
Busseti, E., S. Diamond, S. Boyd, and BlackRock. “CVXPortfolio.” (2017).
Available at https://github.com/cvxgrp/cvxportfolio.
Busseti, E., E. K. Ryu, and S. Boyd. “Risk-constrained Kelly gambling.” Journal
of Investing, vol. 25, no. 3 (2016), pp. 118–134.
Campbell, J. Y., A. W. Lo, and A. C. MacKinlay. The Econometrics of Financial
Markets. Princeton University Press: Princeton (1997).
258 Multi-period trading via convex optimization

Campbell, J. Y. and L. M. Viceira. Strategic Asset Allocation: Portfolio Choice


for Long-Term Investors. Oxford University Press: New York (2002).

Chan, L. K. C., J. Karceski, and J. Lakonishok. “On portfolio optimization:


Forecasting covariances and choosing the risk model.” Review of Financial
Studies, vol. 12, no. 5 (1999), pp. 937–974.
Cho, E. G., K. A. Thoney, T. J. Hodgson, and R. E. King. “Rolling horizon
scheduling of multi-factory supply chains.” In Proceedings of the 2003 Winter
Simulation Conference (2003), pp. 1409–1416.
Chopra, V. K. and W. T. Ziemba. “The effect of errors in means, variances, and
covariances on optimal portfolio choice.” Journal of Portfolio Management,
vol. 19, no. 2 (1993), pp. 6–11.

Constantinides, G. M. “Multiperiod consumption and investment behavior with


convex transactions costs.” Management Science, vol. 25, no. 11 (1979), pp.
1127–1137.
Cornuejols, G. and R. Tütüncü. Optimization Methods in Finance. Cambridge
University Press: New York (2006).

Davis, M. H. A. and A. R. Norman. “Portfolio selection with transaction costs.”


Mathematics of Operations Research, vol. 15, no. 4 (1990), pp. 676–713.
DeMiguel, V., L. Garlappi, F. Nogales, and R. Uppal. “A generalized approach
to portfolio optimization: Improving performance by constraining portfolio
norms.” Management Science, vol. 55, no. 5 (2009a), pp. 798–812.

DeMiguel, V., L. Garlappi, and R. Uppal. “Optimal versus naive diversification:


How inefficient is the 1/N portfolio strategy?” Review of Financial Studies,
vol. 22, no. 5 (2009b), pp. 1915–1953.
Diamond, S. and S. Boyd. “CVXPY: A Python-embedded modeling language for
convex optimization.” Journal of Machine Learning Research, vol. 17, no. 83
(2016), pp. 1–5.
Diamond, S., R. Takapoui, and S. Boyd. “A general system for heuristic min-
imization of convex functions over non-convex sets.” Optimization Methods
and Software, vol. 33, no. 1 (2018), pp. 165–193.

Domahidi, A., E. Chu, and S. Boyd. “ECOS: An SOCP solver for embedded
systems.” In Proceedings of the 12th European Control Conference (2013), pp.
3071–3076.
Dumas, B. and E. Luciano. “An exact solution to a dynamic portfolio choice
problem under transactions costs.” Journal of Finance, vol. 46, no. 2 (1991),
pp. 577–595.
7 Examples 259

Elton, E. J., M. J. Gruber, and C. R. Blake. “Survivor bias and mutual fund
performance.” Review of Financial Studies, vol. 9, no. 4 (1996), pp. 1097–
1120.
Fabozzi, F. J., D. Huang, and G. Zhou. “Robust portfolios: contributions from
operations research and finance.” Annals of Operations Research, vol. 176,
no. 1 (2010), pp. 191–220.
Fastrich, B., S. Paterlini, and P. Winker. “Constructing optimal sparse portfolios
using regularization methods.” Computational Management Science, vol. 12,
no. 3 (2015), pp. 417–434.
Fougner, C. and S. Boyd. “Parameter selection and pre-conditioning for a graph
form solver.” In Emerging Applications of Control and System Theory, edited
by R. Tempo, S. Yurkovich, and P. Misra, chap. 4. Springer: Cham (2018),
pp. 41–61.
Frittelli, M. and E. R. Gianin. “Putting order in risk measures.” Journal of
Banking & Finance, vol. 26, no. 7 (2002), pp. 1473–1486.
Gârleanu, N. and L. H. Pedersen. “Dynamic trading with predictable returns and
transaction costs.” Journal of Finance, vol. 68, no. 6 (2013), pp. 2309–2340.

Goldsmith, D. “Transactions costs and the theory of portfolio selection.” Journal


of Finance, vol. 31, no. 4 (1976), pp. 1127–1139.
Gomes, C. and H. Waelbroeck. “Is market impact a measure of the informa-
tion value of trades? Market response to liquidity vs. informed metaorders.”
Quantitative Finance, vol. 15, no. 5 (2015), pp. 773–793.
Grant, M., S. Boyd, and Y. Ye. “Disciplined convex programming.” In Global Op-
timization: From Theory to Implementation, edited by L. Liberti and N. Mac-
ulan, vol. 84 of Nonconvex Optimization and Its Applications. Springer: New
York (2006), pp. 155–210.

Grinold, R. C. “A dynamic model of portfolio management.” Journal of Invest-


ment Management, vol. 4, no. 2 (2006), pp. 5–22.
Grinold, R. C. and R. N. Kahn. Active Portfolio Management: A Quantitative
Approach for Providing Superior Returns and Controlling Risk. McGraw–Hill:
New York, 2nd ed. (2000).

Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.


Springer: New York, 2nd ed. (2009).
Herzog, F., G. Dondi, and H. P. Geering. “Stochastic model predictive control
and portfolio optimization.” International Journal of Theoretical and Applied
Finance, vol. 10, no. 2 (2007), pp. 203–233.
260 Multi-period trading via convex optimization

Ho, M., Z. Sun, and J. Xin. “Weighted elastic net penalized mean–variance
portfolio design and computation.” SIAM Journal on Financial Mathematics,
vol. 6, no. 1 (2015), pp. 1220–1244.

Jagannathan, R. and T. Ma. “Risk reduction in large portfolios: Why imposing


the wrong constraints helps.” Journal of Finance, vol. 58, no. 4 (2003), pp.
1651–1683.

Jorion, P. “International portfolio diversification with estimation risk.” Journal


of Business, vol. 58, no. 3 (1985), pp. 259–278.

Kan, R. and G. Zhou. “Optimal portfolio choice with parameter uncertainty.”


Journal of Financial and Quantitative Analysis, vol. 42, no. 3 (2007), pp.
621–656.

Kelly, J. L., Jr. “A new interpretation of information rate.” IRE Transactions


on Information Theory, vol. 2, no. 3 (1956), pp. 185–189.

Kolm, P., R. Tütüncü, and F. Fabozzi. “60 years of portfolio optimization:


Practical challenges and current trends.” European Journal of Operational
Research, vol. 234, no. 2 (2014), pp. 356–371.

Kwon, W. H. and S. H. Han. Receding Horizon Control: Model Predictive Control


for State Models. Springer: London (2005).

Li, J. “Sparse and stable portfolio selection with parameter uncertainty.” Journal
of Business & Economic Statistics, vol. 33, no. 3 (2015b), pp. 381–392.

Lillo, F., J. D. Farmer, and R. N. Mantegna. “Master curve for price-impact


function.” Nature, vol. 421, no. 6919 (2003), p. 129.

Lobo, M. S., M. Fazel, and S. Boyd. “Portfolio optimization with linear and fixed
transaction costs.” Annals of Operations Research, vol. 152, no. 1 (2007), pp.
341–365.

Markowitz, H. “Portfolio selection.” Journal of Finance, vol. 7, no. 1 (1952), pp.


77–91.

Mattingley, J. and S. Boyd. “CVXGEN: A code generator for embedded convex


optimization.” Optimization and Engineering, vol. 13, no. 1 (2012), pp. 1–27.

Mattingley, J., Y. Wang, and S. Boyd. “Receding horizon control: Automatic


generation of high-speed solvers.” IEEE Control Systems Magazine, vol. 31,
no. 3 (2011), pp. 52–65.

McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy,
and IPython. O’Reilly Media: Sebastopol (2012).
7 Examples 261

Merton, R. C. “Lifetime portfolio selection under uncertainty: The continuous-


time case.” Review of Economics and Statistics, vol. 51, no. 3 (1969), pp.
247–257.

Merton, R. C. “Optimum consumption and portfolio rules in a continuous-time


model.” Journal of Economic Theory, vol. 3, no. 4 (1971), pp. 373–413.

Meucci, A. Risk and Asset Allocation. Springer: Berlin (2005).

Meucci, A. “Historical scenarios with fully flexible probabilities.” GARP Risk


Professional (2010), pp. 47–51.

Michaud, R. O. “The Markowitz optimization Enigma: Is ’optimized’ optimal?”


Financial Analysts Journal, vol. 45, no. 1 (1989), pp. 31–42.

Moallemi, C. C. and M. Sağlam. “Dynamic portfolio choice with linear rebal-


ancing rules.” Journal of Financial and Quantitative Analysis, vol. 52, no. 3
(2017), pp. 1247–1278.

Moro, E., J. Vicente, L. G. Moyano, A. Gerig, J. D. Farmer, G. Vaglica, F. Lillo,


and R. N. Mantegna. “Market impact and trading profile of hidden orders in
stock markets.” Physical Review E, vol. 80, no. 6 (2009), p. 066102.

Narang, R. K. Inside the Black Box: A Simple Guide to Quantitative and High
Frequency Trading. Wiley: Hoboken, 2nd ed. (2013).

Nesterov, Y. and A. Nemirovskii. Interior-point Polynomial Algorithms in Con-


vex Programming. SIAM: Philadelphia (1994).

Nystrup, P., H. Madsen, and E. Lindström. “Dynamic portfolio optimization


across hidden market regimes.” Quantitative Finance, vol. 18, no. 1 (2018b),
pp. 83–95.

Obizhaeva, A. A. and J. Wang. “Optimal trading strategy and supply/demand


dynamics.” Journal of Financial Markets, vol. 16, no. 1 (2013), pp. 1–32.

O’Donoghue, B., E. Chu, N. Parikh, and S. Boyd. “Conic optimization via opera-
tor splitting and homogeneous self-dual embedding.” Journal of Optimization
Theory and Applications, vol. 169, no. 3 (2016), pp. 1042–1068.

Perold, A. F. “Large-scale portfolio optimization.” Management Science, vol. 30,


no. 10 (1984), pp. 1143–1160.

Perold, A. F. “The implementation shortfall: Paper versus reality.” Journal of


Portfolio Management, vol. 14, no. 3 (1988), pp. 4–9.

Powell, W. B. Approximate Dynamic Programming: Solving the Curses of Di-


mensionality. Wiley: Hoboken (2007).
262 Multi-period trading via convex optimization

Quandl. “WIKI end-of-day data.” (2016). Available at https://www.quandl.


com/data/WIKI.
Samuelson, P. A. “Lifetime portfolio selection by dynamic stochastic program-
ming.” Review of Economics and Statistics, vol. 51, no. 3 (1969), pp. 239–246.
Sharpe, W. F. “Mutual fund performance.” Journal of Business, vol. 39, no. 1
(1966), pp. 119–138.
Sharpe, W. F. “The arithmetic of active management.” Financial Analysts
Journal, vol. 47, no. 1 (1991), pp. 7–9.
Sharpe, W. F. “The Sharpe ratio.” Journal of Portfolio Management, vol. 21,
no. 1 (1994), pp. 49–58.
Tibshirani, R. “Regression shrinkage and selection via the lasso.” Journal of the
Royal Statistical Society. Series B (Methodological), vol. 58, no. 1 (1996), pp.
267–288.
Udell, M., K. Mohan, D. Zeng, J. Hong, S. Diamond, and S. Boyd. “Convex
optimization in Julia.” In First Workshop for High Performance Technical
Computing in Dynamic Languages (2014), pp. 18–28.
PAPER I
264
To appear in Annals of Operations Research

Multi-period portfolio selection


with drawdown control

Peter Nystrup, Stephen Boyd, Erik Lindström, and Henrik Madsen

Abstract

In this article, model predictive control is used to dynamically optimize


an investment portfolio and control drawdowns. The control is based on
multi-period forecasts of the mean and covariance of financial returns from a
multivariate hidden Markov model with time-varying parameters. There are
computational advantages to using model predictive control when estimates
of future returns are updated every time new observations become avail-
able, because the optimal control actions are reconsidered anyway. Transac-
tion and holding costs are discussed as a means to address estimation error
and regularize the optimization problem. The proposed approach to multi-
period portfolio selection is tested out of sample over two decades based on
available market indices chosen to mimic the major liquid asset classes typ-
ically considered by institutional investors. By adjusting the risk aversion
based on realized drawdown, it successfully controls drawdowns with little
or no sacrifice of mean–variance efficiency. Using leverage it is possible to
further increase the return without increasing the maximum drawdown.

Keywords: Risk management; Maximum drawdown; Dynamic asset allocation;


Model predictive control; Regime switching; Forecasting.

1 Introduction
Financial risk management is about spending a risk budget in the most efficient
way. Generally speaking, two different approaches exist. The first approach
consists of diversification, that is, reducing risk through optimal asset allocation
on the basis of imperfectly correlated assets. The second approach consists of
hedging, that is, reducing risk by giving up the potential for gain or by paying
a premium to retain some potential for gain. The latter is also referred to as
insurance, which is hedging only when needed.
The 2008 financial crisis clearly showed that diversification is not sufficient to
avoid large drawdowns (Nystrup et al. 2017a). Diversification fails, when needed
the most, because correlations between risky assets tend to strengthen during
times of crisis (see, e.g., Pedersen 2009, Ibragimov et al. 2011). Large draw-
downs challenge investors’ financial and psychological tolerance and lead to fund
266 Multi-period portfolio selection with drawdown control

redemption and firing of portfolio managers. Thus, a reasonably low maximum


drawdown (MDD) is critical to the success of any portfolio. As pointed out by
Zhou and Zhu (2010), drawdowns of similar magnitude to the 2008 financial
crisis are more likely than a “once-in-a-century” event. Yet, if focusing on tail
events when constructing a portfolio, the portfolio will tend to underperform
over time (Lim et al. 2011, Ilmanen 2012, Downing et al. 2015).
As argued by Goltz et al. (2008), portfolio insurance can be regarded as the
most general form of dynamic—as opposed to static—asset allocation. It is
known from Merton’s (1973) replicating-argument interpretation of the Black
and Scholes (1973) formula that nonlinear payoffs based on an underlying asset
can be replicated by dynamic trading in the underlying asset and a risk-free asset.
As a result, investors willing and able to engage in dynamic asset allocation
(DAA) can generate the most basic form of risk management possible, which
encompasses both static diversification and dynamic hedging (Goltz et al. 2008).
Although DAA is a multi-period problem, it is often approximated by a se-
quence of myopic, single-period optimizations, thus making it impossible to
properly account for the consequences of trading, constraints, time-varying fore-
casts, etc. Following Mossin (1968), Samuelson (1969), and Merton (1969), the
literature on multi-period portfolio selection is predominantly based on dynamic
programming, which properly takes into account the idea of recourse and up-
dated information available as a sequence of trades is chosen (see Gârleanu and
Pedersen 2013, Cui et al. 2014, and references therein). Unfortunately, actually
carrying out dynamic programming for trade selection is impractical, except for
some very special or small cases, due to the “curse of dimensionality” (Bellman
1956, Boyd et al. 2014). As a consequence, most studies include only a limited
number of assets and simple objectives and constraints (Mei et al. 2016).
The opportunity to select portfolio constituents from a large universe of assets
corresponds with a large potential to diversify risk. Exploiting such potential
can be difficult, however, as the presence of error increases when the number
of assets increases relative to the number of observations, often resulting in
worse out-of-sample performance (see, e.g., Brodie et al. 2009, Fastrich et al.
2015). Transaction and holding costs not only have great practical importance
but are also a means to address estimation error and regularize the optimization
problem.
Multi-period investment problems taking into account the stochastic nature of
financial markets are usually solved in practice by scenario approximations of
stochastic programming models, which is computationally challenging (see, e.g.,
Dantzig and Infanger 1993, Mulvey and Shetty 2004, Gülpınar and Rustem 2007,
Pınar 2007, Zenios 2007). Herzog et al. (2007) proposed the benefit of model
predictive control (MPC) for multi-period portfolio selection (see also Meindl
and Primbs 2008, Bemporad et al. 2014, Boyd et al. 2014). The idea is to
control a portfolio based on forecasts of asset returns and relevant parameters.
2 Multi-period portfolio selection 267

It is an intuitive approach with potential in practical applications, because it is


computationally fast. This makes it feasible to consider large numbers of assets
and impose important constraints and costs (see Boyd et al. 2017).
This article implements a specific case of the methods of Boyd et al. (2017), with
an additional mode that controls for drawdown by adjusting the risk aversion
based on realized drawdown. The proposed approach to drawdown control is
a practical solution to an important investment problem and demonstrates the
theoretical link to DAA. A second contribution is the empirical implementation
based on available market indices chosen to mimic the major liquid asset classes
typically considered by institutional investors. The testing shows that the MPC
approach works well in practice and indeed makes it computationally feasible
to solve realistic multi-period portfolio optimization problems and search over
hyperparameters in backtests. When combined with drawdown control and use
of leverage, it is possible to increase returns substantially without increasing the
MDD.
The implementation is based on forecasts from a multivariate hidden Markov
model (HMM) with time-varying parameters, which is a third contribution. The
combination of an adaptive forecasting method and MPC is a flexible framework
for incorporating new information into a portfolio, as it becomes available. Com-
pared to Nystrup et al. (2018b), it is an extension from a single- to a multi-asset
universe, which requires a different estimation approach. The HMM could be
replaced by another return-prediction model, as model estimation and forecast-
ing are treated separately from portfolio selection. Obviously, the better the
forecasts, the more value can be added. The choice of an HMM is motivated
by numerous studies showing that DAA based on regime-switching models can
add value over rebalancing to static weights and, in particular, reduce poten-
tial drawdowns (Ang and Bekaert 2004, Guidolin and Timmermann 2007, Bulla
et al. 2011, Kritzman et al. 2012, Bae et al. 2014, Nystrup et al. 2015a, 2017a,
2018b).
The article is structured as follows: section 2 outlines the MPC approach to
multi-period portfolio selection with drawdown control. Section 3 describes
the HMM, its estimation, and use for forecasting. The empirical results are
presented in section 4. Finally, section 5 concludes.

2 Multi-period portfolio selection


Multi-period portfolio selection is a well-established research field since the work
of Mossin (1968), Samuelson (1969), and Merton (1969). Since then, it is well
understood that short-term portfolio optimization can be very different from
long-term portfolio optimization. For sufficiently long horizons, however, it is
not possible to make better predictions than the long-term average. Hence, it
is really about choosing a sequence of trades to carry out over the next days
268 Multi-period portfolio selection with drawdown control

and weeks (Gârleanu and Pedersen 2013, Boyd et al. 2017). Looking only a
limited number of steps into the future is not just an approximation necessary to
make the optimization problem computationally feasible; it also seems perfectly
reasonable.
Recent work has shown the importance of the frequency of the input estimates
to the portfolio optimization being consistent with the time-horizon that per-
formance is evaluated over (Kinlaw et al. 2014, 2015, Chaudhuri and Lo 2016).
Even for long-term investors, though, performance is evaluated continually. The
problem is that risk premiums and covariances do not remain invariant over long
periods. In a single-period setting, the only way of taking this time variation
into account is by blending short- and long-term estimates or the resulting allo-
cations together, which is not optimal. In a multi-period framework, differences
in short- and long-term forecasts as well as trading and holding costs can be
properly modeled. Multi-period optimization, naturally, leads to a dynamic
strategy.

2.1 Stochastic control formulation


The formulation of the multi-period portfolio selection problem as a stochastic
control problem is based on Boyd et al. (2017). Every day a decision has to
be made whether or not to change the current portfolio, knowing that the deci-
sion will be reconsidered the next day with new input. Possible benefits from
changing allocation should be traded off against risks and costs.
Let wt ∈ Rn+1 denote the portfolio weights at time t, where (wt )i is the fraction
of the total portfolio value Vt invested in asset i, with (wt )i < 0 meaning a short
position in asset i. It is assumed that the portfolio value is positive. The weight
(wt )n+1 is the fraction of the total portfolio value held in cash, i.e., the risk-free
asset. By definition, the weights sum to one, 1T wt = 1, where 1 is a column
vector with all entries one, and are unitless.
A natural objective is to maximize the present value of future, risk-adjusted
expected returns less transaction and holding costs over the investment horizon
T,
[T −1
∑ ( T )
E η t+1 rt+1 wt+1 − γt+1 ψt+1 (wt+1 )
t=0
] (1)
( )
−η t
ϕtrade
t (wt+1 − wt ) + ϕhold
t (wt+1 ) ,

where the expectation is over the sequence of returns r1 , . . . , rT ∈ Rn+1 condi-


tional on all past observations, ψt : Rn+1 → R is a risk function (described in
section 2.3), γt is a risk-aversion parameter used to scale the relative importance
of risk and return, ϕtrade
t : Rn+1 → R is a transaction-cost function (described
2 Multi-period portfolio selection 269

in section 2.5), ϕhold


t : Rn+1 → R is a holding-cost function (described in sec-
tion 2.5), and η ∈ (0, 1) is a discount factor (typically equal to the inverse of
one plus the risk-free rate).

2.2 Model predictive control


MPC is based on the simple idea that in order to determine the trades to
make, all future (unknown) quantities are replaced by their forecasted values
over a planning horizon H. For example, the future returns are replaced by their
forecasted mean values µ̂τ |t , τ = t + 1, . . . , t + H, where µ̂τ |t is the forecast made
at time t of the return at time τ . This turns the stochastic control problem into
a deterministic optimization problem:
∑t+H ( T
τ =t+1 µ̂τ |t wτ − ϕ̂τ |t (wτ − wτ −1 )
trade
maximize
)
− ϕ̂hold (wτ ) − γτ ψ̂τ |t (wτ ) (2)
τ |t
subject to 1T wτ = 1, τ = t + 1, . . . , t + H,

with variables wt+1 , . . . , wt+H (see Boyd et al. 2017, for a detailed derivation).
Note that wt is not a variable, but the known, current portfolio weights. In
formulation (2), ϕ̂trade and ϕ̂hold can be estimates of actual transaction- and
holding-cost functions or arbitrary functions found to give good performance in
backtest (see section 2.5).

Suboptimal control
Solving the optimization problem (2) yields an optimal sequence of weights
⋆ ⋆
wt+1 , . . . , wt+H . The difference of this sequence is a plan for future trades over
the planning horizon H under the highly unrealistic assumption that all future
(unknown) quantities will be equal to their forecasted values. Only the first
trade wt+1⋆
− wt in the planned sequence of trades is executed. At the next step,
the process is repeated, starting from the new portfolio wt+1 . The planning
horizon H can typically be much shorter than the investment horizon T , without
it affecting the solution. This is why discounting is ignored in formulation (2)
compared to (1).
In the case of a mean–variance objective function, Herzog et al. (2007) showed
that future asset allocation decisions do not depend on the trajectory of the
portfolio, but solely on the current tradeoff between satisfying the constraints
and maximizing the objective. MPC for stochastic systems is a suboptimal
control strategy; however, it uses new information advantageously and is better
than pure open-loop control. The open-loop policy would be to execute the
entire sequence of trades based on the initial portfolio without recourse.
While the MPC approach can be criticized for only approximating the full dy-
namic programming trading policy, the performance loss is likely very small in
270 Multi-period portfolio selection with drawdown control

Algorithm 1: MPC approach to multi-period portfolio selection.

1. Update model parameters based on the most recent observation


2. Forecast future values of all unknown quantities H steps into the future
3. Compute the optimal sequence of weights wt+1
⋆ ⋆
, . . . , wt+H based on the current
portfolio wt

4. Execute the first trade wt+1



− wt and return to step 1

practical problems. Boyd et al. (2014) developed a numerical bounding method


that quantifies the loss of optimality when using simplified approaches, such as
MPC, and found it to be very small in numerical examples. In fact, the dynamic
programming formulation is itself an approximation, based on assumptions—
like independent and identically distributed returns—that need not hold well
in practice, so the idea of an “optimal strategy” itself should be regarded with
some suspicion (Boyd et al. 2017).

Computation

Algorithm 1 summarizes the four steps in the MPC approach to multi-period


portfolio selection. There are computational advantages to using MPC in cases
when estimates of future return statistics are updated every time a new ob-
servation becomes available, since the optimal control actions are reconsidered
anyway.

Formulation (2) is a convex optimization problem, provided the risk function


and the transaction and holding costs and constraints are convex (Boyd and
Vandenberghe 2004). Computing the optimal sequence of trades for H = 15
with n = 10 assets by solving the optimization problem (2) with the risk-function
and transaction and holding costs and constraints described in section 2.3 and
section 2.5, respectively, takes less than 0.02 seconds using CVXPY (Diamond
and Boyd 2016) with the open-source solver ECOS (Domahidi et al. 2013) on a
standard Windows laptop.

Using a custom solver, or a code generator such as CVXGEN (Mattingley and


Boyd 2012), would result in an even faster solution time. These solvers are more
than fast enough to run in real-time. The practical advantage of the high speed
is the ability to carry out a large number of backtests quickly. For example
at 0.02 seconds per solve, each year of a backtest with daily trading can be
carried out in around five seconds. In one hour, a 32-core machine can carry
out five-year backtests with 4,000 different combinations of hyperparameters.
2 Multi-period portfolio selection 271

2.3 Risk-averse control


The traditional risk-adjustment charge is proportional to the variance of the
portfolio return given the portfolio weights, which corresponds to
ψt (wt ) = wtT Σt wt . (3)
Note that Σt is an estimate of the return covariance under the assumption that
the returns are stochastic. It can be interpreted as a cost term that discourages
holding portfolios with high variance.
Objective function (1) with risk function (3) corresponds to mean–variance pref-
erences over the changes in portfolio value in each time period (net of the risk-
free return). If the returns are independent random variables, then the objec-
tive is equivalent to the mean–variance criterion of Markowitz (1952).1 It is a
special case of expected utility maximization with a quadratic utility function.
While the utility approach was theoretically justified by von Neumann and Mor-
genstern (1953), in practice few, if any, investors know their utility functions;
nor do the functions which financial engineers and economists find analytically
convenient necessarily represent a particular investor’s attitude toward risk and
return (Dai et al. 2010a, Markowitz 2014). The mean–variance criterion remains
the most commonly used in portfolio selection (Kolm et al. 2014).
There is keen interest in other risk measures beyond the quadratic risk (3), for
many good reasons (see, e.g., Zenios 2007, Scutellà and Recchia 2013). Many of
these are convex and thus would work in this framework. A popular alternative
is expected shortfall, also known as conditional value-at-risk, defined as the
expected loss in the worst q% of cases. It is a coherent measure of risk and a
convex function of the portfolio weights (Artzner et al. 1999, Rockafellar and
Uryasev 2000, Bertsimas et al. 2004). Unlike the quadratic measure (3), it
only penalizes down-side risk.2 In practice, portfolios constructed to minimize
expected shortfall often realize a higher shortfall out of sample than minimum-
variance portfolios because of forecast uncertainty (Lim et al. 2011, Stoyanov
et al. 2012, Downing et al. 2015). The lower the quantile level q, the larger
the uncertainty. For investors concerned with tail risk, drawdown control is an
appealing alternative since it, unlike expected-shortfall optimization, prevents
a portfolio from losing more than a given limit.

2.4 Drawdown control


A portfolio is often subject to a maximum drawdown constraint, meaning that,
at each point in time, it cannot lose more than a fixed percentage of the maxi-
1 When H = 1, the multi-period problem (2) with the risk function (3) reduces to the single-

period mean–variance problem studied by Markowitz (1952).


2 If the underlying return distribution is Gaussian with known parameters, then the portfolio

that minimizes expected shortfall for a given expected return is equivalent to the portfolio that
minimizes variance with the same expected return (Rockafellar and Uryasev 2000).
272 Multi-period portfolio selection with drawdown control

mum value it has achieved up to that time. Recall that Vt denotes the portfolio
value at time t. If the maximum value achieved in the past—sometimes referred
to as a high-water mark—is
Mt = max Vτ , (4)
τ ≤t

then the drawdown at time t is defined as


Vt
Dt = 1 − . (5)
Mt

Controlling drawdown through DAA may appear similar to the constant-pro-


portion portfolio insurance (CPPI) policy introduced by Black and Jones (1987),
Black and Perold (1992). However, they considered the problem of portfolio
selection under the constraint that the portfolio value never falls below a fixed
floor, rather than a fixed fraction of its maximum-to-date. The CPPI procedure
dynamically allocates total assets to a risky asset in proportion to a multiple
of the difference between the portfolio value and the desired protective floor.
This produces an effect similar to owning a put option (under the assumption
that it is possible to trade continuously when asset prices fall), which is the idea
behind option-based portfolio insurance (OBPI), proposed by Leland (1980),
Rubinstein and Leland (1981).
Grossman and Zhou (1993) were first to study portfolio selection under the con-
straint that the portfolio value never falls below a fixed fraction of its maximum-
to-date. They extended the CPPI policy of Black and Jones (1987), Black and
Perold (1992) to a stochastic floor in a frictionless financial market comprised
of a risky asset with random-walk return dynamics and a risk-free asset with
constant return. They showed that, for constant relative risk aversion utility
functions, the optimal allocation to risky assets at time t is in proportion to the
cushion Dmax − Dt , where Dmax ∈ (0, 1) is the maximum acceptable drawdown.
This is implemented by adjusting the risk-aversion parameter in response to
changes to the cushion.
Let γ0 be the risk aversion when the drawdown Dt = 0, i.e., when Vt = Mt . This
is the initial risk aversion, since V0 = M0 , and it is the minimum risk aversion
at any later point in time, because the drawdown can never be negative. When
Dt = Dmax , then the allocation to risky assets should be zero, meaning that the
risk aversion should be infinite. This leads to
Dmax
γt = γ0 . (6)
Dmax− Dt

In practice, the cushion in the denominator is replaced by max (Dmax − Dt , ϵ),


where ϵ is some small number, to avoid division by zero or negative numbers in
case the drawdown limit is breached. Moreover, γτ is only adjusted based on the
realized drawdown, which means keeping γτ = γt for τ = t + 1, . . . , t + H when
2 Multi-period portfolio selection 273

solving(2). Note that it is straight forward to implement another relationship


between γt and γ0 than (6).

Drawdown control is a reactive mechanism that seeks to limit losses as they


evolve (Pedersen 2015). It will, by construction, increase risk aversion in the
domain of losses, implying a path-dependent utility function (see, e.g., Dohi and
Osaki 1993). If the drawdown gets too close to the limit, it can be impossible to
escape it (depending on the risk-free rate). The lower the drawdown limit Dmax
and initial risk-aversion parameter γ0 , the larger the risk of getting trapped at
the limit. In practice, a portfolio manager that gets trapped at a drawdown limit
will need to contact the client or the board to get a new limit—or a dismissal.

2.5 Forecast-error risk


Data-driven portfolio optimization involves estimated statistics that are subject
to estimation errors (Merton 1980). Practitioners tend to trust history for input
estimation, because it is objective, interpretable, and available, but the nonsta-
tionary nature of financial returns limits the number of relevant observations
obtainable. As a result, the benefits of diversification often are more than offset
by estimation errors (Jorion 1985, Michaud 1989, Black and Litterman 1992,
Broadie 1993, Chopra and Ziemba 1993, Garlappi et al. 2006, Kan and Zhou
2007, Ardia et al. 2017). Including transaction and holding costs and constrain-
ing portfolio weights are ways to regularize the optimization problem and reduce
the risk due to estimation errors.

Transaction costs
Transaction costs are important when comparing the performance of dynamic
and static strategies, as frequent trading can offset a dynamic strategy’s poten-
tial excess return. In order to regularize the optimization problem and reduce
the risk of trading too much, a penalty for trading,
2
ϕtrade
t (wt − wt−1 ) = κT1 |wt − wt−1 | + κT2 (wt − wt−1 ) , (7)

should be included in the objective function, where κ1 and κ2 are vectors of


penalty factors and the absolute and squared value are elementwise. This could
reflect actual transaction costs or a conservatism toward trading, for example,
due to the uncertainty related to the parameter estimates and forecasts.

The weighted elastic-net penalty (7) is a convex combination of ℓ1 - and squared


ℓ2 -norm penalties. It reduces the number of trades like the ℓ1 penalty and the
size of trades like the squared ℓ2 penalty. The ℓ1 penalty is similar to the stan-
dard proportional transaction cost and is a convex relaxation of constraining the
number of trades. The squared ℓ2 penalty is used to model price impact (Alm-
274 Multi-period portfolio selection with drawdown control

gren and Chriss 2001, Boyd et al. 2017); it shrinks together trades in correlated
assets and splits trades over multiple days.3
Many alternative formulations are possible. Popular models of transaction costs
3/2
include |wt − wt−1 | , which is another convex function, possibly scaled by the
asset standard deviations and volumes (Grinold and Kahn 2000, Boyd et al.
2017). Grinold (2006) and Gârleanu and Pedersen (2013) argued for a cost
T
of the type (wt − wt−1 ) Σt (wt − wt−1 )—closely related to the risk-adjustment
charge (3)—which captures the increased cost of trading when volatility rises.

Holding costs
Holding the portfolio wt over the t’th period can incur a holding-based cost.
A basic holding-cost model includes a charge for borrowing assets when going
short, which has the form

ϕhold
t (wt ) = sTt (wt )− , (8)

where (st )i ≥ 0 is the borrowing fee for shorting asset i in period t, and (w)− =
max {−w, 0} denotes the negative part of (the elements of) w. This is a fee
for shorting the assets over one investment period. A cash borrow cost can
easily be included if needed, in which case (st )n+1 > 0. This is the premium
for borrowing, and not the interest rate. When short positions are implemented
using futures, the holding cost is (at least) equal to the risk-free rate.
Another option is to include a holding cost similar to

ϕhold
t (wt ) = ρT1 |wt | + ρT2 wt2 , (9)

where ρ1 and ρ2 are vectors of penalty factors and the absolute and squared value
are elementwise. For sufficiently large holding costs (8) and (9), the portfolio will
be long only, because the weights always sum to one (see (2)). Hence, including
holding costs is a means of controlling portfolio leverage.
The weighted elastic-net penalty (9) can be justified by reformulating the mean–
variance criterion as a robust optimization problem (Ho et al. 2015, Boyd et al.
2017). It reduces the number of holdings like the ℓ1 penalty and the size of
holdings like the squared ℓ2 penalty. The ℓ1 penalty is a convex relaxation of
constraining the number of holdings. It can be regarded as a shrinkage estimator
of the expected return (Stein 1956, Fabozzi et al. 2010). The squared ℓ2 penalty
shrinks together holdings in correlated assets; it corresponds to adding a diag-
onal matrix to the forecasted covariance matrix in (3), similar to a Stein-type
shrinkage estimator (Ledoit and Wolf 2004).

3 Price impact is the price movement against the trader that tends to occur when a large order

is executed.
3 Data model 275

Constraints
Another way to improve the out-of-sample performance is to impose constraints
on the portfolio weights, which is equivalent to shrinking the covariance matrix
(Jagannathan and Ma 2003, Ledoit and Wolf 2003, 2004, DeMiguel et al. 2009a,
Li 2015b). Different constraints correspond to different prior beliefs about the
asset weights. The portfolio may be subject to constraints on the asset weights,
such as minimum and maximum allowed positions for each asset:
− wmin ≤ wt ≤ wmax , (10)
where the inequalities are elementwise and wmin and wmax are nonnegative vec-
tors of the maximum short and long allowed fractions, respectively. A long-only
portfolio corresponds to wmin = 0.
Portfolio leverage can be limited with a constraint
∥(wt )1:n ∥1 ≤ Lmax , (11)
which requires the leverage to not exceed Lmax . Refer to Boyd et al. (2017) for
examples of many other convex holding and trading costs and constraints that
arise in practical investment problems and can easily be included.

3 Data model
The volatility of asset prices forms clusters, as large price movements tend to
be followed by large price movements and vice versa, as noted by Mandelbrot
(1963).4 The choice of a regime-switching model aims to exploit this persistence
of the volatility, since risk-adjusted returns, on average, are substantially lower
during turbulent periods, irrespective of the source of turbulence (Fleming et al.
2001, Kritzman and Li 2010, Moreira and Muir 2017).
Clustering asset returns into time periods with similar behavior is different from
other types of clustering, such as k-means, due to the time dependence (Dias
et al. 2015). In machine learning, the task of inferring a function to describe
a hidden structure from unlabeled data is called unsupervised learning. The
data is unlabeled, because the regimes are unobservable. When the transition
between different regimes is controlled by a Markov chain, the regime-switching
model is called a hidden Markov model.
The HMM is a popular choice for inferring the hidden state of financial mar-
kets, because it is well suited to capture the stylized behavior of many financial
time series including volatility clustering and leptokurtosis, as shown by Ry-
dén et al. (1998). In addition, it can match the tendency of financial markets
4 A quantitative manifestation of this fact is that while returns themselves are uncorrelated,

absolute and squared returns display a positive, significant, and slowly decaying autocorrelation
function.
276 Multi-period portfolio selection with drawdown control

to change their behavior abruptly and the phenomenon that the new behavior
often persists for several periods after a change (Ang and Timmermann 2012).

3.1 The hidden Markov model


In an HMM, the probability distribution that generates an observation depends
on the state of an unobserved Markov chain. A sequence of discrete random
variables {st : t ∈ N} is said to be a first-order Markov chain if, for all t ∈ N, it
satisfies the Markov property:
Pr ( st+1 | s1 , . . . , st ) = Pr ( st+1 | st ) .
The conditional probabilities Pr ( st+1 = j| st = i) = γij are called transition
probabilities. A Markov chain with transition probability matrix Γ = {γij } has
stationary distribution π, if π T Γ = π T and 1T π = 1.
Future (excess) returns and covariances are forecasted using a model with mul-
tivariate Gaussian conditional distributions:
ot | st ∼ N (µst , Σst ) .
When the current state st is known, the distribution of the observation ot de-
pends only on st and not on previous states or observations. The sojourn times
are implicitly assumed to be geometrically distributed, implying that the time
until the next transition out of the current state is independent of the time spent
in the state.

3.2 Estimation
Using the online version of the expectation–maximization algorithm proposed
by Stenger et al. (2001), estimates of the model parameters are updated after
each sample value.5 The idea is that forward variables αt are updated in every
step. These variables give the probability of observing o1 , . . . , ot and being in
state i ∈ S at time t:
(αt )i = Pr (st = i, o1 , . . . , ot ) , i ∈ S.
In the first step, the forward variables are set to
(α1 )i = (δ)i Pr ( o1 | s1 = i) , i ∈ S,
where δ is the initial state distribution, i.e., (δ)i = Pr (s1 = i).
With every observation, the α values are updated by summing the probabilities
over all possible paths which end in the new state j ∈ S:
[ ]

(αt )j = (αt−1 )i γij Pr ( ot | st = j) , j ∈ S.
i∈S

5 See also the survey by Khreich et al. (2012).


3 Data model 277

The filtering probability of being in a particular state i ∈ S at time t, given the


observations, is
Pr (st = i, o1 , . . . , ot ) (αt )
(ξt )i = Pr ( st = i| o1 , . . . , ot ) = = T i.
Pr (o1 , . . . , ot ) 1 αt

The probability of a certain state transition i to j, given the observations, is

(ζt )ij = Pr ( st−1 = i, st = j| o1 , . . . , ot )


Pr (st−1 = i, o1 , . . . , ot−1 ) Pr ( st = j| st−1 = i) Pr ( ot | st = j)
=
Pr (o1 , . . . , ot )
(αt−1 )i γij Pr ( ot | st = j)
= .
1T αt

These formulas provide the re-estimation scheme. At every time step t, the
probabilities ξt and ζt are computed and used to update the model parameters
(∀i, j ∈ S):
∑t
t Pr ( sτ −1 = i, sτ = j| o1 , . . . , oτ )
γ̂ij = τ =2 ∑t
τ =2 (ξτ )i
∑t−1 (12)
τ =2 (ξτ )i t−1
(ζt )ij
= ∑t γ̂ij + ∑t
τ =2 (ξτ )i τ =2 (ξτ )i
∑t ∑t−1
τ =1 (ξτ )i oτ (ξτ )i t−1 (ξt ) ot
µ̂ti = ∑ t = ∑τt =1 µ̂i + ∑t i (13)
τ =1 (ξτ )i τ =1 (ξτ )i τ =1 (ξτ )i
∑t t T
τ =1 (ξτ )i (oτ − µ̂i ) (oτ − µ̂i )
t
t
Σ̂i = ∑t
τ =1 (ξτ )i
∑t−1 (14)
t T
(ξ τ )i t−1 (ξ t ) (o t − µ̂ t
i ) (o t − µ̂ i )
= ∑τt =1 Σ̂i + i
∑t .
τ =1 (ξτ )i τ =1 (ξτ )i

The sums in these equations are computed by storing the values and adding
the new terms at each time step. This can be seen as continually updating the
sufficient statistics, which are used to compute the new parameters.

Exponential forgetting
A problem with this method is that all values from t = 1 to the current time
instant are used to compute the sufficient statistics. If the initial parameter
values are far away from the true values, this will slow down the convergence
process. Moreover, nonstationary data are not well handled. As a solution
to these problems, Stenger et al. (2001) proposed to compute the sufficient
statistics using exponential forgetting, by which estimates prior in time receive
less weight.
278 Multi-period portfolio selection with drawdown control

The idea is to replace the sums in the re-estimation formulas


∑(12)–(14) by vari-
t
ables which are updated recursively. For example, the term τ =1 ξτ is replaced
by variables Stξ which are updated as

Stξ = λSt−1
ξ
+ (1 − λ) ξt ,

where λ ∈ (0, 1) is the forgetting factor. This approach discounts old obser-
vations exponentially, such that an observation that is τ samples old carries
a weight that is equal to λτ times the weight of the most recent observation.
Hence, the effective memory length is T eff = 1/ (1 − λ).
Exponential forgetting is a natural choice when parameters are believed to follow
a random walk (Smidl and Gustafsson 2012). The choice of memory length is
a tradeoff between adaptivity to parameter changes and sensitivity to noise. In
order to reduce estimation noise, exponential forgetting is typically unable to
capture abrupt changes. In an HMM, however, the mean and covariance are free
to jump from one state to another at every time step—or instantaneously, if a
continuous-time model is employed (Nystrup et al. 2015b)—even when the time
variation of the underlying parameters is assumed to be smooth. In this way,
the adaptively-estimated HMM combines abrupt changes and smooth variations
(Nystrup et al. 2017b).

Shrinkage estimation
The usual issues when estimating a high-dimensional covariance matrix also arise
in the context of HMMs, causing unstable estimates of the transition matrix and
of the hidden states, as shown by Fiecas et al. (2017). In fact, the problem is
even more pronounced, as some regimes could be seldom visited, in which case
the effective sample size for estimating the covariance matrix will be very small.
Furthermore, when applying exponential forgetting, the sample size is bounded
by the effective memory length.
One possible solution, as proposed by Fiecas et al. (2017), is to apply a Stein-
type shrinkage estimator
( )
Σ̂shrink
i = (1 − νi ) Σ̂i + νi tr Σ̂i n−1 In , (15)

where νi ∈ [0, 1] is the shrinkage factor and In is the n × n identity matrix.


In order to further stabilize the state classification, it can be necessary to con-
sider only a subset of the indices when estimating the state probabilities (see
section 4.2).

3.3 Forecasting
The first step toward calculating the forecast distribution is to estimate the
current state probabilities given the past observations and parameters. This
3 Data model 279

is the ξt that is estimated as part of the online algorithm. Once the current
state probabilities are estimated, the state probabilities h steps ahead can be
forecasted by multiplying the state estimate ξˆt|t with the transition probability
matrix h times:
ξˆt+h|t
T
= ξˆt|t
T h
Γt . (16)
The parameters are assumed to stay constant in the absence of a model describ-
ing their evolution.
The density forecast is the average of the state-dependent conditional densi-
ties weighted by the forecasted state probabilities. When the conditional dis-
tributions are distinct Gaussian distributions, the forecast distribution will be
a mixture with non-Gaussian distribution (Frühwirth-Schnatter 2006). Using
Monte Carlo simulation, Boyd et al. (2014) found that the results of dynamic
portfolio optimization are not particularly sensitive to higher-order moments.
For the present application, only the first and second moment of the forecast
distribution are considered.
The first two unconditional moments of a multivariate mixture distribution are

µ= (ξ)i µi , (17)
i∈S
∑ ∑ T
Σ= (ξ)i Σi + (ξ)i (µi − µ) (µi − µ) , (18)
i∈S i∈S

with (ξ)i denoting the weights, that is, the forecasted state probabilities.
Before calculating the unconditional moments of the mixture distribution, the
conditional means and covariances of the returns rt are calculated based on the
estimated moments of the log-returns. Within each state, the log-returns are
assumed to be independent and identically distributed with Gaussian distribu-
tion:
( )
log (1 + rt ) ∼ N µlog log
st , Σst ,

where µlog log


st and Σst are the conditional mean and covariance of the log-returns.
Thus, the conditional mean and covariance of the returns rt are given by
{ }
( log ) 1 ( log )
(µs )i = exp µs i + Σ − 1, (19)
2 s ii
{ }
( ) ( log ) 1 {( log ) ( ) }
(Σs )ij = exp µlog s i + µs j + Σs ii + Σlog
s jj
2 (20)
{ {( ) } }
log
· exp Σs ij − 1 .

Note that i and j in (19) and (20) refer to elements of the conditional mean and
covariance, i.e., specific assets, whereas s refers to a state.
280 Multi-period portfolio selection with drawdown control

The forecasted mean and covariance will be mean-reverting as the forecast hori-
zon extends and the state probabilities converge to the stationary distribution
of the Markov chain. The more persistent the states are, the slower the rate of
convergence.

4 Empirical results
The empirical testing is divided into two parts. The purpose of the in-sample
training is to determine the optimal number of regimes, memory length in the
estimation, shrinkage factors, and values of the hyper-parameters in the MPC
problem (2). In the out-of-sample test, the performance of the MPC approach
to multi-period portfolio selection with drawdown control is evaluated for the
particular choice of hyper-parameters and compared to various benchmarks.

4.1 Data
In sample
The choice of time period is a tradeoff between historical data availability and
asset universe coverage. The in-sample asset universe consists of developed
market (DM) and emerging market (EM) stocks, listed DM real estate, DM
high-yield bonds, gold, oil, corporate bonds, and U.S. government bonds.6 All
indices measure the total net return in USD with a total of 2,316 daily closing
prices per index covering the period from 1990 through 1998.7 The first two
years are used for initialization and the last seven years are used for training.
This is only a subset of the indices considered in the out-of-sample test, as
historical data is not available for EM high-yield bonds and inflation-linked
bonds. Furthermore, U.S. government bonds are a substitute for the Citi G7
government-bond index in sample.

Out of sample
The asset universe considered in the out-of-sample test consists of DM and EM
stocks, listed DM real estate, DM and EM high-yield bonds, gold, oil, corporate
bonds, inflation-linked bonds, and government bonds.8 All indices measure the

6 The eight indices are MSCI World, MSCI Emerging Markets, FTSE EPRA/NAREIT Devel-

oped Real Estate, BofA Merrill Lynch U.S. High Yield, S&P GSCI Crude Oil (funded futures
roll), LBMA Gold Price, Barclays U.S. Aggregate Corporate Bonds, and Bloomberg Barclays U.S.
Government Bonds.
7 Days on which more than half of the indices had zero price change (27 days in total) have

been removed. In the few months where only monthly prices are available for DM high-yield bonds,
linear interpolation with Gaussian noise has been used to fill the gaps.
8 The ten indices are MSCI World, MSCI Emerging Markets, FTSE EPRA/NAREIT Developed

Real Estate, BofA Merrill Lynch U.S. High Yield, Barclays Emerging Markets High Yield, S&P
GSCI Crude Oil (funded futures roll), LBMA Gold Price, Barclays U.S. Aggregate Corporate Bonds,
4 Empirical results 281

EM HY bonds
600
Real estate
DM HY bonds
DM stocks
500

IFL bonds
CORP bonds
Gold
400

EM stocks
Index

GVT bonds
Oil
300
200
100

1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017
Year

Figure 1: Development of the ten indices over the 20-year out-of-sample period.

total net return in USD with a total of 5,185 daily closing prices per index
covering the period from 1997 through 2016.9 The first two years are used for
initialization and the last 18 years are used for out-of-sample testing.
Figure 1 shows the ten indices’ development over the 20-year out-of-sample
period. There are large differences in the asset classes’ behavior. The financial
crisis in 2008 stands out, in that respect, as the majority of the indices suffered
large losses in this period.
Table 2 summarizes the indices’ annualized excess return, excess risk, Sharpe
ratio (SR)10 , maximum drawdown11 , and Calmar ratio (CR)12 . The risk-free rate
is assumed to be the daily equivalent of the yield on a one-month U.S. treasury
bill. The reported excess risks have been adjusted for autocorrelation using the
procedure outlined by Kinlaw et al. (2014, 2015).13

Barclays World Inflation-Linked Bonds (hedged to USD), and Citi G7 Government Bonds (hedged
to USD).
9 Days on which more than half of the indices had zero price change (19 days in total) have been

removed.
10 The Sharpe ratio is the excess return divided by the excess risk (Sharpe 1966, 1994).
11 The maximum drawdown is the largest relative decline from a historical peak in the index

value, as defined in section 2.4.


12 The Calmar ratio is the annualized excess return divided by the maximum drawdown.
13 The adjustment leads to the reported excess risks being higher than had they been annualized
282 Multi-period portfolio selection with drawdown control

Excess Excess Sharpe Maximum Calmar


Index
return risk ratio drawdown ratio
1. DM stocks 0.042 0.18 0.24 0.57 0.07
2. EM stocks 0.035 0.28 0.12 0.65 0.05
3. Real estate 0.054 0.22 0.24 0.72 0.07
4. DM high-yield bonds 0.050 0.12 0.42 0.35 0.14
5. EM high-yield bonds 0.077 0.13 0.61 0.36 0.21
6. Oil -0.046 0.42 -0.11 0.94 -0.05
7. Gold 0.038 0.16 0.23 0.45 0.09
8. Corporate bonds 0.040 0.06 0.68 0.16 0.25
9. Inflation-linked bonds 0.041 0.04 0.99 0.10 0.40
10. Government bonds 0.032 0.03 1.17 0.05 0.65

Table 2: Annualized performance of the ten indices over the 20-year out-of-sample
period in excess of the risk-free rate.

The differences in performance are substantial. The oil price index is the only
index that has had a negative excess return. The EM high-yield bond index
realized the highest excess return while inflation-linked and government bonds
realized the highest Sharpe and Calmar ratios. Fixed income benefited from
falling interest rates over the considered period.

4.2 In-sample training


In the in-sample training, the risk-aversion parameter is fixed at γ = 5 and
portfolio performance is evaluated in terms of SR, excess return, and annual
turnover. The choice of γ = 5 results in portfolios with an excess risk similar
to that of the equally-weighted 1/n portfolio (in section 4.4 results are shown
for a range of values of the risk-aversion parameter). Training is carried out
solely for a long-only (LO) portfolio with no leverage. Realized transaction
costs, including bid–ask spread, are assumed to be 10 basis points, and there is
no transaction cost associated with the risk-free asset. The assets are assumed
to be liquid enough compared to the total portfolio value that price impact can
be ignored. Further, it is assumed that there are no holding costs.
Many of the hyper-parameters are mutually dependent, which makes the in-
sample training more challenging. For example, if the MPC planning horizon
is doubled, transaction costs also have to be doubled in order to maintain an
approximately similar turnover.14 In addition, the optimal values of the MPC

under the assumption of independence, as most of the indices display positive autocorrelation. The
largest impact was on the excess risk of EM stocks that went from 0.20 to 0.28 and the excess risk
of DM high-yield bonds that went from 0.05 to 0.12.
14 See Grinold (2006), Boyd et al. (2017) for more on amortization of transaction and holding

costs.
4 Empirical results 283

hyper-parameters depend on the choice of forecasting model.

To simplify the training task, it is divided into two steps. First, reasonable
values of the MPC parameters are chosen and then used when testing different
forecasting models. Second, the optimal MPC hyper-parameters are found for
the selected forecasting model. A final check is done to ensure that the model
is still optimal for that choice of MPC hyper-parameters.

Forecasting model

Number of regimes and indices. At first, a multivariate HMM is fitted to


all indices at once. This results in a state sequence with low persistence and
frequent switches, leading to excessive portfolio turnover and poor results. This
is surprising given the large number of studies showing the value of DAA based
on regime-switching models, in particular Nystrup et al. (2017a) who used a
univariate HMM of daily stock returns to switch between predefined risk–on
and risk–off multi-asset portfolios. Inspired by this approach, the states are
instead estimated based on the two stock indices (DM and EM). The mean
vector and covariance matrix in each state is still estimated based on all indices,
but the underlying state is estimated solely based on the two stock indices.
This leads to a more persistent state sequence with fewer switches and better
portfolio performance. There is no benefit to including additional indices in the
state estimation, as it increases the uncertainty. The stock indices appear to be
sufficient in order to capture important changes in risk and return. Models with
two, three, and four regimes are tested. There is no benefit in going from two
to three regimes and it is very hard to distinguish between four regimes out of
sample.

Effective memory length. Effective memory lengths of T eff = 65, 130, 260,
520 days are tested. The shorter the memory length used in the estimation, the
higher the risk of having states with no visits and, consequently, probabilities
converging to zero and never recovering. This happens with memory lengths
shorter than 100 days. The more regimes, the longer the optimal memory length.
With only two regimes, 130 days appear to be optimal.

Shrinkage factors. The shorter the memory length, the higher the optimal
shrinkage factor. Shrinkage factors of νi = 0.1, 0.2, . . . , 0.5 are tried in each
of the two regimes. A shrinkage factor of 0.2 in the most frequent regime and
0.4 in the least visited regime performs best. The use of shrinkage significantly
improves the results, although they are not overly sensitive to the specific choice
of shrinkage factor within the tested range.
284 Multi-period portfolio selection with drawdown control

MPC parameters
Planning horizon. Planning horizons of H = 10, 15, . . . , 30 days are tested.
10 days are found to be too few, while it appears that there is no benefit in
going beyond 15 days.

Maximum holding constraint. Maximum holding constraints wmax = 0.2,


0.3, . . . , 0.5 are tried. A maximum holding constraint ensures a minimum level
of diversification, but with wmax = 0.2 there is limited possibility for deviating
from the equally-weighted portfolio. As a compromise, a value of wmax = 0.4 is
selected.

Transaction costs. Transaction costs (κ1 )1:n = 0.0005, 0.001, . . . , 0.0055 are
tested, while there is no transaction cost associated with the risk-free asset, i.e.,
(κ1 )n+1 = 0. The term κT1 |wt − wt−1 | is very effective at reducing portfolio
turnover. When this penalty is included, there is no additional benefit from
2
including a second term κT2 (wt − wt−1 ) . This squared term reduces the size of
trades, but it appears that it simply means that trades are split over multiple
days and therefore delayed. This is not beneficial given the assumption that
there is no realized price-impact cost. The value (κ1 )1:n = 0.004 is selected.

Holding costs. Holding costs ρ2 = 0, 0.0005, . . . , 0.002 are tested. The hold-
ing cost ρT2 wt2 has a similar effect as the weight constraint (10): it encourages
diversification and reduces the risk due to uncertainty in the covariance forecasts.
Increasing ρ2 leads to a more diversified and stable portfolio. If (ρ2 )n+1 = 0,
this will at the same time increase the allocation to cash, which is undesirable.
The value ρ2 = 0.0005 is selected. There is no benefit to including an ℓ1 term
ρT1 |wt |, which leads to a more sparse portfolio.

4.3 Out-of-sample test results for γ0 = 5


Below, the performance of the MPC approach is evaluated for the above choice
of hyper-parameters and compared to various benchmarks. First, results when
γ0 = 5 are reported, and then in section 4.4 results are analyzed for a range of
values of γ0 . In all cases it is assumed that assets can be bought and sold at the
end of each trading day, subject to a 10 basis point transaction cost, and the
fee for shorting assets is assumed to be equal to the risk-free rate. It is assumed
that there are no price-impact or holding costs.

Allocations
Figure 3 and figure 4 show the asset weights over time for a long-only and a
long–short (LS) portfolio and for a leveraged long-only (LLO) portfolio with and
without drawdown control, respectively. The cost and weight parameters not
4 Empirical results 285

1.0 TBill
IFL bonds
0.8

CORP bonds
Asset weight

GVT bonds
0.6

EM HY bonds
DM HY bonds
0.4

Oil
Gold
Real estate
0.2

EM stocks
DM stocks
0.0

2000 2002 2004 2006 2008 2010 2012 2014 2016


Year
(a) γ = 5, (κ1 )1:n = 0.004, ρ2 = 0.0005, (wmax )1:n = 0.4, (wmax )n+1 = 1.
2.0

TBill
IFL bonds
CORP bonds
Asset weight
1.0

GVT bonds
EM HY bonds
DM HY bonds
Oil
0.0

Gold
Real estate
EM stocks
DM stocks
-1.0

2000 2002 2004 2006 2008 2010 2012 2014 2016


Year
( ) ( )
(b) γ = 5, (κ1 )1:n = 0.004, ρ2 = 0.0005, wmin 1:n = (wmax )1:n = 0.4, wmin n+1 =
(wmax )n+1 = 1, Lmax = 2.

Figure 3: Asset weights over time for a long-only and a long–short portfolio.

mentioned in the figure captions are equal to zero. The portfolios always include
multiple assets at a time due to the imposed maximum holding (wmax )1:n = 0.4.
The allocations change quite a bit over the test period, especially in the LS
portfolio.

Leverage is primarily used between 2003 and mid-2006 and again from 2010 until
mid-2013. With the exception of these two periods, the four portfolios include
holdings in the risk-free asset most of the time in addition to some short positions
in the LS portfolio. The impact of drawdown control on the allocation is most
286 Multi-period portfolio selection with drawdown control
2.0

TBill
IFL bonds
CORP bonds
Asset weight
1.0

GVT bonds
EM HY bonds
DM HY bonds
Oil
0.0

Gold
Real estate
EM stocks
DM stocks
-1.0

2000 2002 2004 2006 2008 2010 2012 2014 2016


Year
( )
(a) γ = 5, (κ1 )1:n = 0.004, ρ2 = 0.0005, (wmax )1:n = 0.4, wmin n+1 = (wmax )n+1 = 1.
2.0

TBill
IFL bonds
CORP bonds
Asset weight
1.0

GVT bonds
EM HY bonds
DM HY bonds
Oil
0.0

Gold
Real estate
EM stocks
DM stocks
-1.0

2000 2002 2004 2006 2008 2010 2012 2014 2016


Year
( )
(b) γ0 = 5, (κ1 )1:n = 0.004, ρ2 = 0.0005, (wmax )1:n = 0.4, wmin n+1 = (wmax )n+1 = 1,
Dmax = 0.1.

Figure 4: Asset weights over time for a leveraged long-only portfolio with and without
drawdown control.
4 Empirical results 287

LO LLO LLODmax =0.1 LS FM 1/n


Excess return 0.10 0.13 0.11 0.12 0.06 0.06
Excess risk 0.11 0.12 0.11 0.12 0.12 0.11
Sharpe ratio 0.97 1.01 1.00 1.01 0.51 0.52
Maximum drawdown 0.19 0.19 0.10 0.23 0.38 0.37
Calmar ratio 0.56 0.65 1.07 0.54 0.16 0.16
Annual turnover 2.93 3.22 3.24 6.75 0.16 0.16

Table 5: Annualized performance of MPC portfolios with γ0 = 5 compared to fixed mix


and 1/n.

evident during the 2008 crisis, where the LLO portfolio subject to drawdown
control is fully allocated to cash.

Performance compared to fixed mix and 1/n


In table 5, the MPC portfolios’ annualized performance in excess of the risk-free
rate when γ0 = 5 is compared to a fixed-mix (FM) portfolio and an equally-
weighted (1/n) portfolio. The FM portfolio is rebalanced monthly to the average
allocation of the LO portfolio over the entire 18-year test period. This means
that the LO and the FM portfolios have the same average allocation. Thus,
differences in performance can only be attributed to timing and transaction
costs. The 1/n portfolio is rebalanced monthly to an equal allocation across all
risky assets. The performance of FM and 1/n is fairly similar.
The LO portfolio’s excess return is 445 basis points higher than that of the FM
portfolio. This, combined with a slightly lower excess risk, leads to a SR of 0.97
compared to 0.51. DAA, even without drawdown control, leads to a MDD of
0.19 compared to the FM portfolio’s 0.38. This leads to a CR that is more than
three times as high (0.56 compared to 0.16). The LO portfolio’s annual turnover
of 2.93 is a lot higher than that of the FM portfolio, but the reported results
are net of transaction costs of 10 basis points per one-way transaction.
Allowing a maximum leverage of Lmax = 2 leads to a higher turnover and a
slightly higher excess return, excess risk, and SR when γ0 = 5. The LLO
portfolio’s MDD is the same as that of the LO portfolio. This leads to a CR of
0.65 compared to the LO portfolio’s 0.56.
By allowing leverage and imposing a maximum acceptable drawdown of Dmax =
0.1, the CR can be further improved. The reduction in MDD more than offsets
the loss of excess return, leading to a CR of 1.07. Note that the imposition of
a drawdown limit only leads to a slightly higher turnover.
The LS portfolio’s SR and CR are similar to those of the LO portfolio. Its annual
turnover of 6.75 is by far the highest of any of the portfolios. This could easily
be reduced, though, by imposing a higher trading penalty in the optimization.
288 Multi-period portfolio selection with drawdown control

LLO
1000

LS
LLODmax =0.1
LO
1/n
500
Index
200
100

1999 2001 2003 2005 2007 2009 2011 2013 2015 2017
Year

Figure 6: Performance over time of MPC portfolios with γ0 = 5 compared to an


equally-weighted portfolio.

Another way to reduce turnover is to reconsider the allocation less frequently.


Weekly rather than daily adjustments reduce the turnover by more than one,
but lead to a slightly lower return and higher risk.15 The savings in terms of
transaction costs are not enough to compensate for the lost opportunities. The
Calmar ratio deteriorates more so than the Sharpe ratio, because drawdown
control does not work as well when allowing less frequent allocation changes. In
addition, portfolio constraints can be violated. Yet, the Sharpe ratio is relatively
stable, which indicates that the regime-switching approach is robust.
Figure 6 shows the value of the portfolios from table 5 over time on a log scale.
The MPC portfolios outperform the equally-weighted portfolio throughout the
18-year period. The leveraged portfolios, in particular, have benefited from the
bull market from 2003 until 2008 and again after the financial crisis. The MPC
portfolios lost value in 2008, but they lost much less than the 1/n portfolio.
None of the portfolios have gained much value in 2014 and 2015.

15 Note that all hyperparameters were selected in sample based on a daily update frequency

(section 4.2). When these parameters are used with a lower update frequency, as expected, the
results are worse.
4 Empirical results 289

4.4 Drawdown control results


Long only
Figure 7 shows the annualized excess return net of transaction costs as a function
of (a) annualized excess risk and (b) maximum drawdown for different values
of γ0 and Dmax for a long-only portfolio. For comparison, the ex-post mean–
variance efficient frontier and the 1/n portfolio are shown. Note that the risk
of the 1/n portfolio could be changed by allocating part of the portfolio to the
risk-free asset. The ex-post efficient frontier shows the maximum excess return
obtainable for a given excess risk for a fixed-mix, long-only portfolio subject
to the maximum holding constraint (wmax )1:n = 0.4, conditional on knowing
the returns beforehand. It is referred to as a no-regret frontier (Bell 1982). It
more or less overlaps with the ex-post mean–MDD efficient frontier in both risk
spaces; therefore, only the former is shown.
The dynamic frontiers are clearly superior to the static, no-regret frontier. This
is impressive considering that the no-regret frontier is constructed in hindsight
and, thus, not obtainable in practice. In other words, even if they knew future
returns when choosing their benchmark, investors who insist on rebalancing
to a static, diversified benchmark could not have outperformed the dynamic
strategies net of transaction costs over the 18-year test period in terms of SR nor
CR. The opportunity for DAA significantly expands the investment opportunity
set; even so, this is a noteworthy result.
The 1/n portfolio is inefficient regardless of whether risk is measured by stan-
dard deviation or MDD. This is no surprise given that it is based on a naive
prior assumption of equal returns, risks, and correlations across all assets. Yet,
equally-weighted portfolios are often found to outperform mean–variance opti-
mized portfolios out of sample (DeMiguel et al. 2009b, López de Prado 2016).
This suggests that the no-regret frontier would likely be closer to the 1/n port-
folio than to the dynamic frontiers, had it not benefited from hindsight.
Looking at the frontiers with and without drawdown control in figure 7(a), it
appears that drawdown control can be implemented with little loss of mean–
variance efficiency. By increasing the risk-aversion parameter as the drawdown
approaches the maximum acceptable drawdown Dmax , a larger fraction of the
portfolio is allocated to the risk-free asset, cf. figure 4(b). Except for the trans-
action costs involved, this does not lead to a worse SR per se, but reduced
risk-taking in periods with above-average SRs would. This is clearly not the
case. Drawdown control simply leads to a higher average risk aversion.
From figure 7(b) it can be seen that the drawdown limit is breached—although
not by much—when γ0 = 1. The success of the proposed approach to draw-
down control is not very sensitive to the choice of initial risk-aversion parameter
γ0 . Essentially, any value γ0 ≥ 3 will work with a drawdown limit as tight as
Dmax = 0.1. Drawdown control is more sensitive to the allocation-update fre-
290 Multi-period portfolio selection with drawdown control

0.14

LO
LODmax =0.15
LODmax =0.1
Annualized excess return

No regret
0.12
0.10
0.08

1/n
0.06

0.06 0.08 0.10 0.12 0.14


Annualized excess risk
(a)
0.14

LO
LODmax =0.15
LODmax =0.1
Annualized excess return

No regret
0.12
0.10
0.08

1/n
0.06

0.10 0.15 0.20 0.25 0.30 0.35 0.40


Maximum drawdown
(b)

Figure 7: Efficient frontiers for different values of Dmax compared to a no-regret frontier
and an equally-weighted portfolio, when no leverage is allowed. The points, from right
to left, correspond to γ0 = 1, 3, 5, 10, 15, 25.
4 Empirical results 291

0.14
LS
LLO
LLODmax =0.15
Annualized excess return

LLODmax =0.1
0.12

No regret
0.10
0.08

1/n
0.06

0.06 0.08 0.10 0.12 0.14


Annualized excess risk
(a)
0.14
Annualized excess return
0.12

LS
LLO
0.10

LLODmax =0.15
LLODmax =0.1
No regret
0.08

1/n
0.06

0.10 0.15 0.20 0.25 0.30 0.35 0.40


Maximum drawdown
(b)

Figure 8: Efficient frontiers for different values of Dmax compared to a no-regret frontier
and an equally-weighted portfolio, when leverage and short-positions are allowed. The
points, from right to left, correspond to γ0 = 1, 3, 5, 10, 15, 25.
292 Multi-period portfolio selection with drawdown control

quency, since optimal drawdown control requires continuous trading. Yet, daily
allocation updates are sufficient for it to work for reasonable values of γ0 and
Dmax .

Long–short
Figure 8 shows the annualized excess return net of transaction costs as a function
of (a) annualized excess risk and (b) maximum drawdown for different values
of γ0 and Dmax for LS and LLO portfolios. For comparison, the ex-post mean–
variance efficient frontier and the 1/n portfolio are shown. The ex-post efficient
frontier gives the maximum excess return obtainable for a given excess risk
for a fixed-mix,
( long–short
) portfolio subject to the same holding and leverage
constraints, wmin 1:n = (wmax )1:n = 0.4 and Lmax = 2, conditional on knowing
the returns beforehand.
In figure 8(a), the possibility of using leverage or taking short positions extends
the efficient frontier. Leverage can be applied to increase risk while maintaining
diversification, rather than concentrating the portfolio in a few assets. This
reduces the gap between the dynamic frontiers and the no-regret frontier.
In figure 8(b), the difference between the dynamic frontiers and the no-regret
frontier is still substantial. Again, the ex-post mean–variance efficient frontier
more or less overlaps with the ex-post mean–MDD efficient frontier; therefore,
only the former is shown. By taking a dynamic approach, the maximum draw-
down can be reduced by 0.25, while maintaining the same excess return.
The combination of leverage and drawdown control is powerful. Compared to
figure 7(b), it is possible to increase the excess return by several hundred basis
points without suffering a larger MDD by combining the use of leverage with
drawdown control. The possible excess return is bounded by the drawdown
limit. Seeking excess return beyond this boundary by removing the drawdown
limit and lowering γ0 comes at the cost of a significantly increased MDD. This
is true regardless of whether leverage can be applied.

5 Conclusion
By adjusting the risk aversion based on realized drawdown, the proposed ap-
proach to multi-period portfolio selection based on MPC successfully controlled
drawdowns with little or no sacrifice of mean–variance efficiency. The empirical
testing showed that performance could be significantly improved by reducing
realized risk and MDD using this dynamic approach. In fact, even if they knew
future returns when choosing their benchmark, investors who insisted on re-
balancing to a static benchmark allocation could not have outperformed the
dynamic approach net of transaction costs over the 18-year out-of-sample test
period. The combination of leverage and drawdown control was particularly
5 Conclusion 293

successful, as it was possible to increase the excess return by several hundred


basis points without suffering a larger MDD.
The MPC approach to multi-period portfolio selection has potential in practi-
cal applications, because it is computationally fast. This makes it feasible to
consider a large universe of assets and implement important constraints and
costs. When combined with an adaptive forecasting method it provides a flexi-
ble framework for incorporating new information into a portfolio as it becomes
available. This should definitely be useful in future research, when evaluating
the performance of return-prediction models.

References
Almgren, R. and N. Chriss. “Optimal execution of portfolio transactions.” Jour-
nal of Risk, vol. 3, no. 2 (2001), pp. 5–39.

Ang, A. and G. Bekaert. “How regimes affect asset allocation.” Financial Ana-
lysts Journal, vol. 60, no. 2 (2004), pp. 86–99.

Ang, A. and A. Timmermann. “Regime changes and financial markets.” Annual


Review of Financial Economics, vol. 4, no. 1 (2012), pp. 313–337.

Ardia, D., G. Bolliger, K. Boudt, and J. P. Gagnon-Fleury. “The impact of


covariance misspecification in risk-based portfolios.” Annals of Operations
Research, vol. 254, no. 1-2 (2017), pp. 1–16.

Artzner, P., F. Delbaen, J. M. Eber, and D. Heath. “Coherent measures of risk.”


Mathematical Finance, vol. 9, no. 3 (1999), pp. 203–228.

Bae, G. I., W. C. Kim, and J. M. Mulvey. “Dynamic asset allocation for varied
financial markets under regime switching framework.” European Journal of
Operational Research, vol. 234, no. 2 (2014), pp. 450–458.

Bell, D. E. “Regret in decision making under uncertainty.” Operations Research,


vol. 30, no. 5 (1982), pp. 961–981.

Bellman, R. E. “Dynamic programming and Lagrange multipliers.” Proceedings


of the National Academy of Sciences, vol. 42, no. 10 (1956), pp. 767–769.

Bemporad, A., L. Bellucci, and T. Gabbriellini. “Dynamic option hedging via


stochastic model predictive control based on scenario simulation.” Quantita-
tive Finance, vol. 14, no. 10 (2014), pp. 1739–1751.

Bertsimas, D., G. J. Lauprete, and A. Samarov. “Shortfall as a risk measure:


properties, optimization and applications.” Journal of Economic Dynamics &
Control, vol. 28, no. 7 (2004), pp. 1353–1381.
294 Multi-period portfolio selection with drawdown control

Black, F. and R. W. Jones. “Simplifying portfolio insurance.” Journal of Portfolio


Management, vol. 14, no. 1 (1987), pp. 48–51.
Black, F. and R. Litterman. “Global portfolio optimization.” Financial Analysts
Journal, vol. 48, no. 5 (1992), pp. 28–43.
Black, F. and A. F. Perold. “Theory of constant proportion portfolio insurance.”
Journal of Economic Dynamics & Control, vol. 16, no. 3–4 (1992), pp. 403–
426.
Black, F. and M. Scholes. “The pricing of options and corporate liabilities.”
Journal of Political Economy, vol. 81, no. 3 (1973), pp. 637–654.
Boyd, S., E. Busseti, S. Diamond, R. N. Kahn, K. Koh, P. Nystrup, and J. Speth.
“Multi-period trading via convex optimization.” Foundations and Trends in
Optimization, vol. 3, no. 1 (2017), pp. 1–76.
Boyd, S., M. T. Mueller, B. O’Donoghue, and Y. Wang. “Performance bounds
and suboptimal policies for multi-period investment.” Foundations and Trends
in Optimization, vol. 1, no. 1 (2014), pp. 1–72.
Boyd, S. and L. Vandenberghe. Convex Optimization. Cambridge University
Press: New York (2004).
Broadie, M. “Computing efficient frontiers using estimated parameters.” Annals
of Operations Research, vol. 45, no. 1 (1993), pp. 21–58.
Brodie, J., I. Daubechies, C. D. Mol, D. Giannone, and I. Loris. “Sparse and
stable Markowitz portfolios.” Proceedings of the National Academy of Sciences
of the United States of America, vol. 106, no. 30 (2009), pp. 12267–12272.
Bulla, J., S. Mergner, I. Bulla, A. Sesboüé, and C. Chesneau. “Markov-switching
asset allocation: Do profitable strategies exist?” Journal of Asset Manage-
ment, vol. 12, no. 5 (2011), pp. 310–321.
Chaudhuri, S. E. and A. W. Lo. “Spectral portfolio theory.” Available at SSRN
2788999 (2016), pp. 1–44.
Chopra, V. K. and W. T. Ziemba. “The effect of errors in means, variances, and
covariances on optimal portfolio choice.” Journal of Portfolio Management,
vol. 19, no. 2 (1993), pp. 6–11.
Cui, X., J. Gao, X. Li, and D. Li. “Optimal multi-period mean–variance policy
under no-shorting constraint.” European Journal of Operational Research, vol.
234, no. 2 (2014), pp. 459–468.
Dai, M., Z. Q. Xu, and X. Y. Zhou. “Continuous-time Markowitz’s model with
transaction costs.” SIAM Journal on Financial Mathematics, vol. 1, no. 1
(2010a), pp. 96–125.
5 Conclusion 295

Dantzig, G. B. and G. Infanger. “Multi-stage stochastic linear programs for


portfolio optimization.” Annals of Operations Research, vol. 45, no. 1 (1993),
pp. 59–76.
DeMiguel, V., L. Garlappi, F. Nogales, and R. Uppal. “A generalized approach
to portfolio optimization: Improving performance by constraining portfolio
norms.” Management Science, vol. 55, no. 5 (2009a), pp. 798–812.
DeMiguel, V., L. Garlappi, and R. Uppal. “Optimal versus naive diversification:
How inefficient is the 1/N portfolio strategy?” Review of Financial Studies,
vol. 22, no. 5 (2009b), pp. 1915–1953.
Diamond, S. and S. Boyd. “CVXPY: A Python-embedded modeling language for
convex optimization.” Journal of Machine Learning Research, vol. 17, no. 83
(2016), pp. 1–5.

Dias, J. G., J. K. Vermunt, and S. Ramos. “Clustering financial time series:


New insights from an extended hidden Markov model.” European Journal of
Operational Research, vol. 243, no. 3 (2015), pp. 852–864.
Dohi, T. and S. Osaki. “A note on portfolio optimization with path-dependent
utility.” Annals of Operations Research, vol. 45, no. 1 (1993), pp. 77–90.

Domahidi, A., E. Chu, and S. Boyd. “ECOS: An SOCP solver for embedded
systems.” In Proceedings of the 12th European Control Conference (2013), pp.
3071–3076.
Downing, C., A. Madhavan, A. Ulitsky, and A. Singh. “Portfolio construction
and tail risk.” Journal of Portfolio Management, vol. 42, no. 1 (2015), pp.
85–102.
Fabozzi, F. J., D. Huang, and G. Zhou. “Robust portfolios: contributions from
operations research and finance.” Annals of Operations Research, vol. 176,
no. 1 (2010), pp. 191–220.

Fastrich, B., S. Paterlini, and P. Winker. “Constructing optimal sparse portfolios


using regularization methods.” Computational Management Science, vol. 12,
no. 3 (2015), pp. 417–434.
Fiecas, M., J. Franke, R. von Sachs, and J. T. Kamgaing. “Shrinkage estimation
for multivariate hidden Markov models.” Journal of the American Statistical
Association, vol. 112, no. 517 (2017), pp. 424–435.
Fleming, J., C. Kirby, and B. Ostdiek. “The economic value of volatility timing.”
Journal of Finance, vol. 56, no. 1 (2001), pp. 329–352.
Frühwirth-Schnatter, S. Finite Mixture and Markov Switching Models. Springer:
New York (2006).
296 Multi-period portfolio selection with drawdown control

Garlappi, L., R. Uppal, and T. Wang. “Portfolio selection with parameter and
model uncertainty: A multi-prior approach.” Review of Financial Studies,
vol. 20, no. 1 (2006), pp. 41–81.
Gârleanu, N. and L. H. Pedersen. “Dynamic trading with predictable returns and
transaction costs.” Journal of Finance, vol. 68, no. 6 (2013), pp. 2309–2340.
Goltz, F., L. Martellini, and K. D. Simsek. “Optimal static allocation decisions
in the presence of portfolio insurance.” Journal of Investment Management,
vol. 6, no. 2 (2008), pp. 37–56.
Grinold, R. C. “A dynamic model of portfolio management.” Journal of Invest-
ment Management, vol. 4, no. 2 (2006), pp. 5–22.
Grinold, R. C. and R. N. Kahn. Active Portfolio Management: A Quantitative
Approach for Providing Superior Returns and Controlling Risk. McGraw–Hill:
New York, 2nd ed. (2000).
Grossman, S. J. and Z. Zhou. “Optimal investment strategies for controlling
drawdowns.” Mathematical Finance, vol. 3, no. 3 (1993), pp. 241–276.
Guidolin, M. and A. Timmermann. “Asset allocation under multivariate regime
switching.” Journal of Economic Dynamics and Control, vol. 31, no. 11 (2007),
pp. 3503–3544.
Gülpınar, N. and B. Rustem. “Worst-case robust decisions for multi-period
mean–variance portfolio optimization.” European Journal of Operational Re-
search, vol. 183, no. 3 (2007), pp. 981–1000.
Herzog, F., G. Dondi, and H. P. Geering. “Stochastic model predictive control
and portfolio optimization.” International Journal of Theoretical and Applied
Finance, vol. 10, no. 2 (2007), pp. 203–233.
Ho, M., Z. Sun, and J. Xin. “Weighted elastic net penalized mean–variance
portfolio design and computation.” SIAM Journal on Financial Mathematics,
vol. 6, no. 1 (2015), pp. 1220–1244.
Ibragimov, R., D. Jaffee, and J. Walden. “Diversification disasters.” Journal of
Financial Economics, vol. 99, no. 2 (2011), pp. 333–348.
Ilmanen, A. “Do financial markets reward buying or selling insurance and lottery
tickets?” Financial Analysts Journal, vol. 68, no. 5 (2012), pp. 26–36.
Jagannathan, R. and T. Ma. “Risk reduction in large portfolios: Why imposing
the wrong constraints helps.” Journal of Finance, vol. 58, no. 4 (2003), pp.
1651–1683.
Jorion, P. “International portfolio diversification with estimation risk.” Journal
of Business, vol. 58, no. 3 (1985), pp. 259–278.
5 Conclusion 297

Kan, R. and G. Zhou. “Optimal portfolio choice with parameter uncertainty.”


Journal of Financial and Quantitative Analysis, vol. 42, no. 3 (2007), pp.
621–656.

Khreich, W., E. Granger, A. Miri, and R. Sabourin. “A survey of techniques


for incremental learning of HMM parameters.” Information Sciences, vol. 197
(2012), pp. 105–130.

Kinlaw, W., M. Kritzman, and D. Turkington. “The divergence of high- and


low-frequency estimation: Causes and consequences.” Journal of Portfolio
Management, vol. 40, no. 5 (2014), pp. 156–168.

Kinlaw, W., M. Kritzman, and D. Turkington. “The divergence of high- and low-
frequency estimation: Implications for performance measurement.” Journal
of Portfolio Management, vol. 41, no. 3 (2015), pp. 14–21.

Kolm, P., R. Tütüncü, and F. Fabozzi. “60 years of portfolio optimization:


Practical challenges and current trends.” European Journal of Operational
Research, vol. 234, no. 2 (2014), pp. 356–371.

Kritzman, M. and Y. Li. “Skulls, financial turbulence, and risk management.”


Financial Analysts Journal, vol. 66, no. 5 (2010), pp. 30–41.

Kritzman, M., S. Page, and D. Turkington. “Regime shifts: Implications for


dynamic strategies.” Financial Analysts Journal, vol. 68, no. 3 (2012), pp.
22–39.

Ledoit, O. and M. Wolf. “Improved estimation of the covariance matrix of


stock returns with an application to portfolio selection.” Journal of Empirical
Finance, vol. 10, no. 5 (2003), pp. 603–621.

Ledoit, O. and M. Wolf. “A well-conditioned estimator for large-dimensional


covariance matrices.” Journal of Multivariate Analysis, vol. 88, no. 2 (2004),
pp. 365–411.

Leland, H. E. “Who should buy portfolio insurance?” Journal of Finance, vol. 35,
no. 2 (1980), pp. 581–594.

Li, J. “Sparse and stable portfolio selection with parameter uncertainty.” Journal
of Business & Economic Statistics, vol. 33, no. 3 (2015b), pp. 381–392.

Lim, A. E., J. G. Shanthikumar, and G. Y. Vahn. “Conditional value-at-risk in


portfolio optimization: Coherent but fragile.” Operations Research Letters,
vol. 39, no. 3 (2011), pp. 163–171.

López de Prado, M. “Building diversified portfolios that outperform out of


sample.” Journal of Portfolio Management, vol. 42, no. 4 (2016), pp. 59–69.
298 Multi-period portfolio selection with drawdown control

Mandelbrot, B. “The variation of certain speculative prices.” Journal of Business,


vol. 36, no. 4 (1963), pp. 394–419.

Markowitz, H. “Portfolio selection.” Journal of Finance, vol. 7, no. 1 (1952), pp.


77–91.

Markowitz, H. “Mean–variance approximations to expected utility.” European


Journal of Operational Research, vol. 234, no. 2 (2014), pp. 346–355.

Mattingley, J. and S. Boyd. “CVXGEN: A code generator for embedded convex


optimization.” Optimization and Engineering, vol. 13, no. 1 (2012), pp. 1–27.

Mei, X., V. DeMiguel, and F. J. Nogales. “Multiperiod portfolio optimization


with multiple risky assets and general transaction costs.” Journal of Banking
& Finance, vol. 69 (2016), pp. 108–120.

Meindl, P. J. and J. A. Primbs. “Dynamic hedging of single and multi-


dimensional options with transaction costs: a generalized utility maximization
approach.” Quantitative Finance, vol. 8, no. 3 (2008), pp. 299–312.

Merton, R. C. “Lifetime portfolio selection under uncertainty: The continuous-


time case.” Review of Economics and Statistics, vol. 51, no. 3 (1969), pp.
247–257.

Merton, R. C. “Theory of rational option pricing.” Bell Journal of Economics


and Management Science, vol. 4, no. 1 (1973), pp. 141–183.

Merton, R. C. “On estimating the expected return on the market: An ex-


ploratory investigation.” Journal of Financial Economics, vol. 8, no. 4 (1980),
pp. 323–361.

Michaud, R. O. “The Markowitz optimization Enigma: Is ’optimized’ optimal?”


Financial Analysts Journal, vol. 45, no. 1 (1989), pp. 31–42.

Moreira, A. and T. Muir. “Volatility-managed portfolios.” Journal of Finance,


vol. 72, no. 4 (2017), pp. 1611–1644.

Mossin, J. “Optimal multiperiod portfolio policies.” Journal of Business, vol. 41,


no. 2 (1968), pp. 215–229.

Mulvey, J. M. and B. Shetty. “Financial planning via multi-stage stochastic


optimization.” Computers & Operations Research, vol. 31, no. 1 (2004), pp.
1–20.

Nystrup, P., B. W. Hansen, H. O. Larsen, H. Madsen, and E. Lindström. “Dy-


namic allocation or diversification: A regime-based approach to multiple as-
sets.” Journal of Portfolio Management, vol. 44, no. 2 (2017a), pp. 62–73.
5 Conclusion 299

Nystrup, P., B. W. Hansen, H. Madsen, and E. Lindström. “Regime-based ver-


sus static asset allocation: Letting the data speak.” Journal of Portfolio
Management, vol. 42, no. 1 (2015a), pp. 103–109.

Nystrup, P., H. Madsen, and E. Lindström. “Stylised facts of financial time


series and hidden Markov models in continuous time.” Quantitative Finance,
vol. 15, no. 9 (2015b), pp. 1531–1541.

Nystrup, P., H. Madsen, and E. Lindström. “Long memory of financial time


series and hidden Markov models with time-varying parameters.” Journal of
Forecasting, vol. 36, no. 8 (2017b), pp. 989–1002.

Nystrup, P., H. Madsen, and E. Lindström. “Dynamic portfolio optimization


across hidden market regimes.” Quantitative Finance, vol. 18, no. 1 (2018b),
pp. 83–95.

Pedersen, L. H. “When everyone runs for the exit.” International Journal of


Central Banking, vol. 5, no. 4 (2009), pp. 177–199.

Pedersen, L. H. Efficiently inefficient: how smart money invests and market


prices are determined. Princeton University Press: Princeton (2015).

Pınar, M. Ç. “Robust scenario optimization based on downside-risk measure


for multi-period portfolio selection.” OR Spectrum, vol. 29, no. 2 (2007), pp.
295–309.

Rockafellar, R. T. and S. Uryasev. “Optimization of conditional value-at-risk.”


Journal of Risk, vol. 2, no. 3 (2000), pp. 21–42.

Rubinstein, M. and H. E. Leland. “Replicating options with positions in stock


and cash.” Financial Analysts Journal, vol. 37, no. 4 (1981), pp. 63–72.

Rydén, T., T. Teräsvirta, and S. Åsbrink. “Stylized facts of daily return series
and the hidden Markov model.” Journal of Applied Econometrics, vol. 13,
no. 3 (1998), pp. 217–244.

Samuelson, P. A. “Lifetime portfolio selection by dynamic stochastic program-


ming.” Review of Economics and Statistics, vol. 51, no. 3 (1969), pp. 239–246.

Scutellà, M. G. and R. Recchia. “Robust portfolio asset allocation and risk


measures.” Annals of Operations Research, vol. 204, no. 1 (2013), pp. 145–
169.

Sharpe, W. F. “Mutual fund performance.” Journal of Business, vol. 39, no. 1


(1966), pp. 119–138.

Sharpe, W. F. “The Sharpe ratio.” Journal of Portfolio Management, vol. 21,


no. 1 (1994), pp. 49–58.
300 Multi-period portfolio selection with drawdown control

Smidl, V. and F. Gustafsson. “Bayesian estimation of forgetting factor in adap-


tive filtering and change detection.” In Proceedings of the 2012 IEEE Statis-
tical Signal Processing Workshop (2012), pp. 197–200.
Stein, C. “Inadmissibility of the usual estimator for the mean of a multivariate
normal distribution.” In Proceedings of the Third Berkeley Symposium on
Mathematical Statistics and Probability, vol. 1. University of California Press:
Berkeley (1956), pp. 197–206.
Stenger, B., V. Ramesh, N. Paragios, F. Coetzee, and J. M. Buhmann. “Topol-
ogy free hidden Markov models: Application to background modeling.” In
Proceedings of the Eighth IEEE International Conference on Computer Vi-
sion, vol. 1 (2001), pp. 294–301.
Stoyanov, S. V., S. T. Rachev, and F. J. Fabozzi. “Sensitivity of portfolio VaR
and CVaR to portfolio return characteristics.” Annals of Operations Research,
vol. 205, no. 1 (2012), pp. 169–187.
von Neumann, J. and O. Morgenstern. Theory of Games and Economic Behavior.
Princeton University Press: Princeton, 3rd ed. (1953).
Zenios, S. A. Practical Financial Optimization: Decision Making for Financial
Engineers. Blackwell: Malden (2007).
Zhou, G. and Y. Zhu. “Is the recent financial crisis really a ”once-in-a-century”
event?” Financial Analysts Journal, vol. 66, no. 1 (2010), pp. 24–27.

You might also like