Multi-Task Gaussian Processes
Multi-Task Gaussian Processes
Abstract
A model involving Gaussian processes (GPs) is introduced to simultaneously handle multi-
task learning, clustering, and prediction for multiple functional data. This procedure acts
as a model-based clustering method for functional data as well as a learning step for sub-
sequent predictions for new tasks. The model is instantiated as a mixture of multi-task
GPs with common mean processes. A variational EM algorithm is derived for dealing
with the optimisation of the hyper-parameters along with the hyper-posteriors’ estimation
of latent variables and processes. We establish explicit formulas for integrating the mean
processes and the latent clustering variables within a predictive distribution, accounting
for uncertainty in both aspects. This distribution is defined as a mixture of cluster-specific
GP predictions, which enhances the performance when dealing with group-structured data.
The model handles irregular grids of observations and offers different hypotheses on the
covariance structure for sharing additional information across tasks. The performances on
both clustering and prediction tasks are assessed through various simulated scenarios and
real data sets. The overall algorithm, called MagmaClust, is publicly available as an R
package.
Keywords: Gaussian processes mixture, curve clustering, multi-task learning, variational
EM, cluster-specific predictions
1. Introduction
The classic context of regression aims at inferring the underlying mapping function asso-
ciating input to output data. In a probabilistic framework, some strategies assume that
this function is drawn from a prior Gaussian process (GP). According to Rasmussen and
Williams (2006), a Gaussian process can be defined as a collection of random variables
(indexed over a continuum), any finite number of which having a joint Gaussian distribu-
tion. From this definition, we may enforce some properties for the target function solely
©2023 Arthur Leroy, Pierre Latouche, Benjamin Guedj and Servane Gey.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v24/20-1321.html.
Leroy, Latouche, Guedj and Gey
by characterising the mean and covariance parameters of the process. Since GPs are an
example of kernel methods, a broad range of assumptions can be expressed through the
definition of the covariance function. We refer to Duvenaud (2014) for a comprehensive
review. Despite undeniable advantages, natural implementations for GPs scale cubically
with the number of data points, which constitutes a major drawback in many applications.
Thereby, the early literature focused on deriving tractable approximations to mitigate this
problem (Schwaighofer et al., 2004; Snelson and Ghahramani, 2005; Titsias, 2009; Hensman
et al., 2013a). Subsequent reviews (Quiñonero-Candela et al., 2007; Bauer et al., 2016) have
also provided standardised formulations and comparisons on this topic. Besides, several ap-
proximations have been proposed (Neal, 1997) and implemented (Rasmussen and Nickisch,
2010; Vanhatalo et al., 2013) for tackling the issue of non-Gaussian errors and adapting
GPs to a broad variety of likelihoods. Since a GP corresponds to a probability distribution
over a functional space, alternate approaches for modelling functional data (Ramsay and
Silverman, 2005) should also be mentioned, in particular for our clustering purpose.
Unsupervised learning of functional data, also called curve clustering, focuses on the
definition of sub-groups of curves, making sense according to an appropriate measure of
similarity. When dealing with functional data, the concept of basis functions expansion is
of paramount importance for defining smooth and manageable representations of the data.
Such a representation allows the adaptation of multivariate methods such as k-means, in
combination with B-splines bases for instance (Abraham et al., 2003), to handle curve clus-
tering problems. Different bases can be used, such as Fourier or wavelets (Giacofci et al.,
2013), according to the context and the nature of the signal. Besides, model-based cluster-
ing methods aims at defining probabilistic techniques for this task, and many approaches
(Jiang and Serban, 2012; Jacques and Preda, 2013) have been proposed in this sense for
the past decade. In particular, the algorithms funHDDC (Bouveyron and Jacques, 2011)
and funFEM (Bouveyron et al., 2015) establish a mixture model where representative co-
efficients of the curves are supposed to come from cluster-specific Gaussian distributions.
Furthermore, the authors in Schmutz et al. (2020) introduced an extension to the case of
multivariate functional data. A comprehensive review (Jacques and Preda, 2014) has been
proposed to discuss and compare the major approaches of this active research area. We
can also mention recent works leveraging generalised Bayesian predictors and PAC-Bayes
for learning and clustering streams of data (Li et al., 2018; Li and Guedj, 2021), which
later inspired a work on clustering (Cohen-Addad et al., 2021). In line with the previous
methods that simultaneously analyse multiple curves, we also introduce a framework that
takes advantage of similarities between resembling data.
The multi-task paradigm consists in using data from several tasks (also called batches or
individuals) to improve the learning or predictive capacities compared to an isolated model.
It was introduced by Caruana (1997) and then adapted in many fields of machine learning.
An initial GP adaptation (Schwaighofer et al., 2004) came as a hierarchical Bayesian model
using an expectation-maximisation (EM) algorithm for learning, and a similar approach
can be found in Shi et al. (2005). Another hierarchical formulation using a GP to model
the mean parameter of another GP was later proposed in Hensman et al. (2013b). Such
modelling assumptions resemble those of the present paper, although the strategies used
2
Cluster-Specific Predictions with Multi-Task Gaussian Processes
for learning and prediction largely differ. Meanwhile, Yu et al. (2005) offered an extensive
study of the relationships between the linear model and GPs to develop a multi-task formu-
lation. More recently, the expression multi-task GP has been coined by Bonilla et al. (2008)
for referring to a covariance structure involving inputs and tasks in two separate matrices.
Some further developments on this approach were proposed (Hayashi et al., 2012; Rakitsch
et al., 2013; Chen et al., 2020) and we can also mention the work of Swersky et al. (2013)
on Bayesian hyper-parameter optimisation in such models. Generally, as presented in the
review Álvarez et al. (2012) which favours the term multi-output GP, all these frameworks
can be expressed as specific cases of the linear model of coregionalisation (LMC) introduced
by Goovaerts (1997) in the field of geostatistics. Finally, let us emphasise the algorithm
Magma from Leroy et al. (2022) that recently proposed a different multi-task paradigm for
GPs, by transferring information through a latent mean process rather than the covariance
structure, an intuition that partially appeared before in Shi et al. (2007). This approach
offers enhanced performances in forecasting while scaling linearly in the number of tasks,
which is noticeably lower than the previous multi-output methods which generally bear a
cubic complexity. However, the assumption of a unique mean process might happen to be
too restrictive and could benefit from a more general formulation. For instance, Shi and
Wang (2008) proposed an idea close to our following model by introducing a curve clus-
tering component to a hybrid splines-GPs multi-task framework, although their approach
does not fully account for uncertainty and cannot handle irregular measurements. More-
over, no implementation has been released for their algorithm, and by deriving a unified
multi-task GP framework that is more general, we aim at offering to practitioners a pow-
erful tool for tackling the simultaneous clustering and prediction of multiple functional data.
3
Leroy, Latouche, Guedj and Gey
Figure 1: Time series representing the evolution of performances in 100m freestyle compe-
titions between 10 and 20 years for 5 different swimmers, differentiated by colors.
2. Modelling
Before diving into modelling considerations, let us provide a motivational example used
throughout the article to illustrate the kind of problems we expect to tackle.
2.1 Motivation
Assume that we have observed results of swimming competitions for thousands of athletes
from 10 to 20 years old. In order to improve talent detection, one might be interested in
using those data to forecast future performances for new young swimmers (e.g observed only
between 10 and 14 years old). Such a data set is composed of thousands of irregular age-
performance time series, where each swimmer would have specific number and locations of
their data points, as illustrated in Figure 1. These examples come from a real data set, thor-
oughly described in Section 6, which was originally the applicative motivation for developing
methods described in the present paper. The following multi-task GPs framework is tai-
lored to simultaneously allocate individuals into clusters while learning model parameters,
and then provide probabilistic predictions for future performances of any young swimmer,
by sharing information between resembling individuals through common mean trends.
4
Cluster-Specific Predictions with Multi-Task Gaussian Processes
2.2 Notation
To remain consistent with this illustrative example, we refer throughout to the input vari-
ables as timestamps and use the term individual as a synonym of batch or task. However,
although the temporal formulation helps intuition, the present framework still applies to the
wide range of inputs one can usually think of in GP applications. As we suppose the data
set to be composed of point-wise observations from multiple functions, the set of all indices
is denoted by I ⊂ N, which in particular contains {1, . . . , M }, the indices of the observed
individuals (i.e. the training set). Since the input values are defined over a continuum, let
us name T this input space (we can assume T ⊂ R here for simplicity). Moreover, since the
following model is defined for clustering purposes, the set of indices K = {1, . . . , K} refers
to the K different groups of individuals. For the sake of concision, the notation is shortened
as follows: for any object x, {xi }i = {x1 , . . . , xM } and {xk }k = {x1 , . . . , xK }.
• ti = {t1i , . . . , tN
i }, the set of timestamps for the i-th individual,
i
Let us stress that the input values may vary both in number and location among indi-
viduals, and we refer as a common grid of timestamps to the case where ti = t, ∀i ∈ I.
Otherwise, we call it an uncommon grid. Besides, in order to define a GP mixture model,
a latent binary random vector Zi = (Zi1 , . . . , ZiK )| needs to be associated with each in-
dividual, indicating in which cluster it belongs. Namely, if the i-th individual comes from
the k-th cluster, then Zik = 1 and 0 otherwise. Moreover, we assume these latent variables
to come from the same multinomial distribution: Zi ∼ M(1, π), ∀i ∈ I, with a vector of
K
mixing proportions π = (π1 , . . . , πK )| and
P
πk = 1.
k=1
• µk (·) ∼ GP(mk (·), cγk (·, ·)) is the common mean process of the k-th cluster,
• fi (·) ∼ GP(0, ξθi (·, ·)) is the specific process of the i-th individual,
5
Leroy, Latouche, Guedj and Gey
mk γk π θi
N M N
µk Zi fi
∀k ∈ K
N
yi εi σi2
∀i ∈ I
This general model depends upon several mean and covariance parameters, fixed as
modelling choices, and hyper-parameters to be estimated:
One can note that we assume here the error term to be individual-specific, although
we could also assume it to be cluster-specific and thus indexed by k. Such a choice would
result in a valid model since the upcoming developments remain tractable if we substitute
εk to εi everywhere, and associate σk2 I with cγk (·, ·) instead of ξθi (·, ·). A discussion about
additionally available assumptions on the covariance structures follows in Section 2.4. In
this paper, we seek an estimation for Θ among the above quantities, whereas the other
objects are pre-specified in the model. For instance, the prior mean mk (·) is usually set to
zero but could also integrate expert knowledge if available. Furthermore, we assume that:
6
Cluster-Specific Predictions with Multi-Task Gaussian Processes
We display a graphical model on Figure 2 to enlighten the relationships between the different
components. From these hypotheses, we can naturally integrate out fi and derive the
conditional prior distribution of yi (·), providing a hierarchical formulation for the model:
yi (·) | {Zik = 1, µk (·)} ∼ GP µk (·), ψθi ,σ2 (·, ·) , ∀i ∈ I, ∀k ∈ K.
i
As a consequence, the output processes {yi (·) | {Zi }i , {µk (·)}k }i are also independent
(conditionally to the latent variables) from one another. Although this model is expressed
in terms of infinite-dimensional GPs, we proceed to the inference using finite-dimensional
sets of observations {ti , yi }i . Therefore, we can write the joint conditional likelihood of the
model (conditioning on the inputs is omitted throughout the paper for clarity):
M
Y
p({yi }i | {Zi }i , {µk (t)}k , {θi }i , σi2 i ) =
p(yi | Zi , {µk (ti )}k , θi , σi )
i=1
YM YK
= p(yi | Zik = 1, µk (ti ), θi , σi )Zik
i=1 k=1
YM Y K Zik
= N yi ; µk (ti ), Ψtθi ,σ2 ,
i i
i=1 k=1
h i
where ∀i ∈ I, Ψtθi ,σ2 = ψθi ,σ2 (ti , ti ) = ψθi ,σ2 (u, v) is a Ni × Ni covariance matrix.
i i i i u,v∈ti
The mean processes being common to all individuals in a cluster, we need to evaluate their
prior distributions on the pooled grid of timestamps t:
K
Y
p({µk (t)}k | {γk }k ) = p(µk (t) | γk )
k=1
YK
N µk (t); mk (t), Ctγk ,
=
k=1
where Ctγk = cγk (t, t) = [cγk (k, `)]k,l∈t is a N × N covariance matrix. Finally, the joint
distribution of the clustering latent variables also factorises over the individuals:
M
Y
p({Zi }i | π) = p(Zi | π)
i=1
M
Y
= M(Zi ; 1, π)
i=1
YM YK
= πkZik .
i=1 k=1
7
Leroy, Latouche, Guedj and Gey
From all these expressions, the complete-data likelihood of the model can be derived:
M
Y
p({yi }i , {Zi }i , {µk (t)}k | Θ) = p({µk (t)}k | γk ) p(yi | Zi , {µk (ti )}k , θi , σi2 )p(Zi | π)
i=1
K
Y M Zik
Y
= N µk (t); mk (t), Ctγk πk N yi ; µk (ti ), Ψtθi ,σ2 .
i i
k=1 i=1
In the algorithm Magma (Leroy et al., 2022), the multi-task aspect is mainly supported
by the mean process, although the model also allows information sharing among individuals
through the covariance structure. These two aspects being constructed independently, we
could think of the model as potentially double multi-task, both in mean and covariance.
More precisely, if we assume {{θi }i , σi2 i } = {θ0 , σ02 }, ∀i ∈ I, then all fi are assumed
to be different realisations of the same GP, and thus all individuals contributes to the es-
timation of the common hyper-parameters. Hence, such an assumption that may appear
restrictive at first glance actually offers a valuable way to share common patterns between
tasks. Furthermore, the same kind of hypothesis can be proposed at the cluster level with
{γk }k = γ0 , ∀k ∈ K. In this case, we would assume that all the clusters’ mean processes
{µk }k share the same covariance structure. This property would indicate that the patterns,
8
Cluster-Specific Predictions with Multi-Task Gaussian Processes
θ0 = θi , ∀i ∈ I θi 6= θj , ∀i 6= j
Notation Nb of HPs Notation Nb of HPs
γ0 = γk , ∀k ∈ K H00 2 H0i M+1
γk 6= γl , ∀k 6= l Hk0 K+1 Hki M+K
Table 1: Summary of the 4 available assumptions on the hyper-parameters, with their re-
spective shortening notation and the associated number of sets of hyper-parameters
(HPs) to optimise.
or the variations of the curves, are expected to be roughly identical from one cluster to
another and that the differentiation would be mainly due to the mean values. Conversely,
different covariance structures across kernels offer additional flexibility for the clusters to
differ both in position and in trend, smoothness, or any property that could be coded in a
kernel. Speaking rather loosely, we may think of these different settings as a trade-off be-
tween flexibility and information sharing, or as a choice between an individual or collective
modelling of the covariance. Overall, our algorithm provides 4 different settings, offering a
rather wide range of assumptions for an adequate adaptation to different applicative situa-
tions. Note that the computational considerations are also of paramount importance when
it comes to optimising a likelihood over a potentially high number of parameters. Hence, we
display on Section 2.4 a summary of the 4 different settings, providing a shortening notation
along with the associated number of hyper-parameters (or sets of hyper-parameters in the
case of θi and γk ) that are required to be learnt in practice.
3. Inference
Although a fully Bayesian point-of-view could be taken on the learning procedure by defin-
ing prior distributions of the hyper-parameters and directly use an MCMC algorithm (Ras-
mussen and Williams, 2006; Yang et al., 2016) for approximate inference on the posteri-
ors, this approach remains computationally challenging in practice. Conversely, variational
methods have proved to be highly efficient to conduct inference in difficult GP problems
(Titsias, 2009; Hensman et al., 2013a) and may apply in our context as well. By introduc-
ing an adequate independence assumption, we are able to derive a variational formulation
leading to analytical approximations for the true hyper-posterior distributions of the latent
variables. Then, these hyper-posterior updates allow the computation of a lower bound of
the true log-likelihood, thereby specifying the E step of the VEM algorithm (Attias, 2000)
that conducts the overall inference. Alternatively, we can maximise this lower bound with
respect to the hyper-parameters in the M step for optimisation purpose, to provide esti-
mates. By iterating on these two steps until convergence (pseudo-code in Algorithm 1), the
procedure is proved to reach local optima of the lower bound (Boyd and Vandenberghe,
2004), usually in a few iterations. Regarding the previous illustrative example, this learn-
ing step allocates swimmers presenting similar progression patterns into dedicated clusters
while simultaneously estimating the underlying mean trend of those clusters (E step) and
optimising all hyper-parameters controlling each specific covariance structure (M step). For
9
Leroy, Latouche, Guedj and Gey
the sake of clarity, the shorthand Z = {Zi }i and µ = {µk (t)}k is used in this section when
referring to the corresponding set of latent variables.
with:
Z Z
q(Z, µ)
KL (qkp) = q(Z, µ) log dZ dµ,
p(Z, µ | {yi }i , Θ)
Z Z
q(Z, µ)
L(q; Θ) = − q(Z, µ) log dZ dµ.
p(Z, µ, {yi }i | Θ)
Therefore, we expressed the intractable log-likelihood of the model by introducing the
Kullback-Leibler (KL) divergence between the approximation q(Z, µ) and the corresponding
true distribution p(Z, µ | {yi }i , Θ). The right-hand term L(q; Θ) in (2) defines a so-called
lower bound for log p({yi }i | Θ) since a KL divergence is nonnegative by definition. This
lower bound depends both upon the approximate distribution q(·) and the hyper-parameters
Θ, while remaining tractable under adequate assumptions. By maximising L(q; Θ) alterna-
tively with respect to both quantities, optima for the hyper-parameters shall be reached.
To achieve such a procedure, the following factorisation is assumed for the approximated
distribution:
q(Z, µ) = qZ (Z)qµ (µ).
Colloquially, we could say that the independence property that lacks to compute explicit
hyper-posterior distributions is imposed. Such a condition restricts the family of distribu-
tions from which we choose q(·), and we now seek approximations within this family that
are as close as possible to the true hyper-posteriors.
3.1.1 E step
In the expectation step (E step) of the VEM algorithm, the lower bound of the marginal
likelihood L(q; Θ) is maximised with respect to the distribution q(·), considering that ini-
tial or previously estimated values for Θ b are available. Making use of the factorised form
previously assumed, we can derive analytical expressions for the optimal distributions over
qZ (Z) and qµ (µ). As the computing of each distribution involves taking an expectation with
respect to the other one, this suggests an iterative procedure where whether the initiali-
sation or a previous estimation serves in the current optimisation process. Therefore, two
propositions are introduced below, respectively detailing the exact derivation of the optimal
distributions qbZ (Z) and qbµ (µ) (all proofs are deferred to the corresponding Section 8).
Proposition 1 Assume that the hyper-parameters Θb and the variational distribution qbµ (µ) =
K
Q
N µk (t); m b t are known. The optimal variational approximation qbZ (Z) of the
b k (t), C k
k=1
10
Cluster-Specific Predictions with Multi-Task Gaussian Processes
true hyper-posterior p Z | {yi }i , Θ
b factorises as a product of multinomial distributions:
M
Y
qbZ (Z) = M (Zi ; 1, τi = (τi1 , . . . , τiN )| ) , (3)
i=1
where:
ti 1 ti −1 ti
π b k (ti ), Ψ b 2 exp − 2 tr Ψ b 2 Ck
bk N yi ; m b
θi ,b
σi θi ,b
σi
τik = K , ∀i ∈ I, ∀k ∈ K. (4)
P ti 1 ti −1 ti
π b l (ti ), Ψ b 2 exp − 2 tr Ψ b 2 Cl
bl N yi ; m b
θi ,b
σi θi ,b
σi
l=1
Proposition 2 Assume that the hyper-parameters Θ b and the variational distribution qbZ (Z) =
M
Q
M (Zi ; 1, τi ) are known. The optimal variational approximation qbµ (µ) of the true hyper-
i=1
posterior p µ | {yi }i , Θb factorises as a product of multivariate Gaussian distributions:
K
Y
qbµ (µ) = N µk (t); m bt ,
b k (t), C (5)
k
k=1
with:
M
−1
−1 −1
• bt Ctγbk
P
Ck = + τik Ψ̃i , ∀k ∈ K,
i=1
M
−1 −1
• m t t
P
b k (t) = Ck Cγbk mk (t) +
b τik Ψ̃i ỹi , ∀k ∈ K,
i=1
h i
• Ψ̃i = 1[t,t0 ∈ti ] × ψθbi ,bσ2 (t, t0 ) (N × N matrix).
i t,t0 ∈t
Notice that the forced factorisation we assumed between Z and µ for approximation
purpose additionally offers an induced independence between individuals as indicated by
the factorisation in (3), and between clusters in (5).
3.1.2 M step
At this point, we have fixed an estimation for q(·) in the lower bound that shall serve to
handle the maximisation of L(b q , Θ) with respect to the hyper-parameters. This maximisa-
tion step (M step) depends on the initial assumptions on the generative model (Section 2.4),
resulting in four different versions for the VEM algorithm (the E step is common to all of
them, the branching point is here).
11
Leroy, Latouche, Guedj and Gey
M
Q
Proposition 3 Assume the variational distributions qbZ (Z) = M (Zi ; 1, τi ) and qbµ (µ) =
i=1
K
b t to be known. For a set Θ = {{γk } , {θi } , σ 2
Q
N µk (t); m
b k (t), C k k i i i
, π}, the optimal
k=1
values are given by:
b = argmax E{Z,µ} [ log p({yi } , Z, µ | Θ) ] ,
Θ i
Θ
where E{Z,µ} indicates an expectation taken with respect to qbµ (µ) and qbZ (Z). In particular,
optimal values for π can be computed explicitly with:
M
1 X
π
bk = τik , ∀k ∈ K.
M
i=1
The remaining hyper-parameters are estimated by numerically solving the following max-
imisation problems, according to the situation. Let us note:
1 b t −1
Lk (x; m, S) = log N (x; m, S) − tr C kS ,
2
K
X 1 b ti −1
Li (x; m, S) = τik log N (x; m, S) − tr C k S .
2
k=1
M
• (θb0 , σ
b02 ) = argmax b k (ti ), Ψtθi ,σ2 .
P
L i yi ; m
0 0
θ0 ,σ02 i=1
12
Cluster-Specific Predictions with Multi-Task Gaussian Processes
M
• (θb0 , σ
b02 ) = argmax b k (ti ), Ψtθi ,σ2 .
P
L i yi ; m
0 0
θ0 ,σ02 i=1
Let us stress that, for each sub-case, explicit gradients are available for the functions to
maximise, facilitating the optimisation process with gradient-based methods (Hestenes and
Stiefel, 1952; Bengio, 2000). The current version of our code implements those gradients and
makes use of them within the L-BFGS-B algorithm (Nocedal, 1980; Morales and Nocedal,
2011) devoted to the numerical maximisation. As previously discussed, the hypothesis Hki
necessitates to learn M + K sets of hyper-parameters. However, we notice in Proposition 3
that the factorised forms defined as the sum of a Gaussian log-likelihoods and trace terms
offer a way to operate the maximisations in parallel on simple functions. Conversely, for
the hypothesis H00 , only 2 sets of hyper-parameters need to be optimised, namely γ0 , and
{θ0 , σ02 }. The small number of functions to maximise is explained by the fact that they are
defined as larger sums over all individuals (respectively all clusters). Moreover, this context
highlights a multi-task pattern in covariance structures, since each individual (respectively
cluster) contributes to the learning of shared hyper-parameters. In practice, H00 is far easier
to manage, and we generally reach robust optima in a few iterations. On the contrary, the
settings with many hyper-parameters to learn, using mechanically less data for each, may
lead more often to computational burden or pathological results. The remaining hypotheses,
H0i and Hk0 , are somehow middle ground situations between the two extremes and might
be used as a compromise according to the problem being dealt with.
3.2 Initialisation
Below some modelling choices are discussed, in particular the initialisation of some quanti-
ties involved in the VEM algorithm:
• {mk (·)}k ; the mean functions from the hyper-prior distributions of the associated
mean processes {µk (·)}k . As it may be difficult to pre-specify meaningful values in
the absence of external or expert knowledge, these values are often assumed to be
0. However, it remains possible to integrate information in the model by this mean.
However, as exhibited in Proposition 2, the influence of {mk (·)}k in hyper-posterior
computations decreases rapidly when M grows in a multi-task framework.
• {γk }k , {θi }i and σi2 i ; the kernel hyper-parameters. We already discussed that the
form itself of kernels has to be chosen as well, but once set, we would advise initiating
{γk }k and {θi }i with close and reasonable values whenever possible. As usual in
GP models, nearly singular covariance matrices and numerical instability may occur
for pathological initialisations, in particular for the hypotheses, like Hki , with many
hyper-parameters to learn. This behaviour frequently occurs in the GP framework,
and one way to handle this issue is to add a so-called jitter term (Bernardo et al.,
1998) on the diagonal of the ill-defined covariance matrices.
• {τik }ik ; the estimated individual membership probabilities (or π; the prior vector of
clusters’ proportions). Both quantities are valid initialisation depending on whether
we start the VEM iterations by an E step or an M step. If we only want to set
the initial proportions of each cluster in the absence of additional information, we
13
Leroy, Latouche, Guedj and Gey
may merely specify π and start with an E step. Otherwise, if we insert the results
from a previous clustering algorithm as an initialisation, the probabilities τik for each
individual and cluster can be fully specified before proceeding to an M step (or to the
qbµ (µ)’s computing and then the M step).
Let us finally stress that the convergence (to local maxima) of VEM algorithms partly
depends on these initialisations. Different strategies have been proposed in the literature to
manage this issue, among which simulated annealing (Ueda and Nakano, 1998) or repeated
short runs (Biernacki et al., 2003).
3.3 Pseudocode
The overall algorithm is called MagmaClust (as an extension of the algorithm Magma to
cluster-specific mean GPs) and we provide below the pseudo-code summarising the inference
procedure. The corresponding R code is available at https://github.com/ArthurLeroy/
MAGMAclust.
Algorithm 1 MagmaClust: Variational EM algorithm
Initialise {mk (t)}k , Θ = {γk }k , {θi }i , σi2 i and {τini
i }i (or π).
while not converged do
E step: Optimise L(q; Θ) w.r.t. q(·):
M
Q
qbZ (Z) = M(Zi ; 1, τi ).
i=1
K
qbµ(µ) =
Q
N (µk (t); m b t ).
b k (t), C k
k=1
14
Cluster-Specific Predictions with Multi-Task Gaussian Processes
M X
K
X 1 −1 π k
b k (ti ), Ψtbi
τik log N yi ; m − tr C t t i
c
= b Ψ
k θb ,b + log
σi2
θi ,b 2 i σi
2
τik
i=1 k=1
K
"
1
b t Ct −1
X
+ log N m b k (t); mk (t), Ctγbk − tr C k γ
2 bk
k=1
#
1
+ log C b + N log 2π + N − αi + αk + (K − 1) log M,
t
k
2 2
where:
Let us mention that the numbers αi and αk in the penalty term vary according to the
considered modelling hypothesis (H00 , Hk0 , H0i or Hki ), see Section 2.4 for details.
4. Prediction
At this point, we would consider that the inference on the model is completed, since the
training data set of observed individuals {yi }i enabled to estimate the desired hyper-
parameters and the distributions of latent variables. For the sake of concision, we thus
omit the writing of conditionings over Θ b in the sequel. Recalling our illustrative example,
we would have used competition’s results over a long period from thousands of swimmers for
training the model, and we now expect to make predictions of future performances for any
young athlete in the early stages of their career. Therefore, let us assume the partial obser-
vation of a new individual, denoted by the index ∗, for whom we collected a few data points
y∗ (t∗ ) at timestamps t∗ . Defining a multi-task GPs mixture prediction consists in seeking
an analytical distribution p(y∗ (·) | y∗ (t∗ ), {yi }i ), according to the information brought by:
its own observations; the training data set; the cluster structure among individuals. As we
aim at studying the output values y∗ (·) at arbitrarily chosen timestamps, say tp (the index
p stands
p for prediction), we propose a new notation for the pooled vector of timestamps
t
tp∗ = . This vector serves as a working grid on which the different distributions involved
t∗
in the prediction procedure are evaluated. In the absence of external restrictions, we would
strongly advise to include the observed timestamps of all training individuals, t, within tp∗ ,
since evaluating the processes at these locations allows for sharing information across tasks.
Otherwise, any data points defined on timestamps outside of the working grid would be
discarded from the multi-task aspect of the model. In particular, if tp∗ = t, we may even
use directly the variational distribution qµ (µ) computed in the VEM algorithm, and thus
skip one step of the prediction procedure that is described below. Throughout the section,
we aim at defining a probabilistic prediction for this new individual, accounting for the in-
formation of all training data {yi }i . To this end, we manipulate several distributions of the
15
Leroy, Latouche, Guedj and Gey
tp∗ = t tp∗ 6= t
H00 2-3bis-4-5 1-2-3bis-4-5
Hk0 2-3bis-4-5 1-2-3bis-4-5
H0i 2-3-4-5 1-2-3-4-5
Hki 2-3-4-5 1-2-3-4-5
Table 2: Summary of the different steps to perform in the prediction procedure, according
to the model assumptions and the target grid of timestamps.
type p(· | {yi }i ) and refer to them with the adjective multi-task. Additionally to highlight-
ing the information-sharing aspect across individuals, this term allows us to distinguish the
role of {yi }i from the one of the newly observed data y∗ (t∗ ), which are now the reference
data for establishing if a distribution is called a prior or a posterior. Deriving a predictive
distribution in our multi-task GP framework requires to complete the following steps.
1. Compute the hyper-posterior approximation of {µk (·)}k at tp∗ : qbµ ({µk (tp∗ )}k ),
3. Compute the new hyper-parameters {θ∗ , σ∗2 } and p(Z∗ | y∗ (t∗ ), {yi }i ) via an EM,
3bis. Assign θ∗ = θ0 , σ∗2 = σ02 and compute directly p(Z∗ | y∗ (t∗ ), {yi }i ),
5. Deduce the multi-task GPs mixture prediction: p(y∗ (tp ) | y∗ (t∗ ), {yi }i ).
We already discussed the influence of the initial modelling hypotheses on the overall
procedure. Hence, let us display in Section 4 a quick reminder helping to keep track of
which steps need to be performed in each context.
with:
M
−1
p
b t∗ = tp∗ −1 −1
• C
P
k Cγbk + τik Ψ̃i , ∀k ∈ K,
i=1
16
Cluster-Specific Predictions with Multi-Task Gaussian Processes
M
p p −1
b t∗ Ct∗ mk (tp∗ ) + P τik Ψ̃−1 ỹi , ∀k ∈ K,
b k (tp∗ ) = C
• m k γ
bk i
i=1
We acknowledge that the subsequent analytical developments party rely on this varia-
tional approximate distribution qbµ ({µk (tp∗ )}k ), and may thus be considered, in a sense, as
approximated as well. However, this quantity provides a valuable closed-form expression
that we can substitute to the true hyper-posterior in Proposition 5 below.
We may regularly substitute one to the other in the sequel depending on the handier in
the context. Once the mean processes’ distributions are re-computed on the working grid,
their underlying influence shall be directly plugged into a marginalised multi-task prior over
y∗ (tp∗ ) by integrating out the {µk (tp∗ )}k . As the mean processes vanish, the new individual’s
outputs y∗ (tp∗ ) directly depends upon the training data set {yi }i , as highlighted in the
proposition below.
Proposition 5 For a set of timestamps tp∗ , the multi-task prior distribution of y∗ knowing
its clustering latent variable is given by:
K Z
p p
b t∗ + Ψt∗ 2 ∗k .
Y
p(y∗ (tp∗ ) | Z∗ , {yi }i ) = N y∗ (tp∗ ); m
b k (tp∗ ), C k θ∗ ,σ
(6)
∗
k=1
Proof Let us recall that, conditionally to their mean process, the individuals are indepen-
dent of one another. Then, for all k ∈ K, we have:
Z
p(y∗ (t∗ ) | Z∗k = 1, {yi }i ) = p(y∗ (tp∗ ), µk (tp∗ ) | Z∗k = 1, {yi }i ) dµk (tp∗ )
p
Z
= p(y∗ (tp∗ ) | µk (tp∗ ), Z∗k = 1) p(µk (tp∗ ) | Z∗k = 1, {yi }i ) dµk (tp∗ )
| {z }
≈qµ (µk (tp∗ ))
Z
p p
≈ N y∗ (tp∗ ); µk (tp∗ ), Ψtθ∗∗ ,σ2 N µk (tp∗ ); m b t∗ dµk (tp )
b k (tp∗ ), C k ∗
∗
p p
= N y∗ (tp∗ ); m b t∗ + Ψ t∗ 2 .
b k (tp∗ ), C k θ∗ ,σ ∗
17
Leroy, Latouche, Guedj and Gey
The final line is obtained by remarking that such a convolution of Gaussian distributions
remains Gaussian as well (Bishop, 2006, Chapter 2), and we refer to Leroy et al. (2022) for
the detailed calculus in this exact context. Therefore, we finally get:
K
Y
p(y∗ (tp∗ ) | Z∗ , {yi }i ) = p(y∗ (tp∗ ) | Z∗k = 1, {yi }i )Z∗k
k=1
K Z
p p
b t∗ + Ψt∗ 2 ∗k .
Y
= N y∗ (tp∗ ); m
b k (tp∗ ), C k θ∗ ,σ ∗
k=1
4.3.1 E step
In the E step, hyper-parameters estimates are assumed to be known. Recalling that the
latent clustering variable Z∗ is independent from the training data {yi }i , the multi-task
hyper-posterior distribution maintains an explicit derivation:
b∗2 , π
p(Z∗ | y∗ (t∗ ), {yi }i , θb∗ , σ b∗2 )p(Z∗ | π
b ) ∝ p(y∗ (t∗ ) | Z∗ , {yi }i , θb∗ , σ b)
YK K
Z∗k Y
∝ N y∗ (t∗ ); m t∗
b k (t∗ ), Ck + Ψ b 2
b t∗
blZ∗l
π
θ∗ ,b
σ∗
k=1 l=1
YK Z∗k
∝ bk N y∗ (t∗ ); m
π b t∗ + Ψt∗
b k (t∗ ), C .
k b σ∗2
θ∗ ,b
k=1
By inspection, we recognise the form of a multinomial distribution and thus retrieve the
corresponding normalisation constant to deduce:
b∗2 , π
p(Z∗ | y∗ (t∗ ), {yi }i , θb∗ , σ b ) = M (Z∗ ; 1, τ∗ = (τ∗1 , . . . , τ∗K )| ) , (7)
with:
bk N y∗ (t∗ ); m
π b t∗ + Ψ t∗
b k (t∗ ), C k σ∗2
θb∗ ,b
τ∗k = K , ∀k ∈ K. (8)
P t∗ t∗
bl N y∗ (t∗ ); m
π b l (t∗ ), Cl + Ψ b 2
b
θ∗ ,b
σ∗
l=1
18
Cluster-Specific Predictions with Multi-Task Gaussian Processes
4.3.2 M step
Assuming to know the value of τ∗ , we may derive optimal values for the hyper-parameters
of the new individual through the following maximisation:
b∗2 } = argmax EZ∗ [ log p(y∗ (t∗ ), Z∗ | {yi }i , θ∗ , σ∗ , π
{θb∗ , σ b) ] .
θ∗ ,σ∗
3bis. In the case where the hyper-parameters are supposed to be common across in-
dividuals (H00 or Hk0 ), there is no need to additional optimisation since we already have
b∗2 = σ
θb∗ = θb0 and σ b0 by definition. However, the probabilities of lying in each cluster τ∗ for
the new individual still need to be computed, which shall be handled by directly using the
expression (8) from the E step.
3ter. Conversely, even if hyper-parameters for each individual are supposed to be dif-
ferent (H0i or Hki ), it remains possible to avoid the implementation of an EM algorithm by
stating τ∗ = πb . Such an assumption intuitively expresses that we would guess the member-
ship probabilities of each cluster from the previously estimated mixing proportions, without
taking new individual’s observations into account. Although we would not recommend this
choice for getting optimal results, it still seems to be worth a mention for applications with
a compelling need to avoid EM’s extra computations during the prediction process.
19
Leroy, Latouche, Guedj and Gey
p p
where Γtk ,t = C b tp + Ψtp 2 and likewise for the other blocks of the matrices. Therefore,
k θ∗ ,σ∗
recalling that conditioning on the sub-vector of observed values y∗ (t∗ ) maintains a Gaussian
distribution (Bishop, 2006; Rasmussen and Williams, 2006), we can derive the multi-task
posterior distribution for each latent cluster:
p
b t , ∀k ∈ K,
p(y∗ (tp ) | Z∗k = 1, y∗ (t∗ ), {yi }i ) = N y∗ (tp ); µ
b∗k (tp ), Γ ∗k (9)
where:
pt −1
• µ b k (tp ) + Γtk
b∗k (tp ) = m ∗
Γtk∗ t∗ (y∗ (t∗ ) − m
b k (t∗ )) , ∀k ∈ K,
p
b t = Γtp tp − Γtp t∗ Γt∗ t∗ −1 Γt∗ tp , ∀k ∈ K.
• Γ ∗k k k k k
Proposition 6 The multi-task GPs mixture posterior distribution for y∗ (tp ) has the fol-
lowing form:
K p
bt .
X
p(y∗ (tp ) | y∗ (t∗ ), {yi }i ) = τ∗k N y∗ (tp ); µ
b∗k (tp ), Γ∗k
k=1
Proof
Taking advantage of (9) and the multi-task hyper-posterior distribution of Z∗ as com-
puted in (7), it is straightforward to integrate out the latent clustering variable:
X
p(y∗ (tp ) | y∗ (t∗ ), {yi }i ) = p(y∗ (tp ), Z∗ | y∗ (t∗ ), {yi }i )
Z∗
X
= p(y∗ (tp ) | Z∗ , y∗ (t∗ ), {yi }i )p(Z∗ | y∗ (t∗ ), {yi }i )
Z∗
K
XY Z∗k
= τ∗k p(y∗ (tp ) | Z∗k = 1, y∗ (t∗ ), {yi }i )
Z∗ k=1
K p Z∗k
bt
XY
= τ∗k N y∗ (tp ); µ
b∗k (tp ), Γ∗k
Z∗ k=1
K p
bt ,
X
= τ∗k N y∗ (tp ); µ
b∗k (tp ), Γ∗k
k=1
where we recall for the transition to the last line that Z∗k = 1 if the ∗-th individual be-
longs to the k-th cluster and Z∗k = 0 otherwise. Hence, summing a product with only one
non-zero exponent over all possible combination for Z∗ is equivalent to merely sum over the
values of k, and the variable Z∗k simply vanishes.
20
Cluster-Specific Predictions with Multi-Task Gaussian Processes
21
Leroy, Latouche, Guedj and Gey
small common grids of timestamps for instance, the left-hand term would control the com-
plexity, and clustering’s additional cost would be negligible. Conversely, for a relatively low
number of individuals or a large size N for the pooled grid of timestamps, the right-hand
term becomes the primary burden, and the computing time increases proportionally to the
number of clusters compared to the original Magma algorithm.
During the prediction step, the re-computation of {µk (·)}k ’s variational distributions
implies K inversions of covariance matrices with dimensions depending on the size of the
prediction grid tp∗ . In practice though, if we fix a fine grid of target timestamps in advance,
this operation can be assimilated to the learning step. In this case, the prediction complexity
remains at most in the same order as the usual learning for a single-task GP, that is
O(K × N∗3 ) (this corresponds to the estimation of the new individual’s hyper-parameters,
and would decrease to O(K × N∗2 ) for Hk0 or H00 ). We shall mention that the definition of
a fine grid is generally desirable only for low-dimensional applications since we may quickly
reach running time or memory limits as the input’s dimension grows. In many contexts,
most of the time-consuming learning steps can be performed in advance, and the immediate
prediction cost for each new individual is negligible in comparison (generally comparable to
a single-task GP prediction).
6. Experiments
The present section is dedicated to the evaluation of MagmaClust on both synthetic and
real data sets. The performance of the algorithm is assessed in regards to its clustering and
forecast abilities. To this purpose, we introduce below the simulation scheme generating the
synthetic data along with the measures used to compare our method to alternatives quanti-
tatively. Throughout, the exponentiated quadratic (EQ) kernel, as defined in Equation (1),
serves as covariance structure for both generating data and modelling. The manipulation
of more sophisticated kernels remains a topic beyond the scope of the present paper, and
the EQ proposes a fair common ground for comparison between methods. Thereby, each
kernel introduced in the sequel is associated with two hyper-parameters. Namely, v ∈ R+
represents a variance term whereas ` ∈ R+ specifies the lengthscale. The synthetic data sets
are generated following the general procedure below, with minor modifications according to
the model assumptions H00 , Hk0 , H0i or Hki :
1. Define a random working grid t ⊂ [ 0, 10 ] of N = 200 timestamps to study M = 50
individuals, scattered into K clusters,
2. Draw the prior mean functions for {µk (·)}k : mk (t) = at + b, ∀t ∈ t, ∀k ∈ K, where
a ∈ [ −2, 2 ] and b ∈ [ 20, 30 ],
3. Draw uniformly hyper-parameters {µk (·)}k ’s kernels: γk = {vγk , `γk }, ∀k ∈ K,
for
where vγk ∈ 1, e3 and `γk ∈ 1, e1 , (or γ0 = {vγ0 , `γ0 }),
22
Cluster-Specific Predictions with Multi-Task Gaussian Processes
This procedure offers data sets for both the individuals {ti , yi }i and the underlying mean
processes {t, µk (t)}k . In the context of prediction, a new individual is generated according
to the same scheme, although its first 20 data points are assumed to be observed while the
remaining 10 serve as testing values. While it may be argued that this repartition 20-10 is
somehow arbitrary, a more detailed analysis with changing numbers of observed points in
Leroy et al. (2022) revealed a low effect on the global evaluation. Unless otherwise stated,
we fix the number of clusters to be K ∗ = 3 and the model assumption to be H00 for gener-
ating the data. Let us recall that we provided a variational-BIC formula in Proposition 4 to
select an appropriate number of clusters K from data. Therefore, this measure is evaluated
in following experiments and used for model selection purposes in the real-life application.
Besides, the adjusted rand index (ARI) (Hubert and Arabie, 1985) is used as a measure
of adequacy for comparison between the groups obtained through the clustering procedure
and the true clusters that generated the data. More specifically, the ARI is defined by
counting the proportions of matching pairs between groups, and a value of 1 represents a
perfect correspondence. One can note that the ARI still applies when it comes to evaluating
clustering partitions with different numbers of clusters. On the matter of prediction, the
mean square error (MSE) between predicted means and the true values offers a measure
of the average forecast performance. Formally, we define the MSE in prediction on the 10
testing points for the new individual as:
30
1 X pred u 2
y∗ (t∗ ) − y∗true (tu∗ ) .
10
u=21
23
Leroy, Latouche, Guedj and Gey
Figure 3 provides a comparison on the same data set between a classical GP regression
(top), the multi-task GP algorithm Magma (middle), and the multi-task GPs mixture
approach MagmaClust (bottom). On each sub-graph, the plain purple line represents the
mean parameter from the predictive distribution, and the pink shaded area covers the CI95 .
The dashed lines stand for the multi-task prior mean functions {m b k (·)}k resulting from the
estimation of the mean processes. The points in black are the observations for the new
individual ∗, whereas the red points constitute the true target values to forecast. Moreover,
the colourful background points depict the data of the training individuals, which we colour
according to their true cluster in MagmaClust displays (bottom). For the sake of fairness,
the prior mean for standard GP has been set as the average of training data points. As
expected, the single GP regression provides an adequate fit close to the data points before
quickly converging to the prior value when lacking information. Conversely, Magma takes
advantage of its multi-task component to share knowledge across individuals by estimating a
more relevant mean process. However, this unique mean process appears unable to account
for the clear group structure, although adequately recovering the dispersion of the data.
In the case of MagmaClust, we display the cluster-specific prediction (9) for the most
probable group instead of the GPs mixture prediction, since max(τ∗ ) = 1 in this example.
k
The model selection method based on maximum VBIC values correctly retrieved the true
number of cluster K = 3. To illustrate the training procedure dynamics in this example, we
provide on Figure 4 a tracking, for each iteration of the VEM algorithm until convergence, of
the evidence lower bound (ELBO) and the corresponding ARI between predicted and true
clusters. As the ELBO increases, we can notice that the ARI also improves to finally reach
1 at convergence, which means we fully retrieved the true clusters at the end of training.
Although this simple example only depicts a single-run instance, it provides an accurate
intuition on the general behaviour of MagmaClust. In practice, the algorithm quickly
improves the clustering structure and associated mean processes during the first two steps,
and generally converges in a handful of iterations.
Overall, this illustrative example highlights the benefit we can get from considering
group-structured similarities between individuals in GP predictions. Notice that our method
offers both a significant improvement in mean prediction and a narrowed uncertainty around
this value. Additionally, we display on Figure 5 the specific predictions according to the two
remaining clusters (although associated with 0 probabilities). We remark that the predic-
tions move towards the cluster-specific mean processes as soon as the observations become
too distant. In this idealised example, we displayed Gaussian predictive distributions for
convenience though, in general, a Gaussian mixture might rarely be unimodal. Therefore,
we propose in Figure 6 another example with a higher variance and overlapping groups,
where the VBIC still provides the correct number of clusters. While the ARI between pre-
dicted and true clusters was equal to 1 (perfect match) in the previous example, it now
decreases to 0.78. Moreover, the vector of membership probabilities associated with the
Figure 6 for the predicted individual happens to be: τ∗ = (0.95, 0.05, 0). The left-hand
graph provides an illustration of the predictive mean, acquired from the multi-task GPs
mixture distribution described in Proposition 6. We may notice that this curve lies very
24
Cluster-Specific Predictions with Multi-Task Gaussian Processes
Figure 3: Prediction curves (purple) with associated 95% credible intervals (pink) from GP
regression (top), Magma (middle) and MagmaClust (bottom). The dashed
lines represent the mean parameters from the mean processes estimates. Observed
data points are in black, testing data points are in red. Backward points are the
observations from the training data set, coloured relatively to individuals (middle)
or clusters (bottom).
25
Leroy, Latouche, Guedj and Gey
Figure 4: Left: Values of the evidence lower bound (ELBO) for successive iterations of
the VEM algorithm during MagmaClust training. Right: The corresponding
ARI values between predicted and true clusters for the same iterations. In this
illustrative example, the algorithm reached convergence after 7 iterations only.
Figure 5: Cluster-specific prediction curves (purple) with associated 95% credible intervals
(pink) from MagmaClust, for two unlikely clusters. The dashed lines represent
the mean parameters from the mean processes estimates. Observed data points
are in black, testing data points are in red. Backward points are the observations
from the training data set, coloured by clusters.
26
Cluster-Specific Predictions with Multi-Task Gaussian Processes
Figure 6: Left: GPs mixture mean prediction curve (blue) from MagmaClust. Right:
heatmap of probabilities for the GPs mixture predictive distribution from Mag-
maClust. The dashed lines represent the mean parameters from the mean pro-
cesses estimates. Observed data points are in black, testing data points are in
red. Backward points are the observations from the training data set, coloured
by clusters.
close to one cluster’s mean although not completely overlapping it, because of the τ∗k = 0.05
probability for another cluster, which slightly pulls the prediction onto its own mean. Be-
sides, the right-hand graph of Figure 6 proposes a representation of the multi-task GPs
mixture distribution as a heatmap of probabilities for the location of our predictions. This
way, we can display, even in this multi-modal context, a thorough visual quantification for
both the dispersion of the predicted values and the confidence we may grant to each of them.
27
Leroy, Latouche, Guedj and Gey
Figure 7: Left: fuzzy case. Right: well-separated case. Curves of the simulated underly-
ing mean processes, coloured by clusters. The dashed lines represent the mean
parameters from the mean processes estimates. Backward points are the obser-
vations from the training data set, coloured by clusters.
splines expansion associated with a k-means algorithm proposed in Abraham et al. (2003),
funHDDC (Bouveyron and Jacques, 2011), and funFEM (Bouveyron et al., 2015). The two
latter methods were introduced to handle curve clustering problems, by taking advantage of
a functional latent mixture modelling, and demonstrated their ability in several applications
(Schmutz et al., 2020). A naive multivariate k-means is used as initialisation for funHDDC,
funFEM, and MagmaClust. We propose on Figure 8 an evaluation of each algorithm in
terms of ARI over 100 data sets, simulated from various schemes. First, the procedure
detailed in Section 6 is applied for each of the 4 different hypotheses (Hki , Hk0 , H0i , H00 )
to generate data in accordance with our modelling assumptions. Additionally, we propose
an alternative simulation scheme, inspired by Bouveyron et al. (2015), to compare perfor-
mances on data sets that are not tailored for our method. We name this procedure Scheme
A, and each of the 100 data sets is made of 50 curves, generated randomly, allocated into
4 clusters, and observed at 30 common time points such that:
where U ∼ U([0, 1]) and ∼ N (0, 0.05). In all situations, MagmaClust seems to outper-
form the alternatives. In particular, our approach provides ARI values consistently close to
1 and a lower variance in all contexts. Furthermore, while performances of the other meth-
ods are expected to deteriorate because of additional smoothing procedures in the case of
28
Cluster-Specific Predictions with Multi-Task Gaussian Processes
Figure 8: Adjusted rand index (ARI) values between the true clusters and the partitions
estimated by k-means, funHDDC, funFEM, and MagmaClust. The value of K
is set to the true number of clusters for all methods. The ARI is computed on
100 data sets for each assumption Hki , Hk0 , H0i , H00 , and Scheme A.
Figure 9: Adjusted Rand Index values between the true clusters and the partitions esti-
mated by MagmaClust with respect to the values of K used as setting. The
ARI is computed on the same 100 data sets for each value of K. (3∗ : the true
number of clusters for all data sets)
29
Leroy, Latouche, Guedj and Gey
Selected K
M = 50 M = 100
True K* 1 2 3 4 5 6 1 2 3 4 5 6
1 100 0 0 0 0 0 100 0 0 0 0 0
2 10 90 0 0 0 0 2 96 2 0 0 0
3 0 22 74 2 2 0 2 14 84 0 0 0
4 4 10 16 58 10 2 2 10 8 80 0 0
5 0 10 16 20 52 2 0 0 14 18 68 0
Table 3: Percentage of data sets for which the true number of cluster is K ∗ , and the number
of cluster selected by the VBIC is K. A total of 50 data sets were simulated for
different values K ∗ = 1, . . . , 5, and MagmaClust was tested on each with varying
settings K = 1, . . . , 6, to retain the configuration that reaches the highest VBIC
value. The data sets are composed of M = 50 (left) or M = 100 (right) individuals
with N = 30 common timestamps, under the hypothesis H00 .
irregular grids, MagmaClust would run the same without any change. On another as-
pect, Figure 9 provides some insights into the robustness of MagmaClust to an incorrect
setting of K, the number of clusters. For 100 data sets with a true value K ∗ = 3, the ARI
has been computed between the true partitions and the ones estimated by MagmaClust
initialised with different settings K = 2, . . . , 10. Except for K = 2 where the low number
of clusters prevents from getting enough matching pairs by definition, we may notice rela-
tively unaffected performances as K increases. Despite a non-negligible variance in results,
the partitions remain consistent overall, and the clustering performances of MagmaClust
seem pretty robust to misspecification of K.
To remain on the matter of clusters’ number, the model selection abilities of the proposed
VBIC (Proposition 4) maximisation procedure are assessed on simulated data. For different
true numbers of clusters, 50 data sets were simulated according to the previous scheme, and
MagmaClust was run successively on each with different settings of K. The setting that
reaches the highest value of VBIC is selected as the best model. The percentages of selection
for each true K ∗ and the corresponding values K retained through VBIC are reported in
Table 3. The procedure seems to adequately select the number of clusters in most context,
with results that deteriorate as K grows and a tendency to over-penalise, which is a classical
behaviour with BIC in practice (Weakliem, 2016). As the typical BIC formula relies on
asymptotic approximations, we also ran the simulation for different numbers of individuals
M = 50 and M = 100. It may be noticed that the VBIC performs better as M increases,
as expected. Such a property appears reassuring since the following real-data applications
involves data sets gathering around M = 103 individuals.
30
Cluster-Specific Predictions with Multi-Task Gaussian Processes
Table 4: Average (sd) values of MSE, W CIC95 , training and prediction times (in secs) on
100 runs for different numbers of clusters as setting for MagmaClust. (3∗ : the
true number of clusters for all data sets)
31
Leroy, Latouche, Guedj and Gey
Table 5: Average (sd) values of MSE, W CIC95 , training and prediction times (in secs) for
GP, SM LMC, MOSM, Magma and MagmaClust over 100 simulated testing
sets.
lations both between inputs and outputs (Álvarez et al., 2012). The Multi-Output Gaussian
Process Tool Kit (MOGPTK) (de Wolff et al., 2021) is a Python package that implements
the main multi-output covariance kernels from literature in a unified framework. We relied
on this package to run experiments, first applying the proposed combination (SM LCM) of
the spectral mixture (SM) (Wilson and Adams, 2013) and the linear model of coregional-
isation (LMC) (Goovaerts, 1997), which is the historical and very general formulation for
multi-output kernels. Additionally, we made use of the more recent multi-Output spectral
mixture (MOSM) algorithm (Parra and Tobar, 2017), which is defined as the default method
in the package. For the sake of fair comparison, the L-BFGS-B algorithm (Nocedal, 1980;
Morales and Nocedal, 2011) is the optimisation procedure used in all competing methods.
Regarding both mean prediction and uncertainty quantification, our approach outperforms
the alternatives. In terms of MSE, MagmaClust takes advantage of its multiple mean pro-
cesses to results in an order of magnitude enhancement compared to the best competitors
(Magma and SM LMC). As expected, the simple GP regression performs rather poorly and
surprisingly MOSM results are even worse, as in practice, its current implementation seems
to reach pathological cases during training most of the time. Additionally, the empirical
quantification of uncertainty of MagmaClust appears very convincing since there are on
average exactly 95% of the observations lying within the weighted CI95 , as expected.
In accordance with the theoretical complexity, the increase in training times displayed
in Table 5 remains roughly proportional to the value of K (we recall that MagmaClust
assumes K = 3 here, compared to Magma which is K = 1). In contrast, the multi-
output GPs methods (SM LMC and MOSM) multiply tenfold both training and prediction
times even in these reasonable experiments (50 individuals and 30 timestamps). This cost
comes from the cubic complexity in the number of tasks (due to size-augmented covariance
matrices) that results in a massive computational burden as M increases. As a consequence,
in the following real data applications, where thousands of individuals are considered, these
methods would merely be unusable in practice.
32
Cluster-Specific Predictions with Multi-Task Gaussian Processes
Women Men
Mean W CIC95 Mean W CIC95
Magma 4.9 (6.6) 94.3 (14.2) 3 (3.3) 96.9 (9.3)
MagmaClust- K = 2 4.5 (6.2) 94.3 (14.1) 2.8 (3.2) 96.2 (10.1)
MagmaClust- K = 3 4.3 (6) 94.4 (14.1) 2.8 (3.1) 96.1 (10.4)
MagmaClust- K = 4 4.2 (5.9) 94.4 (14.1) 2.8 (3) 96.2 (10)
MagmaClust- K = 5 4.2 (5.9) 94.4 (14.1) 2.7 (3) 96.1 (10.1)
MagmaClust- K = 6 4.1 (5.7) 94.3 (14.1) 2.8 (3) 96.3 (9.8)
Table 6: Average (sd) values of MSE and W CIC95 for MagmaClust with K = 1, . . . , 6
on the french swimmer testing data sets.
childhood, and missing data reconstruction in CO2 emissions. The common aspect in all
these problems lies in the presence of time series collected from multiple sources, possibly
with irregular measurements, for which we expect to provide accurate forecasts by taking
advantage of shared information and clustered structures in the data. For all experiments,
the individuals (or countries for CO2 ) are split into training and testing sets (in proportions
60% − 40%). In the absence of expert knowledge, the prior mean functions {mk (·)}k are set
to be constant equal to 0. The hypothesis H00 is specified along with random initialisations
for the hyper-parameters. The hyper-parameters, the mean processes and the cluster’s
membership probabilities are learnt on the training data set. Then, the data points of each
testing individual are split for evaluation purposes between observed (the first 60%) and
testing values (the remaining 40%). Therefore, each new process y∗ (·) associated with a test
individual is assumed to be partially observed, and its testing values are used to compute
MSE and W CIC95 for the predictions. As measuring clustering performances directly for
real-life applications is a vain effort (Kleinberg, 2002), the results are provided for several
values of K, from K = 1 (i.e. Magma) to K = 6, to evaluate the effect of clustered-data
structures on predictions. In all the following experiments, predictive performances tend to
reach a plateau as we increase the number of clusters and no substantial improvement is
noticeable for K ≥ 7.
As previously presented in Section 2.1, the 100m freestyle swimming data sets initially pro-
posed in Leroy et al. (2018) and Leroy et al. (2022) is analysed below in the new light of
MagmaClust. The data sets contain results from 1731 women and 7876 men, members
of the French swimming federation, each of them compiling an average of 22.2 data points
(min = 15, max = 61) and 12 data points (min = 5, max = 57) respectively. In the fol-
lowing, age of the i-th swimmer is considered as the input variable (timestamp t) and the
performance (in seconds) on a 100m freestyle as the output (yi (t)). The analysis focuses
on the youth period, from 10 to 20 years, where the progression is the most noticeable.
Let us recall that we aim at modelling a curve of progression from competition results for
each individual in order to forecast their future performances. We expect MagmaClust
33
Leroy, Latouche, Guedj and Gey
Figure 10: Top: women data set. Bottom: men data set. Heatmap of probabilities
obtained through the GPs mixture predictive distribution of MagmaClust
with K = 5 for a random illustrative swimmer. The dashed lines represent
the mean parameters from the mean processes estimates. Observed data points
are in black, testing data points are in yellow. Backward points are a sample
of observations from the training data set, coloured according to their most
probable cluster.
34
Cluster-Specific Predictions with Multi-Task Gaussian Processes
to provide relevant predictions by taking advantage of both its multi-task feature and the
clustered structure of data, previously highlighted in Leroy et al. (2018). As exhibited by
Table 6, MagmaClust offers excellent performances on both data sets and slightly im-
proves Magma predictions, as we increase the number of clusters. Values of both MSE and
W CIC95 appear satisfactory in all cases, and cluster-specific predictions provide additional
accuracy though one may fairly argue that the gain remains moderate overall. One of the
explaining reasons is highlighted in Figure 10, where we displayed illustrative predictions
for a random man and woman when K = 5. Although we can notice clear distinctions
between the different mean curves at young ages, these differences tend to decrease after-
wards, as adults’ performances lie in narrow intervals, especially in regards to the overall
signal-on-noise ratio. Nevertheless, MagmaClust provides several additional insights into
this problem compared to Magma.
First, the clusters offer interesting results to assess the profile of young swimmers and to
determine the individuals to whom they most resemble. In particular, it is also possible to
differentiate future evolutions associated with each cluster, along with their probabilities to
occur (we do not display all the cluster-specific predictions here for the sake of concision). On
the other hand, our method leads to tighter predictive distributions in terms of uncertainty.
Compared to Magma which uses all training data identically, MagmaClust somehow
discards the superfluous information, through the weights τ∗k , to only retain the most
relevant cluster-specific mean processes.
In contrast with the previous application, we now study individual time series that are
almost similar at young ages before diverging while growing older. This data set (collected
through the GUSTO program, see https://www.gusto.sg/) corresponds to a weight follow-
up of 342 children from Singapore at 11 timestamps between birth and 72 months. In this
experiment, the goal is to predict the weight of a child at {24, 36, 48, 60, 72} months, us-
ing its observed weight at {0, 3, 6, 9, 12, 18} months and data from the training individuals.
Since the weight differences between toddlers are shallow until 18 months, providing ac-
curate long-term forecasts, while clear morphology differences emerge, seems particularly
challenging at first sight. However, MagmaClust still achieves impressive performances
in this context as highlighted by Table 7. Once again, the accuracy of predictions seems to
slightly improve as we increase the number of clusters. In this application, recovering an
adequate cluster for a child is essential to anticipate which future weight pattern is the most
likely, and assuming that more clusters exist appears to help in identifying specific patterns
more precisely. This being said, each new cluster increases the overall complexity of the
model and whether it is worth adding one cluster regarding the relative gain in accuracy
remains a practitioner’s choice. For instance, the VBIC measure we proposed in Section 3.4
indicates K = 3 as the optimal number of clusters for this data set (keep in mind that
this criterion maximises a penalised ELBO and tells us nothing regarding the predicting
abilities). Using this value for K allows us to display on Figure 11 the behaviour of Mag-
maClust predictions for a random testing child. Notice that the prediction remains nicely
accurate all along even though the testing points (in yellow) are not close to any of the
35
Leroy, Latouche, Guedj and Gey
MSE W CIC95
Magma 5.35 (9.48) 93.3 (16.7)
MagmaClust- K = 2 4.94 (8.98) 95.1 (15.2)
MagmaClust- K = 3 5.01 (9.67) 94.7 (16)
MagmaClust- K = 4 4.90 (9) 94.8 (16.1)
MagmaClust- K = 5 4.95 (9.18) 94.7 (16.1)
MagmaClust- K = 6 4.85 (9.8) 94.9 (16.1)
Table 7: Average (sd) values of MSE and W CIC95 for MagmaClust with K = 1, . . . , 6
on the children’s weight testing data set.
Figure 11: Heatmap of probabilities obtained through the GPs mixture predictive distri-
bution of MagmaClust with K = 5 for a random illustrative swimmer. The
dashed lines represent the mean parameters from the mean processes estimates.
Observed data points are in black, testing data points are in yellow. Backward
points are observations from the training data set, coloured according to their
most probable cluster.
mean processes. This particularity comes from the learning of cluster weights τ∗k at young
ages, where the algorithm identifies that this child belonged in nearly equal proportions to
the clusters K1 (in red) and K2 (in blue). Therefore, the multi-task GPs mixture posterior
distribution defined in Proposition 6 defines a weighted trade-off between the two mean
processes that remarkably predicts the true weight values of the child for almost 5 years.
Although this nice behaviour on a single example does prevent pathological cases to occur
36
Cluster-Specific Predictions with Multi-Task Gaussian Processes
Figure 12: Time series representing the evolution of CO2 emissions per capita for an illus-
trative sample of 16 countries, differentiated by colors.
in general, we know from Table 7 that those remain rare in practice, and that the computed
credible intervals encompass true data with the correct degree of uncertainty.
37
Leroy, Latouche, Guedj and Gey
Mean W CIC95
Magma 34.9 (89.5) 92.9 (21.6)
MagmaClust- K = 2 28.9 (62.3) 90.8 (24)
MagmaClust- K = 3 19.6 (49.1) 92.5 (18.3)
MagmaClust- K = 4 15.4 (33.4) 93.7 (17.8)
MagmaClust- K = 5 14 (28.7) 94.1 (17.1)
MagmaClust- K = 6 14.2 (29.3) 93.4 (18)
Table 8: Average (sd) values of MSE and W CIC95 for MagmaClust with K = 1, . . . , 6
on the CO2 emissions testing data set.
ability to automatically recover patterns that seem logical (although such a thing remains
highly subjective). For instance, when setting K = 5, one cluster of 11 countries gathered
United Kingdom, United States, Russia, France and Canada among others. Another cluster
only counted Kuwait and Qatar as members, while the largest regrouped around 90 coun-
tries mainly from Africa, South Asia and South America, for which the CO2 emissions per
capita have generally been really low during the past centuries. More generally, we reported
the prediction performances on this data set in Table 8. Notice that increasing the number
of clusters dramatically improves accuracy in this context, as it seems clear on Figure 12
that some countries present atypical patterns and should be treated in separated clusters to
reach satisfying predictions of the missing values. Overall, these abilities to take advantage
of group-structured data, even with irregular measurements, and to provide probabilistic
statements, highlight once more the interest of MagmaClust to tackle various machine
learning problems both in supervised and unsupervised contexts.
7. Discussion
We introduced a novel framework to handle clustering and regression purposes with a multi-
task GPs mixture model. This approach, called MagmaClust, extends the previous al-
gorithm Magma (Leroy et al., 2022) to deal with group-structured data more efficiently.
The method provides new insights on the matter of GP regression by introducing cluster-
specific modelling and predictions while remaining efficiently tractable through the use of
variational approximations for inference. Moreover, this nonparametric probabilistic frame-
work accounts for uncertainty both in regards to the clustering and predictive aspects, which
appears to be notable in the machine learning literature. We demonstrated the practical
efficiency of MagmaClust on both synthetic and real data sets where it outperformed the
alternatives, particularly in group-structured context. Even though the main concern of
our method remains the predictive abilities, the clustering performances also deserve to be
highlighted, compared to state-of-the-art functional clustering algorithms.
38
Cluster-Specific Predictions with Multi-Task Gaussian Processes
sparse methods (Titsias, 2009; Bauer et al., 2016) also makes use of variational inference,
both to select pseudo-inputs and learn hyper-parameters of GP models. Therefore, an in-
teresting extension could come from simultaneously computing {µk (·)}k ’s hyper-posteriors
and pseudo-inputs in MagmaClust, allowing for a sparse approximation of the highest
dimensional objects in our model. Besides, several additional features would be worth in-
vestigating in future work, such as the extension to non-Gaussian likelihoods or enabling
online updates in the learning procedure.
8. Proofs
As a prerequisite, let us introduce an intermediate result that will be used several times in
the proofs below.
Lemma 7 Let X ∈ RN be a random Gaussian vector X ∼ N (m, K), where EX denotes the
expectation and VX the variance with respect to this distribution. Additionally, let b ∈ RN
be an arbitrary vector and S a N × N covariance matrix. Then:
Proof
39
Leroy, Latouche, Guedj and Gey
M X
K
"
X 1
= πk ) −
Zik log(b log Ψtbi 2
2 θi ,b
σi
i=1 k=1
#
1 t −1
− Eµk (yi − µk (ti ))| Ψ bi 2 (yi − µk (ti )) + C3 . (10)
2 θi ,b
σi
M X
X K
= Zik [ log τik ]
i=1 k=1
40
Cluster-Specific Predictions with Multi-Task Gaussian Processes
M
#
X −1
+ τik (yi − µk (ti )) |
Ψtbi 2 (yi − µk (ti )) + C3 .
θi ,b
σi
i=1
−1
If we regroup the scalar coefficient τik with the covariance matrix Ψtbi , we sim-
σi2
θi ,b
ply recognise two quadratic terms of Gaussian likelihoods on the variables µk (·), although
evaluated onto different sets of timestamps t and ti . By taking some writing cautions and
expanding the vector-matrix products entirely, it has been proved in Leroy et al. (2022)
that this expression factorises with respect to µk (t) simply by expanding vectors yi and
matrices Ψtbi 2 with zeros, ∀t ∈ t, t ∈
/ ti . Namely, we can notice that:
θi ,b
σi
h i
• ∀i ∈ I, Ψ̃i = 1[t,t0 ∈ti ] × ψθbi ,bσ2 (t, t0 ) 0 , a N × N matrix.
i t,t ∈t
Therefore:
K M
!
1X −1 X −1
log qbµ (µ) = − µk (t)| Ctγbk + τik Ψ̃i µk (t)
2
k=1 i=1
M
!
−1 X −1
+ µk (t)| Ctγbk mk (t) + τik Ψ̃i ỹi + C4 .
i=1
with:
M
−1
−1 −1
• C
bt = Ctγbk
P
k + τik Ψ̃i , ∀k ∈ K,
i=1
M
−1 −1
• m bt t
P
b k (t) = Ck Cγbk mk (t) + τik Ψ̃i ỹi , ∀k ∈ K.
i=1
Moreover, we can develop the formulation of the lower bound by expressing the integrals as
an expectation, namely EZ,µ . Recalling the complete-data likelihood analytical expression
41
Leroy, Latouche, Guedj and Gey
L(b
q ; Θ) = −E{Z,µ} log qbZ,µ (Z, µ) − log p({yi }i , Z, µ | Θ)
| {z }
constant w.r.t. Θ
K M
" ( )#
Y Y Zik
= E{Z,µ} log N µk (t); mk (t), Ctγk πk N yi ; µk (ti ), Ψtθi ,σ2 + C1
i i
k=1 i=1
K
"
X 1 h
−1
i
= − log Ctγk + Eµ (µk (t) − mk (t))| Ctγk (µk (t) − mk (t))
2
k=1
M
1X h −1
i
− E{Z,µ} Zik log Ψtθi ,σ2 + (yi − µk (ti ))| Ψtθi ,σ2 (yi − µk (ti ))
2 i i i i
i=1
M
#
X
+ EZ [ Zik ] log πk + C2
i=1
K
"
X 1 −1
b t Ct −1
= − log Ctγk + (m
b k (t) − mk (t))| Ctγk (mb k (t) − mk (t)) + tr C k γk
2
k=1
M
1X −1
b t Ψti 2 −1
− b k (ti ))| Ψtθi ,σ2 (yi − m
τik log Ψtθi ,σ2 + (yi − m b k (ti )) + tr C k θi ,σi
2 i i i i
i=1
M
#
X
+ τik log πk + C2 ,
i=1
where we made use of Lemma 7 twice, at the first and second lines for the last equality. By
reorganising the terms on the second line, we can derive another formulation of this lower
bound that allows for better managing of the computational resources. For information, we
give this expression below since it is the quantity implemented in the current version of the
MagmaClust code:
K
" #
1X t −1 | t −1
t t −1
L(bq ; Θ) = − log Cγk b k (t) − mk (t)) Cγk (m
+ (m b k (t) − mk (t)) + tr C b C
k γk
2
k=1
M K
"
1X −1 −1 X
− log Ψtθi ,σ2 + yi| Ψtθi ,σ2 yi − 2yi| Ψtθi ,σ2 τik m
b k (ti )
2 i i i i i i
i=1 k=1
K
!#
−1 X
ti bt
+ tr Ψ 2 τik m
θi ,σi
b k (ti )| + C
b k (ti )m k
k=1
K X
X M
+ τik (log πk ) + C2 .
k=1 i=1
Regardless of the expression we choose for the following, we can notice thatwe expressed
the lower bound L(q; Θ) as a sum where the hyper-parameters {γk }k , {{θi }i , σi2 i } and π
appear in separate terms. Hence, the resulting maximisation procedures are independent
42
Cluster-Specific Predictions with Multi-Task Gaussian Processes
of each other. First, we focus on the simplest term that concerns π, for which we have an
K
P
analytical update equation. Since there is a constraint on the sum πk = 1, we first need
k=1
to introduce a Lagrange multiplier in the expression to maximise:
K
!
X
λ πk − 1 + L(q; Θ). (12)
k=1
43
Leroy, Latouche, Guedj and Gey
L(b
q ; Θ) = E{Z,µ} [ log p({yi }i , Z, µ | Θ) − log qbZ,µ (Z, µ) ]
= E{Z,µ} log p({yi }i | Z, µ, {θi }i , σi2 i ) + EZ [ log p (Z | π)) ]
Acknowledgments
Significant parts of this work have been carried out while Arthur Leroy was affiliated with
MAP5, Université Paris Cité, CNRS, UMR 8145, and Department of Computer Science,
The University of Sheffield. The authors warmly thank the French Swimming Federation
44
Cluster-Specific Predictions with Multi-Task Gaussian Processes
for collecting data and sharing insights on the analysis, as well as Ai Ling Teh and Dennis
Wang for providing data from the GUSTO project. The study is supported by the National
Research Foundation (NRF) under the Open Fund-Large Collaborative Grant (OF-LCG;
MOH-000504) administered by the Singapore Ministry of Health’s National Medical Re-
search Council (NMRC) and the Agency for Science, Technology and Research (A*STAR).
In RIE2025, GUSTO is supported by funding from the NRF’s Human Health and Potential
(HHP) Domain, under the Human Potential Programme. Benjamin Guedj acknowledges
partial support by the U.S. Army Research Laboratory and the U.S. Army Research Of-
fice, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences
Research Council (grant EP/R013616/1), and by the French National Agency for Research
(grants ANR-18-CE40-0016-01 & ANR-18-CE23-0015-02).
Data availability
The synthetic data, trained models and results are available at https://github.com/
ArthurLeroy/MAGMAclust/tree/master/Simulations. The real data sets, associated trained
models and results are available at https://github.com/ArthurLeroy/MAGMAclust/tree/
master/Real_Data_Study.
Code availability
The current version of the R package implementing MagmaClust is available on the CRAN
and at https://github.com/ArthurLeroy/MagmaClustR.
References
C. Abraham, P. A. Cornillon, E. Matzner-Løber, and N. Molinari. Unsupervised Curve
Clustering using B-Splines. Scandinavian Journal of Statistics, 30(3):581–595, Sept. 2003.
ISSN 1467-9469. doi: 10.1111/1467-9469.00350.
45
Leroy, Latouche, Guedj and Gey
and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages
1533–1541. Curran Associates, Inc., 2016.
J. Bernardo, J. Berger, A. Dawid, and A. Smith. Regression and classification using Gaus-
sian process priors. Bayesian statistics, 6:475, 1998.
C. Biernacki, G. Celeux, and G. Govaert. Choosing starting values for the EM al-
gorithm for getting the highest likelihood in multivariate Gaussian mixture models.
Computational Statistics & Data Analysis, 41(3):561–575, 2003. ISSN 0167-9473. doi:
10.1016/S0167-9473(02)00163-9.
C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statis-
tics. Springer, New York, 2006. ISBN 978-0-387-31073-2.
C. Bouveyron, E. Côme, and J. Jacques. The discriminative functional mixture model for
a comparative analysis of bike sharing systems. The Annals of Applied Statistics, 9(4):
1726–1760, 2015.
R. Caruana. Multitask Learning. Machine Learning, 28(1):41–75, July 1997. ISSN 1573-
0565. doi: 10.1023/A:1007379606734.
46
Cluster-Specific Predictions with Multi-Task Gaussian Processes
J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Proceedings
of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI’13, page
282–290, Arlington, Virginia, USA, 2013a. AUAI Press.
M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems.
Journal of research of the National Bureau of Standards, 49(6):409–436, 1952.
J. Jacques and C. Preda. Funclust: A curves clustering method using functional ran-
dom variables density approximation. Neurocomputing, 112:164–171, July 2013. ISSN
09252312. doi: 10.1016/j.neucom.2012.11.042.
J. Jacques and C. Preda. Functional data clustering: A survey. Advances in Data Analysis
and Classification, 8(3):231–255, Sept. 2014. ISSN 1862-5347, 1862-5355. doi: 10.1007/
s11634-013-0158-y.
H. Jiang and N. Serban. Clustering Random Curves Under Spatial Interdependence With
Application to Service Accessibility. Technometrics, 54(2):108–119, May 2012. ISSN
0040-1706, 1537-2723. doi: 10.1080/00401706.2012.657106.
A. Leroy, A. Marc, O. Dupas, J. L. Rey, and S. Gey. Functional Data Analysis in Sport
Science: Example of Swimmers’ Progression Curves Clustering. Applied Sciences, 8(10):
1766, Oct. 2018. doi: 10.3390/app8101766.
A. Leroy, P. Latouche, B. Guedj, and S. Gey. Magma: inference and prediction using
multi-task gaussian processes with common mean. Machine Learning, 111(5):1821–1849,
2022.
47
Leroy, Latouche, Guedj and Gey
R. M. Neal. Monte Carlo Implementation of Gaussian Process Models for Bayesian Regres-
sion and Classification. arXiv:physics/9701026, Jan. 1997.
G. Parra and F. Tobar. Spectral mixture kernels for multi-output gaussian processes. Ad-
vances in Neural Information Processing Systems, 30, 2017.
C. E. Rasmussen and H. Nickisch. Gaussian processes for machine learning (GPML) toolbox.
The Journal of Machine Learning Research, 11:3011–3015, 2010.
A. Schwaighofer, V. Tresp, and K. Yu. Learning gaussian process kernels via hierarchical
bayes. In Advances in neural information processing systems, volume 17, 2004.
48
Cluster-Specific Predictions with Multi-Task Gaussian Processes
J. Q. Shi and B. Wang. Curve prediction and clustering with mixtures of Gaussian process
functional regression models. Statistics and Computing, 18(3):267–283, 2008. ISSN 0960-
3174, 1573-1375. doi: 10.1007/s11222-008-9055-1.
A. Wilson and R. Adams. Gaussian process kernels for pattern discovery and extrapolation.
In S. Dasgupta and D. McAllester, editors, Proceedings of Machine Learning Research,
volume 28, pages 1067–1075, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
K. Yu, V. Tresp, and A. Schwaighofer. Learning gaussian processes from multiple tasks. In
Proceedings of the 22Nd International Conference on Machine Learning, pages 1012–1019,
2005. doi: 10.1145/1102351.1102479.
49