Nonstationary and Sparsely-Correlated Multioutput Gaussian Process with Spike-and-Slab Prior

Published Online:https://doi.org/10.1287/ijds.2023.0022

Abstract

Multioutput Gaussian process (MGP) is commonly used as a transfer learning method to leverage information among multiple outputs. A key advantage of MGP is providing uncertainty quantification for prediction, which is highly important for subsequent decision-making tasks. However, traditional MGP may not be sufficiently flexible to handle multivariate data with dynamic characteristics, particularly when dealing with complex temporal correlations. Additionally, because some outputs may lack correlation, transferring information among them may lead to negative transfer. To address these issues, this study proposes a nonstationary MGP model that can capture both the dynamic and sparse correlation among outputs. Specifically, the covariance functions of MGP are constructed using convolutions of time-varying kernel functions. Then a dynamic spike-and-slab prior is placed on correlation parameters to automatically decide which sources are informative to the target output in the training process. An expectation-maximization (EM) algorithm is proposed for efficient model fitting. Both numerical studies and real cases demonstrate its efficacy in capturing dynamic and sparse correlation structure and mitigating negative transfer for high-dimensional time-series data modeling. A mountain-car reinforcement learning case highlights that transferring knowledge from source expertise can enhance the efficiency of the target decision-making process. Our method holds promise for application in more complex multitask decision-making challenges within nonstationary environments in the future.

History: Rema Padman served as the senior editor for this article.

Funding: This work was supported by NSFC [Grants NSFC-72171003, NSFC-71932006, and NSFC-72101147].

Data Ethics & Reproducibility Note: The code capsule is available on Code Ocean at https://doi.org/10.24433/CO.4010696.v1 and in the e-Companion to this article (available at https://doi.org/10.1287/ijds.2023.0022).

1. Introduction

Gaussian process (GP) provides an elegant and flexible Bayesian nonparametric framework for modeling nonlinear mappings (Williams and Rasmussen 2006). Characterized solely by mean and covariance functions, it is capable of capturing complex input-output relationships, as well as measuring prediction uncertainty, which is critical for decision making. As a result, GP has been widely applied in various fields, such as Bayesian optimization (Frazier 2018), experiment design (Gramacy 2020), and product quality monitoring (Zhang et al. 2016). However, standard GP is designed for only a single output, which limits its use in multioutput or multitask scenarios arising in various fields, such as Bayesian optimization (Shen et al. 2023), a traffic network (Rodrigues et al. 2019), and a computer simulation emulator (Fricker et al. 2013). Consequently, multioutput Gaussian process (MGP) has been gaining increasing attention from researchers and has emerged as an important member in the vast family of transfer learning (Pan and Yang 2009) and multitask learning (Zhang and Yang 2021) methods.

Stationary GP and MGP models are commonly used with covariance function depending only on the distance among data points. However, this invariance to input space translation makes them unsuitable for nonstationary environments, where the data characteristics vary across the input domain (Williams and Rasmussen 2006). This phenomenon is quite common in time-series data. For instance, in the energy field, the mean of power consumption in a household is different in every season (Paun et al. 2023). In clinical studies, sepsis is very likely to cause changes in the cross-correlation among vital signs in the early onset (Fairchild et al. 2017). In kinesiology, the cooperation patterns of human joints vary across different gestures: for example, both hands move jointly in a “shoot” action but separately in a “throw” action (Xu et al. 2022). In such cases, nonstationary models, which allow all or a subset of parameters to vary, are generally more appropriate. Modeling and capturing such a structural change are important in subsequent decision-making tasks, such as identifying the risk of disease and providing medical care (Paun et al. 2023).

Mainly two kinds of methods have been proposed to capture the dynamic characteristics of the nonstationary data. The first category assumes that the parameters are the same within local regions but different across regions. For example, a Bayesian tree-based GP (Gramacy and Lee 2008) uses a tree structure to partition the input space of a computer simulation emulator. Another method, called jump GP, cuts the input space into several segments to model piece-wise continuous functions (Park 2022). This model is optimized using the expectation-minimization (EM) algorithm or variational inference. Further, a clustering-based GP (Heaton et al. 2017) partitions the spatial data into groups by calculating a cluster dissimilarity and constructs stationary GPs for each group of data. A space-partitioning-based GP is further extended to the active learning area to accelerate the design of experiments with heterogeneous systems (Lee et al. 2023). However, these methods are not suitable for data with gradually changing characteristics. To address this issue, the methods in the second category abandon the locally stationary assumption. They allow all or some parameters to be input/time-dependent, and model those parameters by additional GPs. For instance, nonstationary GP introduced in Heinonen et al. (2016) and Paun et al. (2023) applies GP priors on the amplitude and length-scale parameters of a square-exponential covariance function. In addition, based on these single-output nonstationary GPs, researchers have explored MGP models for multivariate data with dynamic characteristics. For example, a nonstationary MGP is established to model the varying correlation between vital signals, where a GP prior is imposed on the time-dependent correlation matrix (Meng et al. 2021). However, in this state-of-the-art MGP model, the GP prior has no shrinkage effect, which does not encourage a sparse estimation of cross-correlation among multiple outputs.

Pursuing sparse estimation of cross-correlation is rooted in negative information transfer, which is another critical challenge when using MGP. Transfer learning is very promising for leveraging information to a data-insufficient domain, called a target domain, from other correlated domains, called the source domain. However, not all the data from the source domain are necessarily correlated with the target domain. If knowledge is transferred from uncorrelated domains, it may reduce the performance of target learning, known as negative transfer (Pan and Yang 2009, Yoon and Li 2018). For example, the motion signal of a specific human joint may only be correlated with a subset of the other joints. In order to recover the joint’s motion information by borrowing information from others, it is necessary to detect which joints share similar moving trajectories with the target joint. Therefore, it is crucial that researchers or engineers can make the best choice on using which sources to transfer information.

Negative transfer exists widely in transfer learning, often stemming from the excessive inclusion of source data. To handle this issue, one straightforward approach is to measure the relatedness of each source to the target and choose the most closely related one for information transfer. For example, the method proposed in Ye and Dai (2021) takes Jensen-Shannon (JS) divergence as a criterion and selects the source with the least divergence for knowledge transfer. However, such a method only takes the pairwise transferability into account and ignores the global structure between the target and the sources. And this choice is made independently on the specific model before training, which is far from an optimal decision. An alternative approach is the regularized MGP (Wang et al. 2022), which jointly models all outputs and selects informative sources during the training process. However, all these approaches assume that the source-target cross-correlation is fixed in the time space, and thus cannot model the dynamic and sparse structure among multiple outputs.

To this end, we propose a nonstationary MGP model to capture the varying characteristics of data and mitigate the negative transfer simultaneously. Specifically, we focus on modeling the dynamic and sparse correlations between the sources and the target. In the proposed framework, we first construct a convolution-process-based MGP for transfer learning, whose covariance function parameters are allowed to vary in the time space. We then apply a spike-and-slab prior to the parameters that are related to the sparse correlation between the sources and the target. The slab part mainly accounts for smoothly changing or constant correlation parameters, whereas the spike part is responsible for shrinking some parameters to zero, thereby removing the corresponding uninformative sources. To the best of our knowledge, this is the first research on MGP that simultaneously handles dynamic relationships and negative transfer. Our contributions can be summarized as follows:

  1. A novel nonstationary MGP model is established using the convolution of latent processes and time-dependent kernel functions, which is suitable for modeling multiple outputs with varying characteristics.

  2. A dynamic spike-and-slab prior is applied to capture the temporal and sparse correlations among outputs, deciding from which sources to transfer information to the target.

  3. The mixture weight of the spike-and-slab priors is automatically adjusted during the training process using an EM-based optimization algorithm, which can effectively prevent placing shrinkage effects on nonzero elements.

The rest of this paper is organized as follows. In Section 2, we revisit the related literature and the static MGP model. Section 3 presents the proposed nonstationary MGP and an efficient EM algorithm for model training. In Section 4, we evaluate the effectiveness of our model on simulated data. In Section 5, we perform one time-series analysis case on human gesture data (Fothergill et al. 2012) and one control policy optimization case on the mountain-car problem (Moore 1990). In Section 6, we conclude the paper with a discussion.

2. Preliminaries

In this section, we first review research that is related to our work. We then introduce the static MGP based on the convolution process, which has been widely applied in various areas because of its flexibility (Boyle and Frean 2004, Kontar et al. 2018, Hu and Wang 2021).

2.1. Related Work

To deal with nonstationary data, a natural extension of GPs is to release the restriction that the parameters of the covariance functions are invariant throughout the input space. Most of the existing approaches either encourage the parameters to be constant in a local area and construct a piece-wise model (the first category, e.g., Gramacy and Lee 2008, Heaton et al. 2017, Park 2022, Lee et al. 2023), or allow them to vary at each point and model them using other GPs (the second category, e.g., Paciorek and Schervish 2003, Heinonen et al. 2016, Paun et al. 2023). The methods in the second category have a similar structure to that of a two-layer Deep Gaussian Process (Deep GP), where the input is first transformed by the first GP layer into a latent input, and then fed into the second GP layer to obtain the output (Damianou and Lawrence 2013, AlBahar et al. 2022, Ko and Kim 2022). However, the parameters of Deep GP are stationary, which differs from the second category where the covariance parameters are dynamic.

Nonstationary GPs mainly focus on modeling the dynamic mean, smoothness, and amplitude parameters. With regards to MGP, dynamic correlation is another key characteristic that needs to be considered. In the classical Linear Model of Coregionalization (LMC), each output is a linear combination of several latent Gaussian processes, and the covariance matrix is modeled by a Kronecker product of a correlation matrix and a single GP’s covariance matrix (Goulard and Voltz 1992). The existing nonstationary MGPs are mainly extensions of the classical LMC model. For example, the approach in Gelfand et al. (2004) allows the correlation matrix to vary with inputs to model the dynamic relationship among outputs. In Meng et al. (2021), a nonstationary MGP combines the time-varying correlation (across outputs) and smoothness (within each output) together. However, as extensions of the traditional LMC, these methods also suffer from the limitation that all outputs possess the same covariance structure. More flexible MGPs are proposed by constructing each output through the convolution process and modeling them with individual parameters (Boyle and Frean 2004). However, these approaches are for stationary data. Furthermore, all existing approaches fail to capture a sparse correlation structure in a nonstationary environment.

Spatial-temporal modeling of nonstationary data is closely related to our work. In comparison with normal time-series modeling, spatial-temporal analysis requires modeling the spatial correlation to enhance the prediction accuracy. A large number of spatial-temporal models have been investigated, such as the spatial-temporal autoregressive integrated moving average method (ST-ARIMA) (Stathopoulos and Karlaftis 2003), spatial-temporal k-nearest neighbors (ST-KNN) (Xia et al. 2016), spatial-temporal random fields (Christakos 2012, Yun et al. 2022), and spatial-temporal deep neural networks (Wang et al. 2020b, Wen et al. 2023). Based on the aforementioned methods, a number of recent works try to extend them to handle nonstationary spatial-temporal data. One popular and efficient solution is utilizing some change detection algorithm to partition the time-series into several stationary periods, and then applying the stationary model for each period, for example, an ST-KNN with a wrapped K-means partition algorithm (Cheng et al. 2021) or an autoregressive model coupled with a block-fused-Lasso change-point detection algorithm (Bai et al. 2022). Besides partitioning the time-series into stationary parts, the method proposed by Shand and Li (2017) maps the nonstationary space-time process into a high-dimensional stationary process through augmenting new dimensions. Another type of nonstationary spatial-temporal model is Bayesian random fields with nonstationary space-time kernels (Garg et al. 2012, Ton et al. 2018, Wang et al. 2020a, Zou et al. 2023), whose hyper-parameters change over time or location. Deep learning methods are also explored on nonstationary data recently, such as nonstationary recurrent neural networks (Rangapuram et al. 2018, Liu et al. 2020), long short-term memory networks (Wang et al. 2019), and transformer-based networks (Liu et al. 2022, Wen et al. 2023). In contrast to the spatial-temporal model, MGP does not impose a restriction that the source outputs must be sampled during the same period as the target outputs. Furthermore, it does not depend on spatial distance to establish correlations among outputs.

It is important to mention that we use a dynamic spike-and-slab prior in our model. The classical spike-and-slab prior is a Bayesian selection approach (George and McCulloch 1993) that has been used for feature selection in (generalized) linear models, additive models, and Gaussian processes (Ishwaran and Rao 2005, Scheipl et al. 2012, Dance and Paige 2022). With this prior, smaller parameters tend to be more influenced by the spike prior to reach zero, whereas the larger ones are mainly dominated by the slab part and bear little shrinkage. However, the classical spike-and-slab prior cannot account for the modeling of dynamic and sparse correlation parameters in our model. Therefore, we propose to extend this prior to a dynamic version. Although current works have explored the dynamic variable selection for the varying-coefficient linear models (Kalli and Griffin 2014, Huber et al. 2021, Rockova and McAlinn 2021), no work utilizes a dynamic spike-and-slab prior to model the dynamic and sparse correlation among outputs in a nonstationary MGP.

2.2. Static MGP Based on Convolution Process

Consider a set of m outputs fi:XR,i=1,,m, where X is a d-dimensional input domain applied to all outputs. Suppose that the observation yi is accompanied with independent and identically distributed (i.i.d.) noise ϵiN(0,σi2), that is,

yi(x)=fi(x)+ϵi,
where xRd is the input. Denote the ni observed data for the ith output as Di={Xi,yi}, where Xi=(xi,1,,xi,ni)T and yi=(yi,1,,yi,ni)T are the collections of input points and associated observations, respectively. Let the total number of observations be represented by N=ini for m outputs. Denote the data of all outputs as D={X,y}, where X=(X1T,,XmT)TRN×d and y=(y1T,,ymT)TRN. In an MGP model, the observation vector y follows a joint Gaussian distribution:
y|XN(0,K),(1)
where K=K(X,X)RN×N is a block-partitioned covariance matrix. The (i,i)-th block of K, Ki,i=covi,if(Xi,Xi)+τi,iσi2IRni×ni represents the covariance matrix between the output i and output i (τi,i equals to one if i=i, and zero otherwise). The function covi,if(x,x) measures the covariance between fi(x) and fi(x). In the covariance matrix K, the cross-covariance block Ki,i(ii) is the most important part to realize information transfer, as it models the correlation between different outputs.

As the convolution of a GP and a smoothing kernel is still a GP, we can construct each output fi through convolving a group of shared latent processes {zj(x)}j=1h and kernel functions {gji(x)}j=1h in the following way (Boyle and Frean 2004, Alvarez and Lawrence 2011):

fi(x)=j=1hαjigji(x)zj(x)=j=1hαjigji(xu)zj(u)du,(2)
where * represents the convolution operation, αji is the amplitude parameter, and h is the number of shared latent processes. Usually, {zj(x)}j=1h are independent white Gaussian noise processes with cov(z(x),z(x))=δ(xx), where δ(·) is the Dirac delta function. Thus, the covariance function can be derived as
covi,if(x,x)=cov[fi(x),fi(x)]=j=1hcov{αjigji(x)zj(x),αjigji(x)zj(x)}=j=1hαjiαjigji(u)gji(uv)du,(3)
where v=xx. In such a way, the covariance between fi(x) and fi(x) is dependent on their difference xx, the amplitude parameters, and the hyper-parameters in kernels gji and gji. Compared with the classical LMC model fi(x)=j=1hαjiqj(x), where qj(x) is a latent GP with covariance kj(x,x). The convolution-process-based MGP is more flexible than LMC, as it does not restrict all outputs to having the same autocovariance pattern.

At a new point x*, the posterior distribution of yi(x*) given data {X,y} is

yi(x*)|X,yN(μ(x*),Σ(x*)),(4)
where the predictive mean μ(x*) and variance Σ(x*) can be expressed as
μ(x*)=K*TK1y,(5)
Σ(x*)=coviif(x*,x*)+σi2K*TK1K*,(6)
where K*T=[covi1f(x*,X1)T,,covimf(x*,Xm)T] is the covariance between the new point x* and all observed data. From the posterior distribution, we can find that the covariance function plays a crucial role in prediction. For instance, the predicted mean is the linear combination of output data, where the weight is decided by the covariance matrix. However, in the static MGP, the covariance between two data points depends solely on their distance and does not change dynamically. Additionally, some outputs may be uncorrelated with others; therefore, the estimated covariance matrix should possess a sparse structure to avoid negative transfer between the uncorrelated outputs. In the following section, we will propose a novel nonstationary MGP to simultaneously address both problems.

3. Model Development

We propose a nonstationary MGP model for transfer learning that can capture sparse source-target correlation in a dynamic environment. Specifically, we assume that the correlations between the target and each source vary over time. Besides, some sources may not be related to the target during certain time periods. Under such a circumstance, a spike-and-slab prior is utilized to model the varying and sparse correlation structure.

3.1. The Proposed Model

The structure of our hierarchical model is illustrated in Figure 1. The first layer constructs outputs through the convolution of time-dependent kernel functions and latent white Gaussian noise processes, and the second layer consists of priors on function parameters designed to encourage desired properties, such as smoothness and sparsity.

Figure 1. The Graphical Structure of Nonstationary MGP
Notes. Latent processes and kernel functions are shown on a gray background, whereas the parameters are shown on a white background. The parameters’ priors are shown in rectangles.

3.1.1. Dynamic MGP.

In this subsection, we introduce the major part of the proposed nonstationary MGP, which corresponds to the first layer in Figure 1. To simplify the notation, we slightly abuse the notation used in the previous section. Specifically, we take the first m − 1 outputs fi:XR,i=1,,m1 as the sources, and the last one fm:XR as the target. We still assume that the observation yi is accompanied with the i.i.d. measurement noise ϵiN(0,σi2). Let I={1,2,,m} be the index set of all outputs, and IS=I/{m} contain the indices of all sources. For simplicity yet without loss of generality, we assume the source and target data are sampled at the time t{1,2,,n}, that is, Xi=(xi,1,,xi,t,,xi,n)T and yi=(yi,1,,yi,t,..,yi,n)T. In Online Appendix A, we will show a more general case where each output is observed only at a subset of time stamps.

Based on the above assumption, our dynamic MGP model is formulated as

yi(xt)=fi(xt)+ϵi=αii,tgii,t(xt)zi(xt)+ϵi,iISym(xt)=fm(xt)+ϵm=jIαjm,tgjm,t(xt)zj(xt)+ϵm,(7)
where αii,t and αjm,t are time-varying amplitude parameters, gii,t(x) and gjm,t(x) are time-varying kernel functions, and {zi(x)}i=1m are latent white Gaussian noise processes independent of each other. This model is highly flexible as various types of kernel functions can be utilized. We choose to employ a Gaussian kernel which is widely used because of its flexibility (Boyle and Frean 2004, Alvarez and Lawrence 2011). The kernel is given by
gij,t(x)=(2π)d4|θij,t|14exp(12xTθij,t1x),(8)
where θij,t is a diagonal matrix representing the length-scale for each input feature. More importantly, such a Gaussian kernel can yield closed-form covariance functions through the convolution operation in Equation (3) (Paciorek and Schervish 2003, Wang et al. 2020a).

This flexible model allows each source to have its own kernel gii, thereby allowing for heterogeneity among the sources. In order to transfer knowledge from the sources to the target, the target is connected to {zi}i=1m1 though the kernel function gim,t. Regarding the parameters, {αii,t,θii,t}i=1m are responsible for the nonstationary behavior within each output, whereas {αim,t,θim,t}i=1m1 capture the dynamic correlation between target and sources. More specifically, the amplitude parameter αim,t controls the knowledge transfer. For example, if αim,t=0, then fi will not transfer information to fm at time t.

As the latent processes are independent of each other, the covariance matrix among sources is zero-valued. Therefore, the covariance matrix can be repartitioned as

K=(K1,10K1,m0Km1,m1Km1,mK1,mTKm1,mTKm,m)=(K(ss)K(sm)K(sm)TKmm),(9)
where the (i,i)-th block Ki,i=covi,if(Xi,Xi)+τi,iσi2I, the block-diagonal matrix K(ss) represents the covariance of source outputs, and K(sm) represents the cross-covariance between the sources and the target. Based on Equation (3), we can obtain covariance functions for the proposed nonstationary MGP model, as shown below:
coviif(xt,xt)=αii,tαii,t|θii,t|14|θii,t|14|θii,t+θii,t|12exp[12(xtxt)T(θii,t+θii,t)1(xtxt)],(10a)
covimf(xt,xt)=αii,tαim,t|θii,t|14|θim,t|14|θii,t+θim,t|12exp[12(xtxt)T(θii,t+θim,t)1(xtxt)],(10b)
covmmf(xt,xt)=jIαjm,tαjm,t|θjm,t|14|θjm,t|14|θjm,t+θjm,t|12exp[12(xtxt)T(θjm,t+θjm,t)1(xtxt)],(10c)
where iIS. Equations (10a)–(10c) represent the covariance within the sources, between the sources and the target, and within the target, respectively. To ensure the positivity of those hyper-parameters, we utilize a soft-plus transformation for them (Heinonen et al. 2016): αij,t=log[1+exp(α˜ij,t)],θij,t=log[1+exp(θ˜ij,t)], where α˜ij,t,θ˜ij,t are underlying parameters to estimate, whose range is [,]. The proposed covariance functions can be viewed as an extension of the nonstationary kernels (Paciorek and Schervish 2003) from the single-output to the multioutput case. Specifically, the autocovariance of each source in Equation (10a) is the same as the covariance for a single-output nonstationary GP. From the cross-covariance between each source and the target, we can clearly see that the amplitude parameter αim,t controls whether the cross-correlation is zero or not. The validity of the proposed covariance functions is outlined in Proposition 1:

Proposition 1.

The proposed nonstationary MGP covariance matrix in Equation (9) is positive-definite, that is, y0,

yTKy>0.

The proof is provided in Online Appendix B.

Based on Equation (9), the joint distribution of all sources and the target is expressed as

(y(s)ym)|XN([00],[K(ss)K(sm)K(sm)TKmm]),(11)
where y(s) is the collection of all source data. For notational convenience, we partition the parameters in a similar way, α(s)={α(s),t}t=1n,θ(s)={θ(s),t}t=1n,σ(s) ={σi}i=1m1,αm={αm,t}t=1n,θm={θm,t}t=1n, where α(s),t={αii,t}i=1m1,θ(s),t={θii,t}i=1m1,αm,t={αim,t}i=1m, and θm,t={θim,t}i=1m. Furthermore, we denote the collection of all parameters as Φ={Φ(s),Φm}, where Φ(s)={α(s),θ(s),σ(s)} and Φm={αm,θm,σm}.

In the proposed model, the most important and challenging task is to estimate those time-varying parameters. If no restriction is applied, model training may suffer from a serious overfitting problem. To address this issue, Gaussian processes are typically employed to model the kernel parameters, for example, log(αt)GP(0,kα(t,t)),log(θt)GP(0,kθ(t,t)), where kα,kθ are covariance functions for the amplitude and length-scale parameters, respectively. Although this technique can force the parameters to vary smoothly and reduce overfitting, it cannot model a sparse correlation between the sources and the target. Consequently, this approach cannot avoid the negative transfer caused by unrelated sources.

3.1.2. Dynamic Spike-and-Slab Prior.

The classical spike-and-slab prior (George and McCulloch 1993, Dance and Paige 2022) only handles the shrinkage of one single parameter and cannot model the smooth-varying parameters. To take both the dynamic and sparse properties of correlation into account, we propose a dynamic spike-and-slab prior placing on αm:

αim,t|γi,t,αim,t1(1γi,t)pspike(αim,t)+γi,tpslab(αim,t|αim,t1)γi,t|ηBern(η),(12)
where γi,t{0,1} is a binary sparse indicator for αim,t following a Bernoulli distribution, pspike(αim,t) is a zero-mean spike prior pushing the parameter to zero, pslab(αt,im|αt1,im) is a slab prior connecting αim,t1 and αim,t, and η is a prior weight between the spike and slab. If there is no prior information regarding the weight, we can set η to 0.5. The spike-and-slab prior is shown in the second layer of the graphical structure in Figure 1. As for all the other parameters, we do not force them to possess sparsity, so only the slab prior is placed on them to control the smoothness, that is,
αii,t|αii,t1pslab(αii,t|αii,t1),θii,t|θii,t1pslab(θii,t|θii,t1),θim,t|θim,t1pslab(θim,t|θim,t1).(13)

Note that θij,t is a diagonal matrix, so that the slab prior is placed on its d diagonal elements independently, that is, pslab(θii,t|θii,t1)=l=1dpslab({θii,t}l|{θii,t1}l). By using the conditional distributions as priors, we can control the change of amplitude or smoothness from the previous time step to the current one, for example, from αii,t1 to αii,t. Compared with the dynamic spike-and-slab prior used in linear models (Rockova and McAlinn 2021), our method does not constrain the slab prior to be a stable autoregressive process. Besides, we use a simpler but more flexible prior for γi,t, whereas the work in Rockova and McAlinn (2021) uses a prior conditional on the coefficients of the linear model.

The spike prior is responsible for shrinking the parameters to zero and cutting down the information transfer channel to the target. Common choices for this prior include point mass distribution, Laplace distribution, and Gaussian distribution with a small variance. Considering the shrinkage performance and optimization convenience, we choose the Laplace distribution as the spike prior, that is,

pspike(αim,t)=12ν0exp(|αim,t|ν0),(14)
where ν0 is the length-scale for Laplace distribution. If we maximize the log-likelihood function to optimize the model, this prior will become an L1 norm penalty and have the ability to shrink parameters. The slab prior encourages the smoothness of parameter change. In this work, we consider two types of slab priors. The first one is a hard slab prior,
pslabhard(αim,t|αim,t1)=12ν1exp(|αim,tαim,t1|ν1),(15)
which encourages αim,t to remain constant in a continuous period, approximating a piece-wise model. In the second one, the parameters are allowed to change smoothly,
pslabsoft(αim,t|αim,t1)=12πν1exp((αim,tραim,t1)22ν1),(16)
where ν1 is variance of Gaussian distribution, and ρ<1 is an autoregressive coefficient. A similar smoothing approach can also be found in Hu and Wang (2021). These two slab priors make the current parameter value exactly or roughly concentrated around the previous value. Typically, we set ν0 to be much smaller than ν1 in the soft slab prior (or ν1 in the hard slab prior) to make the two priors more separable and to put more penalty on sparsity. Besides, the values of αim at multiple time steps before t can be included in the soft slab prior, for example, pslabsoft(αim,t|αim,t1,αim,t2,). We choose the simplest form pslabsoft(αim,t|αim,t1) because of its wide application and robust performance in practice.

At time t, η can be interpreted as the prior probability that αim,t belongs to the slab process. It influences the strength of the shrinkage effect that αim,t bears. Ideally, for nonzero αim,t, the posterior mean of γi,t should be close to one so that αim,t is barely impacted by the spike prior. In the optimization algorithm developed in the next subsection, we will show that the estimated mean of γi,t is automatically adjusted based on the estimated αim,t to avoid shrinking nonzero elements. This makes our method superior to traditional Lasso methods where the sparse penalty weights are identical for zero and nonzero parameters.

Finally, based on the above discussion, the whole hierarchical model of the proposed nonstationary MGP can be expressed as follows:

y(s),ym|Φ(s),ΦmN(0,K)αii,t|αii,t1pslab(αii,t|αii,t1),iIS,θii,t|θii,t1pslab(θii,t|θii,t1),iIS,αim,t|γi,t,αim,t1(1γi,t)pspike(αim,t)+γi,tpslab(αim,t|αim,t1),iI,γi,t|ηBern(η),iI,θim,t|θim,t1pslab(θim,t|θim,t1),iI.(17)

3.2. Expectation-Maximization-Based Optimization Algorithm

The widely used algorithm for a Bayesian model is Markov chain Monte Carlo (MCMC) sampling, but it is computationally inefficient for the proposed nonstationary model with considerable time-varying parameters. Therefore, we develop an efficient EM algorithm. Instead of directly maximizing the posterior p(Φ|y)=p(Φ(s),Φm|y(s),ym), we proceed iteratively in terms of the complete log posterior logp(Φ,γ|y), where the binary parameters γ are treated as “missing data.” Because this function is not observable, in the Expectation-step (E-step), we calculate its conditional expectation given the observed data and the current estimated parameters. Then in the Maximization-step (M-step), we maximize the expected complete log-posterior with respect to Φ. More precisely, the E-step and M-step at the (k+1)th iteration can be expressed as

Estep:Q(Φ|Φ(k))=Eγ|Φ(k),y{logp(Φ,γ|y)},Mstep:Φ(k+1)=arg maxΦ{Q(Φ|Φ(k))},(18)
where Eγ|Φ(k),y(·) is the conditional expectation on posterior of γ, and Φ(k) is the optimized parameters at the kth iteration. For simplicity, we use Eγ(·) to denote Eγ|Φ(k),y(·).

Based on Bayes’ theorem and the property of multivariate normal distribution, the expectation of logp(Φ,γ|y) can be as follows (derivation details can be found in Online Appendix C):

Eγ{logp(Φ,γ|y)}=12{y(s)TK(ss)1y(s)+log|K(ss)|+(ymμ)TΣ1(ymμ)+log|Σ|}+i=1m1t=2n[logpslab(θii,t|θii,t1)+logpslab(αii,t|αii,t1)]+i=1mt=2n[logpslab(θim,t|θim,t1)+(1Eγγi,t)logpspike(αim,t)+Eγγi,tlogpslab(αim,t|αim,t1)]+const.,(19)
where μ=K(sm)TK(ss)1y(s) is the conditional mean of target given the sources and Σ=KmmK(sm)TK(ss)1K(sm) is the conditional covariance.

In the E-step, because γ is only dependent on Φm, the posterior of γi,t is calculated as

p(γi,t|Φm(k))=p(αim,t(k)|γi,t)p(γi,t)p(αim,t(k))p(αim,t(k)|γi,t)p(γi,t)=[(1γi,t)pspike(αim,t(k))+γi,tpslab(αim,t(k)|αim,t1(k))]·ηγi,t(1η)(1γi,t).(20)

Then the conditional expectation of γi,t can be updated as

Eγγi,t=η·pslab(αim,t(k)|αim,t1(k))(1η)·pspike(αim,t(k))+ηpslab(αim,t(k)|αim,t1(k)).(21)

The posterior mean Eγγi,t can be interpreted as the posterior probability of classifying the current parameter αim,t into a slab process as opposed to a spike process, based on the past value αim,t1. For example, we set η to 0.5 as a noninformative prior, and take a small ν0 (e.g., 0.01) for the spike prior and a large ν1 (e.g., 0.1) for the soft slab prior. Based on Equation (21), supposing that (αim,tαim,t1)2 is small and |αim,t| is large, the expectation Eγγi,t will tend toward one, indicating that αim,t is more likely from the slab prior. On the other hand, if |αim,t| is small, Eγγi,t is close to zero, enforcing strong shrinkage on αim,t.

In the M-step, we can optimize the objective function in Equation (19) with various gradient ascent methods, such as (stochastic) ADAM, L-BFGS, etc. (Zhu et al. 1997, Kingma and Ba 2015). This objective function is actually a standard Gaussian process log-likelihood with additional regularization terms. The regularization terms penalize the difference between parameters at successive time points and shrink the amplitude parameters to facilitate source selection. The weights of the regularization terms are modulated by the expectation of γ. Ideally, for nonzero αim,t,Eγγi,t will equal to one, so no shrinkage effect will be placed on it. In other words, the strength of sparse penalty is automatically adjusted through Equation (21), and this adjustment has explicit statistical interpretability.

In our case studies, we find the algorithm converges rapidly, for example, achieving convergence after only five iterations. The whole algorithm is summarized in Algorithm 1, where an ADAM method (Kingma and Ba 2015) is utilized in the M-step. Note that the large parameter space of Φ poses a challenge in identifying the modes of the full posterior. To speed up the convergence, we propose to initialize the source parameter Φ(s) by maximizing the sum of sources’ marginal log-likelihood and source parameter prior:

maxΦ(s)logp(y(s)|Φ(s))+logp(Φ(s)).(22)

For target parameters, we find simple random initialization can achieve satisfactory performance in experiments.

3.3. Computational Challenge

There are three main computational challenges that we need to address. The first one is the calculation of integration for a convolution kernel. To avoid an intractable integration for the covariance, we utilize the Gaussian kernel in Equation (8) and derive closed-form covariance functions in Equations (10a)–(10c).

The second challenge is calculating the inverse of the covariance matrix, which is a critical issue for all GPs. The computational complexity of Algorithm 1 is approximately O(i=1mni3), where ni is the number of data points for each output. In comparison, the computational cost for traditional MGP is O((i=1mni)3) with the same size of data and nonseparable covariance. The main computational load of our model is in calculating the inverse of K(ss) and Σ in the source marginal distribution logp(y(s)|Φ(s)) and the target conditional distribution logp(ym|Φm,y(s),Φ(s)), respectively, in Equation (19). Because all the latent processes are independent of each other, K(ss) is a block-diagonal matrix and the complexity is reduced from O(i=1m1ni3) to O(i=1m1ni3). The calculation of the inverse of Σ is O(nm3). Therefore, the overall computational complexity is reduced to O(i=1mni3).

Finally, it is a hard task to estimate a considerable number of time-varying parameters. Therefore, we develop the EM-based algorithm to fit the model rather than using a sampling method. Based on the results of a nonstationary linear model (Rockova and McAlinn 2021), the MCMC and EM algorithm lead to very close prediction errors, but the running time of MCMC is about 10 times longer than that of the EM.

Algorithm 1

(The Optimization Algorithm for the Nonstationary MGP Model)

Input: Data {Xi,yi}i=1m, ν0, ν1, ρ, η.

  • 1: Set starting value: Φ(s)(1),Φm(0).

  • 2: Initialization: obtain Φ(s)(0) through Equation (22).

  • 3: for kout iterations do

  • 4:  E-step given Φ(k): Update {Eγγi,t}i=1,t=2m,n through Equation (21).

  • 5:  M-step given Eγγi,t:

  • 6:  for kin epochs do

  • 7:   Calculate K(ss),K(sm), and Kmm according to Equation (10).

  • 8:   Calculate the value and gradient of objective function in Equation (19).

  • 9:   Obtain Φ(k+1) using the ADAM method.

  • 10:  end for

  • 11: end for

3.4. Tuning Parameter Selection

The tuning parameters for our model are ν0, ν1 (for the hard slab prior), and ν2 (for the soft slab prior). Here, because the key of our method is selecting the most informative sources, we propose to maximize the following criterion:

B(ν)=Nlogp(y|X)log(N)cν(αm),(23)
where N is the number of data, logp(y|X) is the log-likelihood for both the source and the target outputs, and cν(αm) is the number of nonzero elements in αm given ν. A similar criterion is proposed in Zhang et al. (2021). Note that ν={ν0,ν1} for the hard slab prior and ν={ν0,ν2} for the soft slab prior. This criterion is similar to the Bayesian Information Criterion (BIC). The first term tends to select a more complex model with a larger likelihood for all outputs, whereas the second term prefers simpler models where fewer sources are chosen to transfer information to the source. To reduce the computation, we first determine the ratios r1=ν1/ν0 and r2=ν2/ν0 to make the spike-and-slab priors separable (Rockova and McAlinn 2021). Then we design a two-dimensional search grid for (ν1,ν1/r1) with the hard slab or (ν2,ν2/r1) with the soft slab. The optimal value of ν is searched over the two-dimensional grid. For example, we set r1{5,10} and (ν1,ν0){(1/5,1/25),(1/5,1/50),(1/10,1/50),(1/10,1/100),(1/15,1/75),(1/15,1/150)} for a hard slab prior.

3.5. Model Prediction

Because the EM algorithm only estimates the value of parameters at the observed time points, given a new input xt* of interest, we first need to estimate the target parameter αm,t* and θm,t* at the new time point t*, then derive the predictive distribution of ym(xt*).

3.5.1. Forecasting.

In the forecasting task, t*>n. The estimated value of θim,t* is

θim,t*={θim,n,forhardslabprior,ρt*nθim,n,forsoftslabprior,
which is actually the mode of pslab(θim,t*|θim,n). As for αim,t*, if Eγγi,n0.5, we consider it from the slab process and estimate it using the same method as θim,t*. Otherwise, it is classified to a spike process and shrunk to zero (Ročková and George 2014).

3.5.2. Recovery.

In the recovery task, the target data are unobserved at some specific time points and we aim to recover the missing data, that is, 1<t*<n. Define tbe and taf to be the nearest observed time points before and after t*, respectively. Denote the nearest observation time to t* as tnear, that is,

tnear=arg mint{tbe,taf}|tt*|.

As the parameters before and after t* are already optimized by the EM algorithm, the estimation of θim,t* becomes

θim,t*={θim,tnear,forhardslabprior,LSE(θim,tbe,θim,taf),forsoftslabprior,
where LSE(·) represents a least-square estimation for autoregressive process introduced in Choong et al. (2009). We also let αim,t*=0, if Eγγi,tnear<0.5. Otherwise, we estimate its value in the same way as θim,t*.

Then, given the parameter αm,t* and θm,t*, and the new input point xt*, the joint distribution of ym(xt*) and observations ym can be expressed as

(ymym(xt*))N([μμt*],[ΣΣt*Σt*TΣt*,t*]),(24)
where μt*=K(s*)TK(ss)1y(s),Σt*=Km*K(sm)TK(ss)1K(s*), and Σt*,t*=Kt*,t*K(s*)TK(ss)1K(s*). In the above equations, K(s*) is the cross-covariance matrix of the sources and the new input xt*,K(m*) is the covariance of the target observation and the new point, and Kt*,t* is the variance at xt*. Then, the posterior distribution of ym(xt*) can be derived as
ym(xt*)N(μt*+Σt*TΣ1(ymμ),Σt*,t*Σt*TΣ1Σt*).(25)

4. Numerical Study

In this section, we verify the effectiveness of the proposed nonstationary MGP with the spike-and-slab prior (denoted as DMGP-SS) using synthetic data. In Section 4.1, we briefly describe the general settings for the numerical study and benchmark methods. In Section 4.2, we introduce the design of simulation cases, where the cross-correlation of the sources and the target is dynamic and sparse. In Section 4.3, we demonstrate our model’s capability in detecting the underlying correlation pattern as well as improving target prediction performance on the synthetic data.

4.1. General Settings

Similar to the assumption made in the model development section, we generate m sequences consisting of m − 1 sources and one target, sampled at the same time stamps. The input space is simply time. To investigate the source selection capability of DMGP-SS, we assign only mt<m1 sources to be correlated with the target at time t. Besides, a source will remain either correlated or uncorrelated continuously for a certain period of time.

For comparison, we consider three benchmarks:

  1. GP. The target is modeled using a single-output GP, with a squared-exponential covariance function.

  2. MGP-L1. It is a state-of-art static method introduced in Wang et al. (2022). MGP-L1 models the target and sources in one MGP model, with the same covariance structure as in Equation (9). The scaling parameters {αim}i=1m1 are penalized by L1 term to achieve source selection. The regularized log-likelihood of this model is

    logF=12yTK1y12log|K|λi=1m1|αim|const.,
    where K is calculated using the static covariance functions in Wang et al. (2022) and λ is a tuning parameter.

  3. DMGP-GP. This is a state-of-the-art nonstationary MGP model, which constructs an LMC model for all outputs and assumes the hyper-parameters follow other GPs (Meng et al. 2021). More details can be found in Online Appendix D. In this model, the covariance for the m outputs is

    cov[y(xt),y(xt)]=AtAtTk(xt,xt)+diag{σi},
    where AtAtTRm×m is the correlation matrix of m outputs, and diag{σi} is the diagonal matrix with {σi}i=1m. In this study, we focus on the correlation between the sources and target, which corresponds to the last column of AtAtT (except the last element (AtAtT)mm), that is, (AtAtT)0:m1,m.

All methods are implemented in Python 3.8 and executed on a computer with an Intel® Core™ i5-7400 CPU with 3.00 GHz and 16 GB RAM. Both GP and MGP-L1 are implemented using GPflow (Matthews et al. 2017). DMGP-GP is implemented using TensorFlow and optimized with ADAM (Kingma and Ba 2015), with a maximum iteration limit of 500. The EM algorithm for DMGP-SS is also based on TensorFlow, and we use stochastic ADAM with four batches in the M-step. We set the kout and kin in Algorithm 1 to 5 and 400, respectively.

For MGP-L1, the weight of L1 penalty λ is a tuning parameter. For DMGP-GP, we use square-exponential functions for kα and kθ. For simplicity, we apply the same tuning parameters for both kernels, that is, the amplitude α# and length-scale θ#. Those parameters are tuned by cross-validation. In the case of DMGP-SS, the prior sparsity parameter η is set to 0.5. We repeat each case 50 times and report the prediction performance through averaging the results.

4.2. Simulation Cases

The main objective of this section is to demonstrate the effectiveness of our method in capturing the nonstationary and sparse cross-correlation between the sources and the target. For simplicity, we hold the other characteristics constant over time, for example, the smoothness of each output. Specifically, we design two simulation cases with different cross-correlation patterns. The first case involves a piece-wise constant cross-correlation, whereas the second case has a smoothly changing correlation. In each case, the input data {xt}t=1130 are evenly spaced in [1,130]. The observed data are generated from sine functions with measurement noise ϵtN(0,0.32).

  • Case 1. In this case, we define four kinds of source functions:

    Y1(xt)=3sin(πxt/20+e1)+ϵt,Y2(xt)=2sin(2πxt/20+e2)exp[0.5(xt%401)]+ϵt,Y3(xt)=3sin(4πxt/20+e3)+ϵt,Y4(xt)=2sin(5πxt/20+e4)+ϵt,
    where eiN(0,0.22) is a random phase to make the sampled outputs different from each other, and “%” represents the reminder operation. The term exp[0.5(xt%401)] is used to deviate the shape of Y2 from the standard sine function. In each experiment, we generate 4k sources through sampling each kind of function k times, that is, m=4k+1. Specifically, the sources {yi+4j|0jk1} are sampled from Yi. Then, we define the dynamic target output as
    fm(xt)=a1,tsin(πxt/20)+a2,tsin(2πxt/20)+a3,tsin(4πxt/20)+a4,tsin(5πxt/20).

    In this case, we simulate a piece-wise constant cross-correlation by setting

    a1,t=(2+2a1)It<40,a2,t=(2+2a2)I40t<80+(1+a2)I80t130,a3,t=(1+a3)I80t130,a4,t=0.

    Therefore, there are three segments, [0,40),[40,80), and [80,130]. Only the first, the second, and the second and third sources are correlated with the target in the three periods, respectively. The other sources remain uncorrelated with the target at all times.

  • Case 2. Compared with Case 1, we only change the coefficients {ai,t}i=13 into smoothly changing ones in this case. Specifically, we let them vary along sine-cosine trajectories,

    a1,t=[(2+a1)cos(πt/120)+0.5]It<40,a2,t=[(2+a2)sin(πt/120π/6)+0.5]I40t<130,a3,t=[(2+a3)sin(πt/120π/2)+0.5]I80t<130.

In all cases, we set k = 1, 4 to generate 4 and 16 source outputs for each case. In order to compare the prediction performance of different methods, we randomly remove three short sequences of length 10 from [10,30],[50,70], and [90,110] respectively. These 30 data points are treated as missing data, whereas the others are used as training data.

4.3. Simulation Results

To begin with, we demonstrate that the proposed DMGP-SS is capable of capturing the dynamic and sparse correlation between the sources and the target. Figure 2 illustrates the estimated {αim}i=14 for MGP-L1, the {αim,t}i=14 for DMGP-SS, and the estimated (AtAtT)1:4,m for DMGP-GP in Cases 1 and 2 with four sources. Figure 3 visualizes the sources and target prediction results in Case 1 with four sources. DMGP-SS is implemented with the hard and soft slab priors for Cases 1 and 2, respectively. Note that the values of ai,t and αim,t are not identical, because ai,t is a linear combination weight rather than the correlation parameter in MGP.

Figure 2. Dynamic Correlation Detection Results with Four Sources
Notes. (a) Case 1; (b) Case 2. The first column is the true ai,t, the second and fourth columns are the estimated αm for MGP-L1 and DMGP-SS, respectively, and the third column shows the estimated (AtAtT)0:m1,m for DMGP-GP.
Figure 3. (Color online) Visualization of Prediction Results in Case 1 (k = 1)
Note. The shaded region represents the 99% confidence interval.

Overall, DMGP-SS successfully recovers the true dynamic structural sparsity of correlation shown in the first column of Figure 2. Firstly, DMGP-SS tracks closely the periods of correlation between each source and the target, achieving a dynamic selection of sources. Because the target’s characteristics do not change abruptly (as shown in Figure 3), it is reasonable that the estimated correlation change points are about 10 time steps before or after the designed change times. Second, the hard slab prior encourages nearly piece-wise correlation, whereas the soft slab prior allows smoothly changing correlation. Because of the appropriate selection of sources at different times in Figure 3, the proposed DMGP-SS achieves precise prediction with the lowest prediction uncertainty. This highly improves the confidence of decision making when using the recovered series. The difference on confidence interval of three MGP models is because the uncorrelated sources “poison” the correlation structure and decrease the influence of the truly correlated sources, resulting in a lower value of variance reduction term Σt,*TΣ1Σt,* in posterior prediction Equation (25).

In contrast, another nonstationary method, DMGP-GP, fails to estimate a sparse structure in the source-target correlation because the GP prior on parameters lacks the shrinkage effect. The proposed DMGP-SS addresses this problem through combining the smooth slab prior and the shrinking spike prior. Regarding MGP-L1, although it can estimate a sparse structure of αm, the estimated values of nonzero parameters are constant over time and cannot reflect the change of correlation. As a result, it performs even worse than GP in recovering the target output in Figure 3.

Another advantage of DMGP-SS is the adaptive adjustment of the spike-and-slab combination weight, Eγγi,t. In our settings, ν01 in the spike prior is much larger than ν11 in the hard slab prior (or ν11 in the soft slab prior), to put more penalty on the correlation sparsity. For example, we set ν0=0.02 and ν1=0.1 in Case 1. However, this sparse penalty does not cause significant bias on nonzero αim,t because of the automatically adjusted Eγγi,t in the EM algorithm. Specifically, we initialize it with 0.99 to barely shrink parameters at the beginning. Then Eγγi,t is updated in the E-step of the EM algorithm. Figure 4 shows its estimated value after five iterations. For the correlated sources (e.g., the first source during t[0,50]), their corresponding Eγγi,t is very close to one, so they bear a negligible shrinkage effect from the spike prior. In contrast, for the uncorrelated sources (e.g., the first source after t = 50), Eγγi,t is approximately 0.2, implying a substantial shrinkage effect. The value 0.2 can be derived based on Equation (21). For consecutive zero elements, pslab=η(2ν1)1, and pspike=(1η)(2ν0)1, resulting in Eγγi,tν11/ν01.

Figure 4. The Estimated Eγγ1:4 for DMGP-SS in Case 1

Table 1 summarizes the results of 40 repetitions for each case. DMGP-SS outperforms all the other methods in both cases. Further, the increase of source number does not affect the prediction accuracy, demonstrating its remarkable robustness in dealing with moderate data. MGP-L1 exhibits slightly higher prediction accuracy than GP. The static covariance structure limits its ability to transfer accurate information at all times. Under some circumstances, this limitation will cause a negative transfer effect on the learning of target (for example, the result shown in Figure 3). DMGP-GP has a better performance on target prediction than GP and MGP-L1, because of the ability to model dynamic correlation. However, it does not achieve the same prediction accuracy as DMGP-SS in these cases. On the one hand, it lacks the ability to exclude the impact of uncorrelated sources. On the other hand, DMGP-GP is an extension of LMC and uses the same function k(xt,xt) to model the autocovariance of every output. This feature makes it unsuitable for our cases where the source covariance functions have four kinds of length-scales. Nevertheless, the proposed DMGP-SS models each source with separate kernels and latent functions, highly increasing its flexibility.

Table

Table 1. Prediction Error for Cases 1 and 2

Table 1. Prediction Error for Cases 1 and 2

OutputsCase 1Case 2
GPMGP-L1DMGP-GPDMGP-SSGPMGP-L1DMGP-GPDMGP-SS
50.990.890.760.561.121.040.850.66
(0.26)(0.21)(0.23)(0.13)(0.27)(0.24)(0.26)(0.17)
170.950.910.750.501.151.020.780.64
(0.25)(0.21)(0.19)(0.15)(0.26)(0.21)(0.17)(0.16)


Note. Values in parentheses are standard deviations.

Table 2 lists the computational time of the four methods in Case 1. Between the two nonstationary methods, DMGP-SS requires much less computation time than DMGP-GP. This exactly verifies the analysis in Section 3.3 that our model can save more time than the traditional nonstationary MGP method, because of a block-sparse covariance matrix. In fact, our model is scalable for larger sizes of data, which is described in Online Appendix E.

Table

Table 2. Computational Time for the Different Methods in Case 1

Table 2. Computational Time for the Different Methods in Case 1

OutputsnGPMGP-L1DMGP-GPDMGP-SS
51300.138.613073
171300.10262,100300

5. Case Study

In this section, we apply DMGP-SS to two cases: human movement signal modeling and control policy optimization. In the first case, these signals are recorded by sensors attached to different joints, such as hands and feet. As the movement of joints is dynamically correlated across different gestures (Xu et al. 2022), it is reasonable to utilize a nonstationary method to capture the varying correlation and leverage information among the signals of joints. Regarding the control policy iteration, we study a classical reinforcement learning (RL) problem, the mountain-car problem. We evaluate the performance of DMGP-SS on leveraging knowledge between different systems when the environment is nonstationary.

5.1. Movement Signal Modeling

5.1.1. Data Description.

We use the MSRC-12 gesture data set (Fothergill et al. 2012) consisting of sequences of human skeletal body movement. Twenty sensors are distributed across the body, and each sensor can record three-dimensional coordinates of joint positions. The motion signal has a sampling frequency of 30 Hz and an accuracy of ±2cm. The data set comprises 12 different gestures performed by 30 participants. The experiment involves each participant starting from a standing position, performing one gesture, and returning to the standing state. This process repeats several times continuously within each sample.

To demonstrate the effectiveness of DMGP-SS, we connect the instances of three gestures (“Goggles,” “Shoot,” and “Throw”) performed by the same individual. Figure 5(a) shows the snapshots of the standing position and the selected gestures. In the first two gestures, the participant stretches both arms in front of him to perform searching or shooting motions. In the third gesture, the participant only uses his left arm to make an overarm throwing movement. In these gestures, the main acting joints are hands, wrists, and elbows, where the trajectories of hands and wrists are almost identical. Therefore, we select the movement signals of four joints (left wrist, left elbow, right wrist, and right elbow) as 12 outputs. We choose the z-coordinate movement of the left elbow as a target output, while using the remaining 11 movement signals as sources.

Figure 5. (Color online) Connection of the Instances of Three Gestures Performed by the Same Individual
Notes. The snapshots of “Stand” (initial state), “Goggles” (Put on vision goggles), “Shoot” (shoot a pistol), and “Throw” (throw an object). The 20 joints are represented by circles. We mark out the four selected joints for our study, where “L” and “R” represent “left side” and “right side,” respectively, and “E” and “W” represent the elbow and wrist joints, respectively. (b) The 12 movement signals of the selected joints in the three gestures’ data, where the red signal is taken as the target signal and the others are the source signals. In the label of the vertical axis, “x,” “y,” and “z” represent different coordinates. Each signal has a vertical range of [2,2], with the horizontal ticks representing time indices.

We select two 120-frame-long instances for each gesture and down-sample each instance to 30 frames. Therefore, there are 180 points for each output. To eliminate the difference in initial coordinate values, we reset the initial three-dimensional coordinate value to (0, 0, 2) across different recordings. Additionally, we rescale all outputs to [2,2]. Figure 5(b) displays the 12 outputs and the change of joints’ positions.

5.1.2. Results.

Intuitively, the cross-correlation between the source and target signals should remain constant for a single gesture, so a hard slab prior is used in DMGP-SS. All other settings for this case are identical to those used in the simulation studies. We still simulate consecutive data missing for the target. From the 60-point-long time-series of each gesture, we randomly remove a sequence of length 10 as missing data.

First of all, Figure 6(a) shows the estimated correlation between the sources and the target. MGP-L1 selects six sources (the third “R-W-z,” the fourth “R-E-x,” the sixth “R-E-z,” the seventh “L-W-x,” the ninth “L-W-z,” and the 11th “L-E-y”) as the correlated signals for the whole time period, in which the ninth source has the strongest correlation. DMGP-SS selects the sixth source (“R-E-z”) and the ninth source (“L-W-z”) as correlated sources when 0t120 and 120t180, respectively. DMGP-GP does not provide a sparse estimation of cross-correlation. Among them, the proposed DMGP-SS accurately captures the underlying physical movements of each gesture. In the “Goggles” and “Shoot” gestures, both elbows have almost the same trajectory. In the “throw” gesture, the left elbow’s movement is highly correlated with that of the left wrist. These findings align well with the signals shown in Figure 5(b). On the contrary, limited by the static correlation structure, MGP-L1 can only select some sources and force them to be correlated with the target all the time, but such signals do not exist in the dynamic environment. For DMGP-GP, the results cannot provide us an intuitive understanding of the joints’ relationship.

Figure 6. (Color online) Estimated Correlation Between the Sources and the Target and Recovered Target Signal in One Experiment
Notes. (a) Correlation estimation results for the data of three gestures. The first and third rows are the estimated αm for MGP-L1 and DMGP-SS, respectively, and the second row shows the estimated {AtAtT}1:11,12 for DMGP-GP. (b) Recovery results of four methods for the movement signal of the right wrist. The shaded region represents the 99% confidence interval.

Figure 6(b) displays the recovered target signal in one experiment. Notably, DMGP-SS accurately recovers the missing data with high precision. Besides, it has the least uncertainty because it can select the most correlated sources at each time and such a high correlation improves the confidence on prediction results. Conversely, the predictions of MGP-L1 and DMGP-GP display an obvious bias in the first gap and higher uncertainty for all gaps. Besides, similar to the results of numerical studies, the proposed DMGP-SS also gives predictions with the lowest uncertainty, which significantly improves the confidence of decision making.

We further repeat the experiments 36 times. Table 3 compares the prediction accuracy of four methods in terms of both the mean absolute error (MAE) and the continuous-ranked-probability score (CRPS). The CRPS measure is a widely used metric to evaluate the probabilistic forecast for a real-valued outcome (Hersbach 2000):

CRPS=ntest1i=1ntest[Φ(y^i)1y^iyi)]2dy^i,

Table

Table 3. Prediction Error for Real Case

Table 3. Prediction Error for Real Case

GPMGP-L1DMGP-GPDMGP-SS
MAE-mean0.430.220.500.18
MAE-std.(0.12)(0.10)(0.25)(0.08)
CRPS-mean0.300.160.410.15


Note. The values in parentheses are standard deviations.

where y^i is the predicted output and Φ is the cumulative density function (CDF) of the predicted distribution. A low CRPS value suggests that the predicted posteriors are concentrated around the true value, indicating a high probabilistic forecast performance. As expected, DMGP-SS outperforms the other methods because of its ability to capture the dynamic and sparse correlation accurately. MGP-L1 performs better than GP, benefiting from the information borrowed from the other sources. However, it cannot model a dynamic correlation, resulting in lower prediction accuracy than DMGP-SS. Regarding DMGP-GP, although it captures the change of correlations between the target and some sources, its prediction accuracy is even lower than GP. This result may be attributed to the fact that nonsparse correlations lead to potential negative transfer effects. Besides, because the sources’ smoothness is heterogeneous, it is inappropriate to use the same autocovariance function to model all outputs.

5.2. Control Policy Optimization

In reinforcement learning problems, there are five important elements for the agent: state s, action a, reward r, value V, and policy W. Starting from one state s, the agent takes an action a, transitions into a new state u, and gets an immediate reward r(u) from the environment. In general, a reinforcement learning framework includes two primary procedures:

  1. Model estimation: estimating the state transition function U(s, a) based on the observed transition samples [(s,a),u].

  2. Policy iteration: iteratively estimating the state value V(s) (the long-term expected reward) for each state and improving the control policy W(s).

More details can be found in the works on GP-based reinforcement learning (Kuss and Rasmussen 2003, Verstraeten et al. 2020).

In this study, we employ the well-known reinforcement learning problem, the mountain-car problem, to demonstrate the application of our model in decision making. Figure 7 illustrates such a problem. A car begins in a valley and aims to reach a goal position of the rightmost hill. Because of the steep slope of the hill, the car cannot directly accelerate to the goal position. Instead, it needs to drive up the opposite side, turn back, and accelerate to reach the goal position. In the system, the agent state is described by the position spos[1.2,1.0] and the velocity svel[0.07,0.07] of the car, that is, s=(spos,svel). The agent action is a horizontal force a[1,1]. The car starts at the state sinit=(sinitvel,sinitvel) with sinitpos[0.6,0.5] and sinitvel=0, aiming to reach the goal state sgoal=(0.45,0). The reward function is the probability density function of N(sgoal,diag{0.052,0.00352}). The dynamic equation of this system is approximated by

stvel=st1vel+P·at1G·cos(3·st1pos)stpos=st1pos+stvel,(26)

Figure 7. (Color online) Illustration of the Mountain-Car Problem

where P is the horizon power unit and G is the vertical force unit.

We consider the control problem involving a car in a nonstationary environment with limited transition samples. During t[0,20), the target car runs in an environment with (P,G)=(0.009,0.0015) and we get 20 random samples. However, at t = 20, an unknown change occurs, altering the environment factors to (P,G)=(0.0011,0.0026). We sampled another 20 random samples during t[20,40). After t = 40, we start to control the car to reach the goal position and the environment does not change any more. Given that those samples are too limited to build an accurate transition model, we transfer information from two historical source data sets, each consisting of 200 samples from stationary environments with (P,G)=(0.01,0.0015) and (P,G)=(0.001,0.0025), respectively. We can see that the target environment is close to the first source environment before t = 20, and is close to the second one after that.

Algorithm 2 summarizes the workflow of reinforcement learning, where we employ the DMGP-SS as a transition model.

Algorithm 2

(Control Policy Optimization for Mountain-Car Case)

Input: Source transition samples {(si,n,ai,n),ui,n}i,n=12,200, target transition samples {(s3,t,a3,t),u3,t}t=140, reward function {r(s)}, 256 support points Ssupp.

  • 1: Initialize: policy W(s) random policy, value Vsupp=r(Ssupp).

  • 2: Fit two DMGP-SS models for the state transition using both the source and the target transition samples, one for the position state and the other for the velocity state.

  • 3: Fit a GP for the state value model V(s) using {Ssupp,Vsupp}.

  • 4: Improve policy W(s) and value model V(s) iteratively based on the state-transition model until V(Ssupp) converges.

  • 5: Execute the optimized policy starting at the state sinit:

For simplicity, we model the source transition data with stationary GPs in DMGP-SS. The sources and the target are expressed as

ui(s,a)=αiigii(s,a)zi(s,a)+ϵi,iIS,um(s,a)=jIαjm,tgjm,t(s,a)zj(s,a)+ϵm.(27)

Here, u is the position or velocity state to which the agent transitioned from a state s after taking an action a. After fitting this model to both the target and source samples, we train a GP for the value model V(s) with even-spaced support points Ssupp and rewards r(Ssupp). Based on the transition model U(s,a), we iteratively improve the policy W(s) and value model V(s) until the prediction of V(s) converges. Details on the policy improvement procedure can be found in Kuss and Rasmussen (2003) and Verstraeten et al. (2020). Although we focus on the offline setting in this case, it is worth noting that this framework can be readily extended to accommodate an online setting, wherein the training sample consists of the visited states and the transition models are updated every few steps.

We choose two reinforcement learning benchmark methods (Kuss and Rasmussen 2003, Verstraeten et al. 2020) with the stationary GP and MGP as the state transition model, respectively. The maximum execution steps are 600 for each method. In Table 4, we report the mean of absolute distances to the goal state for three methods. The DMGP-SS-based control policy has the shortest average distance to the goal state in 600 steps. Specifically, Figure 8 compares the position of the car controlled by the three policies. With DMGP-SS as the transition model, the car reaches the goal position and stays there after about 250 time steps. However, the other methods cannot find a good policy to reach the goal position within 600 moves, because their stationary transition models cannot account for the environment change and make the policy iteration hard to converge.

Table

Table 4. Predictive Accuracy on State Transition and the Mean of Absolute Distances to the Goal State in the Mountain-Car Case

Table 4. Predictive Accuracy on State Transition and the Mean of Absolute Distances to the Goal State in the Mountain-Car Case

MetricRL-GPRL-MGPRL-DMGP-SS
The mean of absolute distances to the goal position0.980.830.27
Predictive MAE on velocity transition (103)3.85.80.51
Predictive MAE on position transition (103)3.64.98.2
Figure 8. (Color online) The Control Path of the Three Methods
Notes. (a) RL-GP, (b) RL-MGP, and (c) RL-DMGP-SS. The blue points mark the positions every 50 steps.

In Table 4, we further compare the predictive MAE on state transition for three methods. The proposed method has a significantly lower prediction error on velocity state transition than the other methods, because it can capture the change of environment and transfer information from the related sources. Figure 9 illustrates the estimated αm from DMGP-SS trained on the velocity transition data. We can find that the proposed method successfully finds that the correlation between the sources and the target changes at t = 20. Therefore, during the policy improvement stage, it can leverage information from the similar source (the second one) and avoid the negative transfer from the uncorrelated source (the first one). Regarding the prediction error on position transition, although RL-DMGP-SS has a higher MAE than the other benchmarks, the difference is minor considering the position range [1.2,0.6]. Therefore, our method can provide the best control policy.

Figure 9. The Estimated Correlation Parameter αm from DMGP-SS on Velocity Transition

6. Conclusion

This paper proposes a flexible nonstationary multioutput Gaussian process for modeling multivariate data in transfer learning. The novelty of our approach lies in its ability to capture dynamic and sparse cross-correlations between sources and targets. We achieve this by allowing correlation parameters to follow a spike-and-slab prior, where the slab prior ensures correlation variation over time, and the spike prior encourages parameters to shrink to zero, eliminating negative transfer effects from uncorrelated sources. The ratio of these two priors is automatically adjusted in the proposed EM algorithm, preventing the shrinkage effect on nonzero correlation parameters. Through the experiments on simulation and a human gesture data set, we demonstrate that our method is well suited for both capturing nonstationary characteristics and mitigating negative transfer.

The proposed data-driven method provides a powerful tool for researchers and engineers to select the most informative sources to transfer knowledge. Except for high-dimensional time-series modeling, our approach could also find applications in both sequential decision making and change-point detection. For instance, transfer learning has arisen to handle the critical challenge of sparse feedbacks in RL, a popular framework for solving sequential decision-making problems. However, negative transfer is still a notable challenge for multitask reinforcement learning and it is risky to naively share information across all tasks (Yang et al. 2020, Zhu et al. 2023). Therefore, we embedded our model with the offline RL transfer framework to automatically select correlated sources to share knowledge in a nonstationary environment (Padakandla et al. 2020). In the future, we can further develop our method to account for an online RL task. Besides that, the estimated dynamic correlation among outputs can help us to understand the structure of time-series even with some missing data, as well as to detect some structural change points for subsequent decision making.

Many extensions are possible for our model. Although the proposed methodology is flexible enough to capture complex and dynamic correlations, scaling it up to large data sets is computationally challenging. Possible solutions include utilizing a sparse approximation for the covariance matrix or developing a more efficient optimizing algorithm.

In addition, the proposed EM optimization algorithm solely provides estimated values of the model parameters without incorporating any uncertainty measurement. To address this issue, we can use variational inference to approximate the true posterior distribution of parameters, thereby capturing the inherent uncertainty associated with these parameters. The approximated uncertainty would be propagated into the prediction at new points.

References

  • AlBahar A, Kim I, Wang X, Yue X (2022) Physics-constrained Bayesian optimization for optimal actuators placement in composite structures assembly. IEEE Trans. Automation Sci. Engrg. 20(4):2772–2783.Google Scholar
  • Alvarez MA, Lawrence ND (2011) Computationally efficient convolved multiple output gaussian processes. J. Machine Learn. Res. 12:1459–1500.Google Scholar
  • Bai Y, Safikhani A, Michailidis G (2022) Hybrid modeling of regional Covid-19 transmission dynamics in the U.S. IEEE J. Selected Topics Signal Processing 16(2):261–275.Google Scholar
  • Boyle P, Frean M (2004) Dependent Gaussian processes. Saul L, Weiss Y, Bottou B, eds. Advances in Neural Information Processing Systems, vol. 17 (MIR Press, Cambridge, MA), 217–224.Google Scholar
  • Cheng S, Lu F, Peng P (2021) Short-term traffic forecasting by mining the non-stationarity of spatiotemporal patterns. IEEE Trans. Intelligent Transportation Systems 22(10):6365–6383.Google Scholar
  • Choong MK, Charbit M, Yan H (2009) Autoregressive-model-based missing value estimation for DNA microarray time series data. IEEE Trans. Inform. Tech. Biomedicine 13(1):131–137.Google Scholar
  • Christakos G (2012) Random Field Models in Earth Sciences (Courier Corporation, North Chelmsford, MA).Google Scholar
  • Damianou A, Lawrence ND (2013) Deep Gaussian processes. Carvalho CM, Ravikumar P, eds. Proc. 16th Internat. Conf. Artificial Intelligence Statist., Proc. Machine Learn. Res., vol. 31 (JMLR.org, Scottsdale, AZ), 207–215.Google Scholar
  • Dance H, Paige B (2022) Fast and scalable spike and slab variable selection in high-dimensional Gaussian processes. Camps-Valls G, Ruiz FJR, Valera I, eds. Proc. 25th Internat. Conf. Artificial Intelligence Statist., Proc. Machine Learn. Res., vol. 151 (JMLR.org), 7976–8002.Google Scholar
  • Fairchild KD, Lake DE, Kattwinkel J, Moorman JR, Bateman DA, Grieve PG, Isler JR, Sahni R (2017) Vital signs and their cross-correlation in sepsis and NEC: A study of 1,065 very-low-birth-weight infants in two NICUs. Pediatric Res. 81(2):315–321.Google Scholar
  • Fothergill S, Mentis H, Kohli P, Nowozin S (2012) Instructing people for training gestural interactive systems. Konstan JA, Chi H, Hook K, eds. Proc. SIGCHI Conf. Human Factors Comput. Systems (Association for Computing Machinery, New York), 1737–1746.Google Scholar
  • Frazier PI (2018) Bayesian optimization. Gel E, Ntaimo L, eds. Recent Advances in Optimization and Modeling of Contemporary Problems. Tutorials in Operations Research (INFORMS, Catonsville, MD), 255–278.LinkGoogle Scholar
  • Fricker TE, Oakley JE, Urban NM (2013) Multivariate Gaussian process emulators with nonseparable covariance structures. Technometrics 55(1):47–56.Google Scholar
  • Garg S, Singh A, Ramos F (2012) Learning non-stationary space-time models for environmental monitoring. Proc. 26th AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 288–294.Google Scholar
  • Gelfand AE, Schmidt AM, Banerjee S, Sirmans C (2004) Nonstationary multivariate process modeling through spatially varying coregionalization. Test 13(2):263–312.Google Scholar
  • George EI, McCulloch RE (1993) Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88(423):881–889.Google Scholar
  • Goulard M, Voltz M (1992) Linear coregionalization model: Tools for estimation and choice of cross-variogram matrix. Math. Geology 24(3):269–286.Google Scholar
  • Gramacy RB (2020) Surrogates: Gaussian Process Modeling, Design, and Optimization for the Applied Sciences (CRC Press, Boca Raton, FL).Google Scholar
  • Gramacy RB, Lee HKH (2008) Bayesian treed Gaussian process models with an application to computer modeling. J. Amer. Statist. Assoc. 103(483):1119–1130.Google Scholar
  • Heaton MJ, Christensen WF, Terres MA (2017) Nonstationary Gaussian process models using spatial hierarchical clustering from finite differences. Technometrics 59(1):93–101.Google Scholar
  • Heinonen M, Mannerström H, Rousu J, Kaski S, Lähdesmäki H (2016) Non-stationary Gaussian process regression with Hamiltonian Monte Carlo. Gretton A, Robert CC, eds. Proc. 19th Internat. Conf. Artificial Intelligence Statist., Proc. Machine Learn. Res., vol. 51 (JMLR.org, Cadiz, Spain), 732–740.Google Scholar
  • Hersbach H (2000) Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather Forecasting 15(5):559–570.Google Scholar
  • Hu Z, Wang C (2021) Nonlinear online multioutput Gaussian process for multistream data informatics. IEEE Trans. Indust. Informatics 18(6):3885–3893.Google Scholar
  • Huber F, Koop G, Onorante L (2021) Inducing sparsity and shrinkage in time-varying parameter models. J. Bus. Econom. Statist. 39(3):669–683.Google Scholar
  • Ishwaran H, Rao JS (2005) Spike and slab variable selection: Frequentist and Bayesian strategies. Ann. Statist. 33(2):730–773.Google Scholar
  • Kalli M, Griffin JE (2014) Time-varying sparsity in dynamic regression models. J. Econometrics 178(2):779–793.Google Scholar
  • Kingma DP, Ba JL (2015) Adam: A method for stochastic optimization. Bengio Y, LeCun Y, eds. Proc. 3rd Internat. Conf. Learn. Representation (ICLR, San Diego).Google Scholar
  • Ko J, Kim H (2022) Deep Gaussian process models for integrating multifidelity experiments with nonstationary relationships. IISE Trans. 54(7):686–698.Google Scholar
  • Kontar R, Zhou S, Sankavaram C, Du X, Zhang Y (2018) Nonparametric modeling and prognosis of condition monitoring signals using multivariate gaussian convolution processes. Technometrics 60(4):484–496.Google Scholar
  • Kuss M, Rasmussen C (2003) Gaussian processes in reinforcement learning. Thrun S, Saul L, Schölkopf B, eds. Advances in Neural Information Processing Systems (NIPS 2003), vol. 16 (Whistler, British Columbia, Canada), 751–758.Google Scholar
  • Lee C, Wang K, Wu J, Cai W, Yue X (2023) Partitioned active learning for heterogeneous systems. J. Comput. Inform. Sci. Engrg. 23(4):041009-1–041009-11.Google Scholar
  • Liu Y, Gong C, Yang L, Chen Y (2020) DSTP-RNN: A dual-stage two-phase attention-based recurrent neural network for long-term and multivariate time series prediction. Expert Systems Appl. 143:113082-1–113082-12.Google Scholar
  • Liu Y, Wu H, Wang J, Long M (2022) Non-stationary transformers: Exploring the stationarity in time series forecasting. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Advances in Neural Information Processing Systems, vol. 35 (Curran Associates Inc., Red Hook, NY), 9881–9893.Google Scholar
  • Matthews AGDG, van der Wilk M, Nickson T, Fujii K, Boukouvalas A, León-Villagrá P, Ghahramani Z, Hensman J (2017) GPflow: A Gaussian process library using TensorFlow. J. Machine Learn. Res. 18(40):1–6.Google Scholar
  • Meng R, Soper B, Lee HK, Liu VX, Greene JD, Ray P (2021) Nonstationary multivariate Gaussian processes for electronic health records. J. Biomedical Informatics 117:103698-1–103698-11.Google Scholar
  • Moore AW (1990) Efficient memory-based learning for robot control. Technical Report Number 209, University of Cambridge, Cambridge, UK.Google Scholar
  • Paciorek C, Schervish M (2003) Nonstationary covariance functions for Gaussian process regression. Thrun S, Saul L, Schölkopf B, eds. Advances in Neural Information Processing Systems, vol. 16 (MIT Press, Cambridge, MA), 273–280.Google Scholar
  • Padakandla S, KJ P, Bhatnagar S (2020) Reinforcement learning algorithm for non-stationary environments. Appl. Intelligence 50(11):3590–3606.Google Scholar
  • Pan SJ, Yang Q (2009) A survey on transfer learning. IEEE Trans. Knowledge Data Engrg. 22(10):1345–1359.Google Scholar
  • Park C (2022) Jump Gaussian process model for estimating piecewise continuous regression functions. J. Machine Learn. Res. 23(278):1–37.Google Scholar
  • Paun I, Husmeier D, Torney CJ (2023) Stochastic variational inference for scalable non-stationary gaussian process regression. Statist. Comput. 33(2):44-1–44-21.Google Scholar
  • Rangapuram SS, Seeger MW, Gasthaus J, Stella L, Wang Y, Januschowski T (2018) Deep state space models for time series forecasting. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates Inc., Red Hook, NY), 7796–7805.Google Scholar
  • Ročková V, George EI (2014) EMVS: The EM approach to Bayesian variable selection. J. Amer. Statist. Assoc. 109(506):828–846.Google Scholar
  • Rockova V, McAlinn K (2021) Dynamic variable selection with spike-and-slab process priors. Bayesian Anal. 16(1):233–269.Google Scholar
  • Rodrigues F, Henrickson K, Pereira FC (2019) Multi-output Gaussian processes for crowdsourced traffic data imputation. IEEE Trans. Intelligent Transportation Systems 20(2):594–603.Google Scholar
  • Scheipl F, Fahrmeir L, Kneib T (2012) Spike-and-slab priors for function selection in structured additive regression models. J. Amer. Statist. Assoc. 107(500):1518–1532.Google Scholar
  • Shand L, Li B (2017) Modeling nonstationarity in space and time. Biometrics 73(3):759–768.Google Scholar
  • Shen B, Gnanasambandam R, Wang R, Kong ZJ (2023) Multi-task Gaussian process upper confidence bound for hyperparameter tuning and its application for simulation studies of additive manufacturing. IISE Trans. 55(5):496–508.Google Scholar
  • Stathopoulos A, Karlaftis MG (2003) A multivariate state space approach for urban traffic flow modeling and prediction. Transportation Res. Part C Emerging Tech. 11(2):121–135.Google Scholar
  • Ton JF, Flaxman S, Sejdinovic D, Bhatt S (2018) Spatial mapping with Gaussian processes and nonstationary Fourier features. Spatial Statist. 28:59–78.Google Scholar
  • Verstraeten T, Libin PJ, Nowé A (2020) Fleet control using coregionalized Gaussian process policy iteration. De Giacomo G, Catala A, Dilkina B, Milano M, Barro S, Bugarin A, Lang J, eds. Proc. 24th Eur. Conf. Artificial Intelligence ECAI 2020 (IOS Press, Amsterdam), 1571–1578.Google Scholar
  • Wang S, Cao J, Philip SY (2020b) Deep learning for spatio-temporal data mining: A survey. IEEE Trans. Knowledge Data Engrg. 34(8):3681–3700.Google Scholar
  • Wang K, Hamelijnck O, Damoulas T, Steel M (2020a) Non-separable non-stationary random fields. Daume H, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn., Proc. Machine Learn. Res., vol. 119 (JMLR.org), 9887–9897.Google Scholar
  • Wang X, Wang C, Song X, Kirby L, Wu J (2022) Regularized multi-output Gaussian convolution process with domain adaptation. IEEE Trans. Pattern Anal. Machine Intelligence 45(5):6142–6156.Google Scholar
  • Wang Y, Zhang J, Zhu H, Long M, Wang J, Yu PS (2019) Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. Davis L, Torr P, Zhu SC, eds. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE Computer Society, Washington, DC), 9154–9162.Google Scholar
  • Wen Q, Zhou T, Zhang C, Chen W, Ma Z, Yan J, Sun L (2023) Transformers in time series: A survey. Proc. 32nd Internat. Joint Conf. Artificial Intelligence (IJCAI, Macao, SAR), 6778–6786.Google Scholar
  • Williams CK, Rasmussen CE (2006) Gaussian Processes for Machine Learning, vol. 2 (MIT Press, Cambridge, MA).Google Scholar
  • Xia D, Wang B, Li H, Li Y, Zhang Z (2016) A distributed spatial–temporal weighted model on MapReduce for short-term traffic flow forecasting. Neurocomputing 179:246–263.Google Scholar
  • Xu R, Wu J, Yue X, Li Y (2022) Online structural change-point detection of high-dimensional streaming data via dynamic sparse subspace learning. Technometrics 65(1):19–32.Google Scholar
  • Yang R, Xu H, Wu Y, Wang X (2020) Multi-task reinforcement learning with soft modularization. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. NIPS ‘20 Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 4767–4777.Google Scholar
  • Ye R, Dai Q (2021) Implementing transfer learning across different datasets for time series forecasting. Pattern Recognition 109:106717-1–106717-12.Google Scholar
  • Yoon H, Li J (2018) A novel positive transfer learning approach for telemonitoring of Parkinson’s disease. IEEE Trans. Automation Sci. Engrg. 16(1):180–191.Google Scholar
  • Yun S, Zhang X, Li B (2022) Detection of local differences in spatial characteristics between two spatiotemporal random fields. J. Amer. Statist. Assoc. 117(537):291–306.Google Scholar
  • Zhang Y, Yang Q (2021) A survey on multi-task learning. IEEE Trans. Knowledge Data Engrg. 34(12):5586–5609.Google Scholar
  • Zhang L, Wang K, Chen N (2016) Monitoring wafers’ geometric quality using an additive gaussian process model. IIE Trans. 48(1):1–15.Google Scholar
  • Zhang C, Yan H, Lee S, Shi J (2021) Dynamic multivariate functional data modeling via sparse subspace learning. Technometrics 63(3):370–383.Google Scholar
  • Zhu C, Byrd RH, Lu P, Nocedal J (1997) Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Software 23(4):550–560.Google Scholar
  • Zhu Z, Lin K, Jain AK, Zhou J (2023) Transfer learning in deep reinforcement learning: A survey. IEEE Trans. Pattern Anal. Machine Intelligence 45(11):13344–13362.Google Scholar
  • Zou Z, Careem M, Dutta A, Thawdar N (2023) Joint spatio-temporal precoding for practical non-stationary wireless channels. IEEE Trans. Comm. 71(4):2396–2409.Google Scholar